Friday, 19 September 2025

molmo and pixmo

 [[openweights vlm (molmo and pixmo).pdf]]

1. so ranjay krishna ko lab le chai ekdam transparent vision language model banayeko xah not only model but the datasets as well. ani yo models haru ko unique and most important feature vanekai yesko dataset ho. khasma chai dataset banaune belama llm or vlm ko lagi k garinxa vanda OpenAI ko api or aru kunai Claude, Gemini jasto models haru batw dataset prepare garinxa bulk ma ani tyo synthetic dataset ma chai model lai train garinxa jasle garda k samasya aayo vanda, tyo dataset batw train gareko model open source vaye pani just tyo model tyo proprietary model ko distilled version matra hunxa tyo vanda badi kei pani garna sakdaina, especially vision language model ko case ma. 
2. So yo Molmo and PixMo vanne paper ma chai uniharu le ekdam novel approach batw dataset banayeko xah vision language model ko lagi jun ekdam costly, ekdam resource intensive, ani quality maintain garna garo hunxa compared to llm ko dataset. so uniharu le ekdam highly detailed image captions vako dataset banayeko xah pre-training ko lagi, ani free-form image q&a dataset (vaneko rigid,fixed caption navayerw explanable general and natural labels for the images are collected) ani innovative 2d pointing dataset (vaneko objects haru point garna ko lagi localization pani garxa co-ordinates ko basis ma). yo sabai datasets kunai pani external vlms use nagari banayeko dataset ho. 
3. yiniharu ko best model 72b parameter ko jasle teti bela (5 dec 2024) ma proprietary vlms like claude 3.5 sonnet, gemini pro 1.5 jasto models haru lai academic benchmarks ma peldiyo ani gpt-4o sanga chai compettion ma second vayo. 
4. yo model chai sota model thiyo in their class of openness vaneko jasle sabai kura openly reproducible tarika le banauna khojxa tyo class ma otherwise properatary models haru sanga tw compete garnu garo hunxa. 
5. natural image understanding ani counting ma chai molmo specialized jo xah aru models haru vanda but advance reasoning problems haru ma chai properatary models haru le yeslai peldeko xah 
6. yiniharu ko dataset ma 712k image haru thioyo josma harek image ko lagi around 200+ words ko caption pani thiyo which wasn't annotated by the crowdsourcing plaform rathery they innovated something very useful technique. 
7. uniharu le annotators haru lai 60-90 seconds samma ko lagi image lai explain garna lagaye (speech ma) ani tyo explanation lai chai as the annotation use garyo. yesari first hand data collection ma modality change garne trick le ekdam  high quality dataset chai banauna help garyo without using any proprietary vlms.
8. Pixmo euta single dataset matrai haina esma arrays ko  dataset haru xa pre-training and finetuning ko lagi. instruction tuning dataset banauna ko lagi uni haru le users sanga batw interactive way ma free-form dataset collect garey for 72k images, annotations haru 162k thiyo (multiple annotations haru thiyo, euta visual object ko barema different comments (free-form annotation))
9. ani uniharu le language rw image lai grounding garna ko lagi 2.3 million grounding annotations haru liye  images(223,000 images) haru batw, uni haru le bounding box(rectangle) or segmentation mask ko thauma just (x,y) single points use garey jasle garda annotation pani ekdam past ani feasible vayo ani counting, identification jasto tasks haru ko lagi yesle majjale kaam pani garne vayo. 
10. The system uses a clever HTML-like format where coordinates are scaled from 0-100 regardless of image size: `<point x="10.0" y="10.0" alt="Mt. Rainier">mountain</point>`. This makes the system resolution-agnostic - works the same whether the image is 100x100 pixels or 4K.
11. they specifically generated the synthetic dataset tarw kunai vlm use nagari just llm use garerw code generate garyo ani specifically (clock reading, chart understanding ani table understanding ) jasto tasks haru ko lagi chai recipe jasto synthetic dataset banayo, like instead of learning about wine by tasting it, they studied chemical formula of the wine. 
12. uniharu le model train garne tarika pani ekdam innovative and effective thiyo jaha uniharu le pre-trained llm ani vision encoder jo use gareko thiye just like any other vlm but special kura k thiyo vanda uniharu le two stage training pipeline banayo  (three stage hunxa jaha connector  tuning garinxa jasko matlab chai noisy data ma structure find garna ko lagi llm lai train garne kura ho if dataset ramro xaina vane), novel overlapping multi-crop strategy use garyo (yesle chai instead of image lai grid grid garerw slide garne thauma esle chai overlapping grids haru banayo jasle garda context samjhinu parena grid grid garerw image lai read garda, crop ko boundaries ma hune information loss rokyo yesle), ani efficient multi-annotation learning (yesko matlab euta image ko multiple annotations xah vane pani training ekaichoti hune vayo instead of creating duplicate image ani tyo image ko different different captions), ani vision/language connector lai pani improve garyo (the vision-language connector bridges visual and textual understanding in multimodal AI. Traditional methods use basic feature stacking. Molmo employs attention-based pooling that preserves spatial relationships effectively. This technical innovation significantly enhances performance on visual reasoning and counting tasks.)
13. 

Wednesday, 10 September 2025

llms

 grouped-query attention vaneko chai multi-headed attention vanda slightly different xah to reduce the cost, kina vane multi-head attention ma harek harek token ko lagi query, key ani values haru compute garnu parxa whereas grouped query attention ma yei kura queries haru ko group ko key ani values hunxa (share garxa same key and value query heads le) jasle garda compute cost significantly ghatxa, memory cost pani ghatxa.
2. yesko analogy  chai k xah vanda, MHA vaneko chai gharma sabai ko iphone xa ani sable afno afno charger cable,plug use garxan tarw gqa ma chai if gharma 4 jana xah vane 2 ota charger, plug, cable use garxan jasle garda bijuli ko bill thorai vayo. 
3. multi-head latent attention chai deepseek r1/v3 ma use vako important concept ho josle chai multi-head attention lai replace garxa original transformer architecture batw. additionally deepseek ko architecture ma 61 ota transformer blocks haru xan. aba aau multi-head latent attention ma, jasari gpt2 ko architecture ma euta query ko respective key ani value hunthyo (let's say 100 dimension ko each) then MLA ma chai key ani value lai compressed form ma rakhinxa (let's say 10 dimension vector) jasle garda memory usage majjale ghatne vayo ani MLA ko yo concept chai training ma vanda pani inference ma implemented hunxa. query ma chai attention mechanism apply hune ho so yeslai chai 100d ko rakhyo ani key rw value vaneko knowledge base ho jasle garda compressed form ma rakhda ni preserve vairakhxa. 
4. yesko analogy vaneko chai zip file ko concept sanga milxa, like .mp4 file xah rey 100mb ko teslai .zip ma compress garerw 10mb ko banayo tarw extract garda tw sabai kura firta aauxa intact. 
5. inference efficiency vanekao nai real-world deployment ko main kura ho memory ani speed MLA le majjale optimize garxa. training ma chai yo use hudena kina vane training ma model lai full precision chainxa. 
6. so olmo2 vanne model ma chai normalization layer ko placement was something unique. initially, original transformer architecture ma (decoder part) post-norm use hunthyo (mha paxi normalization (layernorm) ani feedforward layer paxi normalization), tespaxi pre-norm use huna thalyo in models like llama 3 8b, gpt2 (rmsnorm and layernorm were used respectively) ani finally olmo2 7b model ma chai post-norm inside residual huna thalyo which means original transformer architecture ma post-norm chai residual ko bahira thiyo vane olmo ma chai thyakkai ulto gardiyo jasle garda loss spikes(instability) (mostly seen in pre-norm usage) chai post-norm chai testo thena. 
7. qknorm pani use vako thiyo olmo2 ma additional normalization ho applied to query and key jasle garda chai 
   

Saturday, 6 September 2025

90k parameters


# You Only Need 90K Parameters to Adapt Light: a Light Weight Transformer for Image Enhancement and Exposure Correction

1. The paper addresses the common problem that images taken in difficult lighting conditions (too dark, too bright, under- or over-exposed) look bad and also **degrade the performance** of computer vision algorithms (like object detection).
2. Normally, a camera's internal **Image Signal Processor (ISP**) converts the raw sensor data into the standard image format (sRGB) we usually see. This process involves steps like **color correction** and adjusting brightness/contrast **(gamma correction).**
3. The researchers propose a new method called the **Illumination Adaptive Transformer (IAT).** Instead of just trying to fix the final image, IAT works by essentially **learning to *adjust the parameters of the ISP process itself* based on the input image**. It breaks down the ISP process and uses a Transformer model (specifically, attention queries) to figure out the **best adjustments** for things like color and gamma needed to correct the lighting.
4. The key advantages highlighted are that IAT is very small (lightweight, only **90k parameters**) and extremely fast (takes only **0.004 seconds per image**). Despite its efficiency, it performs better than current leading methods (State-Of-The-Art) on s**tandard tests for fixing low-light and exposure problems**. Importantly, fixing the images with IAT also significantly helps other computer vision tasks, like detecting objects or understanding image segments, perform better in these challenging lighting conditions.

**Notes from the Paper Text:**

- **Problem:** Real-world challenging illumination (low light, under/over-exposure) harms visual quality and computer vision task performance.
- **Background:** Cameras use an Image Signal Processor (ISP) to convert raw data to sRGB images, involving steps like color/gamma correction.
- **Proposed Solution:** Illumination Adaptive Transformer (IAT).
    - A lightweight and fast transformer model.
    - Aims to restore normally lit sRGB images from poorly lit inputs.
- **IAT Mechanism:**
    - Decomposes the ISP pipeline conceptually (into local/global components).
    - Uses attention queries to learn and adjust ISP-related parameters (e.g., color correction, gamma correction).
- **IAT Features:**
    - Very lightweight: ~90k parameters.
    - Very fast: ~0.004s processing time per image [inference time].
- **Performance:**
    - Consistently outperforms State-Of-The-Art (SOTA) methods on low-light enhancement and exposure correction benchmarks.
    - Significantly improves downstream tasks (object detection, semantic segmentation) under various lighting conditions.

---

---

## Content Based Image Retrieval

**Content-based image retrieval** (CBIR) is a process in image retrieval where the system searches for images based on their **visual and semantic content**, rather than **metadata or textual descriptions.** It involves extracting features from images, such as color, texture, and shape, and using these features to **compare and rank images in the database**. This technology is often used in applications like facial recognition, image search engines, and medical imaging.

xtuner


[[llms]]

# Xtuner 

### Single turn and multi-turn converstation dataset. 

single turn dataset is effective for simple *FAQ bots and text classification related task.* 

multi turn conversation dataset is required for applications needing sustained (continuing for long time) interaction like *customer support, mental health counselling and talkbot robots.* 


#### incremental pre-training: 

    training the llama2 in nepali corpus for boosting the nepali language understanding.


**for instruction tuning reponse generation(output) loss is used for weight updates while the loss of instruction part(system input) is neglected**


*amalgamate:combine*


#### multi-turn conversation dataset sample: 


    <|system|> You are a helpful assistant.
    <|user|> What is the capital of France?
    <|assistant|> Paris is the capital of France.
    <|user|> What's the population?
    <|assistant|> About 67 million people live in France.
    <|user|> Who is the president?
    <|assistant|> Emmanuel Macron is the current president.



**xtuner uses their own method to deal with multi-turn conversation dataset**

i. concatenate the full converstaion into one sequence. 
ii. add special **<|user|> and <|assistant|>** tokens to *mark who said what.*

iii. only computed the loss for *assistant token* (loss mask used: 1 means computer loss, 0 means ignore)

iv. training becomes fast and efficient.


**OpenAI's text-davinci-003 engine for dataset generation, alpaca dataset was generated by that engine.** a single turn dataset. 



[arxiv_dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv)

[MOSS: an open conversational llm](https://link.springer.com/article/10.1007/s11633-024-1502-8)

16b parameter model which can perform variety of instructions in *MULTI-TURN INTERACTIONS WITH HUMAN*.

**datasets** are also provided for *sft*

[moss-oo3-sft](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)

    a multi-turn dataset, 1.1 million dialogue samples *(full open-source)*


### Preference-aware training

method to align human preferences explictly to the model training process. (rhlf)


#### Spinning up training job with Xtuner

1. SLURM: Simple linux utility for resource management, 

a fault-tolerant and highly-scalable cluster management and job scheduling system. 

manages resources (cpu, gpu, ram, nodes in linux machine) 

reference command: **srun**

2. Kubernetes

container orchestration platform and used in xtuner for orchestrating the containerized training jobs across multiple nodes. 
 
######################################################################

[**accumulative_counts = 4** *(We do 4 forward/backward passes before stepping the optimizer, so it effectively behaves like batch size = 4.)]*


##### norm-based gradient clipping

rescales the gradient vector value if ||g|| > 1, limiting the magnitude to 1. 

if ||g|| <= 1, gradient left unchanged. 



datasets and finetuning

 [[llms]]

# Domain Specific Dataset Curation for Effective Finetuning

[[axolotl]]

# LLM Finetuning Datasets & Methodologies: Comprehensive Technical Guide

(referenced from claude.ai)

## Models Consideration

Qwen2.5-7B, llama2-7b and llama3.2-3b

## Quantization

qlora 4-bit quantization for 7b models and standard lora for 3b models. 


## Datasets

### Instruction tuning datasets

#### Multi Domain datasets

[Alpaca-52k ](https://huggingface.co/datasets/tatsu-lab/alpaca)

alpaca is the format for **single turn conversation type dataset.**

its can be used for general reasoning and creative writing with  4-6 hours of training 3b model. 

[Ultrachat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

multi-turn conversation dataset. *complex reasoning chain* and *natural dialogue*

[anon8231489123/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json) ([text](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json))

chatgpt conversation 90k. for human like behavior finetuning. 

#### Enhance instruction datasets

[microsoft/orca-math-word-problems-200k](http://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)

for mathematical reasoning with step-by-step solutions. 

upto grade 12. 

format: problem statement....reasoning......final answer. 

### Maths Dataset

1. GSM8K contains 8500 grade school maths problem including basic arithmetic through pre-algebra. 


2. [hendrycks/competition_math](taken down)

 12,500 competition-level problems (algebra, theory, calculus and number theory)

 ### Conversational Assistant Datasets

1. PersonaChat [bavard/personachat_truecased](https://huggingface.co/datasets/bavard/personachat_truecased)
        Contains 160k dialogue with personality traits. good for **human-like engagement patterns training objective.**
2. Empathetic dialogues [empathetic_dialogues](discarded)

    25k conversations

    for emotional understanding and assistant like behavior development. 

3. BlenderBot3-Dialog [facebook/blended_skill_talk](https://www.kaggle.com/datasets/thedevastator/multi-modal-conversation-data)
    
    76k conversations
    knowlege, empathy, personality and consistency. 


### Specific Assistant Behavior Datasets

1. Assistant Conversations by Anthropic [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

    161k human-assitant dialogues. 
    helpful, harmless and honest response, RLHF-ready format. 

2. OpenAssistant Conversations [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)

    161k human-generated conversations. 
    include multiple languages (might contain nepali as well)


### Nepali TTS Development

    OpenSLR Nepali [openslr/43](https://openslr.org/43/)
   


   ## Can we finetune a vision language model on Maths Dataset/pictures?

### InternLM-Math [internlm/internlm2-math-plus-7b](https://huggingface.co/internlm/internlm2-math-plus-7b)

7b and 20b models which are pre-trained with ~100B math-related tokens and *SFT* with
~2M bilingual math supervised data. 

{{minhash and exact number match used for decontaminate possible test set leakage.}}

InternLM-Math is solver, prover, verifier and augmentor. 

It was evaluated for formal math reasoning with this evaluation set [MiniF2F-test](https://github.com/openai/miniF2F)

the dataset contains maths problems (theorem proving) from olympiads as well as high-school and undergraduate maths classes. 


In informal maths reasoning MATH, MATH-Python and GSM8K are used as evaluation set. 

InternLM-Math-7b performance: **34.6, 50.9, 78.1**

the 7b model outperforms the deepseek-7b-rl model 


InternLM-Math will be combined with Lean 3 (for theorem proving and maths problem solving). 

[Lean 3](https://lean-lang.org/doc/reference/latest/Elaboration-and-Compilation/) is a interactive theorem prover and functional programming language  based on dependent kernel theory which means types can depend on terms, enabling expressive formalization of mathematics and programs

#### How does test-set leakage happens?

future data used for training in time series. 

and improper cross-validation and its repeated use during hyperparameter tuning. 

information from test fold influence the training process of the model causing data leakage. 



## Mixture of Experts
    a machine learning architecture where llm is divided into multiple networks called experts and the **gated network** dynamically selects and routes input to one or few relevant experts. 

    Models like Mixtral-8x7b, Youtube Recommendation system, Z-code, Switch Transformer are based on MOE. 


    Different modes or methods of MOE are Top-k routing, 
    top-1 routing (only one exper per input token), 
    expert choice routing (expert decides which input they can handle best), 
    sparse activation/routing (only subset of experts are activated)

    **Capacity factor** is the hyperparameter that influences how many tokens each expert can handle during training and inference. 


## Xtuner by InternLM

a finetuning toolkit for large language models, it can finetune the 7b models within 8GB V-RAM. 

Supported models are **internlm, mixtral, llama and qwen**. 

QLORA can be used for finetuning InternLM with publicly available datasets. 

For example
```xtuner train internlm_7b_qlora_oasst_e3```

**Python3.10 support**
```conda create -n xtuner_env python=3.10```
``` pip install -U xtuner```

*Deepspeed* module not found. && 

can be installed with ```pip install deepspeed```

encountered another issue

{{ raise MissingCUDqAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
      op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
      [end of output]}}


Above error encountered due to lack of CUDA Compiler, PyTorch install the CUDA runtime but *nvcc --version* checks whether the CUDA compiler is installe or not. 

### [CUDA Compiler Installation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu-installation)

1. ```wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb```

2. ```sudo dpkg -i cuda-keyring_1.1-1_all.deb```

3. ``` sudo apt install cuda-toolkit -y```
4. ``` sudo apt install nvidia-gds```
5. ``` reboot ```


Or

1. ```sudo apt install nvidia-cuda-toolkit``` in Ubuntu 24.04.2 LTS. 



{{}}


### What is Nvidia GDS packages?

means GPU direct Storage, that enables bypassing the CPU for data path. 
It allow **direct memory access (DMA) transfers between GPU and Storage devices**



wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb

axolotl


## Axolotl (alternative to hugginface/transformer)

It is a tool to full finetune model, parameter efficient finetuning and alignment techniques with support for multiple model architectures like llama, mixtral, phi, qwen, mixtral-moe, gemma, gpt-j, pythia etc. 

Support includes fp16/fp32, lora, qlora, gptq and flash attn. pre-training, finetuning and preference-based post-training (DPO, ORPO AND PRMs) 

Its installation requires **packaging==23.2, setuptools==75.8.0, wheel, ninja along with flash-attn and deepspeed.**

YAML file based finetuning technique. 

### Dataset Format required for pre-training. 

```python
{"text": "first row"}
{"text": "second row"}
...
```
in **.jsonl format.** 

```python
Dataset.load_dataset
```
it loads various formats of dataset including *jsonl, csv, arrow, parquet, sql and Webdataset*

### Dataset Format required for SFT

SFT means training model to respond to an instruction or chat input. (chatbots like GPT and Gemini)

Formats supported are **Conversation Dataset and Instruction Dataset** along with *tokenized dataset*

#### Conversation dataset 

It usually contain **role and content** key. 

its called chat_templates which is a Jinja2 template which formats a list of messages into a prompt. 


#### <|im_start|> and <|im_end|>

they are **de**limit**ers** which is a prompt that separates different speakers which  allow model to identify which portion belongs to whom. 

##### Sharegpt format

{"conversations": [{"from": "...", "value": "..."}]}

##### OpenAI format

{"messages": [{"role": "...", "content": "..."}]}

possible roles are *user, system, assistant*

{{**What do you want to mask?** }}

we can bring our own custom template via: 

**chat_template_jinja: # your template**

#### Instruction Dataset

used for training instruction following models. 

common format

```{"instruction": "...", "input": "...", "output": "..."}```

its called alpaca instruction dataset format. 

but custom instruction prompt are welcomed. 

## RLHF

RLHF means language model optimized through human feedback which means


### Methods for RLHF

#### DPO 

Direct Preference Optimization


#### IPO

Identity preference optimization

#### KTO

Kahneman-Tversky Optimization

#### GRPO

Group relative policy optimization

#### ORPO

Odds ratio preference optimization




komputer vision


# COMPUTER VISION

```python
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
```

First tuple is mean for the RGB color channel whereas the second tuple is standard deviation, each values ranges from (0,1) and (-1,1). This ensure the pixel values are centered around 0 and have a standard deviation of 1 after normalization. Improves training stability and model learning convergence. 

**Formula**

```python
normalized_value = (original_value - mean) / std
```

**Before Normalization**

```python
[[[0.0, 0.5, 1.0],
  [0.2, 0.7, 0.9]],
 [[0.1, 0.6, 0.8],
  [0.3, 0.4, 0.5]]]
```

**After Normalization**

```python
[[[-1.0, 0.0, 1.0],
  [-0.6, 0.4, 0.8]],
 [[-0.8, 0.2, 0.6],
  [-0.4, -0.2, 0.0]]]
```

---

---

---

# Image Captioning

***https://readmedium.com/image-caption-model-from-scratch-vit-gpt-94afaae30fb7***

1. patch_size: This variable determines the patch size used in the ViT component. The patch size is the size of the small image patches that are used as input to the ViT model. Here, the patch size is 16x16 pixels.
2. d_model_vit: This variable determines the dimensionality of the output embedding from the ViT component. It is calculated as the product of the patch size and the number of color channels.
3. num_patches: This variable determines the number of patches in the input image. It is calculated by dividing the image size by the patch size.
4. softmax_denom_eps: This variable determines a small value added to the denominator of the softmax function to prevent division by zero.

## Patch Embeddings

Patch Embedding is a technique used in computer vision to convert an image into a format that can be fed into a neural network.

Imagine an image is made up of small, non-overlapping squares called patches. Each patch is a small portion of the image, and it can be thought of as a tiny, independent image.

The Patch Embedding process involves:

- **Dividing the image into patches:** The image is divided into a grid of patches, where each patch is a small square portion of the image.
- **Representing each patch as a vector**: Each patch is represented as a vector, which is a list of numbers that describe the patch's color and texture.
- **Flattening the patch vectors:** The patch vectors are flattened into a long, one-dimensional list of numbers.

In Vision Transformers (ViT), Patch Embedding is used to convert the input image into a sequence of patch embeddings, which are then fed into the transformer network. The transformer network processes the patch embeddings in parallel, allowing it to learn global features and relationships in the image.

**PatchEmbeddings class**

The `PatchEmbeddings` class is responsible for creating the patches of an image using a convolutional layer. Here's a step-by-step explanation:

1. **Convolutional layer**: The class uses a convolutional layer (`nn.Conv2d`) to create the patches of the image. The convolutional layer takes the input image and applies a filter to it, resulting in a feature map.
2. **Flatten**: The feature map is then flattened using the `nn.Flatten` layer, which converts the 3D feature map into a 2D tensor.
3. **Permute**: The flattened tensor is then permuted using the `permute` method, which rearranges the dimensions of the tensor. The resulting tensor has shape `(B, N, D_MODEL)`, where `B` is the batch size, `N` is the number of patches, and `D_MODEL` is the dimensionality of the patch embeddings.

```jsx
self.conv_patch_layer = nn.Conv2d(in_channels=config['channels'],
                                  out_channels=config['d_model'],
                                  kernel_size=config['patch_size'],
                                  stride=config['patch_size'])
```

**ViTEmbedding class**

The `ViTEmbedding` class creates the input embeddings for the ViT model by combining both patch and positional embeddings. Here's how it works:

1. **Class token embedding**: The class token embedding is a learnable parameter that represents the class token. The class token is a special token that is used to represent the entire image.
    1. **Positional embedding**: The positional embedding is a learnable parameter that represents the position of each patch in the image. The positional embedding is used to capture the spatial relationships between patches.
    2. **Patch embeddings layer**: The `PatchEmbeddings` class is used to create the patch embeddings from the input image.
    3. **Dropout**: The patch embeddings are then passed through a dropout layer, which randomly sets a fraction of the output elements to zero during training.
    4. **Add positional embedding**: The patch embeddings with class token are then added to the positional embedding, resulting in the final input embeddings for the ViT model.

## Creating Patch Embeddings using Convolutional Layers

Patch embeddings are created using a two-dimensional convolutional layer. This might seem surprising, as many people think that patch embeddings are created by simply dividing an image into patches and flattening them.

### Why Convolutional Layers?

However, using convolutional layers to create patch embeddings has several advantages:

1. Computational Efficiency: Convolutional layers are highly optimized and come pre-built with deep learning libraries like PyTorch and TensorFlow. This means that they can be used efficiently and effectively, without the need to implement custom patch embedding code.
2. Capturing Different Information: Convolutional layers can capture different types of information from the image, such as edges, textures, and patterns. This is because they are designed to extract features from images, which is exactly what we need to create patch embeddings.

**How it Works**

Here's a step-by-step explanation of how patch embeddings are created using convolutional layers:

1. **Convolutional Layer**: A 2D convolutional layer is applied to the input image. This layer extracts features from the image, such as edges and textures.
2. **Feature Maps**: The convolutional layer produces a feature map, which is a 2D array of values that represent the features extracted from the image.
3. **Flattening**: The feature map is then flattened into a 1D array, which represents the patch embeddings.
4. **Classification Token**: The classification token is appended to the front of the patch embeddings, which represents the entire image.

## Normalization

In reality, the mean of 0 and a standard deviation of 1 are mathematical concepts that are used to normalize the input data. Here's what it means in simple terms:

**Mean of 0**

Imagine you have a dataset of exam scores, and the average score is 80. If you subtract 80 from each score, you get a new set of scores that have a mean of 0. This means that the scores are centered around 0, and there is no longer an overall bias or shift in the data.

In the context of neural networks, normalizing the input data to have a mean of 0 helps to:

- Reduce the effect of outliers or extreme values
- Improve the stability of the model
- Enhance the accuracy of the model

**Standard Deviation of 1**

The standard deviation is a measure of how spread out the data is. If the standard deviation is 1, it means that the data points are relatively close to the mean, and there is not a lot of variation in the data.

In the context of neural networks, normalizing the input data to have a standard deviation of 1 helps to:

- Improve the convergence of the model during training
- Enhance the interpretability of the model
- Reduce the risk of overfitting

rest api

 #rnd #web

[[proposal_rudra#Software Tools]]

# REST [REPRESENTATIONAL STATE TRANSFER] API

REST stands for representational state transfer and is a software architecture style that defines a pattern for client and server communications over a network. Performance, scalability, simplicity and reliability are some of the features of the REST architecture which ease the development of websites and software.

### Constraints

1. Stateless server that doesn’t maintain the state between requests from the client. In simpler words, it doesn’t remember anything about the past requests which keep the task of request and response simple and robust.
2. Independent server and client by decoupling each other allowing the changes and updates integration seamless, making maintenance easier.
3. The data retrieved from the server should be cacheable either by client or the server which reduces the load on server along with improved performance.
4. REST architecture may contain the intermediary layers like helpers between the client and the main server which allow the security, traffic and adding extra features. The client may access the resources on the server indirectly through other layers such as proxy or load balancer. 
5. The server will provide a uniform interface for accessing resources without defining their representation. There is a standard way to request and response task no matter what kind of data it is. Those standard includes
    1. Resources and URLs: Each individual data have their unique address.
    2. HTTP Methods: These methods includes standard methods like GET(get/read data), POST(create/order new data), PUT/PATCH(update existing data) and DELETE(remove  data). 
    3. Representations: Data is sent in standard formats like JSON or XML.

These constraints mentioned above aren’t a set of specificiation, rather a the guidelines and best practices to build a web system. The more we adhere these principles, the benefits we will get but its not strictly set to follow these rules. 

## REST APIs and Web Services

REST web service is any web service that adheres to REST architecture constraints. These web services expose their data to the outside world through an API which can be accessed using the REST API with public web URLs. 

Github’s REST API URL: `https://api.github.com/users/<username>`

The data is accessed from REST API by sending HTTP requests to specific URL, it listens to HTTP methods to know which operations to perform on the web service’s resources. Resource can be accessed and manipulated with  HTTP requests.

### Status Code

1. 2xx: Successful Operation
2. 3xx: Redirection
3. 4xx: Client Error
4. 5xx: Server Error

### API Endpoints[Door to Web Service]

REST API exposes a set of public URLS thatclient applications use to access the resources of a webn service. These URLs are called endpoints. Each endpoint is designed for a specific purpose. Endpoint URL allow to choose the web service resource that the HTTP method wants to interact with.

reinforce


# REINFORCEMENT LEARNING

## Theoretical Foundations of Reinforcement Learning

### Markov Decision Process (MDP)

Markov process is the simplest child of the Markov family, which is also known as Markov chain. Imagine an observable system only by yourself, what you observe is called states, and the system can switch between the states. The set of the all possible states is known as state space. For Markov process, the number of possible states need to be finite. Also, the system can not be influenced by you but can be observe while it changing. 

For example, looking at the simplest model of the weather in some city, we can observe the current day as sunny or rainy, which is our state space. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, sunny, …], and this is called history.

To call such system an Markov Process, it need to fulfill the Markov Property, which means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In this chase, **only one state is required to model the future dynamics of the system and not the whole history** or, say, the last N states. 

As the system model complies with Markov property, you can capture transition probabilities with a transition matrix, which is a square matrix of the size N x N, where N is the number of states in our model. Every cell in a row, i, and a column, j, in the matrix contains the probability of the system to transition from state i to state j. The transition matrix defines the system dynamics. Additionally, Markov process implies stationarity, where there is no any factor influencing the system dynamics. 

A state transition graph, where circle represents the state, arrow represents the possible transitions and self revolving arrows represents the self-state. If a model is at coffee state, then its next state is only depends on the Coffee state not any state before it. 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/03193b46-bb12-498f-911e-b8878f5c32e5/image.png)

In Markov reward processes, the Markov process is extended to a bit by adding the reward value to out transition from state to state. Reward is a another square matrix, similar to the transition matrix, with reward given for transitioning from state i to state j, which reside in row i and column j. 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/36c08b38-3e68-463c-b123-bc5b6f007385/image.png)

The return (Gt) is the sum of rewards agent collects in future from the time ‘t’ onward. The discount factor (Gamma) is applied at every step starting from the point where we calculate the return Gt. The farther the reward is in time, the higher the power of Gamma, which means bigger discount. 

In RL, the agent uses these rewards to calculate:

1. **Immediate reward** (R) for each step.
2. **Return** (Gt) by summing up discounted future rewards.
3. **Value function** (V(s)) to average the returns for a state.

The State Value [V(s)] is the average return obtained from the Markov reward processes. The equation of the state value is given as: 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ad74b21c-9945-4fbe-b64f-d5260391d76c/image.png)

This equation simply represents, **“ If I start at state s, what is the average total reward I can expect over time?”**

V(s) quantifies how **good** a state s is in terms of long-term rewards. In RL, this concept is extended to find the optimal policies that maximize V(s)

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ad5a668d-8551-456f-bfbf-7deae856c754/image.png)

In the absence of terminal states (sink states) in infinite horizon problems, the value of Gamma = 1, the agent becomes completely far-sighted and cares about all the future rewards equally, no matter how far they are in the future which leads the agent to sums all future rewards infinitely.

The value of Gamma = 1 is idea for finite-horizon problems (tic-tac-toe) but impractical in infinite-horizon problems without a stopping condition. A larger gamma (e.g., 0.9 or 0.99) means the agent considers the **long-term future more**, but rewards farther in time still diminish in value.Gamma < 1 avoids the problem of infinite sums, which is common in infinite-horizon problems. 

---

### Policy

Policy is some set of rules that controls the agent’s behavior. The main objective of RL is to gather the maximum cumulative return as possible. Mathematically policy can be represented by the given equation, 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/1306c345-53e4-4133-9b89-187b8c9f4f5e/image.png)

where ‘|’ denotes the conditional probability, P denotes the probability, At denotes the action ‘a’ chosen by the agent at time step t, St denotes the state ‘s’ the agent currently in at time t. 

The equation basically ask, “ **If I am in state “s”, what action “a” should I take?”**

There are two different types of policy RL. 

1. Deterministic Policy where the action ‘a’ is chosen with certainty.
2. Stochastic Policy where the actions are chosen randomly based on the probability distribution. 

---

---

## Dynamic Programming and Bellman Equation

Dynamic programming consist of two different parts, dynamic and programming, the term dynamic means such problems with temporal or sequential aspect, and the programming means optimizing the policy, mathematically. 

Any problem to be solved with dynamic programming requires following property: 

1. Optimal substructure
2. Overlapping sub problems

These properties are satisfied by the MDP (Markov Decision Process) which allow the use of Bellman Optimality equation to creates the recursive decomposition to the problem. 

---

Bellman Equation of Optimality applies for two different cases: 

1. Deterministic Case
2. Stochastic Case

### Deterministic Case for Bellman Equation of Optimality

Deterministic cases are such problems with where the actions have 100% guaranteed outcome and not influenced by randomness.. The equation for the deterministic case is given by: 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/d8de4770-54a2-4451-828b-f7c7791dabdd/image.png)

Here:

- ( V^*(s) ) is the optimal value function for state ( s ).
- ( R(s, a) ) is the reward received after taking action ( a ) in state ( s ).
- ( \gamma ) is the discount factor, which determines the importance of future rewards.
- ( s' ) is the next state resulting from taking action ( a ) in state ( s ).

### Stochastic Case for Bellman Equation of Optimality

The outcomes of the actions are governed by the probabilities in the stochastic cases of Bellman Optimality equation. This means that taking an action in a given state can lead to multiple possible next states, each with certain probability. 

The equation for the stochastic case is given as: 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/e8db5520-6b2f-498b-9a21-83f9b9d92829/image.png)

Here:

- ( V^*(s) ) is the optimal value function for state ( s ).
- ( R(s, a) ) is the expected reward received after taking action ( a ) in state ( s ).
- ( \gamma ) is the discount factor, which determines the importance of future rewards.
- ( P(s' | s, a) ) is the probability of transitioning to state ( s' ) from state ( s ) after taking action (a).
- ( s' ) represents the possible next states.


### Q(s,a) & V(s)

Q(s,a) is known as the Q-value function whereas the V(s) is known as value function. They have the given mathematical connection with each other: 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ee9b6fec-56ac-43c8-9e68-7fecad2cd460/image.png)

The key difference between Q(s,a) and V(s) lies in what they evaluate:

- **V(s) - State Value Function:** Represents the expected return starting from state s and following the current policy. It tells us how good it is to be in a particular state.
- **Q(s,a) - State-Action Value Function:** Represents the expected return starting from state s, taking action a, and then following the current policy. It tells us how good it is to take a specific action in a particular state.

The relationship between Q(s,a) and V(s) can be expressed as:

V(s) = max Q(s,a) for all actions **a**

This means the value of a state is equal to the maximum Q-value possible from that state across all possible actions.

### Bellman Equation for General Case

According to Bellman's optimality proof, at every state the agent ends up in, it needs to select the action with the maximum expected reward, which is a sum of the immediate reward and the one-step discounted long-term reward.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/3553e82c-6346-485a-997a-c526cfc31de4/image.png)

Representation of Q(s,a) recursively: 

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/357aad7a-4bf7-4c32-a859-02d1fcc8045e/image.png)

## Value Iteration Method

deepfake


[[computer-vision]]

# DEEPFAKE DETECTION

Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions, which has significant application potential in fields such as entertainment, movie production and digital human creation.  In addition to deepfake generation, corresponding deepfake detection technology continuously evolve to regulate the potential misuse of deepfakes, such as privacy invasion and phishing attacks.

As deepfakes become realistic and widespread around the social media, it becomes harder to identify the authenticity of all kind of information sources. The manipulation of content, such as photography or audio, also raises the ethical issues around the consent. 

## How are Deepfakes Made?

There are two different ways to create the deepfakes images, videos and audios. They are listed below: 

1. Generative Adversarial Networks
2. Diffusion Models

1. GAN (Generative Adversarial Networks) : GAN is composed of two models that play a game against each other. The first model, the generator either selects the image or video, or generate the fake one. The second model, the discriminator, decides whether the image or video is real or fake. The generator win the game if the discriminator, decides whether the image or video is real or fake. The generator wins the game if the discriminator can’t tell that a generate content is fake. Playing this game over and over trains the generator to generate the realistic content, whilst improving this game over discriminator’s ability to guess correctly whether the content is real. 
2. Diffusion Models: A diffusion model  is trained to restore an image or video to its original state after visual ‘noise’ had been added. Some diffusion models are trained with guidance such as text prompts encouraging them to generate particular images, whilst others try to decide what the likeliest output will be on their own. The resultant models can ‘inpaint’ missing patches in an image, filling the gaps with something plausible. Model such as Stable Diffusion and DALL-E 2 are both examples of diffusion models that take text prompts as part of their input. Diffusion models are newer than GANs and likely to become more prominent in deepfake generation as they are believed to be easier for train than GANs. 

### Deepfake Detection

Deepfakes are becoming increasingly hard to detect due to the advancement in the generative AI methods for creating deepfakes. There are several ways that the images, videos and audio can be classified as deepfake based on the spatial and visual inconsistencies contained by the deepfake contents. Video and audio deepfakes can be given away by time-based inconsistencies, such as mismatch between speech and mouth movements. Deepfake generation methods such as GANs and diffusion models can also leave detectable ‘fingerprints’ within the pixels of images or videos. 

Deep fake detection opensource projects

1. **Faceswap.dev**
2. https://github.com/shaoanlu/faceswap-GAN; Face tracking/alignment using MTCNN and kalman filter in video conversion

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/aee735a5-eac4-4d5e-b899-fc378944231e/image.png)

1. https://www.youtube.com/watch?v=x2g48Q2I2ZQ
2. https://github.com/Billy1900/Awesome-DeepFake-Learning?tab=readme-ov-file#3-curated-lists

**spatio-temporal action recognition**

https://www.creativebloq.com/features/deepfake-examples

https://github.com/jacobgil/pytorch-grad-cam

https://typeset.io/

# Classification Model Neural Architecture with PyTorch.

Recent evidence shows that the network The effect of depth on network performance is critical, and the main results on the challenging ImageNet dataset all employ very deep models, ranging from 16 to 30 layers deep (https://arxiv.org/pdf/2208.08231).

However,the first obstacle to this problem is the infamous gradient vanishing and gradient exploding problems, which hinder the convergence of the network. Later, researchers found that this problem can be alleviated by normalizing the input data and batch normalization, which is generally not a problem for a dozen-layer network.

Simply stacking the network to increase the depth of the network does not improve
the performance of the network. He et al. call this phenomenon the degradation problem,
which shows that not all systems are easy to optimize(He et al.).

## Inception-ResNetv1

The Inception-Resnet v1 is a hybrid network inspired by inception and the performance of resnet. There are two different versions of the Inception-Resnet, V1 and V2, where V2 being comparatively  producing higher computational cost. 

The Inception-Resnet incorporates the use of 3 different stem modules and reduction blocks. The output of Inception module is added to the input. 

![Schematic for Inception-ResNet v1 AND v2  Network](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/2ebf2ce2-dad3-44ff-8636-6a14f3106f02/image.png)

Schematic for Inception-ResNet v1 AND v2  Network

The dimension of the output from the inception module and the input from the previous layer must have same dimension without any alteration. Factorization of the convolutions filters become much more important to match these dimensions. However, further studies shows that the network dies when the convolutions filters exceeds 1000(https://iq.opengenus.org/inception-resnet-v1/). This  problem was later solved by introducing the concept called Activation Scaling. 

![    Inception-ResNetv1](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/b35538f6-71c4-4e6d-b27a-5487e8f42699/image.png)

    Inception-ResNetv1

![Inception-ResNetv2](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/477b6378-9d9e-410a-9571-75f7653e5e42/image.png)

Inception-ResNetv2

# LSTM (Long Short Term Memory)

## Video Vision Transformer for Video Classification

The VivitForVideoClassification class in Hugging Face Transformers library provides a PyTorch implementation of the Video Vision Transformer specifically designed for video classification. This class requires a VivitConfig object, which contains all the necessary parameters for the model’s architecture and operation. When initializing the model with the configuration, it only load the structural details of the models instead of the model’s weights. For loading the model’s weights, the **from_pretrained()** method need to be implemented. 

The ViViT model is a powerful architecture for video classification. It processes video frames as sequences of patches. Classification head is attached on top of  the model. It also supports the fine-tuning process by enabling **interpolate_pos_encoding,** which adjusts position embeddings for new resolutions. This allows leveraging pre-trained weights effectively on different datasets like **Kinetics-400.** 

The forward function contains the parameters such as **pixel_values, head_mask, output_attentions, output_hidden_states, interpolate_pos_encoding, return_dict, labels.** 

```python
import torch
from transformers import VivitModel, VivitImageProcessor

model = VivitModel.from_pretrained("google/vivit-base")
processor = VivitImageProcessor.from_pretrained("google/vivit-base")
images = ["image1.png", "image2.png"]  # Example images
inputs = processor(images, return_tensors="pt", padding=True)
head_mask = None  # No masking of attention heads
labels = torch.tensor([0, 1])  # Example ground truth labels
output_attentions = False
output_hidden_states = False
interpolate_pos_encoding = True
return_dict = True

#forward pass
outputs = model.forward(
    pixel_values=inputs.pixel_values,
    head_mask=head_mask,
    labels=labels,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
    interpolate_pos_encoding=interpolate_pos_encoding,
    return_dict=return_dict,
)

if labels is not None:
    loss = outputs.loss  # Cross-entropy or MSE loss
logits = outputs.logits  # Classification scores

print("Model outputs:", logits)
if labels is not None:
    print("Loss:", loss)

```

### Logits in PyTorch

The raw outputs from the output layer of the neural network are called logits which are also known as activations. Deep learning networks at the core are made up of matrices multiplication and non-linearities like ReLU, these logits can range from **(-R,R)** where **R** represents real numbers. These logits can not be interpreted as model scores due to which activations are applied to them before getting the final score. 

### Init and Forward Method in PyTorch

```python
class MyNeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyNeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
```

**init** is a constructor method used to initialize the parameters of the network. It is executed when an object of the class is created. For example, in PyTorch, this method is used to define the layers of the network, such as convolutional layers, linear layers, activation functions, etc. **forward** is a method that defines the forward pass of the neural network. This method takes the input data and passes it through the layers of the network to produce the output. This method is executed whenever the model is called to make prediction or to compute the loss during training. 

In other words, **init** sets up the network by defining the layers while forward specifies the data flows through the network. Both methods are required to create a neural network in PyTorch and serve different purposes. 

# Model Training on Celebs Dataset

```python
""" """
import os
import cv2
import torch
import numpy as np
from torch import nn
from torchvision import transforms
"""Transforms are common image transformations. They can be chained together using
        Compose. There is also a functional module for transform which provides the 
        fine-grained control over transformations."""
        
from torch.utils.data import Dataset, DataLoader
"""
"""
from facenet_pytorch import InceptionResnetV1
from PIL import Image

class VideoDataset(Dataset):
    def __init__(self, folder_paths, frame_count=20, transform=None):
        self.frame_count = frame_count
        self.transform = transform
        self.videos = []
        self.labels = []
       
        # Process real celebrity videos
        for video_file in os.listdir(folder_paths[0]):
            if video_file.endswith(('.mp4')):
                self.videos.append(os.path.join(folder_paths[0], video_file))
                self.labels.append(0)  # Real
               
        # Process fake celebrity videos
        for video_file in os.listdir(folder_paths[1]):
            if video_file.endswith(('.mp4')):
                self.videos.append(os.path.join(folder_paths[1], video_file))
                self.labels.append(1)  # Fake
               
        # Process YouTube real videos
        for video_file in os.listdir(folder_paths[2]):
            if video_file.endswith(('.mp4')):
                self.videos.append(os.path.join(folder_paths[2], video_file))
                self.labels.append(0)  # Real
   
    def extract_frames(self, video_path):
        frames = []
        cap = cv2.VideoCapture(video_path)
       
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        interval = max(total_frames // self.frame_count, 1)
       
        frame_counter = 0
        while len(frames) < self.frame_count and frame_counter < total_frames:
            ret, frame = cap.read()
            if not ret:
                break
               
            if frame_counter % interval == 0:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frame = Image.fromarray(frame)
                if self.transform:
                    frame = self.transform(frame)
                frames.append(frame)
               
            frame_counter += 1
           
        cap.release()
       
        # Pad sequence if necessary
        while len(frames) < self.frame_count:
            frames.append(torch.zeros_like(frames[0]))
           
        return torch.stack(frames)
   
    def __len__(self):
        return len(self.videos)
   
    def __getitem__(self, idx):
        video_path = self.videos[idx]
        frames = self.extract_frames(video_path)
        label = self.labels[idx]
        return frames, torch.tensor(label, dtype=torch.float32)

class DeepFakeDetector(nn.Module):
    def __init__(self, frame_count=20, hidden_size=512):
        super(DeepFakeDetector, self).__init__()
       
        # Load pretrained InceptionResNetV1
        self.feature_extractor = InceptionResnetV1(pretrained='vggface2')
        # Freeze feature extractor parameters
        for param in self.feature_extractor.parameters():
            param.requires_grad = False
           
        # LSTM for sequence processing
        self.lstm = nn.LSTM(
            input_size=512,  # InceptionResNetV1 output size
            hidden_size=hidden_size,
            num_layers=2,
            batch_first=True,
            dropout=0.5
        )
       
        # Final classification layers
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
       
    def forward(self, x):
        batch_size, seq_len, c, h, w = x.size()
       
        # Reshape for feature extraction
        x = x.view(-1, c, h, w)
       
        # Extract features
        features = self.feature_extractor(x)
       
        # Reshape for LSTM
        features = features.view(batch_size, seq_len, -1)
       
        # Process with LSTM
        lstm_out, _ = self.lstm(features)
       
        # Use last LSTM output
        lstm_out = lstm_out[:, -1, :]
       
        # Final classification
        output = self.classifier(lstm_out)
        return output

def train_model(model, train_loader, val_loader, epochs=50, device='cuda'):
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters())
   
    model = model.to(device)
    best_val_loss = float('inf')
   
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0
        for frames, labels in train_loader:
            frames, labels = frames.to(device), labels.to(device)
           
            optimizer.zero_grad()
            outputs = model(frames)
            loss = criterion(outputs.squeeze(), labels)
           
            loss.backward()
            optimizer.step()
           
            train_loss += loss.item()
           
        # Validation phase
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
       
        with torch.no_grad():
            for frames, labels in val_loader:
                frames, labels = frames.to(device), labels.to(device)
                outputs = model(frames)
                loss = criterion(outputs.squeeze(), labels)
                val_loss += loss.item()
               
                predicted = (outputs.squeeze() > 0.5).float()
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
       
        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        accuracy = 100 * correct / total
       
        print(f'Epoch {epoch+1}/{epochs}:')
        print(f'Training Loss: {train_loss:.4f}')
        print(f'Validation Loss: {val_loss:.4f}')
        print(f'Validation Accuracy: {accuracy:.2f}%')
       
        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pth')

# Example usage
def main():
    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(device)
   
    # Define transforms
    transform = transforms.Compose([
        transforms.Resize((160, 160)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
    ])
   
    # Create datasets
    folder_paths = [
        'datasets/Celeb-real',
        'datasets/Celeb-synthesis',
        'datasets/YouTube-real'
    ]
   
    # Create full dataset
    dataset = VideoDataset(folder_paths, frame_count=20, transform=transform)
   
    # Split dataset
    train_size = int(0.8 * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = torch.utils.data.random_split(
        dataset, [train_size, val_size]
    )
   
    # Create data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=8,
        shuffle=True,
        num_workers=4
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=8,
        shuffle=False,
        num_workers=4
    )
   
    # Create and train model
    model = DeepFakeDetector()
    train_model(model, train_loader, val_loader, epochs=50, device=device)

if __name__ == '__main__':
    main()
```

### Probability Distribution

A **probability distribution** is the mathematical function that gives the probabilities of occurrence of the possible outcomes for an experiment. It is a mathematical description of  a random phenomenon in terms of its sample space and probabilities of events. The sample space, often represented in  notation by **Ω** (omega) is the set of all possible outcomes of a random phenomenon being observed. The sample space may be any set: a set of real numbers, a set of vectors, a set of arbitrary non-numerical values etc. Sample space of coin flip would be **Ω =** {”heads”, “tails”}.

Defining the probability of the distribution depends on the type of random variables. **Discrete and absolutely continuous.** In the discrete case, it is sufficiently to specify a probability mass function *p*  assigning probability to each possible outcomes. In contrast when a random variable takes values from a continuum then by convention, any individual outcome is assigned probability zero. For such continuous random variables, only events that include infinitely many outcomes such as intervals have probability greater than zero.

odyssey plan




take three sheets of papers, in first paper plan out next 5-years in your life or anything you are planning keeping the current scenario and your situations and realistic expectations in the mind. In second paper what will you do if the plan A can't happen or failed. This plan should be distinctly different from plan A not just minor variation. but it should still align with your interests and values  and finally in third paper write down the plan keeping in mind that the time and money are no object. 
for each plan you access evaluate following aspects: resources(do you have time, money, skills and contacts needed to execute this plan), how much do you like it? and how confidence are you to pull this off.

cudas


GPUs memory bandwidth is faster/better as compared to CPU because of the use of wider memory buses in GPU (256-bit, 384-bit, 512-bit) interface whereas CPUs memory uses narrow memory buses (64-bit). GPUs memory bandwidth ranges from GB/s to TB/s. 

double precision means 64 bit (1 bit for sign 11 bits for exponent and 53 bits for mantissa), single precision means 32 bit (1 bit for sign, 8 bit for exponent, 23 bits for mantissa)

IEEE 754 is the de facto standard for floating-point arithmetic in modern computing, supporting a variety of precisions and covering edge cases for robust numerical computation.

OpenGL (open graphics language) is a graphics API used for rendering 2d and 3d graphics. it is designed for hardware accelerated rendering system. 
Direct3D is microsoft's _proprietary_ api for rendering 2d and 3d graphiccs for microsoft platforms (windows, xbox). 

2007, programs written on CUDA didn't have to use graphics part of the GPU instead CUDA can directly talk to the special general-purpose part of chip made for parallel processing. 

GPUs are made up of SMs (streaming microprocessors), each SMs contains SPs (streaming processors) which are basic execution units for arithmetic operations (+ and *), teraflops of floating point operations are performed by SMs and SPs. *

GPU also have their global memory with very high bandwidth as compared to CPU's RAM. 

nepal-foreign-employment

 
#anything #smarc 
# STATS ON FOREIGN EMPLOYMENT

1. News and report of 771,000 Nepali youths ventured abroad for foreign employment in FY 2022/23.According to the Foreign Employment Board, more than 600,000 individuals pursued foreign jobs in the year following the pandemic, and this number surged to over 750,000 in the last fiscal year.The impact of the COVID-19 pandemic has led to a surge in Nepali youths seeking employment opportunities abroad, as opportunities within the country have become limited.

Krishna Prasad Bhusal, the information officer of the Ministry of Labor, emphasized that nearly all migrant workers from Nepal fall within the economically productive age group of 18 to 44 years. Over the past three years, half of these migrant workers were between the ages of 25 and 34 years old.
https://myrepublica.nagariknetwork.com/news/over-771-000-youths-sought-foreign-employment-in-fy-2022-23

---

---

1. A news article on RisingNepalDialy by Dixya Poudel reports that nearly 808,415 Nepali youths left nation for foreign employment the year 2023. Over 108,542 Nepali students opted to study abroad in 2023. 
https://risingnepaldaily.com/news/37709

---

---

1. Data collected from the Nepal Labour Migration report of 2073-2080 BS can be found in this site:

https://dofe.gov.np/yearly.aspx

**2080-81 Yearly Labour Migration Report:**

[Document_2024071711480.pdf](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/e64297a9-dfda-4509-a1b5-c7e12cb5ee9a/Document_2024071711480.pdf)

Department of Foreign Employment Nepal also provide the portal to search the foreign job across the world through this website as well: 

https://foreignjob.dofe.gov.np/

---

---

1. This paper discusses the trends and impacts of foreign employment and remittance inflows in Nepal.
    
    [7+Remitance+20240221.pdf](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/7be7d503-6afe-4171-8c6b-8c419bbfb157/7Remitance20240221.pdf)
    
     
    
    | **Category** | **Metric** | **Value** | **Period** |
    | --- | --- | --- | --- |
    | GDP Contribution | Remittance Percentage | 22.7% | 2022/23 |
    | Migration Volume | Daily Departures (Pre-COVID) | 1,500 | Pre-2020 |
    | Labor Approvals | 1.1 million | 2019/20-2021/22 |  |
    | Gender Distribution (2021) | Male Migrants | 64,903 | 2021 |
    | Female Migrants | 7,018 | 2021 |  |
    | Top Destinations | Saudi Arabia | 30% | 2021/22 |
    | Qatar | 29% | 2021/22 |  |
    | Remittance Trends | 2012 Amount | 434,581.7M NPR | 2012 |
    | 2021 Amount | 1,007,307M NPR | 2021 |  |
    | 10-Year Averages | Annual Migration | 335,828 | 2012-2021 |
    | Annual Remittance | 786,731M NPR | 2012-2021 |  |
2. Another data provided by NepalinData on the REPORT ON THE STATUS OF NEPALI MIGRANT WORKERS IN THE KEY DESTINATION COUNTRIES 2019, we got following stats: 

| **Category** | **Statistic** | **Value** |
| --- | --- | --- |
| **Labor Permits Issued** | Total labor permits issued by DoFE (2008/09 to 2016/17) | 3,554,683 |
|  | % of permits issued for Malaysia and GCC countries | 86% |
|  | Workers received work permits through EPS to work in South Korea | 45,000 |
| **Skill Levels** | % of labor permits issued in 2014/15 for unskilled labor | 74% |
|  | % of labor permits issued in 2014/15 for semi-skilled labor | 25% |
|  | % of labor permits issued in 2014/15 for skilled workers | 1% |
| **Survey Findings** | % of workers get preliminary information from foreign employment agencies/agents | 50% |
|  | % of workers received prior information from recruitment agencies about earnings, working hours, and obligations | 89% |
|  | % of workers found orientation training very helpful | 37% |
|  | % of workers found orientation training a little helpful | 48% |
|  | Average cost of migration (NPR) | 100,000 (0 to 550,000) |
| **Job Types and Earnings** | Top 10 highest-earning jobs include Chef, Accountant, Foreman, Supervisor, Cook, Mechanic, Salesperson, Driver, General Technician, Storekeeper |  |
|  | Average monthly earnings of Nepali migrant workers (NPR) | 48,000 |
|  | Earnings range (NPR) | 19,000 to 257,000 |
|  | Median earnings (NPR) | 39,000 |
| **Education and Earnings** | Workers with education above grade 12 earn on average (NPR) | 68,000 |
|  | Illiterate workers earn on average (NPR) | 24,000 |
|  | Higher educational attainment leads to higher earnings |  |
| **Work Experience and Earnings** | Experienced workers earn significantly more than less experienced workers |  |
|  | In South Korea, experience does not significantly affect earnings |  |
| **Skill Level and Earnings** | Skilled workers earn between (NPR) | 72,000 to 188,000 |
|  | Professional category workers earn the highest average salary (NPR) | 115,204 |
| **Working Conditions** | Nepali migrant workers work between (hours/day) | 7 to 18 |
|  | Average overtime work (hours) | 3 |
|  | % of workers receive both lodging and food facilities | 56% |
| **Financial Practices** | % of workers receive their salary/wage each month | 94% |
|  | % of workers receive salary/wages through their bank account or ATM | 87% |
|  | % of workers send earnings through money transfer companies | 84% |

---

---

1. Another article provided on Spotlight Nepal reports that the Nepalese going to work abroad decreased by 13 percent in eight months. 

https://www.spotlightnepal.com/2024/03/16/nepalese-going-work-abroad-decreased-13-percent-eight-months/

---

---

1. An article from Study Travel reports that more than 112,000 Nepalese studnets applied to study abroad last year. Students from Nepal looking to study long-term language programmes, vocational courses and university degrees overseas require a No Objection Certificate (NOC), and the government issued 112,593 in the fiscal year, covering 17th July 2023 to 15th July 2024.

https://studytravel.network/magazine/news/0/30780

The figure represented a slight decrease compared with 117,563 NOCs issued in 2022/23, but the number of Nepalese students heading abroad is still significantly higher than in previous years. Government data shows that there were 63,259 NOCs issued in 2018/19, and 24,824 provided in 2008/08.

Japan was the most popular destination in the [latest data set](https://moest.gov.np/post/3_669e3ec49d2f4) with 34,371 NOCs issued, followed by Canada with 15,982, Australia (14,372), the UK (13,339) and the USA (11,261).

---

---

1. The latest report from the [IRCC](https://www.canada.ca/en/immigration-refugees-citizenship.html) (Immigration, Regugees, and Citizenship Canada) suggests that nearly 16,000 Nepalese students visa for post-secondary studies in 2023. For comparison, Nepalese students accounted for just 0.2% of Canadian study permits issued for post-secondary studies in 2018.
    
    A report from [US Department of State-Bureau of Consular Affairs](https://travel.state.gov/content/travel/en/legal/visa-law0/visa-statistics/nonimmigrant-visa-statistics.html), in just the first nine month of fiscal year 2024, the US already issued a record-high number of F-1 student visas to Nepalese students. 9200 Nepalese students were issues a student visa from October through June 2024. This is an increase of 61% over full-year 2023, and 49% more than full-year 2022—the previous full-year high.
    
    International student interest in the [UK](https://www.gov.uk/government/statistics/immigration-system-statistics-year-ending-march-2024/why-do-people-come-to-the-uk-to-study) dropped following significant policy changes in 2023. But as the new Labour government begins to establish their views on international education, Nepal may prove to be a critical source of students to buoy the UK’s international education sector moving forward. 
    
    More than 8,500 Nepalese students were issued a main applicant student visa to the UK in 2023, an increase of 83% over the previous year. Critically, **the 2,200 Nepalese students issued a student visa in Q1 2024 represent growth of 27% over Q1 2023**. Over this same time frame, the number of student visas issued to all international students to the UK dropped by 22%. In short, Nepalese students remain highly interested in an education in the UK post-policy changes.
    
    For Australia, the declination of the student visas can be seen in the given table: 
    
    | Year | Visa Granted (Australia) |
    | --- | --- |
    | 2018 | 17603 |
    | 2019 | 10,691 |
    | 2020 | 2,747 |
    | 2021 | 3,271 |
    | 2022 | 21,864 |
    | 2023 | 14,530 |
    | 2024 | 4,075 |
    |  |  |

Link to article:https://www.applyboard.com/applyinsights-article/how-nepal-will-help-alter-the-international-student-landscape-in-the-coming-decade#f9

vlm

 vlm
1. text ani  image duitai modality  ma operate garne large language models haru nai vision language models ho josle euta model lai feel garna (perceive) ani reason garna (bujhna, karan garna, why?) able garauxa. CLIP was pioneer developed by OpenAI (open-sourced project). alignment methods haru chai vlm ko aspect ma ekdam hot topic ho jasto lagxa malai. alignment vaneko chai kun word kasto dekhinxa (visual) vanne kura sikaunu parxa model lai through contrastive learning (correct and incorrect samples, model learns to keep correct samples closer and incorrect far (pull correct pairs closer in its internal representation and push incorrect ones apart), cross-modal attention (yesma chai image ani text attention mechanism ko through same time ma jasto hunxa, like text process garirahada model le image ko relevant parts haru ni attention mechanism ko through learn garxa or sampling garxa(What color is the car?”, the model focuses on the car in the image.)),  Masked Language and Image Modeling  (yesma chai bert jasto missing part guess garne ho, ani tehi learn garxa, jastai image dine ani tesko description batw chai certain word hataidine josle garda model le chai tyo visual data hererw word predict garnu parxa.), Supervised Alignment with Bounding Boxes (yesma chai model lai train garne belama bounding boxes haru grounding ko lagi banainxa josle garda model le generate garne belama pani grounding sanga response generate garxa. [[sled-lab]]

2. vlm ko sabai vanda thulo samasya vaneko yesle visual hallucination garxa, ani llm component batw parametric knowledge use garerw response generate garxa josle garxa kheri hallucination sanga deal garna ko lagi grounding dherai jaruri xah. 
3. currently llama-3.2-vision vanne model ma dekhiyeko chai k xah vanda vision ani text ko correlation ko lagi llm backbone use gareko painxa which doesn't requires pre-training like that of CLIP. 

Thursday, 4 September 2025

rag

 1. BERT lai cosine similarity ko basis ma embeddings generate garna train garne. [SentenceTransformers using Siamese BERT  Networks](https://arxiv.org/abs/1908.10084)
2. documents lai pahila chunks ma divide garerw tesko embeddings haru nikalxa, tespaxi tyo embeddings haru lai vector database ma store garxa. jun embedding model documents ko embeddings generate garna use gareko xah tesaile query ko pani embeddings generate garinxa ani tyo embedding rw vector database ma vako embeddings haru bich cosine similarity nikalxa. ani jun ko cosine similarity high xah (matlab angle between two vector small xah) tyo embeddings ko original text document dekhi retrieve garerw ani teslai user ko query sanga attach garerw prompt sanga llm lai dinxa, ani llm le contextual response provide garxa.
3. Tarw aba embeddings chai token level ma ani sentence level ma generate garna milxa, jastai BERT le words words ko token-level ma generate garxa vane Sentence-BERT le chai sentence ko direct embedding create garxa jun chai aru sentences haru sanga compare garna kaam lagxa RAG pipeline ma. BERT lai chai cosine similarity ko basis ma embeddings generate garna (masked transformer) train garinxa jasle garda sentences haru ko embeddings pani (semantic ani contextual meanings) majjale intact rahanxa jasle garda accurate retrieval garna maddat milxa. Tyo embeddings ko lagi train garne bert ma chai different tarika le model train garinxa (layers), kosaile [cls] token use garxa, kosaile maxpooling use garxa vane production ma chai mean pooling use hunxa josle k garxa vanda euta sentence lai single vector (something dimension ko) ma compress garxa. 

Thursday, 28 August 2025

cmake learn

 

cmake learn

  1. cmake_minimum_required(VERSION 3.16) set minimum version for CMake, kei features haru naya CMake ko version ma matrai hunxa josle garda version specify garnu parxa.
  2. project(VoiceAssistant yo chai project ko name set garaxa, project anusar rakhnu parxa yo nam chai.
  3. set(CMAKE_CXX_STANDARD 17 yesle chai program lai or system lai jun C++ version chainxa tei set garxa.
  4. find_package(PkgConfig REQUIRED find_package(CURL REQUIRED yo duita line le chai CMake lai installed libraries haru kaha xa herna vanxa. aba yeha chai CURL vetena vane build fail hunxa tesari aru dependency pani rakhna milyo instead of CURL.
  5. include_directories(${CMAKE_SOURCE_DIR}/external) yo line le chai headers file .h/.hpp files haru kaha xa vanerw specify gardinxa, {CMAKE_SOURCE_DIR} vaneko chai project ko root ho.
  6. set(main.cpp audio_manager.cpp transcriber.cpp llm_client.cpp voice_assistant.cpp yo line le chai sabai source files haru list garxa, dherai files haru specify garnu thauna files haru lai module anusar group garnu parxa.
  7. add_executable(voice_assistant${sources}) yo line le chai kun chai program build garne ho vanerw define garxa, vaneko binary file ko name specify garinxa, ${SOURCES} le chai sabai cpp files haru mathi lekheko kura haru include garxa.
  8. target_link_libraries)voice_assistant CURL::libcurl pthread asound yo line le chai sabai external libraries haru include garxa use garan ko lagi, libcurl chai httprequest ko lagi, pthread multithreading ko lagi ani asound chai ALSA audio Linux ma. yesko thauma aru libraries haru ni huna sakyo jun chainxa like whisper.cpp.

Wednesday, 27 August 2025

falsophy

 fiction only gives truth, philosophy doesn't gives you truth, its just a justification.

Nietzshche, over-intellectualization or decadence

1. For Nietzsche, instinct (like in animals — 
direct, decisive, no hesitation) is a life-affirming force.
2. When a person over-analyzes everything, their 
**energy shifts from acting to just knowing**. The result: hesitation
, paralysis, delay.
3. Outwardly, they look civilized, but inside they’ve lost 
their primal strength (instinct).
4.thinking is important, but **thinking must serve instinct, not dominate it.**

  1. oedipux rex, the king (killed his father, married his mother (4 children), then they know and mother hang herself, tragedy between god and people)
  2. dionysus, ancient greek god, (balance between apollo and dionysus: )
  3. Nietzsche’s idea of the Apollonian vs. Dionysian:
  4. only philosophy is plato, its in our DNA, every philosophy after plato is commentary on plato.
  5. Oedipus Complex (son loves his mother and rival his father, subconciously (its psychological and unconscious, not literal))
  6. Electra Complex (vice versa to oedipus complex)

transformers


What is Transformers?

It is a library of pretrained natural language processing, computer vision, audio and multimodal models for inference and training. The main features of transformer library are:

pip install 'transformers[torch]'

Pipeline

It includes the inference class for many machine learning tasks like text generation, image generation, automatic speech recognition, document answering and so on. ((run inference with pipeline))

from transformer import pipeline
pipeline = pipeline('text-generation', 'model/name', device='cuda')
pipeline("The malaria is caused due to", max_length=50)

Trainer

It includes the trainer configuration for that supports mixed precision, torch.compile, Flash Attention and distributed training for PyTorch models.((finetune the model with trainer))

from transformers import AutoModelForSequenceClassification, AutoTokenizer

from datasets import load_dataset

from transformers import DataCollatorWithPadding

from transformers import TrainingArguments

from transformers import Trainer

  

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer = AutoTokenizer("distilbert/distilbert-base-uncased")

dataset = load_dataset("rotten_tomatoes")

  

# Converting to Tensors

def tokenize_dataset(dataset):

    return tokenizer(dataset['text'])

dataset = dataset.map(tokenize_dataset, batched=True)

  

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  
  

training_args = TrainingArguments(

    output_dir='./output_directory_class',

    learning_rate=2e-5,

    per_device_train_batch_size=8,

    per_device_eval_batch_size=8,

    num_train_epochs=2,

)

  

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=dataset['train'],

    eval_dataset=dataset['test'],

    tokenizer=tokenizer,

    data_collator=data_collator,

)

  

trainer.train()

Generate

It allow the fast text generation with large language models and vision language models, including support for streaming and multiple decoding strategies.


Accelerate Library

Accelerate library automate the device placement, distributed process orchestration and mixed-precision handling across CPU, GP and TPU. Its API wraps model, optimizer and data loader so that the training loop scales seamlessly on any hardware with minimal code change.

Process Orchestration

Process orchestration means coordinating training processes across devices and nodes, launching, synchronizing and managing inter-process communication to ensure parallel workloads share gradients, configuration, monitoring and automatic scaling efficiently without manual boilerplate. ((infrastructure-level code for initializing devices, spawning processes, managing precision scaling)) [[#Accelerate Library]]


Timm Library

'timm' is a deep learning library that contains the SOTA computer vision models, layers, optimizers, utilities, data loaders, augmentations and training/validation scripts with ability to reproduce ImageNet training results. [[#Accelerate Library]] %% pip install timm %%

import timm
import torch

model = timm.create_model('resnet50')
x = torch.randn(1, 3, 224, 224)
model(x).shape

Pretrained Models

Each Hugging Face's pre-trained models inherits from three base classes:

  • PretrainedConfig: Model attributes ((number of heads, vocabulary size))
  • PretrainedModel ((model architecture defined with configuration file, pretrained models only return raw hidden states. Model head need to be used to convert raw hidden states, LlamaForCausalLM))
  • Preprocessor ((class for converting raw inputs into numerical inputs to the model, PreTrainedTokenizer, ImageProcessingMixin)) AutoClass API is recommended to use for loading models and preprocessors to automatically infer to appropriate architecture for each task.

Note

PyTorch loads weights in torch.float32 by default.


[!MobileBERT ] MobileBERT MobileBERT introduces an inverted-bottleneck structure to maintain a balance between self-attention and feed-forward networks, achieving a 4.3x size reduction and a 5.5x speedup compared to the base version of BERT.

[!BabyLLaMA] BabyLLaMA BabyLLaMA distill knowledge from multiple teachers (LLaMA, GPT) into a 58M-parameter model and a 345M-parameter model respectively, demonstrating that distillation can exceed teacher model's performance particularly under data-constrained conditions.

FlashAttention

Reformer improves the complexity of the self-attention from 0(N2) to 0(N log N) by replacing the dot product attention with one which uses locality-sensitivity hashing.

Mixed Precision Training is a technique for enhancing pre-training efficiency of SLMs and LLMs. This approach leverages low-precision representations for forward and pass and backward propagation while maintaining high precision for updates. Few notable works are Automatic Mixed Precision, Brain Floating Point (BFLOAT16). NVIDIA's latest hopper architecture supports 8-bit-floating (FP8) precision enabling greater computational efficiency for large-scale language models.

Hessian Metrics

The Hessian is a matrix that tells you how a function bends. While the gradient tells you the direction of steepest ascent or descent, the Hessian tells you how the steepness itself changes. In machine learning, especially deep learning, it helps us understand how our loss function behaves around minima and how our model might generalize.

Monolithic Multi-Modal Model

Monolithic multimodal models are single, unified neural networks trained to process and integrate multiple data types (e.g., text, image, audio) within one architecture. Unlike modular systems that use separate models for each modality, monolithic models share parameters and jointly learn cross-modal representations, enabling richer understanding and generation across modalities with end-to-end training and shared context.

Gradient Clipping

Gradient clipping limits how large the gradients can get during backpropagation. If they exceed a threshold, they’re scaled down to prevent instability or exploding gradients. This stabilizes training, especially in deep or recurrent neural networks, by avoiding extreme parameter updates that can cause divergence or NaN losses. It ensures smoother, more controlled learning steps.

Universal Login Distillation Loss

Universal Logit Distillation Loss is a knowledge distillation technique where a student model learns from the logits (pre-softmax outputs) of a teacher model, using a unified, modality-agnostic loss. It’s designed to work across different tasks or modalities (e.g., vision, language, audio), hence the term "universal".

[!NOTE] Title Logits: Raw output from the last layer of a model before applying softmax.

BFLOAT16

BFLOAT16 is a compact floating-point format that keeps the range of FP32 but with lower precision. It speeds up training while avoiding common issues like overflow seen in FP16. That’s why it’s widely used in TPUs and supported in modern deep learning frameworks BFLOAT16 keeps the same 8-bit exponent as FP32, so it can represent large and small values just like FP32.

GPU Generation and Supported Precision

Generation Precision Support
Volta (V100) FP16, FP32
Turing (T4) FP16, INT8
Ampere (A100) FP16, BFLOAT16, TF32, INT8, FP64
Hopper (H100) FP8, FP16, BFLOAT16, INT8, FP64

Mixture of Experts

Mixture of Experts (MoE) is a neural network architecture that splits the model into multiple “experts” (sub-networks), and during inference or training, only a small number of them are activated based on the input.

Usage of MoE

  • GShard (Google): Scaled to 600B+ parameters.
  • Switch Transformer: Efficient MoE with only 1 expert per token.
  • GPT-MoE variants: Sparse expert-based large language models.
  • Vision Models: Used in sparse ViT and multimodal architectures.