Published on

Part 2: Training My Own Coding Model — The SFT and DPO Pipeline

Authors

This is the full pipeline: PII scan, masking, temporal ordering, dedup, SFT, and DPO pairs.

What the raw data looked like

Over 12 months I collected AI coding conversations from three main sources:

  • Claude Projects exports
  • Cursor IDE logs
  • Codex sessions

Total raw data was around 727MB, about 107k conversations. A lot of it was not safe or not usable.

107,502 conversations total. After scanning for secrets, 95,561 got quarantined. That is 89% of my data flagged for potential API keys, private keys, or AWS credentials.

11,711 conversations survived security checks. 51.75 million tokens. Enough to fill 1,200 copies of The Great Gatsby with nothing but code and error messages.

Reconstruction and temporal ordering

A big chunk of the work was reconstructing the conversations so they were in the right order.

Cursor v2 stores Composer and Agent chats in a different place than old chat mode. The Composer bubbles live in:

~/Library/Application Support/Cursor/User/globalStorage/state.vscdb
Table: cursorDiskKV
Keys: composerData:{uuid} and bubbleId:{composer-id}:{bubble-id}

Those bubbles include timestamps. I sorted messages by timestamp so the conversation order is stable before anything else happens.

For Claude, I grouped raw events by sessionId, dropped sidechain messages for SFT, and chunked long sessions into windows so they fit context limits. This gives me real multi turn conversations, not random message shards.

Dedup that survives time

After reconstruction and ordering, I deduplicate by a stable trace id derived from the conversation content. That catches duplicates across sources and across time.

Numbers from the build:

  • Duplicates skipped: 13,634
  • Sessions rebuilt from raw Claude logs: 4,747
  • Chunks written from those sessions: 5,007

This matters because the raw logs contain the same thread in multiple places. If you dedup before ordering, the hashes are unstable. Order first, then dedup.

PII scan and masking

I did not trust myself to manually review 100k conversations. Everything goes through a scan and masking step first.

What I scan for:

  • API keys and tokens
  • Private keys
  • Database URLs
  • Local paths like /Users/sero

Then I apply a redaction policy that masks patterns and rewrites local paths to /<ABS>/.

From the final SFT run:

  • 95,561 conversations quarantined
  • High risk markers quarantined: 79
  • Path rewrites: 75,127
  • Pattern replacements included GitHub tokens, HF tokens, OpenAI keys, Anthropic keys, Slack tokens

Prepared outputs are scanned again and come back clean with 0 hits. The quarantine file only stores row pointers and reasons. No raw text leaks.

Did I train on single pairs

No. The SFT dataset uses full multi turn conversations. Some sources are only pairs, but most are not. The final SFT output has:

  • Average 8.4 messages per conversation
  • p50 of 2 messages, p90 of 19

The model sees real back and forth, not just single question and answer pairs.

The SFT dataset

Final split:

  • Train: 11,711
  • Validation: 107
  • Test: 123

Domain mix from samples:

  • Solidity and Web3 around 35%
  • TypeScript and Node around 30%
  • Python around 20%
  • SQL around 10%
  • Other around 5%

This matches what I actually work on.

SFT training config

I trained a LoRA adapter on top of NousCoder 14B.

  • QLoRA 4 bit
  • LoRA rank 64, alpha 128, dropout 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Batch size 2, grad accumulation 8
  • Max length 4096, packing enabled
  • Learning rate 2e-5, cosine schedule
  • 3 epochs target, timed out at 2.52

Infra:

  • HuggingFace Jobs
  • A100 80GB
  • 18 hours
  • Cost around $47

Results

  • Final loss: 0.685
  • Token accuracy: 81.6%
  • Total tokens: 51.75M

The checkpoint still works.

What it learned

The model absorbed my patterns:

  • OpenZeppelin imports (35% of training data was Solidity)
  • ethers.js v6 (not v5, because I debugged the differences)
  • Type annotations in TypeScript
  • Error handling that actually catches things
  • Concise explanations followed by code blocks

First test: "Write a Solidity ERC20 token"

It generated valid code with OpenZeppelin. Used the correct import paths. Included permit functionality (EIP-2612) without being asked. The model remembered my style better than I do.

The Artifacts

Three problems emerged immediately:

  • Besides measuring loss on the training data, how am I suppose to measure the model's performance on unseen data?
  • Is the dataset representative of what I want to do moving forward?
  • If I have been incentivizing the model to learn my style, how do I ensure it doesn't overfit?

Deploying to production

The model lives on my linux server https://github.com/0xSero/vllm-studio. It is serving at 100+ tokens/second. I am using it in opencode, and vllm-studio. It suggests code that looks like code I would write but lacks the flexibility of a larger model.

DPO pairs and alignment plan

I prepared preference pairs for DPO. These are mined from Claude sidechains where the draft response becomes rejected and the final response becomes chosen.

  • Total pairs: 4,532
  • Train: 4,443
  • Validation: 46
  • Test: 43

Where the model lives

Model card and files:

It is a LoRA adapter, not a merged model. I serve it via vLLM with a base model behind the API.

How I use it

I run it through an OpenAI compatible endpoint:

curl https://api.homelabai.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -d '{
    "model": "sero-nouscoder",
    "messages": [{"role": "user", "content": "Write a Solidity ERC20 token"}],
    "max_tokens": 500,
    "temperature": 0.7
  }'

The economics

This is where it gets interesting:

Total project cost: 47Remainingbudget:47 Remaining budget: 103 (started with $150) Time investment: 2 days data prep, 18 hours training Result: A coding assistant that knows my preferences

Compare to ChatGPT Pro at $20/month. This cost me 2.3 months of subscription, but I own the model. No rate limits. No context windows shared with millions. Actual privacy.

What I am thinking next

  • Can this scale to a distributed set of training datasets sources from OSS developers?
  • What size model would make this actually useful on a day to day basis as a daily driver?
  • How can I better structure the dataset to not just show it what i've done but also push it towards better overall behaviour?
  • I need to learn more about RL, SFT, etc..

The verdict

This project changed my mental model of AI. I thought fine tuning required PhDs and data centers. Reality: one person, one weekend, $47.

The model is not perfect. It shows training artifacts. But it codes like me, understands my project patterns, and runs on hardware I control.


Continued from Part 1: How I actually work with AI

Training: $47, 18 hours on A100, 51.75M tokens. Final loss: 0.685, accuracy: 81.6%.