A real issue here is lack of training data (at least for LLMs). There's lots of ...

rybosome · 2025-08-20T20:41:13 1755722473

I fine-tuned an LLM to do Verification IP wiring at a LLM hardware startup. We built the dataset in house. It was quite effective actually, with enough investment in expanding the dataset this is a totally viable application.

nxobject · 2025-08-20T21:45:58 1755726358

I'm curious: did you have to tailor your dataset around instruction-following/reasoning capabilities as well? No conflict of interest myself – I'm interested in hobby programming for vintage computers – but my understanding comes from Unsloth's fine-tuning instructions. [1]

[1] https://docs.unsloth.ai/basics/datasets-guide

rybosome · 2025-08-20T21:55:41 1755726941

No problem - although I'm out of that particular role, it's appropriate to discuss since the company shared these details already in an openAI press release a few months back.

I fine-tuned reasoning models (o1-mini and o3-mini) which were already well into instruction-following and reasoning behavior. The dataset I prepared was taking this into account, but it was just simple prompt/response pairs. Defining the task tightly, ensuring the dataset was of high quality, picking the right hyper parameters, and preparing the proper reward function (and modeling that against the API provided) were the keys to success.

rbanffy · 2025-08-21T18:00:39 1755799239

That’s really cool. I’d love to see that process from up close.

criemen · 2025-08-20T21:09:37 1755724177

> Indeed perhaps it's important to include a high quality corpus in pre training? I doubt anyone wants to train an LLM from scratch for EDA.

That does sound reasonable to me. The main problem is that you (at least for software) can't train on source code alone, as comments are human language, so you need some corpus of human language as well, so that the LLM learns that, next to the programming language(s). I'd assume it's the same as well.

Depending on what you're going for, you could take an existing pre-trained model, and further pretrain it on your EDA corpus. That means you'll have to reinvent or lift from somewhere else the entire finetuning data and pipeline, which is significantly harder than doing a finetune.