Hacker Newsnew | past | comments | ask | show | jobs | submit | panabee's commentslogin

VCs are soccer stars, but founders play basketball.

It’s easy to dunk on VCs, but the herd effect is rational after considering the typical VC’s background, the intense competition for good deals, and the job requirements — to prudently deploy capital.

Who wants to pitch their boss on investing $1-10M in a product no one uses, built by a team of anons?

This is not to defend the process, but merely explain it. It’s not so different from customer marketing. To win a VC, first understand the VC.

Once hired, VCs cannot easily get fired yet they exert immense strategic control.

Nonetheless, many founders interview summer interns harder than VCs.

Heuristic: after removing capital, would you hire the VC to be your boss?

Great VCs are worth the equity and will turbocharge startups. When you find one, don't haggle. Get a fair deal, and get right back to coding.

Bad VCs will destroy companies the same way soccer stars would destroy basketball teams if made the head coach.


The association between pathogens and cancer is under-appreciated, mostly due to limitations in detection methods.

For instance, it is not uncommon for cancer studies to design assays around non-oncogenic strains, or for assays to use primer sequences with binding sites mismatched to a large number of NCBI GenBank genomes.

Another example: studies relying on The Cancer Genome Atlas (TCGA), which is a rich database for cancer investigations. However, the TCGA made a deliberate tradeoff to standardize quantification of eukaryotic coding transcripts but at the cost of excluding non-poly(A) transcripts like EBER1/2 and other viral non-coding RNAs -- thus potentially understating viral presence.

Enjoy the rabbit hole. :)


can you translate this to English?


A more accurate title: "Are Cornell Students Meritocratic and Efficiency-Seeking? Evidence from 271 MBA students and 67 Undergraduate Business Students."

This topic is important and the study interesting, but the methods exhibit the same generalizability bias as the famous Dunning-Kruger study.

The referenced MBA students -- and by extension, the elites -- only reflect 271 students across two years, all from the same university.

By analyzing biased samples, we risk misguided discourse on a sensitive subject.

@dang


This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.


This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.


100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.

(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)


> this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth

Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.


That is axiomatically true, but both harsh and useless, given that (as I understand from HN articles and comments) the choice is "play the publishing game as it is" vs "don't be a scientist anymore".


I agree, but there is an important side-effect of this statement: it's possible to criticize science, without criticizing scientists. Or at least without criticizing rank and file scientists.

There are many political issues where activists claim "the science has spoken." When critics respond by saying, "the science system is broken and is spitting out garbage", we have to take those claims very seriously.

That doesn't mean the science is wrong. Even though the climate science system is far from perfect, climate change is real and human made.

On the other hand, some of the science on gender medicine is not as established medical associates would have us believe (yet, this might change in a few years). But that doesn't stop reputable science groups from making false claims.


If we’re not going to hold any other sector of the economy personally responsible for responding to incentives, I don’t know why we’d start with scientists. We’ve excused folks working for Palantir around here - is it that the scientists aren’t getting paid enough for selling out, or are we just throwing rocks in glass houses now?


Valid critique, but one addressing a problem above the ML layer at the human layer. :)

That said, your comment has an implication: in which fields can we trust data if incentives are poor?

For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?

These are hard questions.

ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.

Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."


Not a answer, but contributory idea - Meta-analysis. There are plenty of strong meta-analysis out there and one of the things they tend to end up doing is weighing the methodological rigour of the papers along with the overlap they have to the combined question being analyzed. Could we use this weighting explicitly in the training process?


Thanks. This is helpful. Looking forward to more of your thoughts.

Some nuance:

What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.

Worse, who decides?

To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”


Scientists are responding to the incentives of a) wanting to do science, b) for the public benefit. There was one game in town to do this: the American public grant scheme.

This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]

So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.

[1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ

[2] https://jamanetwork.com/journals/jama-health-forum/fullartic...


> This is true for every subfield I have been working on for the past 10 years

Hasn’t data labelling being the bulk of the work been true for every research endeavour since forever?


If you download data sets for classification from Kaggle or CIFAR or search ranking from TREC it is the same. Typically 1-2% of judgements in that kind of dataset are just wrong so if you are aiming for the last few points of AUC you have to confront that.


I still want to jump off a bridge whenever someone thinks they can use the twitter post and movie review datasets to train sentiment models for use in completely different contexts.


To elaborate, errors go beyond data and reach into model design. Two simple examples:

1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".

2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.

There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.

We need way more people thinking about biomedical AI.


> What was true last year may be false today. For instance, ...

Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.

Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.


What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.


Would not one approach be to just ensure the system has all the data? Relevance to address systems, side effects, and legal constraints. Then when making a recommendations it can account for all factors not just prior use cases.


If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.

Every fact is born an opinion.

This challenge exists in most, if not all, spheres of life.


I think an often overlooked aspect of training data curation is the value of accurate but oblique data. Much of the “emergent capabilities “ of LLMs comes from data embedded in the data, implied or inferred semantic information that is not readily obvious. Extraction of this highly useful information, in contrast to specific factoids, requires a lot of off axis images of the problem space, like a CT scan of the field of interest. The value of adjacent oblique datasets should not be underestimated.


I noticed this when adding citations to wikipedia.

You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.

The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.


Synthetic data generation techniques are increasingly being paired with expert validation to scale high-quality biomedical datasets while reducing annotation burden - especially useful for rare conditions where real-world examples are limited.


Centaur Labs does medical data labeling https://centaur.ai/


Isn't labelling medical data for ai illegal as unlicensed medical practice?

Same thing with law data


Paralegals and medical assistants don’t need licenses


I think their question is a good one, and not being taken charitably.

Lets take the medical assistant example.

> Medical assistants are unlicensed, and may only perform basic administrative, clerical and technical supportive services as permitted by law.

If they're labelling data that's "tumor" or "not tumor", with any agency of the process,does that fit within their unlicensed scope? Or, would that labelling be closer to a diagnosis?

What if the AI is eventually used to diagnose, based on data that was labeled by someone unlicensed? Should there there need to be a "chain of trust" of some sort?

I think the answer to liability will be all on the doctor agreeing/disagreeing with the AI...for now.


To answer this, I would think we should consider other cases where someone could practice medicine without legally doing so. For example, could they tutor a student and help them? Go through unknown cases and make judgement, explaining their reasoning? As long as they don't oversell their experience in a way that might be considered fraud, I don't think this would be practicing medicine.

It does open something of a loophole. Oh, I wasn't diagnosing a friend, I was helping him label a case just like his as an educational experience. My completely IANAL guess would be that judges would look on it based on how the person is doing it, primarily if they are receiving any compensation or running it like a business.

But wait... the example the OP was talking about is doing it like a business and likely doesn't have any disclaimers properly sent to the AI, so maybe that doesn't help us decide.


A bit simpler, but if they are training the AI to answer law questions or medical questions (specific to a case, and not general), then that's what I would argue is unlicensed practice.

Of course it's the org and not the individual who would be practicing, as labelling itself is not practicing.


No.


Illegal?


The author is a respected voice in tech and a good proxy of investor mindset, but the LLM claims are wrong.

They are not only unsupported by recent research trends and general patterns in ML and computing, but also by emerging developments in China, which the post even mentions.

Nonetheless, the post is thoughtful and helpful for calibrating investor sentiment.


What is wrong about their claims?


Agreed. There is deep potential for ML in healthcare. We need more contributors advancing research in this space. One opportunity as people look around: many priors merit reconsideration.

For instance, genomic data that may seem identical may not actually be identical. In classic biological representations (FASTA), canonical cytosine and methylated cytosine are both collapsed into the letter "C" even though differences may spur differential gene expression.

What's the optimal tokenization algorithm and architecture for genomic models? How about protein binding prediction? Unclear!

There are so many open questions in biomedical ML.

The openness-impact ratio is arguably as high in biomedicine as anywhere else: if you help answer some of these questions, you could save lives.

Hopefully, awesome frameworks like this lower barriers and attract more people.


I'd love to hear more of our thoughts re open questions in biomedical ML. You sound like you have a crisp, nuanced grasp the landscape, which is rare. That would be very helpful to me, as an undergrad in CS (with bio) trying to crystalize research to pursue in bio/ML/GenAI.

Thank you.


Thanks, but no one truly understands biomedicine, let alone biomedical ML.

Feynman's quote -- "A scientist is never certain" -- is apt for biomedical ML.

Context: imagine the human body as the most devilish operating system ever: 10b+ lines of code (more than merely genomics), tight coupling everywhere, zero comments. Oh, and one faulty line may cause death.

Are you more interested in data, ML, or biology (e.g., predicting cancerous mutations or drug toxicology)?

Biomedical data underlies everything and may be the easiest starting point because it's so bad/limited.

We had to pay Stanford doctors to annotate QA questions because existing datasets were so unreliable. (MCQ dataset partially released, full release coming).

For ML, MedGemma from Google DeepMind is open and at the frontier.

Biology mostly requires publishing, but still there are ways to help.

After sharing preferences, I can offer a more targeted path.


ML first, then Bio and Data. Of course, interconnectedness runs high (eg just read about ML for non-random missingness in med records) and that data is the foundational bottleneck/need across the board.

Interesting anecdote abt Stanford doctors annotating QA question!

Each of your comments get my mind going... I'm going to think about them more and may ping you on other channels, per your profile. Thanks!


More like alarming anecdote. :) Google did a wonderful job relabeling MedQA, a core benchmark, but even they missed some (e.g., question 448 in the test set remains wrong according to Stanford doctors).

For ML, start with MedGemma. It's a great family. 4B is tiny and easy to experiment with. Pick an area and try finetuning.

Note the new image encoder, MedSigLIP, which leverages another cool Google model, SigLIP. It's unclear if MedSigLIP is the right approach (open question!), but it's innovative and worth studying for newcomers. Follow Lucas Beyer, SigLIP's senior author and now at Meta. He'll drop tons of computer vision knowledge (and entertaining takes).

For bio, read 10 papers in a domain of passion (e.g., lung cancer). If you (or AI) can't find one biased/outdated assumption or method, I'll gift a $20 Starbucks gift card. (Ping on Twitter.) This matters because data is downstream of study design, and of course models are downstream of data.

Starbucks offer open to up to three people.


Thank you both for an illuminating thread. Comments were concise, curious, and dense with information. Most notably, there was respectful disagreement and a levelheaded exchange of perspective.


Thank you! I try, but often I fail.


To provide more color on cancers caused by viruses, the World Health Organization (WHO) estimates that 9.9% of all cancers are attributable to viruses [1].

Cancers with established viral etiology or strong association with viruses include:

- Cervical cancer - Burkitt lymphoma - Hodgkin lymphoma - Gastric carcinoma - Kaposi’s sarcoma - Nasopharyngeal carcinoma (NPC) - NK/T-cell lymphomas - Head and neck squamous cell carcinoma (HNSCC) - Hepatocellular carcinoma (HCC)

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC8831861


Nvidia (NVDA) generates revenue with hardware, but digs moats with software.

The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.

OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.

Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.

They had the talent and the incentive to migrate, but didn't.

In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.

People are desperate to quit their NVDA-tine addiction, but they can't for now.

[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]


The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.


My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.


Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.

For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.

The calculus is even worse for SOTA LLMs.

The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.


llm inference is fine on rocm. llama.cpp and vllm both have very good rocm support.

llm training is also mostly fine. I have not encountered any issues yet.

most of the cuda moat comes from people who are repeating what they heard 5-10 years ago.


> OpenAI, Meta, AWS, AMD, and others have long attempted to eliminate the Nvidia tax, yet failed.

Gemini / Google runs and trains on TPUs.

You have no incentive to infer on AMD if you need to buy a massive Nvidia cluster to train.


Meta trains on Nvidia and infers on AMD. There is incentive if your inference costs are high.


Meta also has a second generation of their own AI accelerator chips designed.


Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.

Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.

Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.


> yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.

It's almost as if being a first-mover is more important than whether or not you use CUDA.


Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.

Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.


Google has a self inflicted wound in the time to get an api key.


The fact that this comment is DOWNVOTED despite being literally 1000% true is evidence that HN is full of loonies.


It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.

With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.


Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.

But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.

OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.

Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.

Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.

If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.

It will be fascinating to see how this unfolds.

Congrats to OAI on yet another fantastic release.


To address the downvotes, this comment isn't guaranteeing OAI's success. It merely notes the remarkably elevated probability of OAI escaping Nadella's grip, which was nearly unfathomable 12 months ago.

Even after breaking free, OAI must still contend with intense competition at multiple layers, including UI, application, infrastructure, and research. Moreover, it may need to battle skilled and powerful incumbents in the enterprise space to sustain revenue growth.

While the outcome remains highly uncertain, the progress since the board fiasco last year is incredible.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: