More

hamsterbooster · on Aug 13, 2024

I would say the opposite. We want to make sure that we build our systems in a way that it get better as foundational model becomes better.

Our thesis is that foundational models will become good and affordable enough to be used in almost all data processing pipelines. We build systems on top of that to manage workflows, integrations, and data applications that people may want to develop.

CuriouslyC · on Aug 13, 2024

Seems like you want foundational models to become better at doing what you want when you give it your "magic" prompt, while not becoming smart enough to not need your magic prompt at all.

I'd need to actually dig into your product to make an informed statement but my guess is that if you build your business around AI secret sauce you're going to get your business eaten and pivot or fail, and if you build your business around a UI and specific integrations/tools real customers you're already in contact with want right now, you'll be ok.

hamsterbooster · on Aug 13, 2024

Thanks! We got quite a few good enterprise leads from Intercom chats.

hamsterbooster · on Aug 13, 2024

Thanks! There are still a lot of amazing hardware companies and vertical applications in our YC batch.

We believe that AI is only one part of our product. A significant amount of value comes from building robust integrations with different data sources and managing the business logic that operates on top of this unstructured data.

hamsterbooster · on Aug 13, 2024

In many use cases, like flagging documents for compliance issues or processing customer emails, it's challenging to manage this at the vendor level because end customers want the ability to apply business logic and run different analyses.

For data ingestion and mapping, I agree that in an ideal world, we would all have first-party API integrations. However, many industries still rely on PDFs and CSV files to transfer data.

localfirst · on Aug 13, 2024

perhaps im misunderstanding the product offering here, isn't this just throwing PDFs (which also has unparsable content like formulas, symbols and large tables even with OCR) on an LLM with structured outputs and running SQL queries?

isn't it obvious that this would be a problem that will eventually be solved by the LLM providers themselves including the ability to flag and apply business logic on top of the structured outputs?

Like I'm not sure if this is well known but LLM providers have huge pressure to turn a profit and will not hesitate to copy any downstream wrappers out of existence rather than acquiring them outright.

Its like selling wrapping tape around the shovel handle for better grip and expecting the shovel makers to not release their new shovels with it in the near future.

The shovel makers don't even need to do any market research or product development and the buyers don't have any incentive to seek or pay a dedicated third party for what their vendors will release for free and at lower costs if that makes sense.

constantinum · on Aug 13, 2024

This misunderstanding is valid. Another example is why subscription/recurring billing software exists when payment gateways can solve this problem themselves. The elephant in the room is the complexities involved down the funnel that need very specific focus/solutions.

localfirst · on Aug 13, 2024

then please elaborate on "complexities involved down the funnel" and where I am misunderstanding with examples.

macklinkachorn · on Aug 13, 2024

A few that we experience as we’re building Trellis out:

1. Managing end-to-end workflows from integrating with data sources, automatically triggering new runs when there’s new data coming in, and keeping track of different business logic that’s involved (i.e. I want to classify the type of the emails and based on that apply different extraction logic)

2. Most out-of-the-box solutions only get you 95% of the way there. The customers want the ability to pass in their own data to improve performance and specify their unique ontology.

3. Building a good UI and API support for both technical and non-technical users to use the product.

localfirst · on Aug 13, 2024

too generic

hamsterbooster · on Aug 13, 2024

Thanks for the feedback. We built Trellis based on our experience with ingesting and analyzing unstructured customer calls and chats in a reliable way. We couldn’t find a good solution apart from developing a dedicated ML pipeline, which is quite difficult to maintain.

There are some elements that might resemble Dagster, but I believe the challenging part is constructing validation systems that ensure high accuracy and correct schemas while processing all kinds of complex PDFs and document edge cases. Over the past few weeks, our engineering team has spent a lot of time developing a vision model robust enough to extract nested tables from documents

visarga · on Aug 13, 2024

What is your metric and score? Maybe you have reached perfect reliability, but in my experience information extraction is about 90% accurate for real life scenarios, and you can't reliably know which 90%.

In critical scenarios companies won't risk using 100% automation, the human is still in the loop, so the cost doesn't go down much.

I work on LLM based information extraction and use my own evaluation sets. That's how I obtained the 90% score. I tested on many document types. It looks like it's magic when you try an invoice in GPT-4o and skim the outputs, but if you spend 15 minutes you find issues.

Can you risk an OCR error confusing a dot for a comma to send 1000x more money in a bank transfer, or to get the medical data extraction wrong and someone could suffer because there was no human in the document ingestion pipeline to see what is happening?

hamsterbooster · on April 30, 2024

Really neat sandbox! Wonder how this would scale to large amount of data .Do you plan to support other data sources like audio and videos as well?

hamsterbooster · on March 20, 2021

Insitro (led by Daphne Koller) just raised $400M dollar and seems to be making good progress on applying AI to drug discovery (https://insitro.com/).

thesausageking · on March 21, 2021

What progress have they made (other than raising more money)?

hamsterbooster · on March 9, 2021

Totally agree that library is easier to maintain and for developer to use. However, services are a lot easier to monetize and allow the company to collect any data they want. I can't think of any billion dollar companies that release their core library to the users.

jacobr1 · on March 9, 2021

Libraries can also be a PITA to maintain when you need to maintain widespread platform support, including for legacy/crufy systems. I quite enjoy the freedom of owning the underlying platform with cloud deployments vs my on-prem software days.

hamsterbooster · on Nov 27, 2020

Hi Peter, Thank you for doing this. I am a student founder at a university in the US on J-1 visa (with 2 year requirement in place). I have got an interview with YC last summer and plan to apply again. If I got into YC and plan to continue working on the startup full-time, what would be my options in term of immigration and visa?

proberts · on Nov 27, 2020

Neither the B-1 nor the O-1 are subject to the 2-year home residence requirement so one of these might work. You also might want to kick off the waiver process at some point.

hamsterbooster · on Nov 15, 2020

This would be an amazing dataset to train ML/CV on. There is image + accurate labels.

visarga · on Nov 15, 2020

What would you have it learn from name, address and photo? Face recognition?

gao8a · on Nov 15, 2020

Perhaps some sort of correlation to password strength for dictionary type attacks

andrewoneone · on Nov 15, 2020

Ah yes - let’s forget about a person’s right to privacy entirely! Great!

aboringusername · on Nov 15, 2020

At this point in 2020 there must have been over 1000 data breaches, often times containing (and reaffairming) the same data such as a person's address (and in data breaches, we can track that same persons data across time, which can be useful).

You can't have a globally connected society, data storage in the billions of records, and a 'right to privacy' at the same time. It's not possible, these breaches just reaffirm that.

How long before the next data leak pops up on HN again?

Privacy died after 2001, and there's nothing to prove otherwise.

Am I wrong?

qchris · on Nov 15, 2020

"Someone did something unethical, and we can probably benefit from it.

And since people will probably do the unethical thing again, and the system which enabled people to do the unethical thing isn't perfect, we don't even really need to worry about whether using that to our benefit is unethical.

In fact, it's been years since we really needed to worry about that sort of thing anyway.

Isn't it just easier to ignore our personal responsibility to try to do the right thing in this situation?"

b112 · on Nov 15, 2020

I think this was mentioned elsewhere, but this is different from many other cases.

In this case, you cannot choose to opt out, not if you wish to drive. Further, this isn't just ID, but a whole range of info around that ID.

As example, by law in my jurisdiction, I am legally obliged to keep medical info, address, and other info up to date for my license.

I believe breaches of government ID must be held to a far higher standard than a lost credit card. IMO equifax goes into that slot, for it falls into the "cannot escape" category.

As an example, I have never used airbnb, due to their mandatory ID requirements, regardless of their claims uploaded ID is deleted after vetting. After all, Equifax's compromise went on for almost a year, and in such a case each upload, waiting to be vetted, could be copied before deletion.

So I use alternative services, like VRBO, or Craigslist even. I have choice, options, and am not compelled by law to use any of these services.

As soon as I am compelled, by the threat of violence (arrest, etc) to do a thing, you'd better perform the utmost in due diligence.

And maybe, not give entire databases to others, for profit?