Hacker Newsnew | past | comments | ask | show | jobs | submit | mo0's commentslogin

They display Audio info if tracks from the audioswap library were used: https://support.google.com/youtube/answer/94316?hl=en


First we rewrote echonests truescore logic in perl and then altered slightly and implemented some extra checks to further try to exclude false positives. We also believe what they used in the late song/identify API might have been different from what is open sourced in https://github.com/echonest/echoprint-server

Also we pack each individual hash before storing in Elasticsearch and gained at least 50% storage space this way.

Our Fingerprint data is quite different from theirs(unreliable ID3 tags, N versions of same track) which is why we needed some tweaks. So far the matching is still far from perfect...

Whether we will open source the whole thing at some point we don't know yet.


song/identify supported both ENMFP and Echoprint, and AFAIK the Echoprint matching path was exactly the same as is published on Github.

I know at some point we did adapt the Solr end (for example, we removed the N most occurring codes) for speed optimizations.

Many users of Echoprint in the wild have adapted the python matching logic for their use case as well as changed the hash update rate on the codegen. A great modification to watch was Sunlight Labs' "Ad Hawk", which ID'd commercials: https://github.com/sunlightlabs/adhawk


When you say the matching is far from perfect, is that at your end or on the part of the echoprint / echonest code? You made tweaks because you found issues with what they were doing....?


The reason for it being far from perfect is likely a combination of both. If the correct song is indexed there is a high probabiliy for us to find the right match. However if its not, with a bit of bad luck a false positive can happen easily with the default solution (and ours too). Also when analysing a youtube video it can happen that in a 30sec snippet only 10 secs are a matching song and 20 are unrealated or 15 are one matching song the other 15 match a different one in which case 2 tracks or multiple versions of 2 different tracks will have relatively OK scores. Deciding what to consider a match (or whether to try different queries for the same or slightly altered timespan prior to deciding) is not trivial in these cases and our changes are mostly concerning when a match will be considered a match by altering thresholds and how matching truescores will be looked at in relation to other fingerprints true scores. Due to issues like these, specifying a timeframe for analysis will often produce better results.

http://static.echonest.com/echoprint_ismir.pdf


Crawling the internet for mp3s. It took us a couple of months to get to 7m.


That's an absolutely creative way of doing it! Congrats. I'm interested in how you did it. For example, how do you exactly "crawl the internet"? Did you have a bunch of sites that you've pre-selected and then just crawl those or did you actually follow through on links?

Thanks.


One-click hosters and cloud hosters with sharing options(such as docs.google.com), several music streaming sites, usenet. Also the amount of mp3s hosted on regular webservers which are indexed by and easily found using the usual search engines is mindblowing


I am also interested in knowing more specifics about your crawling process (if you can divulge).

Did you just crawl random sites and search for .mp3 content on them? Or did you have a set of pre-defined search sites to craw?


Music used in compilations, ads, intros etc are typical usecases. Mixtapes & livesets of course are also great. But when pitch/bpms have been altered more than 1-2% we currently get a lot of false positives. We are still working on finding a solution to this.


I think I've read on the echonest board that the most common solution is to index multiple pitch variants of the same songs. Apparently that's what Shazam does.

Also the guys from Trax-air.com are doing something pretty similar to you guys but with pitch/bpm bending support.


So far we've been hesitant to go that route as the current 7m fingerprints come down to 42m+ documents in Lucene. I.e. If we were to save 20 versions of each fingerprint in different bpms we'd quickly have about a billion+ documents on our hands to be searched on every query (with ~1000-2000 hashes per document)


Good idea we'll keep it in mind. Bookmarking the ident page and coming back later to check on results will work too


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: