Hacker Newsnew | past | comments | ask | show | jobs | submit | tk42's commentslogin

Thanks :) Hope it will be of use to people. We don't use solr, we use elasticsearch and we also don't hold the unzipped fingerprints in the same format the default echoprint does. But for this dump we exported everything in the official echoprint format so people can use it with the default cluster.


- Now works with vimeo & dailymotion

- Exact time for analysis can be passed

- Added additional third party fingerprinting technology(doreso) to find snippets echonest can't identify

- false positives reduced drastically as we can be stricter about thresholds due to above point

- works much better on mixtapes with altered BPMs due to above 2 points

- Finds and embeds additional sources for identified songs


So far we've been hesitant to go that route as the current 7m fingerprints come down to 42m+ documents in Lucene. I.e. If we were to save 20 versions of each fingerprint in different bpms we'd quickly have about a billion+ documents on our hands to be searched on every query (with ~1000-2000 hashes per document)


Hug of death occured faster than expected :) scaling now


The user can paste a youtube url which will then be analysed, fingerprinted and matched against a database of 7+ million audio fingerprints. It does not only identify a single song but is able to identify multiple songs contained in a single file or video and generates a timeline listing which tracks it contains at which time.

Our matching algorithm is based on the open source echoprint-codegen fingerprinting method, which we have built our own stack around:

- Replaced Solr/Tokyo Tyrant with Elasticsearch

- Reimplemented matching-logic

- Crawlers search multiple sources for audio files to be indexed (mp3s arent stored long term, only fingerprinted then deleted)

- Indexing about 1 new track per second

- Found method to verify unrealiable ID3 tags (in progress, current database also includes unferified)

- mogilefs as primary data store for fingerprints

- perl everything

We also provide a free music identification API.

Any feedback would be much appreciated!


Does it include music from oft-used music sources in YouTube videos such as AudioMicro and Incompetech[1]? I'd guess it's not really possible on the AudioMicro front as they're paid-for music that'd cost you a fortune to index but may be worth adding the latter

[1] http://incompetech.com/music/royalty-free/


Where do you get the MP3s from in the first place, and how long did it take to index 7 million?


Crawling the internet for mp3s. It took us a couple of months to get to 7m.


That's an absolutely creative way of doing it! Congrats. I'm interested in how you did it. For example, how do you exactly "crawl the internet"? Did you have a bunch of sites that you've pre-selected and then just crawl those or did you actually follow through on links?

Thanks.


One-click hosters and cloud hosters with sharing options(such as docs.google.com), several music streaming sites, usenet. Also the amount of mp3s hosted on regular webservers which are indexed by and easily found using the usual search engines is mindblowing


I am also interested in knowing more specifics about your crawling process (if you can divulge).

Did you just crawl random sites and search for .mp3 content on them? Or did you have a set of pre-defined search sites to craw?


Until you posted this clarification, my first thought was - What makes it different from Shazam?

Thanks for clearing that up, good luck with your site!


I've worked on the echoprint-codegen algorithm for my current project ( trak.rocks ) and I'm curious about how you reimplemented the matching logic ?

Do you plan to document/opensource you work ?


First we rewrote echonests truescore logic in perl and then altered slightly and implemented some extra checks to further try to exclude false positives. We also believe what they used in the late song/identify API might have been different from what is open sourced in https://github.com/echonest/echoprint-server

Also we pack each individual hash before storing in Elasticsearch and gained at least 50% storage space this way.

Our Fingerprint data is quite different from theirs(unreliable ID3 tags, N versions of same track) which is why we needed some tweaks. So far the matching is still far from perfect...

Whether we will open source the whole thing at some point we don't know yet.


song/identify supported both ENMFP and Echoprint, and AFAIK the Echoprint matching path was exactly the same as is published on Github.

I know at some point we did adapt the Solr end (for example, we removed the N most occurring codes) for speed optimizations.

Many users of Echoprint in the wild have adapted the python matching logic for their use case as well as changed the hash update rate on the codegen. A great modification to watch was Sunlight Labs' "Ad Hawk", which ID'd commercials: https://github.com/sunlightlabs/adhawk


When you say the matching is far from perfect, is that at your end or on the part of the echoprint / echonest code? You made tweaks because you found issues with what they were doing....?


The reason for it being far from perfect is likely a combination of both. If the correct song is indexed there is a high probabiliy for us to find the right match. However if its not, with a bit of bad luck a false positive can happen easily with the default solution (and ours too). Also when analysing a youtube video it can happen that in a 30sec snippet only 10 secs are a matching song and 20 are unrealated or 15 are one matching song the other 15 match a different one in which case 2 tracks or multiple versions of 2 different tracks will have relatively OK scores. Deciding what to consider a match (or whether to try different queries for the same or slightly altered timespan prior to deciding) is not trivial in these cases and our changes are mostly concerning when a match will be considered a match by altering thresholds and how matching truescores will be looked at in relation to other fingerprints true scores. Due to issues like these, specifying a timeframe for analysis will often produce better results.

http://static.echonest.com/echoprint_ismir.pdf


Do you have any intuition for whether the echhoprint-codegen algorithm would be suitable for saying whether two voice recordings match? One would be a little lossy, the other pretty much perfect.


Echonest can work with voice but is optimized for music so you might encounter a lot of false positive with it. Check out the echonest board on google. It's a recurring topic.


It'd be good to have an option to provide an email address which results could be sent to once it has them, rather than keep checking an open tab.


Good idea we'll keep it in mind. Bookmarking the ident page and coming back later to check on results will work too


My software engineering process works as follows:

I try to code as much as possible as early as possible. I throw away lots of stuff and recode it. Besides that I have an eye for stuff that is "similar" and can be abstracted. If someone wants an estimate, I guess as good as possible.

Big code is idealy split into one-person-chunks each with a documented API, but sometimes many people have to work on the same "files". Then big code is split between multiple people that sit nearby and communicate personally while discussing implementations based on technical arguments.

How to make a product of software is a different story. But I guess it works when you design your product in estimateable pieces and adapt fast to changing requirements.

Also I am pretty sure I forgot one or two things...


But don't forget the cards.


The state of JavaScript in 2015: it still sucks. I just wish the webbrowser creators would implement a way of other languages to be executed as a replacement of JavaScript. Not gonna happen, I know...


In my mind it's in the same spot as PHP: the language can hardly be worse, but so much quality code is written in both, you're beginning to accept them.

Lucky if you can work with a disciplined team...


JavaScript, while it was rushed and had questionable design goals, at least was somewhat designed. You can't say that about PHP.

Also not sure I can get on board with "lots of quality code in PHP", some sure, but seems that anyone who knows PHP and another language prefers to write in another language.


The early PHP releases indeed weren't thought through fully, but recent versions are taken quite good care of.

The are emergent (PSR) standards nicely backed by all the leaders in the community (Symfony, Laravel, Silex, Drupal 8.) The Doctrine ORM is also an extremely pleasant way to interact with a relational DB (on par with Python's SQLAlchemy.) I very deeply hated PHP but nowadays I find it fun to develop and ship software written in PHP.

As for JavaScript, 1 week is nowhere near enough what I'd call "designing a language." It has quite the same evolutionary history as PHP, except for being rediscovered a year or two ago.


No dude. Sorry, your opinion is wrong. If recent versions were indeed "taken quite good care of", there wouldn't be >5000 functions all existing in the global scope, with completely inconsistent naming and arity patterns. There wouldn't be exactly one non-scalar data type, the so-called "associative array" monstrosity. I could go on. I don't think you'd care.

It's also funny that you listed 4 actors as "all the leaders" when Laravel uses mostly Symfony code, Silex is made by Symfony, and Drupal is known to be terrible.

The last time I had to program something real in PHP, I got caught trying to debug network calls (all PHP debugging is an exercise in futility, you just never know, am I going to throw an error, return 0, return -1/false/something else, is it something that gets swallowed by the interpreter?) and tried to use a try/catch to catch all exceptions. So pretty much "catch (Exception $e)". Except that, since I was trying to be good and use PSR, I was using namespaces, and PHP does not search the root namespace unless you are in the root namespace. Which is ridiculous and the only language I know to do that. The really infuriating part however was that the interpreter gave no indication whatsoever that something was wrong, it just silently failed. Making it ironic that by dint of using an error construct, I was prevented from seeing compile errors in my code.

Congratulations on knowing that JavaScript was written in a week, however if you had actually dug deeper, you would have known that the creator was a bit of a language nerd and was concepting something different but was forced to shoehorn it into a more Java-esque paradigm at the eleventh hour. When I use JavaScript, I don't have to RTFM to find out what order str_replace takes its arguments in for the 3000th time (and don't give me the string functions/array functions bit because it isn't actually true). Nope, I can just "string".replace(find, replace) because the language was designed. Whereas PHP to this day seems to be a slipshod cadre of hackers (and not in the elite FBI break-in sense, but in the duct tape and baling twine sense) blithely adding whatever feature scratches their particular itch with no care or concern for the rest of the ecosystem.

I don't think JavaScript is a great language. But it is an understandable language, a predictable language. When I have to use PHP it makes me want to quit my job. When I have to use PHP it makes me want to quit being a software developer.


I look at it the same way.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: