Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In addition to everyone suggesting the classic n-gram approaches, now it is rather easy to use a word2vec (google it) representation of the words instead - obtain a mapping between words and an array of x numbers (either by finding a pretrained word2vec model on internet or training one on texts from your special domain), and then just run clustering on those numbers instead.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: