Yeah, I felt the same way. Most papers do not mention about features, why they chose them, what they learned about the dataset. Only, a high level architecture of the system, and possibly some comparisons. One of the reasons why I wrote about the post is because I wanted to shed some light on the parts that books and papers do not focus on.
I get some feedback around proofreading and grammar for the post. I will definitely try to make it better next time, possibly making someone do the review. I guess I cannot correct non-native speaking, but at least next time I could be more careful about sloppiness and the things that are related to grammar.
So, I tried to do cluster the outlines but generally they are quite short and does not reveal much about what the movies are about but rather the plot. So, I am a little skeptical if it yields good results for subtitles.
I missed the comment in the previous one, sorry about that.
I am not familiar with the theory but if I could get the writers information(right now I only have the directors) that would be interesting aspect of data to investigate.
1. Yes, I agree with you and I wrote on the first post is that that is mostly due to selection bias. http://bugra.github.io/work/notes/2014-02-15/imdb-top-100K-m...
2) That is true but I think that holds for __all__ of the movies not a subset of them.
3) I am not quite sure about that but I think that would be quite interesting look at the temporal information correlation to the rating of the movie.
That is exactly why I generally prefer median-like average methods versus mean-like averages in these type of crowdsourced systems. Generally speaking, average votes are distributed evenly whereas the ratings on the edges are not. The movies that you gave as examples could be due to preference of voters for particular type of movies.
I don't think it would make much odds for IMDB. Sure you'll maybe negate the overly high proportion of extreme votes, but I don't think that effect is large. I did once take the top films on IMDB and remove the extreme votes, and the effect was marginal - a few films shuffled a place or so. In exchange, if you use the median, you could have polarising films - like Twilight, having their ratings change markedly over time, as people decide to start warring over it and the numbers swamp the small number of ambivalents.
For your other point, there will be voting patterns for various demographics and certain groups of people prefer certain movies. But I think the phenomenon I'm describing is a bit too extreme to be merely a bunch of old women disproportionately taking a dislike to Ikiru and then rushing onto the internet to tell the world they hate it. It is the smallest demographic IMDB has, so it's not hurting the ratings of the films in general, it's just an oddity.
1. The timestamps are only years so I cannot use at least the information that IMDB provides for such analysis. That would be quite interesting, though. Also, what time of year that movie is released and how does that affect to the movie's overall success could be another interesting point.
2. That is a good suggestion but I think we need more votes and more importantly the users whose votes could be taken as a basis of quality of the movie.
I am also very interested in the correlation of rating vs director. However, I do not have the budget information for the movies. It would be great if I could find budget information and combine them with the data I have. I have not though that, really good suggestion. Thanks!
It is true that as runtime increases, the number of movies decreases. However, if you look at the mainstream runtimes(>80 and <100), generally they get quite a variety of ratings for a given runtime if not uniform. On the other hand, the movies that have higher runtimes generally get higher number of votes.
You could also observe the same behavior from rating vs #votes graph; as # votes increases, the number of movies decreases. However, rating and # votes correlate quite strongly.
I wonder why long movies are consistently rated higher?
My guess would be for two reasons. The first being audience selection. Long films are (IMO) more likely to attract, and retain for the entire length, an audience with a prior interest in the film, and therefore more likely to be an audience who know beforehand if the movie is generally good or not.
Secondly, psychology will come into play. The more time an audience invests in a film, the more likely they are to seek a positive reward for their time so they don't feel like they have got a bad deal[1]. Thus, they're more likely to rate the film higher than it perhaps otherwise would be. I also believe this holds true for 'art house' films that are difficult to follow and perhaps less enjoyable than a more mainstream film. Audiences will rate them higher to reassure themselves that they haven't just wasted 2 hours watching something boring that they don't understand.
Yeah, I felt the same way. Most papers do not mention about features, why they chose them, what they learned about the dataset. Only, a high level architecture of the system, and possibly some comparisons. One of the reasons why I wrote about the post is because I wanted to shed some light on the parts that books and papers do not focus on.
I get some feedback around proofreading and grammar for the post. I will definitely try to make it better next time, possibly making someone do the review. I guess I cannot correct non-native speaking, but at least next time I could be more careful about sloppiness and the things that are related to grammar.