How does Google deal with code refactoring? If one splits the code file into two files, how does the algorithms deals with its former repository commits scoring? Sure splitting classes in Java is hard (one file per class), but in languages like C++, C#, PHP with namespaces it's trivial and happens all the time. (disclaimer: I read ~80% of the article and comments, and used the browser search function, but found no answer)
So I actually wrote this article (strange to see it again!)
We don't run the code any more and I would have to go back to the source to check, but I think that it handles renames OK, but it doesn't have any handling for splitting two files up that way.
It's a somewhat by-the-by issue, as my intuition are that the files that actually can be broken up and successfully refactored are not the same ones that are going to get flagged. The flagged files are the ones that are churning because no-one really knows how to write them any better.
So the follow-up paper that assesses the impact is here [1]
TL;DR is that developers just didn't find it useful. Sometimes they knew the code was a hot spot, sometimes they didn't. But knowing that the code was a hot spot didn't provide them with any means of effecting change for the better. Imagine a compiler that just said "Hey, I think this code you just wrote is probably buggy" but then didn't tell you where, and even if you knew and fixed it, would still say it due to the fact it was maybe buggy recently. That's what TWR essentially does. That became understandably frustrating, and we have many other signals that developers can act on (e.g. FindBugs), and we risked drowning out those useful signals with this one.
Some teams did find it useful for getting individual team reports so they could focus on places for refactoring efforts, but from a global perspective, it just seemed to frustrate, so it was turned down.
From an academic perspective, I consider the paper one of my most impactful contributions, because it highlights to the bug prediction community some harsh realities that need to be overcome for bug prediction to be useful to humans. So I think the whole project was quite successful... Note that the Rahman algorithm that TWR was based on did pretty well in developer reviews at finding bad code, so it's possible it could be used for automated tools effectively, e.g. test case prioritization so you can find failures earlier in the test suite. I think automated uses are probably the most fruitful area for bug prediction efforts to focus on in the near-to-mid future.
I was one of the interviewees for the study (or at least, I remember ranking those three lists as described in the experimental design).
My impressions were that the results of the algorithm were pretty accurate, but they were not very actionable. Very often, the files identified were ones the team knew to be buggy, but there were good reasons they were buggy, eg. the problem the code was solving was complex, that area of the code was undergoing heavy churn because the problem it solved was a high priority, or the code was ugly but another system was being developed to replace it and it wasn't worth fixing when it was going to be thrown away anyway. In some cases, proposals to fix or refactor the code had been nixed repeatedly by executives.
Basically - not all bugs are created equal. Oftentimes code is buggy because it's important, and the priority is on satisfying user needs rather than fixing bugs.
I work in software reliability (bug finding through dynamic program analysis) which is a related domain of this research.
Most of these machine learning based software engineering research tools are based on unrealistic scenarios, full of over-promises and very little to deliver in real life.
Curious why this isn't used anymore? Seems like it would have been a useful thing to flag certain files as being worth extended review. Did it not provide the expected benefit(s)? I'm interpreting 'we' as 'google'...
My interpretation after reading the whole article is that if two new files are created, the history of this 'hot spot' will be forgotten. If one new file is created and the old one remains, that old one will only remain a hot spot as long as code is changed in it.