Thanks for your added comments, it's really helpful to see a more candid breakdo...

Thanks for your added comments, it's really helpful to see a more candid breakdown of others views of a paper.

>I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

They are referenced in the paper in the section:

>Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi, 2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et al., 2017b]. Some local convergence results have also been proved [Zhong et al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.

I had this idea independently but never pursued it due to lack of time. It's another reason I like this paper: for referencing this approach, at least if they investigate what I think they do, I still havent had time to read those references, but from the description in this section it appears they investigate nearly if not exactly what I wanted to investigate "some day" :)