As a step in that direction, you can take some inspiration from what the brain does: as you learn things better and better, the knowledge essentially gets pulled down to neuronal layers that are closer to sensory input. This leaves the higher layers more free to do other stuff (potentially reusing the results from the surface layers), which is a step in the direction of optimizing for future flexibility.
It's possible to create rules that operate on an already-trained network and push it in this direction without totally destroying what it's learned, by "fuzzing" the original network to generate a bunch of input/output pairs, and then using that dataset to retrain smaller sub-networks. For instance, if you have a 5 layer network that you've trained on a classification task, you can often use that network as a teacher to train a smaller network to do pretty damn well on the same classification task, even in some cases where training the smaller network directly would have been very difficult. There are several reasons that this trick can work, not the least of which is that in a sense it is a way to expand the training set dramatically.
NB: the above approach is probably not how you'd implement this, there are less crude methods to incentivize shallower levels to have more activation than deeper ones that would probably work better
I can easily imagine a phased training strategy that oscillates between a) learning new things by making the deeper layers more malleable and the shallower ones fairly rigid, and b) compressing all the data by opening up the shallow layers to change and replaying input/output into itself. I have no idea if there are any benchmarks around this sort of thing, though, typically benchmarks have fixed goals so ability to retrain for additional tasks is not really measured.
It's possible to create rules that operate on an already-trained network and push it in this direction without totally destroying what it's learned, by "fuzzing" the original network to generate a bunch of input/output pairs, and then using that dataset to retrain smaller sub-networks. For instance, if you have a 5 layer network that you've trained on a classification task, you can often use that network as a teacher to train a smaller network to do pretty damn well on the same classification task, even in some cases where training the smaller network directly would have been very difficult. There are several reasons that this trick can work, not the least of which is that in a sense it is a way to expand the training set dramatically.
NB: the above approach is probably not how you'd implement this, there are less crude methods to incentivize shallower levels to have more activation than deeper ones that would probably work better
I can easily imagine a phased training strategy that oscillates between a) learning new things by making the deeper layers more malleable and the shallower ones fairly rigid, and b) compressing all the data by opening up the shallow layers to change and replaying input/output into itself. I have no idea if there are any benchmarks around this sort of thing, though, typically benchmarks have fixed goals so ability to retrain for additional tasks is not really measured.