I have a question about how to shape the distribution of a learned variable with the loss function.
I want to train two variables to separate signal from background, and I want the two variables to be uncorrelated on the background. I did this by converting the signal-background problem into a regression, have two targets, and then calculate the correlation (standard “r”) on the background in the loss function. It worked well.
The problem is that the network did exactly what it was told - and “over” learned the loss function. The background is distributed along with a “check” mark pattern which while it is technically uncorrelated… It isn’t really.
I decided the right way forward was to try a loss function that not only required good signal/background separation and uncorrelation but also required the two variables to have a normal distribution. The problem is: I do not know how to effect this in the loss function. Options I’ve looked at:
Look at moments of the distribution - like kurtosis (4th) and skewness (3rd). In fact, there are statistical “tests of normalcy” out there that put moments together in interesting ways. However, my network has enough nodes in it to learn the functional 4th and 3rd order moment forms and trains itself into more geometrical shapes (like the checkmark above).
Dumb down the network so it can’t learn the specifics of the loss function. I did this by removing nodes, layers, etc. (since there is no deep design behind the architecture at the moment). This does help - but also the ability to separate the signal from the background is damaged significantly.
Change the correlation test to be only over small regions of the distribution rather than just the whole distribution. I’ve not tried this, mostly because I’m worried the network would learn many “check marks” rather than a single big one.
There is a sort primitive in PyTorch, and I could use the cumulative probability for a normal distribution (also a PyTorch primitive), and I think I could probably calculate a probability curve for my distribution and for the normal distribution and then do a mean square error… but this feels like it will be painful. Also, my understanding is sorting is not-differentiable, so I’m worried I’ll run into a limitation of these tools as soon as I start to use the sorted tensor.
That last point got me to wondering if there was a better way to do this, and hence this question.