In my analysis both signal and background have a significant number of negative weights, around 30%.
My question is how to use these weights correctly for training my neural network.
I am more or less sure that just using negative weights in the model cannot be correct because it would make it so, that negative weight events always get misclassified.
Therefor I have been using just positive weights or absolute weights for the training. Especially using absolute weights gives me good results. Unfortunately, when validating the model on the proper weights, the model can be very inconsistent.
Negative events will be removed from the response plots and leave gaps in the distribution if not equally distributed.
Did you ever train a neural network model on samples with a high fraction of negative weights and what do you think is the correct way to do this?
I’m not sure I follow the logic here:
A negative weight for some event in some region of phase space is telling you that events in its neighborhood are over-represented. Within the neural network the negative weight will invert the loss gradient for that event, which should decrease the overall loss gradient into that region of phase space.
In other words, since using weights is a trick generators use to get the correct event density, using them in the network should train a network that assumes the correct event density as well. If you omit the weights or otherwise mess with them by taking absolute values, you’re training a network on an unphysical event density.
If I understand the math correctly, then, at least in Keras, assigning a negative weight will just assign a negative sign to the loss. This is why the region does not become less important, instead the training has an incentive to mislabel events with negative weights.
But this is exactly the point, where I am unsure how to handle it.
I do not know how to properly tell the network to decrease the loss gradient instead of messing it up.
You’re right about how Keras treats negative weights, but I think this point needs clarification:
Can you clarify what you mean by “mislabel”? A signal event with a negative weight is telling the neural network that events in its neighborhood are background-like. That seems like the correct incentive: if the sum of signal weights is lower than the sum of background weights in some region, the network should classify events there as background, even if, in the absence of weights, there are more signal events populating the region.
“A signal event with a negative weight is telling the neural network that events in its neighborhood are background-like”
I’m not sure, negative weights are used to correct the differential cross section. It can very well be that negative weights signal events are very signal like. They should be used so that the local pdf is correctly estimated (by subtracting from the positive weight events contribution), whether when computing the loss or performance plots like AUC or significance.
My take on this issue :
- negative weight should absolutely be used (with negative weights) for all performance plots. (even if some sklearn functions might choke)
- removing them or taking the absolute value when training might work (i.e. give better performance), but this is accidental for a specific dataset, it should not be seen as a general recipe
- algs might be made more robust wrt negative weights, but there will always be problems when there are accidental accumulation (negative pdf !) in the training feature space
This might be, where I am confused.
My understanding is that negative weights mean that a certain type of event is over-represented. But it does not necessarily mean that for example a signal event with a negative weight should be classified as background-like because there still can be events of the same kind left that might actually be signal.
By using weights I want to tell the network that a certain region is more or less important but by using negative weights I am telling the network to specifically misclassify events in a certain region.
I am sorry if this is a major misunderstanding but it is where I am stuck at the moment.
Not sure if this helps, but a single event being “background-like” doesn’t mean that it will be classified as background. The loss is determined over a batch, and even with small batches the NN updates are usually small enough that the network will average over nearby events. If you (over)train a large network for long enough it will start to memorize the individual events, but that’s something people generally avoid (by picking the optimal training based on a disjoint validation sample, for example).
My understanding is that (in the limit of large statistics) there should always be more positive signal weight than negative signal weight, everywhere. This means that in general negative-weighted signal events should inhabit a region with an overall positive signal—the NN won’t classify the whole region as background unless there’s more background there than the total weighted signal.
Anyway, as @droussea says the important thing is that you use all the weights for the validation sample. You can do whatever you want with the training sample, so the effect of taking the absolute weights there is something you can test. Does using the weights give you worse classification on the validation sample than taking the absolute value?
This is pretty much what I have been doing so far.
Thank you everybody for your efforts and answers. It was really good to hear some opinions on this.
I was hoping that there was a smarter way out there to handle this.
I only now discovered this forum, and would like to add a bit of info even if this is two weeks old. As David said, by default you should consider always using event weights (even negative) because they are there for a reason: give the proper distributions and cross sections. Now it is also true that software-wise, negative weights are not always handled properly (not just with NN, the same is true with some BDT implementations), so we find tricks around. But do not see this just as an ML trick, check what you are doing.
In many cases, events with negative weights look very similar to positive-weight ones (so, just as signal or background-like as positive weights in that class, not like the other class), so using all events with absolute value of their weight is just fine: you just increase your training stats compared to only using positive-weight events. Just make sure your input distributions look indeed similar between positive and negative weight events.
If the positive and negative weight events do look different, it will be more problematic, because using the absolute value or rejecting negative-weight events will change the input shapes. As Dan mentioned, your network works in batches and with a decent batch size (or leaf size in a tree) you could hope to always have a positive sum of weights, and then most software will handle this gracefully. But if you have a large fraction of negative weights, the risk of having just one negative batch or leaf will become non-negligible and will start to cause trouble. Technically there are ways to prepare batches pairing neg weights with pos weight events (TMVA used to have such a feature I think) to ensure your sums are always positive, but it’s maybe too much trouble.
In the end what matters is the performance on the validation set, using all weights. If you see biases there, it’s possibly because your handling is inappropriate. You can test this easily: train on positive weights only, check for mismodelling on positive validation events, and then do the same with negative weights only (with their absolute value of course). If each separately seems ok, then your handling of the combination is not good.