(P)NN Dev/Test Set Split for HH->bbtautau

Dear ML forum,

As you might know, in HH->bbtautau our final discriminants are based on NN, BDT or PNN models. In our previous publication, and in some of our current analysis channels, we perform a two-fold cross validation, in which half the events are used for training and the other half as a joint dev/test set. We have since expanded the hyperparameter optimisation extensively in some of our analysis channels, using O(100s-1000s) of hyperparameter sets sampled first by random sampling and then by Gaussian Process Regression of the optimisation metric. Our studies so far have kept the same validation strategy, though we have some concern that the extent of the optimisation may now necessitate fully separated train/dev/test sets. Can you offer any advise on whether or not our current strategy will introduce a significant bias in our results?

Many thanks and best wishes,



It’s a good question and one I think is relevant to a lot of analyses.
Here’s a short reply for now outlining my main thoughts.

When training an NN and optimising the hyperparameters, the validation set used in training should be considered part of the “training” data in terms of application.You are optimising based on this for when to stop training the model and which epoch to use, as well as for the selection of hyperparameters.
They really should not be used again in the evaluation as you have tailored your network for these events too.
The effect won’t be as drastic as using the same events for training, but especially because your statistics are low this would have the potential to introduce a problem. But, fitting a regression on the loss of the validation set to find a minimum and subsequently using those events for the final evaluation does introduce a risk of overtraining and bias.

I would always advise to use a k-fold approach, where you optimise k for the resources and required statistics for training. In the past I have used a 5-fold training for a 60:20:20 train:val:test split (threefold with 50:16:33 works too).
You can manage your k to really amp up the available training statistics above the default 50% you othewise would have had, and as long as the you only use the validation sets in each of the fold for optimisation you can use the test set for the final evaluation in the analysis.

There are other options, but I think this keeps it closer to your initial approach, with the benefit of increased training statistics in each fold.

Just my thoughts, and this is a point we will discuss in an upcoming AML meeting (pencilled in for 29th October).


Hi Johnny,

Thanks a lot for your swift response. Further discussion in the ML forum would definitely be of interest. I’m aware that split train/dev/test sets are the norm for the reasons you mentioned. However, it is often the case that the dev and test sets are merged in HEP. I’ve read that this is acceptable when the optimisation is limited, overtraining is not expected, or the stats are large enough that fluctuations aren’t important. I suspect that the last point is the motivation in most HEP cases. It would be nice to hear a discussion about the conditions under which merging the dev and test sets is expected to lead to a negligible bias.


I guess by “dev” you mean validation set?

In any case, technically we’re doing this because we’re cutting corners: if stats aren’t a concern then using a separate test set is something you can afford. In practice we often cut this corner in reconstruction or object-id, because much higher stats mitigate the overtraining and because it’s a bit less practical to (for example) remove every b-jet that was used in training from every physics analysis.

As for how we justify this: if the difference in discrimination on the training and testing datasets are much smaller than other uncertainties you can probably consider it negligible. In b-tagging, for example, we often see disagreement much smaller than 1%, which is within the uncertainties the group provides on calibrations. But I wouldn’t recommend this for an analysis where:

  • The statistics are often more limited,
  • There’s no cross check on the result, and
  • It isn’t so difficult to keep the test set separated.

I think in your case it’s probably worth doing it properly, especially if you have limited events to work with.

I think one thing to consider as well with an analysis versus cp tagger, is that we don’t derive scale factors or uncertainties on the discriminant either - it goes into a profile likelihood fit, where even if one background has a slight bias you will see pulls on other NPs which although small can often have a big impact on free floating parameters which are used to compensate bad modelling for no cost.

Therefore in signal regions where we apply these discriminants, that means the most likely pulled parameter is the signal strength. A small bias can lead to a big change in a weak signal.