- Joined
- Mar 5, 2016
- Messages
- 7,142
- Reaction score
- 2,013
I do! I'm surprised you remember My PhD supervisor loved that method of analyzing data. He did the 'leave half out' method quite a lot, where you separate it 50/50 (so train on 5, then analyze 5).You can approximate information you don’t have. The idea is this
say we have 10 data points and we randomly partition the data set into training and testing sets (for instance 8 training points 2 testing points).
If we train the model on the 8 points and it is able to account for the data we put off to the side then there’s a good chance it can account for actual unseen data.
@Alpha_T83 can you chime in? (He has a PhD in data science).
It is indeed statistically valid, but there's a caveat: you need to pick your analysis carefully and do it once. Obviously there are many ways to analyze data: If you split data 5 & 5 to train and analyze, but you do 100 different training methods, you will find an excellent 'outcome' in the 5 data points you analyze. However, that could be what we call 'overtrained'. For example, if you had 15 data points and you split them 5/5/5, train on the first 5 100 times and then analyze the 2nd 5 picking the best 'outcome', you're actually likely to get a very bad outcome on the 3rd untested data.