- Joined
- Feb 19, 2010
- Messages
- 8,697
- Reaction score
- 5,011
TLDR: I trained an XGBoost model on MMA data I've scraped from various resources, and then tested how well the model would do on test data that it wasn't trained on. The model scored an accuracy of 64% on the test set. This is considerably higher than 50% you'd expect at random. The model hasn't been used for actual betting yet, but the next step is to do just that.
A more detailed description follows here:
1. I've scraped various MMA websites for UFC, Pride, Bellator, WEC and Strikeforce results. I also scraped the profiles of all the fighters in these events, as well as their opponent's records, and their opponent's records. This allowed me to have an idea about the quality of competition each fighter has faced. For example: Jones vs. Smith, I have not only the records of Jones and Smith, but also for each fighter the combined record of their opponents indicating how many fights they've had, and what their win% is.
2. I took quite some time cleaning the data and created derrivative / aggregated variables. Every row does not contain future information, but only information that was available before the fight. I created lots of variables containing:
I split up the data in a training and test set. The training set contained the 80% oldest data (since MMA is temporal, i.e. a fighters current record contains information about their past performance), and the test data contained the 20% most recent fights.
I used XGBoost as a model, since this has a great record over other modeling techniques, is quite fast to train, and has won several Kaggle competitions.
The model spits out a probability of fighter A winning. These are then converted into classes (i.e. a win% over .50 means a win, below .50 a loss).
4. Model performance
I reached 65% accuracy on the training set, and 64% on the test set containing data the model was NOT trained on. This indicated that the model was not overtrained, since the train and test set both performed very similarly.
5. Betting:
The next step is to use the model for betting (for fun). The plan is to bet on MMA fights following the model 100%, no emotions, no gut feelings.
Since I'm only willing to risk 100 euros and like to keep things simple, I'm planning to do fixed bet sizes at 1.5-2% of the bankroll.
Any fight that provides a positive EV will be bet on. So let's say Cain has a 67% win probability over Ngannou in the model, and the odds are 1.64 for Cain, it follows that 1/1.64 = 60.97%. This would be positive EV, and thus a good bet according to the model.
6. Future steps:
Disclaimer:
I know the model isn't amazing, and the betting strategy is questionable. However, this project was mainly done as a side project to update my webscraping / scripting skills. Now for entertainment purposes I'm excited what will follow first: doubling my bankroll money, or losing it all.
A more detailed description follows here:
1. I've scraped various MMA websites for UFC, Pride, Bellator, WEC and Strikeforce results. I also scraped the profiles of all the fighters in these events, as well as their opponent's records, and their opponent's records. This allowed me to have an idea about the quality of competition each fighter has faced. For example: Jones vs. Smith, I have not only the records of Jones and Smith, but also for each fighter the combined record of their opponents indicating how many fights they've had, and what their win% is.
2. I took quite some time cleaning the data and created derrivative / aggregated variables. Every row does not contain future information, but only information that was available before the fight. I created lots of variables containing:
- event continent / home advantage
- reach, height, age
- experience (total amount of fights, wins, losses, draws)
- number of finishes (subs, ko's)
- current winning/losing streak, max winning/losing streak
- # of wins in last 1, 3 and 5 fights
- average fighting time
- fighter birth continent
- opponent strength (win% of opponent)
- opponent experience (number of fights of opponent)
- difference between both fighters in the above variables, e.g. reach difference, height difference etc.
I split up the data in a training and test set. The training set contained the 80% oldest data (since MMA is temporal, i.e. a fighters current record contains information about their past performance), and the test data contained the 20% most recent fights.
I used XGBoost as a model, since this has a great record over other modeling techniques, is quite fast to train, and has won several Kaggle competitions.
The model spits out a probability of fighter A winning. These are then converted into classes (i.e. a win% over .50 means a win, below .50 a loss).
4. Model performance
I reached 65% accuracy on the training set, and 64% on the test set containing data the model was NOT trained on. This indicated that the model was not overtrained, since the train and test set both performed very similarly.
5. Betting:
The next step is to use the model for betting (for fun). The plan is to bet on MMA fights following the model 100%, no emotions, no gut feelings.
Since I'm only willing to risk 100 euros and like to keep things simple, I'm planning to do fixed bet sizes at 1.5-2% of the bankroll.
Any fight that provides a positive EV will be bet on. So let's say Cain has a 67% win probability over Ngannou in the model, and the odds are 1.64 for Cain, it follows that 1/1.64 = 60.97%. This would be positive EV, and thus a good bet according to the model.
6. Future steps:
- Add opening odds to the model. I feel like this could greatly improve the model.
- Add recency (days since last fight). This could help in determining whether ring rust is an actual thing, and if so, it could add predictive power.
Disclaimer:
I know the model isn't amazing, and the betting strategy is questionable. However, this project was mainly done as a side project to update my webscraping / scripting skills. Now for entertainment purposes I'm excited what will follow first: doubling my bankroll money, or losing it all.