Predicting Transfer Fee of a Football Player

Yahya Yavuz
11 min readDec 1, 2020

--

Project Overview

”In 2019, the football transfer market reached unprecedented levels both in terms of the number of transfers and the amount spent on fees..”

*James Kitching,Director of Football Regulator*

Growing football industry can cause clubs spend higher than what they possibly win. UEFA brought out Financial Fair Play Regulations to audit the expenses of clubs and prevent them to collapse financially.UEFA explained those regulations as improving the overall financial health of European club football.To follow the rules in the regulation, valuation of a player becomes crucial either to sell or to buy.

Fifa published a report in 2019, showing the statistics about transfers in 2019. In the report, it is highlighted many times that football teams can give lots of money to be in the competition with others.

Above chart shows how fast the football transfer industry is growing.It reached ~7.5 billion USD in 2019.

Above table shows the numbers of transfers are increasing, average fees are increasing as well as years pass.

This project is based on predicting transfer fees of football players. In the project firstly the dataset is constructed by scraping the transfers data from Transfermarkt. Then, all the features used in the model are created.

The motivation of this project is to predict an unknown fee, by constructing many features from different links

Problem Statement

In the project, the aim is to predict transfer fees, by scraping data from world-wide known website Transfermarkt.com. The fees of transfers are the target of the model. The features are all created using different pages of website. The features are mostly from statistics of the players,their achivements, profile and transfer history. Prediction of the fees were done by a machine learning and a deep learning algortihm.

Metrics

Model has continuous target, so the main metrics used in the projects are R² , Root mean squared error and mean absoulute error. LightGBM and Keras used in the project and all the metrics are compared.

Data Gathering

All the transfers with buy out fee ,player profile, all the matches a football player played before transfer date, achievements and player transfer history are scraped using Beautifulsoup. All detailed scraping codes can be reachable following github repository

Below, motivation of scraping each page and the derived features are explained.

Transfers

Transfers are the main table in which there are all the transfer history of all leagues that transfermarkt has. This link shows the transfers in each date. The model development data was gathered from this link by scraping and looping from 2016 to 2020.

Finally transfers with buy-out fees were selected for model development sample.

Transfer Details and Teams Played

Transfer details are the source of the transfer history of a player. Also in each transfer, specific info is provided such as age at the time of transfer, market value at the time of transfer, remaining contract .The transfer detail website for a player can be reached from here.

The webpage was looped for the players in the development sample and the transfer history dataframe is created. Some other features were created like Number of transfers in transfer history, Avg Market Value, Total Fee, Avg Fee etc. Flag features were also created such as: Country_Change_Flag,League_Change_Flag, League_Tier_Up_Flag, League_Tier_Down_Flag

Another dataset was also created under this webpage which gave the information about a player’s team in a specific time period.

Player Profile

Player profile is providing the general information about the footballer such as citizenship, height,preferred foot, positions etc. A player’s profile page can be reached from here. Below screenshot is also showing available information about a player’s profile.

A dataset is created from the profile page, including all the information about above screenshot for each player subject to the transfer.

Stats in Club Matches

This was the hard part of the project. In transfermarkt, stats are given in each season, however a footballer can change his club during the season.

Stats features must show the attributes of the players before transfer date to predict future fee so the stats that website gives cant be used directly in the model. Football matches of each players are scraped, and features are created according to the matches played before transfer date.

All the scraping code is available in github repository, however the football matches dataset can’t be reached from github since the dataset is bigger than the limit of github.

The matches are available in the link and a screenshot of the matches can be seen below.

Statistics of football players are very important and some detailed feature engineering is done here. Many features are created including, matches played, matches benched, average minutes, goals, assists, substituted games, injuries, suspensions, match points, etc in last 5,10,20,30 games before transfer date. The dataset to use in the model is available in github repository.

Stats in National Matches

A football player can also play for his national team. National team stats can also be helpful to predict transfer fee. Here the difficult part of creating this dataset is to get the matches of players, since they can play multiple under levels like Under 21, Under 19 etc. Here is the link of national matches. Firstly the clubs are scraped then the matches in each national clubs are scraped to create national football matches dataset.

Feature engineering is the same as club matches. Same related features are created using football matches of national teams.

Achievements

Finally, achievements are also very important for predicting transfer fee. The fee can increase if the player wins top goal scorer awards or cups. Hence achievements are also scraped from the link. Below a screenshot of achievements page is provided.

Two main features are created from the dataset: Number of achievements and number of distinct achievements.

Data Exploration and Data Visualization

After all the data is gathered, they are merged to be final model dataset. After basic controls such as null counts and summary statistics, 4 questions come to mind and data visualization is used to answers those questions.
* How many players are in the data in each year? (2016–2020)
* What is the average fee in each year?
* How many players exist in each position?
* What is the average fee in each positions?

In the transfer dataset that created, there are almost 2K transfers in 17–18, 18–19 and 19–20 seasons. Season 15–16 and Season 20–21 is not fully in the dataset and represents half season since the transfer dataset includes the transfers from Januar 2016 to November 2020.

Avg Fees of transfers are increasing as similar in Fifa report.

Most transfers in the dataset is coming from Centre-Forward position (~20%) .Second striker, Left & Right Midfield positions exist ~1% in the data.

Although Second Striker is the rarest in the transfer dataset, players have the most transfer fee in the data. Left & Right Midfield positions have the lowest value in the dataset.

Data Preprocessing

After scraping all data, there are still some preprocessing steps that must be done for modelling.

Some time difference features created in national stats and stats section have timedelta64 data type. “Days” were selected from the data which also makes format of the features float. In the profile dataset, positions footballer can play was specified as main position, other position1 and other position2. A new variable was created as number of positions by using those fields.Every football player has main position but, some football players can play in multiple positions.

In each transfer, fee of transfer should be highly correlated to the remaining contract time of footballers in the existing team. This value was given for some football players but format needed to be changed. Remaining months of existing contract was calculated. Age was formatted to use in the model by calculation age as month from the string variable.

Height of the players field was changed and formatted to be float. Also in the model target variable was Fee amount. Fee amount was given string with “m.” or “Th.” representing millions and thounsand respectively. A float Fee Amount is created by changing the format of the field.

In the dataset, foot is specified as categorical variable. dummy variables for foot was created which will give if the player is left footed, right footed or use both.Main position of a football player is categorical, dummy variables for each main position was created . In each transfer, country left and country joined fields are available. A threshold 50 was defined and dummy variable for each country was created if transfer count of country is greater than threshold. Below 50 was gathered to be “uncommon” field.

Implementation

Let’s try to predict the value of the players using the features. LightGBM and Keras was used for prediction. After, the results were compared for these 2 algorithms.

LightGbm

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency.

* Lower memory usage.

* Better accuracy.

* Support of parallel and GPU learning.

* Capable of handling large-scale data.

Leaf-wise (Best-first) Tree Growth
Most decision tree learning algorithms grow trees by level (depth)-wise, like the following image:

LightGBM grows trees leaf-wise. It will choose the leaf with max delta loss to grow. Holding #leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

Leaf-wise may cause over-fitting when data is small, so LightGBM includes the max_depth parameter to limit tree depth. However, trees still grow leaf-wise even when max_depth is specified.

Source : LightGBM

Firstly, a LightGBM regression model is run with default parameters. Then, hyper parameter optimization is done to reach the optimized parameters by changing n_estimators,learning_rate,max_depth and num_leaves.

Grid Search gave the best parameters in the model as :

For the LightGBM model the performance metrics are:

RMSE is 3.67 million ,model explained 79% of the all variance(R² =0.79) and the mean absolute error is 1.6 mio. Top 15 important features in the optimized model is given below

The 5 most important variables are the Market Value, remaining contract, Age, avg fee paid before and a binary variable if the transfer is joining the English league or not.

Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.

Keras is the high-level API of TensorFlow 2.0: an approchable, highly-productive interface for solving machine learning problems, with a focus on modern deep learning. It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity.

Keras empowers engineers and researchers to take full advantage of the scalability and cross-platform capabilities of TensorFlow 2.0: you can run Keras on TPU or on large clusters of GPUs.

Source : Keras

A network with 5 hidden layers and different nodes in each layers was created sequaentially. Dropout option to randomly eliminate 0.2 of the nodes to prevent overfitting was selected. Below , network layers, nodes were given. The model was trained with 25 epochs.

For the Keras prediction the performance metrics are:

Keras explained 81% of the total variance in the model with RMSE 3.5 mio and MAE 1.56 mio.

Justification

Two models are examined to predict the transfer fee in the project. Below the values and the predictions were plotted for LightGBM and Keras.

LightGbm vs Keras

The line represents x=y meaning, if points are closed to the line, it is petter prediction. In the graph it is hard to tell a model is better perfomed compared to other one. However LightGBM seemed to underestimate more for high ends.Comparing R² gave also that Keras explained slightly better than LightGBM. Also RMSE and MAE metrics are less than LightGBM.

Reflection & Improvement

In conclusion, the transfers, many features about player’s profile,transfer history, achievements and statistcis of club and mational matches scraped from Transfermarkt and created. Data gathered from the website is expored and preprocessed for the modeling. The fee amount which was the target variable of the problem was tried to be explained by two different algorithms LightGBM and Keras. Model performance metrics were similar but Keras seemed slightly better to predict the transfer fee amount.

The most interesting part of this project was using data science in sports with real data to predict a difficult value. Creating datasets, features about football players from using a webpage was exiciting. This project gave the joy of creating a fully designed data science project even the dataset is not available at the beginning. Also the project attracted a lot of attention among my friends who like football.

There were some times that the project made me feel to change the subject. The most difficult part was the gathering all the data from different sources and creating features from scratch.

Some additional stats features can improve the model results. Stats features were produced according to last 5,10,20,30 matches of each player. Stats features not according to the matches but dates can improve the model. Also, changing development sample to just European Clubs transfers can make the model perform well.

--

--