DATA ANALYSIS

INDIAN PREMIER LEAGUE (IPL CRICKET) PERFORMANCE ANALYSIS FOR PLAYERS SELECTION

Institute of technology Carlow, Ireland

STUDENT NAME: Sumit Kumar Singh

STUDENT NUMBER:C00232333

COURSE NAME:Masters in Data Science

DEPARTMENT:Department of Computing and Networking

COURSE CODE:(CW_SRDAT_M) Y5

SUPERVISOR:Dr. Greg Doyle

DATE OF SUBMISSION: 28 04 2018

CONTENTS TOC o “1-3″ u CONTENTS PAGEREF _Toc512673896 h 2

ABSTRACT PAGEREF _Toc512673897 h 3

introduction PAGEREF _Toc512673898 h 4

INDIAN PREMIER LEAGUE PAGEREF _Toc512673899 h 5

Role of Analysis in Sports PAGEREF _Toc512673900 h 5

BACKGROUND AND SIGNIFICANCE PAGEREF _Toc512673901 h 6

Research Question PAGEREF _Toc512673902 h 7

research METHODOLOGY PAGEREF _Toc512673903 h 7

Research Design PAGEREF _Toc512673904 h 7

The Data PAGEREF _Toc512673905 h 7

ANALYSIS PAGEREF _Toc512673906 h 11

RESULT TABLE PAGEREF _Toc512673907 h 15

Conclusion PAGEREF _Toc512673908 h 16

references PAGEREF _Toc512673909 h 17

List of Appendices PAGEREF _Toc512673910 h 18

Appendix 1: R CODE FOR BATTING REGRESSION PAGEREF _Toc512673911 h 19

Appendix 2: R CODE FOR BOWLING REGRESSION PAGEREF _Toc512673912 h 20

Appendix 3: PYTHON CODE FOR Batting Average PAGEREF _Toc512673913 h 21

APPENDIX 4: PYTHON CODE FOR Bowling AVERAGE PAGEREF _Toc512673914 h 24

ABSTRACTThe purpose of this research is to develop models that could help team owners to build talented teams with minimum possible spending. The models are developed using Python and R Studio tools. Models such as Multiple Linear Regression (Backward Elimination Rule), K-Nearest Neighbour, Support Vector Machine and Linear Regression have been used to predict the performances of the players. The models would provide the probability measure of the selection of players which can be used by the team owners during bidding. It would also help decision makers during auction by calculating the value of each player and help set the salaries for the players.

introductionCricket is the game of ball and bat and is the second most liked sports in the world after football. It originated in England and spread over the world through the British Empire. Historically, cricket was organised at two levels ADDIN EN.CITE <EndNote><Cite><Author>Dalmia</Author><Year>May 2010</Year><RecNum>10</RecNum><DisplayText>(dalmia, May 2010)</DisplayText><record><rec-number>10</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>10</key></foreign-keys><ref-type name=”Unpublished Work”>34</ref-type><contributors><authors><author>krittivas dalmia</author></authors></contributors><titles><title>The Indian Premier League: Pay versus Performance</title><tertiary-title>Bachelor of Science</tertiary-title></titles><pages>70</pages><dates><year>May 2010</year></dates><publisher>Leonard N. Stern School of Business</publisher><urls><related-urls><url>http://www.stern.nyu.edu/sites/default/files/assets/documents/con_043005.pdf</url></related-urls></urls></record></Cite></EndNote>(dalmia, May 2010) One at the intra-country level in which the two opponent teams were from the same country but from a different province and the players in each team compete among each other to get selected for the national team and represent their country. Second one at the international level, in which the two teams represent their countries.

International Cricket Council (ICC) is the governing body of cricket, which sets the international cricket calendar and also the rules and regulations which is followed by every team around the world and in each international or intra-country match that is played in any part of the world. ICC has three levels of members: full, associate and affiliate members. Full members consist of the leading group of teams in the game. Teams such as Australia, England, India, Pakistan, New Zealand, South Africa, Sri Lanka, West Indies, Bangladesh, Zimbabwe, Ireland and Afghanistan are in this category. In addition there are 35 associate members and 59 affiliate members for a total of 104 members. ADDIN EN.CITE <EndNote><Cite><Author>Dalmia</Author><Year>May 2010</Year><RecNum>10</RecNum><DisplayText>(dalmia, May 2010)</DisplayText><record><rec-number>10</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>10</key></foreign-keys><ref-type name=”Unpublished Work”>34</ref-type><contributors><authors><author>krittivas dalmia</author></authors></contributors><titles><title>The Indian Premier League: Pay versus Performance</title><tertiary-title>Bachelor of Science</tertiary-title></titles><pages>70</pages><dates><year>May 2010</year></dates><publisher>Leonard N. Stern School of Business</publisher><urls><related-urls><url>http://www.stern.nyu.edu/sites/default/files/assets/documents/con_043005.pdf</url></related-urls></urls></record></Cite></EndNote>(dalmia, May 2010)

International cricket’s format has evolved over a period of time. There are different versions of it. Earlier, only longer version called “Test Match” was being played between two countries. It was a multi-day affair. Test matches lasted for five days and that continues to hold even today ADDIN EN.CITE ;EndNote;;Cite ExcludeYear=”1″;;Author;ICC;/Author;;RecNum;11;/RecNum;;DisplayText;(ICC);/DisplayText;;record;;rec-number;11;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;11;/key;;/foreign-keys;;ref-type name=”Web Page”;12;/ref-type;;contributors;;authors;;author;ICC;/author;;/authors;;/contributors;;titles;;title;International Cricket Council;/title;;/titles;;dates;;/dates;;urls;;related-urls;;url;https://www.icc-cricket.com/;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(ICC) Although it’s a gentleman’s game, this game was not popular among the fans as the results were not obvious. Many test matches ended up in a draws which were unsatisfying to many fans. This is why one-day international (ODI) matches were introduced in 1970. ODI matches lasted one day only with the certainty of result in every game. It was so well structured and since it ends in a day, multi-country competitions were facilitated. First ODI “World Cup” was held in 1975 and has been held every four years since then.

A third and even shorter version of the game was introduced in 2000’s. It was called Twenty20 cricket, and was abbreviated as T20 ADDIN EN.CITE <EndNote><Cite><Author>colin cannonier</Author><Year>September 24, 2013</Year><RecNum>12</RecNum><DisplayText>(colin cannonier, September 24, 2013)</DisplayText><record><rec-number>12</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>12</key></foreign-keys><ref-type name=”Journal Article”>17</ref-type><contributors><authors><author>colin cannonier, bibhudutta panda, sudipta sarangi</author></authors></contributors><titles><title>20-over versus 50-over cricket
is there a difference?</title><secondary-title>journal of sports economics</secondary-title></titles><periodical><full-title>Journal of Sports Economics</full-title></periodical><dates><year>September 24, 2013</year></dates><work-type>research</work-type><urls><related-urls><url>http://journals.sagepub.com/doi/abs/10.1177/1527002513505284</url></related-urls></urls></record></Cite></EndNote>(colin cannonier, September 24, 2013) This is a fast paced format and the game only lasts for 3 hours. First T20 “World Cup” was held in 2007 and due to its huge success particularly in the dominant market of India, led to the launch of cricket’s first domestic professional T20 league. This league is called Indian Premiere League, the subject of my assignment.

INDIAN PREMIER LEAGUEIndian Premier League (IPL) is the biggest and richest sub-continent cricket league which is played every year in India. It is a franchise based twenty-twenty cricket competition which has format similar to that of football Premier League of England and the NBA in the United States ADDIN EN.CITE ;EndNote;;Cite ExcludeYear=”1″;;Author;India;/Author;;RecNum;1;/RecNum;;DisplayText;(India);/DisplayText;;record;;rec-number;1;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;1;/key;;/foreign-keys;;ref-type name=”Web Page”;12;/ref-type;;contributors;;authors;;author;Board of Control for Cricket In India;/author;;/authors;;/contributors;;titles;;title;IPL;/title;;/titles;;dates;;/dates;;urls;;related-urls;;url;http://www.iplt20.com/auction/2018;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(India). IPL is the most attended cricket league in the world and the business value of IPL in 2017 has increased to Euro 4.4 billion from Euro 3.5 billion last year as per Duff ; Phelps ADDIN EN.CITE ;EndNote;;Cite;;Author;Foundation;/Author;;Year;2017;/Year;;RecNum;2;/RecNum;;DisplayText;(Foundation, 2017);/DisplayText;;record;;rec-number;2;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;2;/key;;/foreign-keys;;ref-type name=”Web Page”;12;/ref-type;;contributors;;authors;;author;India Brand Equity Foundation;/author;;/authors;;/contributors;;titles;;title;IPL brand valuation rises to USD 5.3 billion: Duff ;amp; Phelps;/title;;/titles;;dates;;year;2017;/year;;/dates;;urls;;related-urls;;url;https://www.ibef.org/news/ipl-brand-valuation-rises-to-usd-53-billion-duff-phelps;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(Foundation, 2017) representing a three year compound annual growth rate (CAGR) of 13.9 per cent and it contributes minimum of Euro 151 million to the GDP of Indian Economy each year. Each franchises base value is estimated to be around Euro 330 million and hence it becomes utmost important for the team owners to select the right players for a winning combination. This forms the theory of my research to analyse the data and find the algorithms which could be used to predict the factors that contributes to make a player valuable for each team and to pick the right combinations of players to form a winning team which would ultimately benefit the team owners.

Role of Analysis in SportsStatistical analysis has played a huge role in many major sports. Team owners, managers, coaches and even sports fan like to rate players/teams as well as point out good or bad performances accomplished over a period of time ADDIN EN.CITE ;EndNote;;Cite;;Author;WILLIAM G. HOPKINS;/Author;;Year;2009;/Year;;RecNum;13;/RecNum;;DisplayText;(william G. hopkins, 2009);/DisplayText;;record;;rec-number;13;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;13;/key;;/foreign-keys;;ref-type name=”Thesis”;32;/ref-type;;contributors;;authors;;author;william G. hopkins, stephen W. marshal, alan M. batterham, juri hanin;/author;;/authors;;/contributors;;titles;;title;Progressive Statistics for Studies in Sports
Medicine and Exercise Science;/title;;/titles;;pages;10;/pages;;dates;;year;2009;/year;;/dates;;work-type;Analysis;/work-type;;urls;;related-urls;;url;http://www.ugr.es/~fmocan/MATERIALES%20DOCTORADO/Progressive%20Statistics%20for%20Studies%20in%20Sports.pdf;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(william G. hopkins, 2009) Analysing the performance of the players and teams helps the owners and coaches to understand the potential of the players and also figure out the area of improvement. Thus statistics plays a major role in analysing the sports. For example, US college major football league uses Bowl Championship Series (BCS) formula to rank teams by taking into account many factors like number of wins and losses, the strength of the schedule, the margin of victory, whether the game were played at home or away, etc. To determine which two teams will qualify for the finals and meet in order to crown the national champion ADDIN EN.CITE ;EndNote;;Cite;;Author;Thomas Callaghan;/Author;;Year;2004;/Year;;RecNum;15;/RecNum;;DisplayText;(thomas callaghan, 2004);/DisplayText;;record;;rec-number;15;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;15;/key;;/foreign-keys;;ref-type name=”Journal Article”;17;/ref-type;;contributors;;authors;;author;thomas callaghan, peter J. mucha, mason A. porter;/author;;/authors;;/contributors;;titles;;title;The Bowl Championship Series: A Mathematical Review;/title;;/titles;;dates;;year;2004;/year;;/dates;;urls;;related-urls;;url;http://www.math.ucla.edu/~mason/papers/bcsnotices.pdf;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(thomas callaghan, 2004) Obviously, for a university to have its football team playing in the championship game is a really big deal because they receive publicity and millions of dollars in revenue.

Another instance of statistics can be seen in the game of American basketball. The number of points allotted to each player per game is a statistic that makes sense is a good measurement of how good a player is. Similar kind of statistic is used to allot points in the game of hockey; the tool used to allot points to each player is called goals against average (GAA). It is calculated as number of points scored equals (goals + assists) for a hockey player. ADDIN EN.CITE ;EndNote;;Cite;;Author;Lees;/Author;;Year;Dec 2010;/Year;;RecNum;14;/RecNum;;DisplayText;(lees, Dec 2010);/DisplayText;;record;;rec-number;14;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;14;/key;;/foreign-keys;;ref-type name=”Journal Article”;17;/ref-type;;contributors;;authors;;author;adrian lees;/author;;/authors;;/contributors;;titles;;title;Technique analysis in sports: a critical review;/title;;secondary-title;Journal of Sports Sciences;/secondary-title;;/titles;;periodical;;full-title;Journal of Sports Sciences;/full-title;;/periodical;;pages;813-828;/pages;;volume;Volume 20, 2002;/volume;;number;10;/number;;section;813;/section;;dates;;year;Dec 2010;/year;;/dates;;urls;;related-urls;;url;https://www.tandfonline.com/doi/abs/10.1080/026404102320675657;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(lees, Dec 2010) Batting average for a baseball player and pitcher is calculated by a formula called earned runs average (ERA). Similar calculations exist in all sports to determine the best players of that game.

Unlike other sports, it is not any easy task to determine a good measure of performance in T20 cricket game. There have been few statistical analysis conducted to determine what is really needed to win a match in the shortest format of the game.

BACKGROUND AND SIGNIFICANCETeam selection is a highly critical process in every sport as players are selected on their past performance. Forecasting future from the past is highly subjective and thus requires expert decision making ADDIN EN.CITE ;EndNote;;Cite;;Author;Pankush Kalgotra;/Author;;Year;2013;/Year;;RecNum;4;/RecNum;;DisplayText;(pankush kalgotra, 2013);/DisplayText;;record;;rec-number;4;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;4;/key;;/foreign-keys;;ref-type name=”Journal Article”;17;/ref-type;;contributors;;authors;;author;pankush kalgotra, ramesh sharda, goutam chakraborty;/author;;/authors;;/contributors;;titles;;title;Predictive Modeling in Sports Leagues: An Application in Indian Premier League;/title;;secondary-title;SAS Global Forum 2013;/secondary-title;;/titles;;dates;;year;2013;/year;;/dates;;urls;;related-urls;;url;http://support.sas.com/resources/papers/proceedings13/019-2013.pdf;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(pankush kalgotra, 2013) It becomes more prominent when a huge amount of money is involved. A number of predictive models have been created on IPL and related studies.

Parker, Burns and Natarajan (2008) explored the determinants of valuations and investigated a number of hypotheses related to the design of the auction in IPL using information of the previous performance, experience, and other characteristics of individual players ADDIN EN.CITE ;EndNote;;Cite;;Author;david parker;/Author;;Year;2008;/Year;;RecNum;5;/RecNum;;DisplayText;(david parker, 2008);/DisplayText;;record;;rec-number;5;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;5;/key;;/foreign-keys;;ref-type name=”Thesis”;32;/ref-type;;contributors;;authors;;author;david parker, phil burns and harish natarajan;/author;;/authors;;/contributors;;titles;;title;Player valuations in the Indian Premier
League;/title;;/titles;;dates;;year;2008;/year;;/dates;;urls;;related-urls;;url;https://www.frontier-economics.com/documents/2008/10/player-valuations-in-the-indian-premier-league-frontier-paper.pdf;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(david parker, 2008)

Iyer and Sharda (2009) used neural networks in forecasting the selection of athletes in the cricket teams by predicting their future performance based on past performance. A prediction for the selection of a cricketer in the one-day international world cup 2007 was made. To predict the selection, players were categorized into a performer, a moderate or a failure ADDIN EN.CITE ;EndNote;;Cite;;Author;Subramanian Rama Iyer;/Author;;Year;2009;/Year;;RecNum;6;/RecNum;;DisplayText;(subramanian rama iyer, 2009);/DisplayText;;record;;rec-number;6;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;6;/key;;/foreign-keys;;ref-type name=”Journal Article”;17;/ref-type;;contributors;;authors;;author;subramanian rama iyer, ramesh sharda;/author;;/authors;;/contributors;;titles;;title;Prediction of athletes performance using neural networks: An application in cricket team selection;/title;;/titles;;dates;;year;2009;/year;;/dates;;urls;;related-urls;;url;https://www.researchgate.net/publication/220214600_Prediction_of_athletes_performance_using_neural_networks_An_application_in_cricket_team_selection;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(subramanian rama iyer, 2009)

Karnik (2009) followed a very simple approach to derive the hedonic price equations for estimating a bid amount for each cricketer in the Indian Premier League (IPL) auction. He developed price models using the data from the 2008 season and successfully tested against the data from the 2009 season. The variables used in the equations were the common playing factors such as runs scored, wickets taken and age. He observed a lower rate of return from the expensive players to the owners of the teams that showed the inefficiency in judging the pay levels of the players by the bidders ADDIN EN.CITE ;EndNote;;Cite;;Author;Karnik;/Author;;Year;2009;/Year;;RecNum;7;/RecNum;;DisplayText;(karnik, 2009);/DisplayText;;record;;rec-number;7;/rec-number;;foreign-keys;;key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”;7;/key;;/foreign-keys;;ref-type name=”Journal Article”;17;/ref-type;;contributors;;authors;;author;ajit karnik;/author;;/authors;;/contributors;;titles;;title;Valuing Cricketers Using Hedonic Price Models;/title;;/titles;;dates;;year;2009;/year;;/dates;;urls;;related-urls;;url;https://www.researchgate.net/publication/227359953_Valuing_Cricketers_Using_Hedonic_Price_Models;/url;;/related-urls;;/urls;;/record;;/Cite;;/EndNote;(karnik, 2009)

Singh, Gupta and V. Gupta (2011) formulated an integer-programming model for the efficient bidding strategy for the franchises. The model was implemented in a spread sheet that helped in taking bidding decisions in real time and overcome winner’s curse, which is typically associated with normal bidding processes ADDIN EN.CITE <EndNote><Cite><Author>Sanjeet Singh</Author><Year>2010</Year><RecNum>8</RecNum><DisplayText>(sanjeet singh, 2010)</DisplayText><record><rec-number>8</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>8</key></foreign-keys><ref-type name=”Journal Article”>17</ref-type><contributors><authors><author>sanjeet singh, shaurya gupta, vibhor gupta</author></authors></contributors><titles><title>Dynamic Bidding Strategy for Players Auction in IPL</title></titles><dates><year>2010</year></dates><urls><related-urls><url>http://www.worldacademicunion.com/journal/SSCI/sscivol05no01paper01.pdf</url></related-urls></urls></record></Cite></EndNote>(sanjeet singh, 2010)

Singh (2011) made an effort to measure the performance of teams in the IPL using the non-parametric mathematical approach called Data Envelopment Analysis (DEA). He used both playing and non-playing factors for analysing the efficiencies of the teams in 2009 season ADDIN EN.CITE <EndNote><Cite><Author>Singh</Author><Year>2011</Year><RecNum>9</RecNum><DisplayText>(singh, 2011)</DisplayText><record><rec-number>9</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>9</key></foreign-keys><ref-type name=”Journal Article”>17</ref-type><contributors><authors><author>sanjeet singh</author></authors></contributors><titles><title>Measuring the Performance of Teams in the
Indian Premier League</title></titles><dates><year>2011</year></dates><urls><related-urls><url>https://file.scirp.org/pdf/AJOR20110300006_97742271.pdf</url></related-urls></urls></record></Cite></EndNote>(singh, 2011)

Kalgotra, Sharda and Chakraborty (2013) created a number of models for predicting the selection of a player based on their past performance. The models were developed using SAS Enterprise Miner and the best performing model was selected based on the validation data misclassification rate. The selected model provides us with the probability measure of the selection of each player, which can be used as a valuation factor in the bidding equation. The models that are developed can help decision makers during auction set salaries for the players ADDIN EN.CITE <EndNote><Cite><Author>Pankush Kalgotra</Author><Year>2013</Year><RecNum>4</RecNum><DisplayText>(pankush kalgotra, 2013)</DisplayText><record><rec-number>4</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>4</key></foreign-keys><ref-type name=”Journal Article”>17</ref-type><contributors><authors><author>pankush kalgotra, ramesh sharda, goutam chakraborty</author></authors></contributors><titles><title>Predictive Modeling in Sports Leagues: An Application in Indian Premier
League</title><secondary-title>SAS Global Forum 2013</secondary-title></titles><dates><year>2013</year></dates><urls><related-urls><url>http://support.sas.com/resources/papers/proceedings13/019-2013.pdf</url></related-urls></urls></record></Cite></EndNote>(pankush kalgotra, 2013)

The following researches and literature reviews suggest that few techniques have been used in predicting the top players and teams. However, a lot of machine learning techniques could be explored in this area.

Research QuestionThe purpose of this study is to do a performance analysis of each participating team and its members in IPL using the data and predict the best batsman and bowler using different machine learning models.

research METHODOLOGYResearch DesignMy research design is a secondary data analysis of the data collected on Indian Premier League (IPL Cricket) by Raghunath. Following is the design of my research:

The Data

The input data was downloaded from data.world ADDIN EN.CITE <EndNote><Cite><Author>Raghunath</Author><Year>2017</Year><RecNum>3</RecNum><DisplayText>(raghunath, 2017)</DisplayText><record><rec-number>3</rec-number><foreign-keys><key app=”EN” db-id=”dva9z05px5sv0sef9f4xxav0r5rprd9s059f”>3</key></foreign-keys><ref-type name=”Web Page”>12</ref-type><contributors><authors><author>raghunath</author></authors></contributors><titles><title>IPL Cricket data</title></titles><dates><year>2017</year></dates><urls><related-urls><url>https://data.world/raghu543/ipl-data-till-2016-set-of-csv-files</url></related-urls></urls></record></Cite></EndNote>(raghunath, 2017) and IPLT20.com. It corresponds to seasonal statistics, calculated for each team and players. It covers the last 9 seasons, from 2008 to 2017 which comprises of data of 577 matches. The most relevant data field is the all-time and yearly performance of players for all the games he played, and also the team’s performance of every game they played.

Before going into the analysis, the data was cleaned and made consistent using fillna techniques of python. An additional column named Age and value for each player’s current age was added to test the influence of this variable on the performance. 100 players from the “IPL_Most_Runs” dataset and 100 players from the “IPL_Most_Wickets” dataset were sent for the first analysis to find the factors valuable to choose a player. For the second analysis, entire “deliveries” data set was used to make predictions.

Collection of data and Data Management

R, RStudio and Python (Spyder) tools were used to collect the data as the data is in CSV format. Unfortunately, the data was not appropriate and up to date with all the information in it. However, function like “fillna” was used to compensate on the missing data.

Data Analysis Strategies

Following researches and literature reviews suggest that few techniques have been used in predicting the top players and teams. However, a lot of machine learning techniques could be explored in this area.

Models such as Linear Regression, Multiple Linear Regression, KNN and SVM have been tested in this research.

Cricket consists of three main aspects; batting, bowling and fielding. Batting is how the team scores runs in the game and there are a number of statistics pertaining to batting that speaks to how well a player is performing.

Table 2: Description of variables for batting dataset:

Variable Level Variable Description

Matches (Mat) Interval No of matches played by the player

Innings (Inns) Interval Number of innings played by the player instead of matches

Not Out (NO) Interval Number of times a player has been not out in one season

Runs Interval Total number of runs in one season

Highest Score (HS) Interval Highest runs in an innings by the player

Average (Avg) Interval Total number of runs a player has scored divided by the number of times he is out

Ball Faced (BF) Interval Total number of balls faced to score the runs in one season

Strike Rate (SR) Interval Average number of runs scored per 100 balls faced

Century Interval Number of hundreds scored in one season

Half Century Interval Number of fifties scored in one season

4’s Interval Number of fours in all the innings he has played in twenty20 so far.

6’s Interval Number of sixes in all the innings

Runs – A run is a basic unit of batting. The basic objective of batting is to score as many runs as possible.

Batting Average – It is the number of runs scored per innings played and is a first measure of the potency of a batsman.

Strike Rate – This is a measure of the number of runs scored per ball faced. It gives an idea as to how fast the batsman is scoring his runs. Since each team play only a limited number of balls, scoring run fast is important.

Not Outs – It is a measure of the number of times a batsman has played an innings and not gotten out or lost his wicket by the time the innings wrapped up. There are various ways in which a batsman can get out. Along with scoring runs, another objective for batsmen is to protect their wicket or remain not out.

Highest Score – It is the highest number of runs a batsman has scored in an innings in his career.

100’s – The hundred run mark is considered a milestone in cricket and is called a century. Like runs, the number of centuries is a measure of a batsmen’s performance.

50’s – The fifty run mark is also considered a milestone and is referred to as a half-century.

Like the other performance measures, the higher the number, the better the batsman.

Bowling is the other major aspect of cricket. Bowling is how the team takes the wickets or gets the other team out. If the bowling team takes 10 wickets in an innings, the other team’s innings is over and the two teams switch roles.

Table 3: Description of variables for Bowling Dataset:

Variable Level Variable Description

Matches (Mat) Interval No of matches played by the player

Innings (Inns) Interval Number of innings played by the player instead of matches

Overs (Ov) Interval Number of overs bowled in one season

Runs Interval Total number of runs conceded

Wickets (Wkts) Interval Number of wickets taken

Average (Avg) Interval Average number of runs conceded per wicket

Economy (Econ) Interval Average number of runs conceded per over

Strike Rate (SR) Interval Average number of balls bowled per wicket taken

4w Interval Number of innings in which the bowler took at least four wickets

5w Interval Number of innings in which the bowler took at least five wickets

Overs – An over is a set of six valid balls delivered by a single bowler. A valid ball is a ball whose delivery meets certain specified requirements.

Runs – Runs in bowling statistics refers to the number of runs scored off the bowler’s bowling. A lower number of runs indicate a better performance for bowlers.

Wickets – A wicket is getting the batsmen out and can be done in various ways. The objective of bowling is to get the batsmen out or to take their wicket. Thus, a higher number of wickets indicate a good bowling performance.

4W – This refers to Four Wickets and is a record of how many times a bowler has taken 4 or more wickets in a particular match indicating excellent performance.

Average – The average in bowling refers to the number of runs given by the bowler per wicket taken and is a measure of the consistency of the bowler.

Economy Rate – The economy rate is the number of runs given per over. Since one of the objectives of bowling is to not give runs, this metric gives us an idea of the performance of the bowler.

ANALYSISResearch Question 1: To find the most important factors/variables to recognise a good batsman and a bowler.

Multiple Linear Regression Analysis (Backward Elimination Rule) was carried out to find out the most important factors to recognise good batsman and bowler. All non-significant factors were eliminated following the rule. For batting regression, variables such as matches, innings, not outs, highest score, average, balls faced, strike rate, century, half-century, fours, sixes and age were included. For the bowling regression, variables such as matches, innings, overs, runs, average, economy, strike rate, 4wickets, 5wickets and age were included. In both the regressions, age an additional variable was added which was not used in previous analysis performed by Krittivas Dalmia. The R codes for both regressions can be found under appendices section. The results from the regression analysis are given below.

Dataset: Use ‘IPL_Most_Runs’ and ‘IPL_Most_Wickets’ files for this analysis.

Initial and Final results for batting regression:

Figure SEQ Figure * ARABIC 1 Initial Result

After using the backward elimination rule of regression, got the final result as-

Figure SEQ Figure * ARABIC 2 Final Result

Initial and Final results for bowling regression:

Figure SEQ Figure * ARABIC 3 Initial Result

After using the backward elimination rule of regression, got the final result as-

Figure SEQ Figure * ARABIC 4 Final Result

Interpretation of Results:

The final result for both the batting and bowling regression shows that (Age), which was considered an important factor has no significance at all in the shorter version of the game. It also shows that variables such as balls faced (BF), strike rate (SR), X4s and X6s (Fours and Sixes) has very high significance on our target variable (Runs) and suggests should be considered while choosing a batsman for one’s team. And variables such as overs (Ov), Average (Avg) and X4w and X5w should be considered while looking for bowlers as it has high significance value.

Research Question 2: To predict batting average and bowling average for the batters and bowlers.

Algorithms such as K-Nearest Neighbour (KNN), Linear Regression (LRM) and Support Vector Machine (SVM) were used to make the prediction for the batting and bowling averages. Statistics of David A. Warner was used to predict his batting average depending on his past performance and statistics of Ashish Nehra was used to predict his bowling average depending on his past performance.

Dataset: Use ‘deliveries.csv’ file for this analysis

Results for Batting Average: Using David A. Warner’s batting average given as 40.14. Models like KNN, SVM and Linear Regression were tested.

Figure SEQ Figure * ARABIC 5 Result of KNN

Figure SEQ Figure * ARABIC 6 Result of LRM

Figure SEQ Figure * ARABIC 7 Result of SVM

Results for Bowling Average: Using Ashish Nehra’s bowling average given as 24. Models like KNN, SVM and Linear Regression were tested.

Figure SEQ Figure * ARABIC 8 Result of KNN

Figure SEQ Figure * ARABIC 9 Result of LRM

Figure SEQ Figure * ARABIC 10 Result of SVM

Interpretation of Results:

The results show that K- Nearest Neighbour algorithm (K=2) worked better than Linear Regression and Support Vector Machine in prediction of both batting and bowling averages. It is also observed that LR failed to predict drastically in both the cases. However, SVM performed equally well like KNN in predicting batting average, but failed to do so in predicting bowling average.

RESULT TABLE

Sr. no Players Model Output

Real Value KNN Linear Regression SVM

1 David Warner Batting Average 40.14 35.96 -126.24 43.67

2 Ashish Nehra Bowling Average 24 28.25 17.51 16.12

ConclusionMy research suggests that K-Nearest Neighbour models are comparatively better than Linear Regression and Support Vector Machine. It gave more accurate measures of variables from KNN to include while selecting a player for the team than the one given by LR and SVM models.

Future work includes considering other factors like, captaincy (ability to lead the team), performance from other format of the games, playing conditions, venues should be tested to find the best players. Considering all these variables may result in better models.

references ADDIN EN.REFLIST COLIN CANNONIER, B. P., SUDIPTA SARANGI September 24, 2013. 20-over versus 50-over cricket

is there a difference? journal of sports economics.DALMIA, K. May 2010. The Indian Premier League: Pay versus Performance. Leonard N. Stern School of Business.DAVID PARKER, P. B. A. H. N. 2008. Player valuations in the Indian Premier

League.FOUNDATION, I. B. E. 2017. IPL brand valuation rises to USD 5.3 billion: Duff & Phelps Online. Available: https://www.ibef.org/news/ipl-brand-valuation-rises-to-usd-53-billion-duff-phelps.ICC. International Cricket Council Online. Available: https://www.icc-cricket.com/.INDIA, B. O. C. F. C. I. IPL Online. Available: http://www.iplt20.com/auction/2018.KARNIK, A. 2009. Valuing Cricketers Using Hedonic Price Models.LEES, A. Dec 2010. Technique analysis in sports: a critical review. Journal of Sports Sciences, Volume 20, 2002, 813-828.PANKUSH KALGOTRA, R. S., GOUTAM CHAKRABORTY 2013. Predictive Modeling in Sports Leagues: An Application in Indian Premier

League. SAS Global Forum 2013.RAGHUNATH. 2017. IPL Cricket data Online. Available: https://data.world/raghu543/ipl-data-till-2016-set-of-csv-files.SANJEET SINGH, S. G., VIBHOR GUPTA 2010. Dynamic Bidding Strategy for Players Auction in IPL.SINGH, S. 2011. Measuring the Performance of Teams in the

Indian Premier League.SUBRAMANIAN RAMA IYER, R. S. 2009. Prediction of athletes performance using neural networks: An application in cricket team selection.THOMAS CALLAGHAN, P. J. M., MASON A. PORTER 2004. The Bowl Championship Series: A Mathematical Review.WILLIAM G. HOPKINS, S. W. M., ALAN M. BATTERHAM, JURI HANIN. 2009. Progressive Statistics for Studies in Sports

Medicine and Exercise Science. Analysis.

List of Appendices TOC o “1-3” u appendix 1 – R Code for Batting regression PAGEREF _Toc506501834 h 4

Appendix 2 – R code for bowling regression PAGEREF _Toc506501838 h 5

Appendix 1: R CODE FOR BATTING REGRESSION# Importing the dataset

dataset1 = read.csv(‘IPL_MOst_Runs.csv’)

dataset1 = dataset13:15

# Filling the missing data using “mean” function

dataset1$Avg = ifelse(is.na(dataset1$Avg),

ave(dataset1$Avg, FUN = function(x) mean(x, na.rm = TRUE)),

dataset1$Avg)

# Splitting the dataset into the Training set and Test set

# install.packages(‘caTools’)

library(caTools)

split = sample.split(dataset1$Runs, SplitRatio = 0.8)

training_set1 = subset(dataset1, split == TRUE)

test_set1 = subset(dataset1, split == FALSE)

# Fitting Multiple Linear Regression to the Training set

regressor1 = lm(formula = Runs ~ Mat + Inns + NO + HS + Avg + BF + SR + X100 + X50 + X4s + X6s + Age,

data = training_set1)

summary (regressor1)

# Predicting the Test set results

y_pred1 = predict(regressor1, newdata = test_set1)

# Building the optimal model using Backward Elimination (SL = 0.05 OR 5%)

# X100 variable has 69% P value and hence we will eliminate that

regressor1 = lm(formula = Runs ~ Mat + Inns + NO + HS + Avg + BF + SR + X50 + X4s + X6s + Age,

data = training_set1)

summary (regressor1)

# Matches variable has 41% P value and hence gets eliminated

regressor1 = lm(formula = Runs ~ Inns + NO + HS + Avg + BF + SR + X50 + X4s + X6s + Age,

data = training_set1)

summary (regressor1)

# Inns variable has 27% P value and hence we will eliminate that

regressor1 = lm(formula = Runs ~ NO + HS + Avg + BF + SR + X50 + X4s + X6s + Age,

data = training_set1)

summary (regressor1)

# Age variable has 16% P value and hence gets eliminated

regressor1 = lm(formula = Runs ~ NO + HS + Avg + BF + SR + X50 + X4s + X6s,

data = training_set1)

summary (regressor1)

# HS (Highest Score)–over fitting the model

regressor1 = lm(formula = Runs ~ NO + Avg + BF + SR + X50 + X4s + X6s,

data = training_set1)

summary (regressor1)

Appendix 2: R CODE FOR BOWLING REGRESSION# Importing the dataset

dataset = read.csv(‘IPL_Most_Wickets.csv’)

dataset = dataset3:14

# Taking care of missing data

dataset$Avg = ifelse(is.na(dataset$Avg),

ave(dataset$Avg, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$Avg)

dataset$SR = ifelse(is.na(dataset$SR),

ave(dataset$SR, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$SR)

# Splitting the dataset into the Training set and Test set

# install.packages(‘caTools’)

library(caTools)

split = sample.split(dataset$Wkts, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

# Fitting Multiple Linear Regression to the Training set

regressor = lm(formula = Wkts ~ Mat + Inns + Ov + Runs + Avg + Econ + SR + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# Predicting the Test set results

y_pred = predict(regressor, newdata = test_set)

summary (regressor)

# Building the optimal model using Backward Elimination (SL = 0.05 OR 5%)

# Since Matches has P value of 95%, we will eliminate that

regressor = lm(formula = Wkts ~ Inns + Ov + Runs + Avg + Econ + SR + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# Second Elimination- Econ (P value = 90%)

regressor = lm(formula = Wkts ~ Inns + Ov + Runs + Avg + SR + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# Since Runs has P value 59%, we will remove it

regressor = lm(formula = Wkts ~ Inns + Ov + Avg + SR + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# Inns don’t have significant impact, P value of 46%

regressor = lm(formula = Wkts ~ Ov + Avg + SR + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# SR has p value of 22% which is high enough to eliminate

regressor = lm(formula = Wkts ~ Ov + Avg + X4w + X5w + AGE,

data = training_set)

summary (regressor)

# AGE has p value of 10% which is high enough to eliminate

regressor = lm(formula = Wkts ~ Ov + Avg + X4w + X5w,

data = training_set)

summary (regressor)

Appendix 3: PYTHON CODE FOR Batting Average# Predicting batting average of a batsman

# Importing the libraries

import numpy as npimport matplotlib.pyplot as pltimport pandas as pd# Importing the dataset

matches=pd.read_csv(‘C:/Users/Sumit/Desktop/Data Analysis/matches.csv’)

deliveries=pd.read_csv(‘C:/Users/Sumit/Desktop/Data Analysis/deliveries.csv’)

# Batsmen Runs:

batsmen = deliveries.groupby(“match_id”, “inning”, “batting_team”, “batsman”)

batsmen = batsmen”batsman_runs”.sum().reset_index()

#dismissals

dismissals = deliveriespd.notnull(deliveries’player_dismissed’)

dismissals = dismissals’match_id’,’inning’,’player_dismissed’,’dismissal_kind’,’fielder’

dismissals.rename(columns={‘player_dismissed’:’batsman’},inplace=True)

batsmen=batsmen.merge(dismissals,left_on=’match_id’,’inning’,’batsman’,right_on=’match_id’,’inning’,’batsman’, how= “left”)

batsmen’dismissal_kind’=batsmen.dismissal_kind.fillna(‘not_out’)

batsmen’fielder’=batsmen.fielder.fillna(‘-‘)

#Number of innings

no_of_innings=batsmen.groupby(‘inning’,’batsman’).size().reset_index()

no_of_innings=no_of_innings.groupby(‘batsman’).sum().reset_index()

no_of_innings=no_of_innings.drop(‘inning’,1)

no_of_innings.rename(columns={0:’no_of_innings’},inplace=True)

#dismissal_typesdismissal_group=batsmen’dismissal_kind’,’batsman’

dismissal_group=dismissal_group.groupby(‘batsman’,’dismissal_kind’).size().reset_index()

dismissal_group.rename(columns={‘0′:’No_of_times’},inplace=True)

not_outs=dismissal_groupdismissal_group’dismissal_kind’==’not_out’

not_outs=not_outs.drop(‘dismissal_kind’,1)

not_outs.rename(columns={0:’no_of_not_outs’},inplace=True)

#batting_average= Runs/innings-not_outstotal_runs=batsmen.groupby(‘batsman’ ).sum().reset_index()

total_runs=total_runs.drop(‘inning’,’match_id’,1)

batsmen_overall=no_of_innings.merge(total_runs,on=’batsman’)

batsmen_overall=batsmen_overall.merge(not_outs,on=’batsman’)

batsmen_overall’batting_average’=batsmen_overall’batsman_runs’/(batsmen_overall’no_of_innings’-batsmen_overall’no_of_not_outs’)

batsmen_overall=batsmen_overall.replace(np.inf, -np.inf, 0)

batsmen_overall.rename(columns={‘batsman’:’player’},inplace=True)

#A=list(range(len(batsmen_overall)))

#plt.scatter(A,batsmen_overall’batting_average’, color=’g’)

#plt.xlabel(‘Player’)

#plt.ylabel(‘Average’)

#plt.show()#player, innings, runs, no_of_not_outs, batting_averageplayer_overall=batsmen_overall

# Filling the missing value

playerdf=player_overallplayerdf.fillna(-9999, inplace= True)

# loading the libraries

from sklearn import preprocessing, cross_validation, neighborsfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import svm# Splitting the dataset into the Training set and Test set

#X1 = playerdf.drop(‘player’,’batting_average’,’total_wickets’,’total_runs_in_different_ways’,’bowling_average’,’total_balls’,’economy’,’total_overs’,1)

X1 = playerdf.drop(‘player’,’batting_average’,1)

y1 = np.asarray(playerdf’batting_average’, dtype=np.float)

X1_train, X1_test, y1_train, y1_test = cross_validation.train_test_split(X1,y1,test_size=0.2)

########################################################################

#Batting Average Prediction

##David Warner’s statistics and his average = 40.14

example_measures = np.array(115,4014,15) ## I took innings, runs and no_of_not_outsexample_measures = example_measures.reshape(1,-1)

#KNN

n_neighbors = 2

clf1 = neighbors.KNeighborsRegressor(n_neighbors)

clf1.fit(X1_train.values, y1_train)

accuracy1= clf1.score(X1_test,y1_test)

print(accuracy1)

prediction = clf1.predict(example_measures)

print(prediction)

#Linear Regression

clf=LinearRegression()

clf.fit(X1_train.values, y1_train)

accuracy= clf.score(X1_test,y1_test)

print(accuracy)

prediction = clf.predict(example_measures)

print(prediction)

#SVM

clf = svm.SVR(kernel=’linear’)

clf.fit(X1_train.values, y1_train)

accuracy= clf.score(X1_test,y1_test)

print(accuracy)

prediction = clf.predict(example_measures)

print(prediction)

APPENDIX 4: PYTHON CODE FOR Bowling AVERAGE# Prediction bowling average of a bowler

# Importing the libraries

import numpy as npimport matplotlib.pyplot as pltimport pandas as pd# Importing the dataset

matches=pd.read_csv(‘C:/Users/Sumit/Desktop/Data Analysis/matches.csv’)

deliveries=pd.read_csv(‘C:/Users/Sumit/Desktop/Data Analysis/deliveries.csv’)

# Batsmen Runs:

batsmen = deliveries.groupby(“match_id”, “inning”, “batting_team”, “batsman”)

batsmen = batsmen”batsman_runs”.sum().reset_index()

#dismissals

dismissals = deliveriespd.notnull(deliveries’player_dismissed’)

dismissals = dismissals’match_id’,’inning’,’player_dismissed’,’dismissal_kind’,’fielder’

dismissals.rename(columns={‘player_dismissed’:’batsman’},inplace=True)

batsmen=batsmen.merge(dismissals,left_on=’match_id’,’inning’,’batsman’,right_on=’match_id’,’inning’,’batsman’, how= “left”)

batsmen’dismissal_kind’=batsmen.dismissal_kind.fillna(‘not_out’)

batsmen’fielder’=batsmen.fielder.fillna(‘-‘)

#Number of innings

no_of_innings=batsmen.groupby(‘inning’,’batsman’).size().reset_index()

no_of_innings=no_of_innings.groupby(‘batsman’).sum().reset_index()

no_of_innings=no_of_innings.drop(‘inning’,1)

no_of_innings.rename(columns={0:’no_of_innings’},inplace=True)

#dismissal_typesdismissal_group=batsmen’dismissal_kind’,’batsman’

dismissal_group=dismissal_group.groupby(‘batsman’,’dismissal_kind’).size().reset_index()

dismissal_group.rename(columns={‘0′:’No_of_times’},inplace=True)

not_outs=dismissal_groupdismissal_group’dismissal_kind’==’not_out’

not_outs=not_outs.drop(‘dismissal_kind’,1)

not_outs.rename(columns={0:’no_of_not_outs’},inplace=True)

#batting_average= Runs/innings-not_outstotal_runs=batsmen.groupby(‘batsman’ ).sum().reset_index()

total_runs=total_runs.drop(‘inning’,’match_id’,1)

batsmen_overall=no_of_innings.merge(total_runs,on=’batsman’)

batsmen_overall=batsmen_overall.merge(not_outs,on=’batsman’)

batsmen_overall’batting_average’=batsmen_overall’batsman_runs’/(batsmen_overall’no_of_innings’-batsmen_overall’no_of_not_outs’)

batsmen_overall=batsmen_overall.replace(np.inf, -np.inf, 0)

batsmen_overall.rename(columns={‘batsman’:’player’},inplace=True)

A=list(range(len(batsmen_overall)))

plt.scatter(A,batsmen_overall’batting_average’, color=’g’)

plt.xlabel(‘Player’)

plt.ylabel(‘Average’)

plt.show()#bowlers

bowlers=deliveries’ball’,’bowler’,’extra_runs’,’total_runs’,’player_dismissed’,’dismissal_kind’,’fielder’

#Overs

total_overs=bowlers.groupby(‘bowler’,’ball’).size().reset_index()

total_overs.rename(columns={0:’no_of_each_balls’},inplace=True)

total_overs’total_balls’=0

for i in range(0,len(total_overs)):

if total_overs.ball.iloci==1:

total_overs’total_balls’.iloci = total_overs’no_of_each_balls’.iloci+total_overs’no_of_each_balls’.iloci+1+total_overs’no_of_each_balls’.iloci+2+total_overs’no_of_each_balls’.iloci+3+total_overs’no_of_each_balls’.iloci+4+total_overs’no_of_each_balls’.iloci+5

#runs conceded

runs_conceded=bowlers.groupby(‘bowler’,’total_runs’).size().reset_index()

runs_conceded.rename(columns={0:’runs_in_different_ways’},inplace=True)

runs_conceded’total_runs_in_different_ways’=0

runs_conceded’total_runs_in_different_ways’=runs_conceded’runs_in_different_ways’*runs_conceded’total_runs’

runs_conceded=runs_conceded’bowler’,’total_runs_in_different_ways’

runs_conceded=runs_conceded.groupby(‘bowler’).sum().reset_index()

#wickets

wickets=bowlers’bowler’,’dismissal_kind’

wickets=wickets.dropna()

wickets=wicketswickets.dismissal_kind != ‘run out’

wickets=wickets.groupby(‘bowler’).size().reset_index()

wickets.rename(columns={0:’total_wickets’},inplace=True)

#Bowling average=runs/wicket taken

Bowling_average=runs_conceded.merge(wickets,on=’bowler’,how=’left’)

Bowling_average’bowling_average’=Bowling_average’total_runs_in_different_ways’/Bowling_average’total_wickets’

#economy=runs/overs

Economy=runs_conceded.merge(total_overs,on=’bowler’,how=’left’)

Economy=EconomyEconomy.total_balls != 0

Economy=Economy’bowler’,’total_balls’,’total_runs_in_different_ways’

Economy’total_overs’=Economy’total_balls’/6

Economy’economy’=Economy’total_runs_in_different_ways’/Economy’total_overs’

#player,runs,batting_average,wickets,bowling_average,economybowler_overall=wickets.merge(Bowling_average’bowler’,’total_runs_in_different_ways’,’bowling_average’, on=’bowler’,how=’left’)

bowler_overall=bowler_overall.merge(Economy’bowler’,’total_balls’,’economy’,’total_overs’,on=’bowler’,how=’left’)

bowler_overall.rename(columns={‘bowler’:’player’},inplace=True)

player_overall=batsmen_overall.merge(bowler_overall,on=’player’,how=’left’)

# Filling the missing value

playerdf11=bowler_overallplayerdf11.fillna(-9999, inplace= True)

# loading the libraries

from sklearn import preprocessing, cross_validation, neighbors

from sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import svm# Splitting the dataset into the Training set and Test set

X1 = playerdf11.drop(‘player’,’total_balls’, ‘bowling_average’, ‘total_overs’,1)

y1 = np.asarray(playerdf11’bowling_average’, dtype=np.float)

X1_train, X1_test, y1_train, y1_test = cross_validation.train_test_split(X1,y1,test_size=0.2)

########################################################################

#Bowling Average Prediction

## A Nehra statistics and his average = 24

example_measures = np.array(106, 2537, 8) ## I took innings, Total no of runs in diff ways

example_measures = example_measures.reshape(1,-1)

#KNN

n_neighbors = 2

clf1 = neighbors.KNeighborsRegressor(n_neighbors)

clf1.fit(X1_train.values, y1_train)

accuracy1= clf1.score(X1_test,y1_test)

print(accuracy1)

prediction = clf1.predict(example_measures)

print(prediction)

#Linear Regression

clf=LinearRegression()

clf.fit(X1_train.values, y1_train)

accuracy= clf.score(X1_test,y1_test)

print(accuracy)

prediction = clf.predict(example_measures)

print(prediction)

#SVM

clf = svm.SVR(kernel=’linear’)

clf.fit(X1_train.values, y1_train)

accuracy= clf.score(X1_test,y1_test)

print(accuracy)

prediction = clf.predict(example_measures)

print(prediction)