Archive for the ‘Sabermetrics’ Category

# install R 
sudo DEBIAN_FRONTEND=noninteractive add-apt-repository ppa:marutter/rrutter
sudo DEBIAN_FRONTEND=noninteractive add-apt-repository ppa:marutter/c2d4u
sudo DEBIAN_FRONTEND=noninteractive apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get --yes --force-yes install r-base-core r-base
# install RStudio :
# Free disk space required: around 5 GB
# Mac OS X users should use RStudio instead of R to avoid the following UNIX child process forking error:
# THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY_YOU_MUST_EXEC__() to debug.
MACHINE_TYPE=`uname -m`
cd /tmp
rm rstudio*.deb
rm index.html
if [ ${MACHINE_TYPE} == 'x86_64' ]; then
 # 64-bit stuff here
sudo DEBIAN_FRONTEND=noninteractive apt-get --yes --force-yes install gdebi-core pandoc libssl0.9.8 libapparmor1
wget --no-check-certificate http://www.rstudio.com/products/rstudio/download/
wget --no-check-certificate `cat index.html|grep -v tar|grep amd64\.deb|cut -d"\"" -f2`
sudo dpkg -i rstudio*.deb
sudo DEBIAN_FRONTEND=noninteractive apt-get --yes --force-yes -f install
else
 # 32-bit stuff here
sudo DEBIAN_FRONTEND=noninteractive apt-get --yes --force-yes install gdebi-core pandoc libssl0.9.8 libapparmor1
wget --no-check-certificate http://www.rstudio.com/products/rstudio/download/
wget --no-check-certificate `cat index.html|grep -v tar|grep i386\.deb|cut -d"\"" -f2`
sudo dpkg -i rstudio*.deb
sudo DEBIAN_FRONTEND=noninteractive apt-get --yes --force-yes -f install
fi
cd $HOME
# troubleshooting information to check the rstudio installation:
uname -m
file /usr/lib/rstudio/bin/rstudio
ldd `which rstudio`
Advertisements

My previous posts were about accessing the lahman database, which you can find here:

https://courses.edx.org/courses/BUx/SABR101x/2T2014/discussion/forum/i4x-BUx-SABR101x-course-2T2014/threads/5398867e95b062beb8000ca4

and here:

https://courses.edx.org/courses/BUx/SABR101x/2T2014/discussion/forum/c31cc5c16ee24e52b1465ba3fad7e2ac/threads/53ac612786d329c9f70004c7

This post describes how to create a locally installed MySQL retrosheet database and how to access the retrosheet MySQL database from R or RStudio. Hope it is of use to someone.

My system configuration has changed a little and is now as follows:

* Dell Latitude E6500 4GB RAM
* GNU/Linux kernel 3.16.0-031600rc2-generic #201406220135 SMP Sun Jun 22 05:36:21 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
* Ubuntu 14.04 LTS 64-bit
* lxqt-session version 0.7.0+bzr200+201406301231~ubuntu14.04.1
* R version 3.1.0 (2014-04-10)
* Platform: x86_64-pc-linux-gnu (64-bit)
* RStudio 0.98.953
* RMySQL R package manually installed via RStudio GUI
* MySQL server Ver 14.14 Distrib 5.6.17, for debian-linux-gnu (x86_64) using EditLine wrapper

I configured the .my.cnf configuration file in my home directory as follows to avoid having to mention any passwords in any .R scripts where MySQL connections are made to the retrosheet database. I did chmod 600 .my.cnf to limit the access rights to the .my.cnf configuration file in my home directory.

[client]
user=root
password=put right password here

[mysql]
user=root
password=put right password here

[mysqldump]
user=root
password=put right password here

[mysqldiff]
user=root

#############################################################
# How to access Retrosheet DB in locally stored MySQL DB: #
#############################################################
# data set sources:
# https://github.com/alexreisner/baseball_data/blob/master/retrosheet/games.sql
# ‘Multiple regular season data compiled in Zip Archives’ downloaded from here:
# http://www.retrosheet.org/gamelogs/

# Reproducible research procedure:

# create local MySQL database called retrosheet by running this command:
# mysqladmin create retrosheet
# run following Terminal command in your home directory:
# git clone https://github.com/alexreisner/baseball_data.git
# run following Terminal command in the location where the file games.sql
# is stored to create correct columnnames in retrosheet database:
# (games.sql is part of previous baseball_data.git git clone operation)
# mysql -u root retrosheet < games.sql
# then unzip
# http://www.retrosheet.org/gamelogs/gl1900_19.zip
# and http://www.retrosheet.org/gamelogs/gl1940_59.zip
# and http://www.retrosheet.org/gamelogs/gl2000_09.zip
# and http://www.retrosheet.org/gamelogs/gl2010_13.zip
# into same location where games.sql is stored
# then run an SQL statement similar to the following one (within MySQL DB, not in R)
# to insert the unzipped .TXT files from gl2010_13.zip into the retrosheet database:
#LOAD DATA LOCAL INFILE ‘GL2010.TXT’
#INTO TABLE games
#FIELDS TERMINATED BY ‘,’
#OPTIONALLY ENCLOSED BY ‘”‘
#LINES TERMINATED BY ‘\n’
#IGNORE 0 LINES;

# repeat import for each GL<year>.TXT file and each gl19YY_ZZ.zip that you are going to use for SQL queries
# after that, you can connect to the local MySQL retrosheet DB via R and execute SQL queries via R

Here is the .R script that I use in Rstudio to connect to a locally installed copy of the retrosheet database running on a MySQL server in Ubuntu 14.04 LTS.

library(RMySQL)
retrosheetDb = dbConnect(MySQL(), user=”root”,
host=”localhost”)
result = dbGetQuery(retrosheetDb,”show databases;”);dbDisconnect(retrosheetDb);
result
retrosheet = dbConnect(MySQL(), user=”root”, db=”retrosheet”,
host=”localhost”)
allTables = dbListTables(retrosheet)
length(allTables)
allTables

Here are sample SQL queries in RStudio:

# beware the use of ” vers ‘ in SQL queries inside RStudio!
# ”works but not “”
# correct examples:

dbGetQuery(retrosheet,”
SELECT visitor_ab, home_ab,
visitor_hr,home_hr,
home,visitor
FROM games
WHERE YEAR(date) >=2009
AND ((visitor = ‘BOS’
OR home = ‘BOS’) OR
(visitor = ‘CLE’
OR home = ‘CLE’))
;”)

dbGetQuery(retrosheet,”
SELECT Year(date) as YR,
Sum(CASE
WHEN ( home_score > visitor_score) THEN 1
ELSE 0
END) AS HM_WINS,
Count(*) AS TOTAL_GMS,
Sum(CASE
WHEN ( home_score > visitor_score) THEN 1
ELSE 0
END) / Count(*) AS HM_WINNING_PCT
FROM games
GROUP BY YR ASC
HAVING YR>1900
LIMIT 40
;”)

When submitting a new SQL query in RStudio, make to sure to insert the line

dbGetQuery(retrosheet,”

right before the SQL query and

;”)

right after the SQL query, if your database name is called retrosheet (like mine).

This post describes how to perform reproducible research using Ubuntu 14.04 LTS, R, RStudio, MySQL and the Lahman database.

My system configuration is as follows:

* Dell Latitude E6500 4GB RAM
* GNU/Linux kernel 3.13.0-29-generic
* Ubuntu 14.04 LTS 64-bit
* lxqt desktop environment version 0.0.1+bzr6+201405140118~ubuntu14.04.1
* R version 3.1.0 (2014-04-10)
* Platform: x86_64-pc-linux-gnu (64-bit)
* RStudio 0.98.507
* RMySQL R package manually installed via RStudio GUI
* MySQL server Ver 14.14 Distrib 5.6.17, for debian-linux-gnu (x86_64) using EditLine wrapper

I configured the .my.cnf configuration file in my home directory as follows to avoid having to mention any passwords in any .R scripts where MySQL connections are made to the Lahman database. I did chmod 600 .my.cnf to limit the access rights to the .my.cnf configuration file in my home directory.

[client]
user=root
password=put right password here

[mysql]
user=root
password=put right password here

[mysqldump]
user=root
password=put right password here

[mysqldiff]
user=root

Here is the .R script that I use in Rstudio to connect to a locally installed copy of the Lahman database running on a MySQL server in Ubuntu 14.04 LTS.

library(RMySQL)
lahmanDb = dbConnect(MySQL(), user=”root”,
host=”localhost”)
result = dbGetQuery(lahmanDb,”show databases;”);dbDisconnect(lahmanDb);
result
lahman = dbConnect(MySQL(), user=”root”, db=”lahman”,
host=”localhost”)
allTables = dbListTables(lahman)
length(allTables)
allTables

Here is a sample SQL query in RStudio:

dbGetQuery(lahman,”
EXPLAIN SELECT teamID, Batting.playerID,
nameFirst AS FIRST_NAME,
nameLast AS LAST_NAME,
SUM(H) as CareerH,
SUM(AB) as CareerAB,
SUM(H)/SUM(AB) as CareerBA,
(SUM(H)+SUM(BB)+SUM(HBP))/(SUM(AB)+SUM(BB)+SUM(HBP)+SUM(SF)) as CareerOBP,
(SUM(H)+SUM(2B)+SUM(2*3B)+SUM(3*HR))/SUM(AB) as CareerSLG,
((SUM(H)+SUM(BB)+SUM(HBP))/(SUM(AB)+SUM(BB)+SUM(HBP)+SUM(SF)) +
(SUM(H)+SUM(2B)+SUM(2*3B)+SUM(3*HR))/SUM(AB)) as CareerOPS,
min(yearID) as Rookie,
max(yearID) as LastSeason
FROM Batting
INNER JOIN Master
# join on primary key playerID in Batting and Master tables
ON Batting.playerID = Master.playerID
WHERE Batting.playerID IN (SELECT DISTINCT Batting.playerID FROM Batting WHERE yearID >= 1980)
GROUP BY Batting.playerID
HAVING CareerAB >= 500
ORDER BY CareerOPS DESC
LIMIT 50
;”)

When submitting a new SQL query in RStudio, make to sure to insert the line

dbGetQuery(lahman,”

right before the SQL query and

;”)

right after the SQL query, if your database name is called lahman (like mine).