reproducible research using Ubuntu 14.04 LTS, R, RStudio, MySQL and retrosheet database

Posted: 2014/07/19 in Data Science, Sabermetrics, Ubuntu

My previous posts were about accessing the lahman database, which you can find here:

https://courses.edx.org/courses/BUx/SABR101x/2T2014/discussion/forum/i4x-BUx-SABR101x-course-2T2014/threads/5398867e95b062beb8000ca4

and here:

https://courses.edx.org/courses/BUx/SABR101x/2T2014/discussion/forum/c31cc5c16ee24e52b1465ba3fad7e2ac/threads/53ac612786d329c9f70004c7

This post describes how to create a locally installed MySQL retrosheet database and how to access the retrosheet MySQL database from R or RStudio. Hope it is of use to someone.

My system configuration has changed a little and is now as follows:

* Dell Latitude E6500 4GB RAM
* GNU/Linux kernel 3.16.0-031600rc2-generic #201406220135 SMP Sun Jun 22 05:36:21 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
* Ubuntu 14.04 LTS 64-bit
* lxqt-session version 0.7.0+bzr200+201406301231~ubuntu14.04.1
* R version 3.1.0 (2014-04-10)
* Platform: x86_64-pc-linux-gnu (64-bit)
* RStudio 0.98.953
* RMySQL R package manually installed via RStudio GUI
* MySQL server Ver 14.14 Distrib 5.6.17, for debian-linux-gnu (x86_64) using EditLine wrapper

I configured the .my.cnf configuration file in my home directory as follows to avoid having to mention any passwords in any .R scripts where MySQL connections are made to the retrosheet database. I did chmod 600 .my.cnf to limit the access rights to the .my.cnf configuration file in my home directory.

[client]
user=root
password=put right password here

[mysql]
user=root
password=put right password here

[mysqldump]
user=root
password=put right password here

[mysqldiff]
user=root

#############################################################
# How to access Retrosheet DB in locally stored MySQL DB: #
#############################################################
# data set sources:
# https://github.com/alexreisner/baseball_data/blob/master/retrosheet/games.sql
# ‘Multiple regular season data compiled in Zip Archives’ downloaded from here:
# http://www.retrosheet.org/gamelogs/

# Reproducible research procedure:

# create local MySQL database called retrosheet by running this command:
# mysqladmin create retrosheet
# run following Terminal command in your home directory:
# git clone https://github.com/alexreisner/baseball_data.git
# run following Terminal command in the location where the file games.sql
# is stored to create correct columnnames in retrosheet database:
# (games.sql is part of previous baseball_data.git git clone operation)
# mysql -u root retrosheet < games.sql
# then unzip
# http://www.retrosheet.org/gamelogs/gl1900_19.zip
# and http://www.retrosheet.org/gamelogs/gl1940_59.zip
# and http://www.retrosheet.org/gamelogs/gl2000_09.zip
# and http://www.retrosheet.org/gamelogs/gl2010_13.zip
# into same location where games.sql is stored
# then run an SQL statement similar to the following one (within MySQL DB, not in R)
# to insert the unzipped .TXT files from gl2010_13.zip into the retrosheet database:
#LOAD DATA LOCAL INFILE ‘GL2010.TXT’
#INTO TABLE games
#FIELDS TERMINATED BY ‘,’
#OPTIONALLY ENCLOSED BY ‘”‘
#LINES TERMINATED BY ‘\n’
#IGNORE 0 LINES;

# repeat import for each GL<year>.TXT file and each gl19YY_ZZ.zip that you are going to use for SQL queries
# after that, you can connect to the local MySQL retrosheet DB via R and execute SQL queries via R

Here is the .R script that I use in Rstudio to connect to a locally installed copy of the retrosheet database running on a MySQL server in Ubuntu 14.04 LTS.

library(RMySQL)
retrosheetDb = dbConnect(MySQL(), user=”root”,
host=”localhost”)
result = dbGetQuery(retrosheetDb,”show databases;”);dbDisconnect(retrosheetDb);
result
retrosheet = dbConnect(MySQL(), user=”root”, db=”retrosheet”,
host=”localhost”)
allTables = dbListTables(retrosheet)
length(allTables)
allTables

Here are sample SQL queries in RStudio:

# beware the use of ” vers ‘ in SQL queries inside RStudio!
# ”works but not “”
# correct examples:

dbGetQuery(retrosheet,”
SELECT visitor_ab, home_ab,
visitor_hr,home_hr,
home,visitor
FROM games
WHERE YEAR(date) >=2009
AND ((visitor = ‘BOS’
OR home = ‘BOS’) OR
(visitor = ‘CLE’
OR home = ‘CLE’))
;”)

dbGetQuery(retrosheet,”
SELECT Year(date) as YR,
Sum(CASE
WHEN ( home_score > visitor_score) THEN 1
ELSE 0
END) AS HM_WINS,
Count(*) AS TOTAL_GMS,
Sum(CASE
WHEN ( home_score > visitor_score) THEN 1
ELSE 0
END) / Count(*) AS HM_WINNING_PCT
FROM games
GROUP BY YR ASC
HAVING YR>1900
LIMIT 40
;”)

When submitting a new SQL query in RStudio, make to sure to insert the line

dbGetQuery(retrosheet,”

right before the SQL query and

;”)

right after the SQL query, if your database name is called retrosheet (like mine).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s