reproducible research using Ubuntu 14.04 LTS, R, RStudio, MySQL and Lahman database

Posted: 2014/07/19 in Data Science, Sabermetrics, Ubuntu

This post describes how to perform reproducible research using Ubuntu 14.04 LTS, R, RStudio, MySQL and the Lahman database.

My system configuration is as follows:

* Dell Latitude E6500 4GB RAM
* GNU/Linux kernel 3.13.0-29-generic
* Ubuntu 14.04 LTS 64-bit
* lxqt desktop environment version 0.0.1+bzr6+201405140118~ubuntu14.04.1
* R version 3.1.0 (2014-04-10)
* Platform: x86_64-pc-linux-gnu (64-bit)
* RStudio 0.98.507
* RMySQL R package manually installed via RStudio GUI
* MySQL server Ver 14.14 Distrib 5.6.17, for debian-linux-gnu (x86_64) using EditLine wrapper

I configured the .my.cnf configuration file in my home directory as follows to avoid having to mention any passwords in any .R scripts where MySQL connections are made to the Lahman database. I did chmod 600 .my.cnf to limit the access rights to the .my.cnf configuration file in my home directory.

[client]
user=root
password=put right password here

[mysql]
user=root
password=put right password here

[mysqldump]
user=root
password=put right password here

[mysqldiff]
user=root

Here is the .R script that I use in Rstudio to connect to a locally installed copy of the Lahman database running on a MySQL server in Ubuntu 14.04 LTS.

library(RMySQL)
lahmanDb = dbConnect(MySQL(), user=”root”,
host=”localhost”)
result = dbGetQuery(lahmanDb,”show databases;”);dbDisconnect(lahmanDb);
result
lahman = dbConnect(MySQL(), user=”root”, db=”lahman”,
host=”localhost”)
allTables = dbListTables(lahman)
length(allTables)
allTables

Here is a sample SQL query in RStudio:

dbGetQuery(lahman,”
EXPLAIN SELECT teamID, Batting.playerID,
nameFirst AS FIRST_NAME,
nameLast AS LAST_NAME,
SUM(H) as CareerH,
SUM(AB) as CareerAB,
SUM(H)/SUM(AB) as CareerBA,
(SUM(H)+SUM(BB)+SUM(HBP))/(SUM(AB)+SUM(BB)+SUM(HBP)+SUM(SF)) as CareerOBP,
(SUM(H)+SUM(2B)+SUM(2*3B)+SUM(3*HR))/SUM(AB) as CareerSLG,
((SUM(H)+SUM(BB)+SUM(HBP))/(SUM(AB)+SUM(BB)+SUM(HBP)+SUM(SF)) +
(SUM(H)+SUM(2B)+SUM(2*3B)+SUM(3*HR))/SUM(AB)) as CareerOPS,
min(yearID) as Rookie,
max(yearID) as LastSeason
FROM Batting
INNER JOIN Master
# join on primary key playerID in Batting and Master tables
ON Batting.playerID = Master.playerID
WHERE Batting.playerID IN (SELECT DISTINCT Batting.playerID FROM Batting WHERE yearID >= 1980)
GROUP BY Batting.playerID
HAVING CareerAB >= 500
ORDER BY CareerOPS DESC
LIMIT 50
;”)

When submitting a new SQL query in RStudio, make to sure to insert the line

dbGetQuery(lahman,”

right before the SQL query and

;”)

right after the SQL query, if your database name is called lahman (like mine).

Advertisements
Comments
  1. I am truly grateful to tthe owner of this web site who has shared this impressive article
    at at this place.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s