# DataLab Getting Started in R

The Bigstep DataLab is a open data exploration service that offers data science, analytics and technology experimentation, built on our SparkArray, DataLake and on our highly flexible and high performance bare-metal infrastructure.

This tutorial assumes some programming experience.

## Uploading Data

A private datalake (HDFS service) is used to store the data that the SparkArray uses. To upload data to an HDFS cluster one would typically:
1. download the hadoop binaries (2.7.x) from a mirror like [here](http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz)  - rather large  (240mb)
2. unarchive 
3. execute commands like "-ls".

```
/hadoop-2.7.3/bin/hdfs dfs -ls hdfs://headnodes-8885.cluster-8885.us-private-datalake.7.bigstep.io/
16/09/26 17:18:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxrwxrwx   - hdfs supergroup          0 2016-09-22 13:12 hdfs://headnodes-8885.cluster-8885.us-private-datalake.7.bigstep.io/baseball
drwxrwxrwt   - hdfs supergroup          0 2016-09-22 12:08 hdfs://headnodes-8885.cluster-8885.us-private-datalake.7.bigstep.io/tmp
drwxr-xr-x   - hdfs supergroup          0 2016-09-22 12:08 hdfs://headnodes-8885.cluster-8885.us-private-datalake.7.bigstep.io/user
```


You can also execute the same commands on the master container:

In [None]:
# Allow the use of shell operations 
system("wget http://www.exploredata.net/ftp/MLB2008.csv", intern=TRUE)

In [None]:
# Copy the downloaded file to Bigstep DataLake, using the path specified under the Spark tab in the Bigstep Control Center
system("hdfs dfs -put MLB2008.csv /", intern=TRUE)

## Initialize Spark Context

For all Spark functions to be available, a Spark context has to be initialized in the current notebook.

In [None]:
library(SparkR)
sparkR.session(appName = "R", sparkConfig = list(spark.warehouse.dir=""))


## RDDs

An Resilient Distributed Dataset is an array that is spread across multiple servers. It allows the programmer to abstract away the complexity of transforming large volumes of distributed data.

In [None]:
system("wget http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip", intern=TRUE)

system("apt-get install -y unzip", intern=TRUE)
system("unzip baseballdatabank-master_2016-03-02.zip", intern=TRUE)
system("rm -rf baseballdatabank-master_2016-03-02.zip", intern=TRUE)

system("hdfs dfs -put baseballdatabank-master/core/AllstarFull.csv /", intern=TRUE)

In [None]:
system("hdfs dfs -chmod 777 /tmp/hive", intern=TRUE)

In [None]:
Sys.getenv()

In [None]:
sc <- sparkR.session()
 
people <- read.df("/AllstarFull.csv", "csv")


In [None]:
count(people)

In [None]:
first(people)

## DataFrames and SparkRSQL

A SparkDataFrame can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.

Spark 2.0.0. has a built-in CSV reader:

In [None]:
# Read a json file
dfPeople <- read.df("file:///opt/spark-2.1.0-bin-hadoop2.7/examples/src/main/resources/people.json", "json")

In [None]:
# Register the DataFrame as a SQL temporary view
createOrReplaceTempView(dfPeople, "people")

# SQL statements can be run by using the sql method
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
head(teenagers)