Summary Statistics in R and Python

I wonder how easy it is, to do simple statistics with Python and R. For example, to calculate mean, median and other summary statistics. Let’s start with the following very simple data set.

height weight
1.85   85.0
1.80   79.1
1.91   80.5
1.75   80.3
1.77   90.8
1.79   60.9
1.81   81.1
1.69   70.2
1.80   70.9
1.89   90.9

Having stored this data in a file called “data.csv” (comma-separated), you can read this file in R using the following single line of code.

hw <- read.csv("data.csv")

The file will be read into variable “hw” of class (type) “data.frame”.

> class(hw)
[1] "data.frame"

Do a quick check that the data was read correctly by evaluating the variable or access single columns of the data frame.

> hw
   height weight
1    1.85   85.0
2    1.80   79.1
3    1.91   80.5
4    1.75   80.3
5    1.77   90.8
6    1.79   60.9
7    1.81   81.1
8    1.69   70.2
9    1.80   70.9
10   1.89   90.9
> hw[,1]
 [1] 1.85 1.80 1.91 1.75 1.77 1.79 1.81 1.69 1.80 1.89
> hw[,2]
 [1] 85.0 79.1 80.5 80.3 90.8 60.9 81.1 70.2 70.9 90.9

Now how can we read the data using Python? This is best handled with the help of a library like pandas. The data will be read into a variable of type DataFrame.

>>> import pandas as pd
>>> hw = pd.read_csv("data.csv")
>>> hw
   height  weight
0    1.85    85.0
1    1.80    79.1
2    1.91    80.5
3    1.75    80.3
4    1.77    90.8
5    1.79    60.9
6    1.81    81.1
7    1.69    70.2
8    1.80    70.9
9    1.89    90.9
>>> hw["height"]
0    1.85
1    1.80
2    1.91
3    1.75
4    1.77
5    1.79
6    1.81
7    1.69
8    1.80
9    1.89
Name: height, dtype: float64
>>> hw["weight"]
0    85.0
1    79.1
2    80.5
3    80.3
4    90.8
5    60.9
6    81.1
7    70.2
8    70.9
9    90.9
Name: weight, dtype: float64

Now that the we have read the data into our programming environment, we can calculate some summary statistics like mean and median. Let’s start with R, which has build-in functions for this.

> mean(hw[,1])
[1] 1.806
> mean(hw[,2])
[1] 78.97

For Python, the pandas library provides functions to calculate summary statistics.

>>> hw["height"].mean()
1.8060000000000003
>>> hw["weight"].mean()
78.97

As you can see, it is very straightforward to calculate summary statistics with R and Python. Next time, I will evaluate, how to do a linear regression and plot the results.

Comment