North Carolina births
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
In R
download.file('http://www.openintro.org/stat/data/nc.RData', destfile = 'nc.RData')
load('nc.RData')
In Python
import pandas as pd
data = pd.read_csv('http://photo.etangkk.com/python/NCbirths.txt', sep='\t')
In PySpark
Spark doesn’t support reading from URL, my workaround is load as pandas dataframe first then convert to Spark DataFrame
import pandas as pd
from pyspark.sql import SQLContext
pandas_df = pd.read_csv('http://photo.etangkk.com/python/NCbirths.txt', dtype=object, sep='\t')
for col in pandas_df:
pandas_df[col] = pandas_df[col].astype(str)
df = sqlCtx.createDataFrame(pandas_df)
df = df.withColumn("fage_d", df["fage"].cast("int"))
df = df.withColumn("mage_d", df["mage"].cast("int"))
df = df.withColumn("weeks_d", df["weeks"].cast("int"))
df = df.withColumn("visits_d", df["visits"].cast("int"))
df = df.withColumn("gained_d", df["gained"].cast("int"))
df = df.withColumn("weight_d", df["weight"].cast("double"))
df = df.select('fage_d', 'mage_d', 'mature', 'weeks_d', 'premie', 'visits_d', 'marital', 'gained_d', 'weight_d', 'lowbirthweight', 'gender', 'habit', 'whitemom')
In R
dim(nc)
## [1] 1000 13
This command should output ‘[1] 1000 13’, indicating that there are 1000 rows and 13 columns.
By slicing the dimensions result we get the number of rows/columns only.
paste('Number of rows', dim(nc)[1], sep=": ")
cat('Number of columns:', dim(nc)[2])
## [1] "Number of rows: 1000"
## Number of columns: 13
Each row is an observation and each column is a variable, there are 1000 observations and 13 variables in our data set.
In Python
print data.shape
print 'Number of observations: %d' % data.shape[0]
print 'Number of variables: %d' % data.shape[1]
## (1000, 13)
## Number of observations: 1000
## Number of variables: 13
In PySpark - no function to determine DataFrame dimension, workaround is to find number of columns and rows separately
print "Number of columns", len(df.schema.names)
print "Number of rows:", df.count()
Detail data type of each variable in the data frame.
In R
str(nc)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
In Python
data.dtypes
## fage float64
## mage int64
## mature object
## weeks float64
## premie object
## visits float64
## marital object
## gained float64
## weight float64
## lowbirthweight object
## gender object
## habit object
## whitemom object
## dtype: object
or
data.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1000 entries, 1 to 1000
## Data columns (total 13 columns):
## fage 829 non-null float64
## mage 1000 non-null int64
## mature 1000 non-null object
## weeks 998 non-null float64
## premie 998 non-null object
## visits 991 non-null float64
## marital 999 non-null object
## gained 973 non-null float64
## weight 1000 non-null float64
## lowbirthweight 1000 non-null object
## gender 1000 non-null object
## habit 999 non-null object
## whitemom 998 non-null object
## dtypes: float64(5), int64(1), object(7)
## memory usage: 109.4+ KB
## None
In PySpark
df.printSchema()
In R
names(nc)
## [1] "fage" "mage" "mature" "weeks"
## [5] "premie" "visits" "marital" "gained"
## [9] "weight" "lowbirthweight" "gender" "habit"
## [13] "whitemom"
In Python
data.columns.values
## ['fage' 'mage' 'mature' 'weeks' 'premie' 'visits' 'marital' 'gained'
## 'weight' 'lowbirthweight' 'gender' 'habit' 'whitemom']
In PySpark
No function in Spark to display columns’ name
In R
head(nc, 5)
## fage mage mature weeks premie visits marital gained weight
## 1 NA 13 younger mom 39 full term 10 married 38 7.63
## 2 NA 14 younger mom 42 full term 15 married 20 7.88
## 3 19 15 younger mom 37 full term 11 married 38 6.63
## 4 21 15 younger mom 41 full term 6 married 34 8.00
## 5 NA 15 younger mom 39 full term 9 married 27 6.38
## lowbirthweight gender habit whitemom
## 1 not low male nonsmoker not white
## 2 not low male nonsmoker not white
## 3 not low female nonsmoker white
## 4 not low male nonsmoker white
## 5 not low female nonsmoker not white
In Python
data.head(5)
## fage mage mature weeks premie visits marital gained weight \
## 1 NaN 13 younger mom 39 full term 10 married 38 7.63
## 2 NaN 14 younger mom 42 full term 15 married 20 7.88
## 3 19 15 younger mom 37 full term 11 married 38 6.63
## 4 21 15 younger mom 41 full term 6 married 34 8.00
## 5 NaN 15 younger mom 39 full term 9 married 27 6.38
##
## lowbirthweight gender habit whitemom
## 1 not low male nonsmoker not white
## 2 not low male nonsmoker not white
## 3 not low female nonsmoker white
## 4 not low male nonsmoker white
## 5 not low female nonsmoker not white
In PySpark
print df.show(5)
Display the DataFrame as an HTML table so it’s easier to read.
print display(df)
variabe | description | type |
---|---|---|
fage | father’s age in years | continuous numerical |
mage | mother’s age in years | continuous numerical |
mature | maturity status of mother | categorical |
weeks | length of pregnancy in weeks | discrete numerical |
premie | whether the birth was classified as premature (premie) or full-term | categorical |
visits | number of hospital visits during pregnancy | discrete numerical |
marital | whether mother is married of not married at birth | categorical |
gained | weight gained by mother during pregnancy in pounds | continuous numerical |
weight | weight of the baby at birth in pounds | continuous numerical |
lowbirthweight | whether baby was classified as low birthweight (low) or not (not low) | categorical |
gender | gender of the baby | categorical |
habit | status of mother as nonsmoker or a smoker | categorical |
whitemom | whether mom is white or not white | categorical |
In R
summary(nc)
## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
In Python - NaN values are excluded
print data.describe()
print data['mature'].value_counts()
## fage mage weeks visits gained weight
## count 829.000000 1000.000000 998.000000 991.000000 973.000000 1000.00000
## mean 30.255730 27.000000 38.334669 12.104945 30.325797 7.10100
## std 6.763766 6.213583 2.931553 3.954934 14.241297 1.50886
## min 14.000000 13.000000 20.000000 0.000000 0.000000 1.00000
## 25% 25.000000 22.000000 37.000000 10.000000 20.000000 6.38000
## 50% 30.000000 27.000000 39.000000 12.000000 30.000000 7.31000
## 75% 35.000000 32.000000 40.000000 15.000000 38.000000 8.06000
## max 55.000000 50.000000 45.000000 30.000000 85.000000 11.75000
## younger mom 867
## mature mom 133
## dtype: int64
In PySpark - NaN values make computation of mean and standard deviation fail
df.describe().show()
df.groupBy('mature').count().show()
Python Pandas’ describe() function only gives summary statistics for numerical values, for categorical variables we need to use the value_counts() function. For large dataset with many categorical variables it is cumberson to write one statement for each categorical variable, it would be nice to have a R summary() equivalent function in Python Pandas. Below is my version of R summery function for Python.
def Rsummary(df):
stat = pd.DataFrame()
for i in df.columns:
right = pd.DataFrame()
if type(df.iloc[0][i]) is str:
right = pd.DataFrame(data[i].value_counts())
else:
right = pd.DataFrame(data[i].describe())
right.reset_index(level=0, inplace=True)
right.columns = ['key', 'count']
right.loc[max(right.index)+1] = ['NA\'s', data[i].isnull().sum()]
right['count'] = right['count'].round(1)
right[i] = right['key'] + ': ' + right['count'].apply(str)
stat = pd.concat((stat, right[i]), 1)
print stat.fillna('')
Rsummary(data)
## fage mage mature weeks \
## 0 count: 829.0 count: 1000.0 younger mom: 867 count: 998.0
## 1 mean: 30.3 mean: 27.0 mature mom: 133 mean: 38.3
## 2 std: 6.8 std: 6.2 NA's: 0 std: 2.9
## 3 min: 14.0 min: 13.0 min: 20.0
## 4 25%: 25.0 25%: 22.0 25%: 37.0
## 5 50%: 30.0 50%: 27.0 50%: 39.0
## 6 75%: 35.0 75%: 32.0 75%: 40.0
## 7 max: 55.0 max: 50.0 max: 45.0
## 8 NA's: 171.0 NA's: 0.0 NA's: 2.0
##
## premie visits marital gained \
## 0 full term: 846 count: 991.0 not married: 613 count: 973.0
## 1 premie: 152 mean: 12.1 married: 386 mean: 30.3
## 2 NA's: 2 std: 4.0 NA's: 1 std: 14.2
## 3 min: 0.0 min: 0.0
## 4 25%: 10.0 25%: 20.0
## 5 50%: 12.0 50%: 30.0
## 6 75%: 15.0 75%: 38.0
## 7 max: 30.0 max: 85.0
## 8 NA's: 9.0 NA's: 27.0
##
## weight lowbirthweight gender habit whitemom
## 0 count: 1000.0 not low: 889 female: 503 nonsmoker: 873 white: 714
## 1 mean: 7.1 low: 111 male: 497 smoker: 126 not white: 284
## 2 std: 1.5 NA's: 0 NA's: 0 NA's: 1 NA's: 2
## 3 min: 1.0
## 4 25%: 6.4
## 5 50%: 7.3
## 6 75%: 8.1
## 7 max: 11.8
## 8 NA's: 0.0