The Data

North Carolina births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Read data from URL

In R

download.file('http://www.openintro.org/stat/data/nc.RData', destfile = 'nc.RData')
load('nc.RData')

In Python

import pandas as pd
data = pd.read_csv('http://photo.etangkk.com/python/NCbirths.txt', sep='\t')

In PySpark

Spark doesn’t support reading from URL, my workaround is load as pandas dataframe first then convert to Spark DataFrame

import pandas as pd
from pyspark.sql import SQLContext

pandas_df = pd.read_csv('http://photo.etangkk.com/python/NCbirths.txt', dtype=object, sep='\t')

for col in pandas_df:
  pandas_df[col] = pandas_df[col].astype(str)

df = sqlCtx.createDataFrame(pandas_df)
df = df.withColumn("fage_d", df["fage"].cast("int"))
df = df.withColumn("mage_d", df["mage"].cast("int"))
df = df.withColumn("weeks_d", df["weeks"].cast("int"))
df = df.withColumn("visits_d", df["visits"].cast("int"))
df = df.withColumn("gained_d", df["gained"].cast("int"))
df = df.withColumn("weight_d", df["weight"].cast("double"))

df = df.select('fage_d', 'mage_d', 'mature', 'weeks_d', 'premie', 'visits_d', 'marital', 'gained_d', 'weight_d', 'lowbirthweight', 'gender', 'habit', 'whitemom')

Data Frame

Dimensions

In R

dim(nc)
## [1] 1000   13

This command should output ‘[1] 1000 13’, indicating that there are 1000 rows and 13 columns.

By slicing the dimensions result we get the number of rows/columns only.

paste('Number of rows', dim(nc)[1], sep=": ")
cat('Number of columns:', dim(nc)[2])
## [1] "Number of rows: 1000"
## Number of columns: 13

Each row is an observation and each column is a variable, there are 1000 observations and 13 variables in our data set.

In Python

print data.shape
print 'Number of observations: %d' % data.shape[0]
print 'Number of variables: %d' % data.shape[1]
## (1000, 13)
## Number of observations: 1000
## Number of variables: 13

In PySpark - no function to determine DataFrame dimension, workaround is to find number of columns and rows separately

print "Number of columns", len(df.schema.names)
print "Number of rows:", df.count()

Caption for the picture.

Data structure

Detail data type of each variable in the data frame.

In R

str(nc)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

In Python

data.dtypes
## fage              float64
## mage                int64
## mature             object
## weeks             float64
## premie             object
## visits            float64
## marital            object
## gained            float64
## weight            float64
## lowbirthweight     object
## gender             object
## habit              object
## whitemom           object
## dtype: object

or

data.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1000 entries, 1 to 1000
## Data columns (total 13 columns):
## fage              829 non-null float64
## mage              1000 non-null int64
## mature            1000 non-null object
## weeks             998 non-null float64
## premie            998 non-null object
## visits            991 non-null float64
## marital           999 non-null object
## gained            973 non-null float64
## weight            1000 non-null float64
## lowbirthweight    1000 non-null object
## gender            1000 non-null object
## habit             999 non-null object
## whitemom          998 non-null object
## dtypes: float64(5), int64(1), object(7)
## memory usage: 109.4+ KB
## None

In PySpark

df.printSchema()

Caption for the picture.

Names of columns (or variables)

In R

names(nc)
##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"

In Python

data.columns.values
## ['fage' 'mage' 'mature' 'weeks' 'premie' 'visits' 'marital' 'gained'
##  'weight' 'lowbirthweight' 'gender' 'habit' 'whitemom']

In PySpark

No function in Spark to display columns’ name

Preview data

In R

head(nc, 5)
##   fage mage      mature weeks    premie visits marital gained weight
## 1   NA   13 younger mom    39 full term     10 married     38   7.63
## 2   NA   14 younger mom    42 full term     15 married     20   7.88
## 3   19   15 younger mom    37 full term     11 married     38   6.63
## 4   21   15 younger mom    41 full term      6 married     34   8.00
## 5   NA   15 younger mom    39 full term      9 married     27   6.38
##   lowbirthweight gender     habit  whitemom
## 1        not low   male nonsmoker not white
## 2        not low   male nonsmoker not white
## 3        not low female nonsmoker     white
## 4        not low   male nonsmoker     white
## 5        not low female nonsmoker not white

In Python

data.head(5)
##    fage  mage       mature  weeks     premie  visits  marital  gained  weight  \
## 1   NaN    13  younger mom     39  full term      10  married      38    7.63   
## 2   NaN    14  younger mom     42  full term      15  married      20    7.88   
## 3    19    15  younger mom     37  full term      11  married      38    6.63   
## 4    21    15  younger mom     41  full term       6  married      34    8.00   
## 5   NaN    15  younger mom     39  full term       9  married      27    6.38   
## 
##   lowbirthweight  gender      habit   whitemom  
## 1        not low    male  nonsmoker  not white  
## 2        not low    male  nonsmoker  not white  
## 3        not low  female  nonsmoker      white  
## 4        not low    male  nonsmoker      white  
## 5        not low  female  nonsmoker  not white

In PySpark

print df.show(5)

Caption for the picture.

Display the DataFrame as an HTML table so it’s easier to read.

print display(df)

Caption for the picture.

Types of Variables

  1. Numerical/Quantitative - any numerical values, can apply arithmetic operations such as +, -, *, /, average, mean, etc.
  • Continuous - measured values, any infinite number within a given range
  • Discrete - counted values, specific set of numeric values
  1. Categorical/Qualitative - limited set of distinct categories, categories can be in numerical form but applying arithmetic operations would not make much sense
  • Normal categorical - no intrinsic ordering to the categories
  • Ordinal - clear ordering of the variables such as low, medium and high
variabe description type
fage father’s age in years continuous numerical
mage mother’s age in years continuous numerical
mature maturity status of mother categorical
weeks length of pregnancy in weeks discrete numerical
premie whether the birth was classified as premature (premie) or full-term categorical
visits number of hospital visits during pregnancy discrete numerical
marital whether mother is married of not married at birth categorical
gained weight gained by mother during pregnancy in pounds continuous numerical
weight weight of the baby at birth in pounds continuous numerical
lowbirthweight whether baby was classified as low birthweight (low) or not (not low) categorical
gender gender of the baby categorical
habit status of mother as nonsmoker or a smoker categorical
whitemom whether mom is white or not white categorical

Summary Statistics

In R

summary(nc)
##       fage            mage            mature        weeks      
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00  
##  Median :30.00   Median :27                     Median :39.00  
##  Mean   :30.26   Mean   :27                     Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                     Max.   :45.00  
##  NA's   :171                                    NA's   :2      
##        premie        visits            marital        gained     
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.00  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.00  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.00  
##                  Mean   :12.1                     Mean   :30.33  
##                  3rd Qu.:15.0                     3rd Qu.:38.00  
##                  Max.   :30.0                     Max.   :85.00  
##                  NA's   :9                        NA's   :27     
##      weight       lowbirthweight    gender          habit    
##  Min.   : 1.000   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.380   not low:889    male  :497   smoker   :126  
##  Median : 7.310                               NA's     :  1  
##  Mean   : 7.101                                              
##  3rd Qu.: 8.060                                              
##  Max.   :11.750                                              
##                                                              
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
## 

In Python - NaN values are excluded

print data.describe()
print data['mature'].value_counts()
##              fage         mage       weeks      visits      gained      weight
## count  829.000000  1000.000000  998.000000  991.000000  973.000000  1000.00000
## mean    30.255730    27.000000   38.334669   12.104945   30.325797     7.10100
## std      6.763766     6.213583    2.931553    3.954934   14.241297     1.50886
## min     14.000000    13.000000   20.000000    0.000000    0.000000     1.00000
## 25%     25.000000    22.000000   37.000000   10.000000   20.000000     6.38000
## 50%     30.000000    27.000000   39.000000   12.000000   30.000000     7.31000
## 75%     35.000000    32.000000   40.000000   15.000000   38.000000     8.06000
## max     55.000000    50.000000   45.000000   30.000000   85.000000    11.75000
## younger mom    867
## mature mom     133
## dtype: int64

In PySpark - NaN values make computation of mean and standard deviation fail

df.describe().show()
df.groupBy('mature').count().show()

Caption for the picture.

Python Pandas’ describe() function only gives summary statistics for numerical values, for categorical variables we need to use the value_counts() function. For large dataset with many categorical variables it is cumberson to write one statement for each categorical variable, it would be nice to have a R summary() equivalent function in Python Pandas. Below is my version of R summery function for Python.

def Rsummary(df):
    stat = pd.DataFrame()
    for i in df.columns:
        right = pd.DataFrame()
        if type(df.iloc[0][i]) is str:
            right = pd.DataFrame(data[i].value_counts())
        else:
            right = pd.DataFrame(data[i].describe())
        right.reset_index(level=0, inplace=True)
        right.columns = ['key', 'count']
        right.loc[max(right.index)+1] = ['NA\'s', data[i].isnull().sum()]
        right['count'] = right['count'].round(1)
        right[i] = right['key'] + ': ' + right['count'].apply(str)
        stat = pd.concat((stat, right[i]), 1)
    print stat.fillna('')

Rsummary(data)
##            fage           mage            mature         weeks  \
## 0  count: 829.0  count: 1000.0  younger mom: 867  count: 998.0   
## 1    mean: 30.3     mean: 27.0   mature mom: 133    mean: 38.3   
## 2      std: 6.8       std: 6.2           NA's: 0      std: 2.9   
## 3     min: 14.0      min: 13.0                       min: 20.0   
## 4     25%: 25.0      25%: 22.0                       25%: 37.0   
## 5     50%: 30.0      50%: 27.0                       50%: 39.0   
## 6     75%: 35.0      75%: 32.0                       75%: 40.0   
## 7     max: 55.0      max: 50.0                       max: 45.0   
## 8   NA's: 171.0      NA's: 0.0                       NA's: 2.0   
## 
##            premie        visits           marital        gained  \
## 0  full term: 846  count: 991.0  not married: 613  count: 973.0   
## 1     premie: 152    mean: 12.1      married: 386    mean: 30.3   
## 2         NA's: 2      std: 4.0           NA's: 1     std: 14.2   
## 3                      min: 0.0                        min: 0.0   
## 4                     25%: 10.0                       25%: 20.0   
## 5                     50%: 12.0                       50%: 30.0   
## 6                     75%: 15.0                       75%: 38.0   
## 7                     max: 30.0                       max: 85.0   
## 8                     NA's: 9.0                      NA's: 27.0   
## 
##           weight lowbirthweight       gender           habit        whitemom  
## 0  count: 1000.0   not low: 889  female: 503  nonsmoker: 873      white: 714  
## 1      mean: 7.1       low: 111    male: 497     smoker: 126  not white: 284  
## 2       std: 1.5        NA's: 0      NA's: 0         NA's: 1         NA's: 2  
## 3       min: 1.0                                                              
## 4       25%: 6.4                                                              
## 5       50%: 7.3                                                              
## 6       75%: 8.1                                                              
## 7      max: 11.8                                                              
## 8      NA's: 0.0