What is Exploratory Data Analysis(EDA)?

EDA is the process of getting to know our data primarily through simple visualizations before fitting a model. As Wickham and Grolemund state, EDA is more an attitude than a scripted list of steps which must be carried out1.

Why is it done?

Two main reasons:

  1. If we collected the data ourselves to solve a problem, we need to determine whether our data is sufficient for solving that problem.

  2. If we didn't collect the data ourselves, we need to have a basic understanding of the type, quantity, quality, and possible relationships between the features in our data.

How do we do it in Python?

While I could use a toy data set, like in my last post, after seeing seeing tweets like this:

and listening to Hugo Bowne-Anderson on DataFramed bemoan the over use of the Iris and Titanic datasets, I'm feeling inspired to use my own data :grin:

As always, I'll follow the steps outlined in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow

Step 1: Frame the Problem

"Given a set of features, can we determine how old someone needs to be to read a book?"

Step 2: Get the Data

To answer the question above, I sourced labeled data by scraping Common Sense Media's Book Reviews using BeautifulSoup and then wrote the data to a csv.2

Now that we have our data lets move on to...

Step 3: Explore the Data to Gain Insights (i.e. EDA)

As always, import the essential libraries, then load the data.

#For data manipulation
import pandas as pd
import numpy as np

#For visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno 

url = 'https://raw.githubusercontent.com/educatorsRlearners/book-maturity/master/csv/book_info_complete.csv'

df = pd.read_csv(url)

Time to start asking and answering some basic questions:

  • How much data do we have?
df.shape
(5816, 24)

OK, so we have 23 features and one target as well as 5,816 observations.

Why do we have fewer than in the screenshot above?

Because Common Sense Media is constantly adding new reviews to their website, meaning they've added nearly 100 books to their site since I completed my project at the end of March 2020.

  • What type of data do we have?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5816 entries, 0 to 5815
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   title                           5816 non-null   object 
 1   description                     5816 non-null   object 
 2   plot                            5816 non-null   object 
 3   csm_review                      5816 non-null   object 
 4   need_to_know                    5816 non-null   object 
 5   par_rating                      2495 non-null   float64
 6   kids_rating                     3026 non-null   float64
 7   csm_rating                      5816 non-null   int64  
 8   Author                          5468 non-null   object 
 9   Genre                           5816 non-null   object 
 10  Topics                          3868 non-null   object 
 11  Book type                       5816 non-null   object 
 12  Publisher                       5675 non-null   object 
 13  Publication date                5816 non-null   object 
 14  Publisher's recommended age(s)  4647 non-null   object 
 15  Number of pages                 5767 non-null   float64
 16  Available on                    3534 non-null   object 
 17  Last updated                    5816 non-null   object 
 18  Illustrator                     2490 non-null   object 
 19  Authors                         348 non-null    object 
 20  Awards                          68 non-null     object 
 21  Publishers                      33 non-null     object 
 22  Award                           415 non-null    object 
 23  Illustrators                    61 non-null     object 
dtypes: float64(3), int64(1), object(20)
memory usage: 1.1+ MB

Looks like a mix of strings and floats.

Lets take a closer look.

df.head().T
0 1 2 3 4
title The Third Twin Small Damages The School for Good and Evil, Book 1 Agent of Chaos: The X-Files Origins, Book 1 Crossing Ebenezer Creek
description Gripping thriller skimps on character developm... Luminous story of pregnant teen's summer in Sp... Fractured fairy tale has plenty of twists for ... Series pictures Mulder as teen, captures essen... Heartbreaking novel follows freed slaves on Sh...
plot Twins Ava and Alexa "Lexi" Rios live in an aff... It's the summer of 1996, which 18-year-old Ken... When best friends Sophie and Agatha are stolen... Set in 1979, AGENT OF CHAOS follows a 17-year-... CROSSING EBENEZER CREEK is a YA novel from awa...
csm_review THE THIRD TWIN has an interesting, compelling ... This could well have been a minefield of clich... The School for Good and Evil is no run-of-the-... Popular TV characters don't always make a smoo... Beautifully written and poetically rendered, t...
need_to_know Parents need to know that The Third Twin is a ... Parents need to know that Small Damages is nar... Parents need to know that The School for Good ... Parents need to know that Agent of Chaos: The ... Parents need to know that Crossing Ebenezer Cr...
par_rating 17 NaN 11 NaN NaN
kids_rating 14 14 11 NaN NaN
csm_rating 12 14 8 13 13
Author CJ Omololu Beth Kephart Soman Chainani Kami Garcia Tonya Bolden
Genre Mystery Coming of Age Fairy Tale Science Fiction Historical Fiction
Topics Adventures, Brothers and Sisters, Friendship, ... Friendship, History, Horses and Farm Animals Magic and Fantasy, Princesses, Fairies, Mermai... Magic and Fantasy, Adventures, Great Boy Role ... Friendship, History
Book type Fiction Fiction Fiction Fiction Fiction
Publisher Delacorte Press Philomel HarperCollins Children's Books Imprint Bloomsbury Children's Books
Publication date February 24, 2015 July 19, 2012 May 14, 2013 January 3, 2017 May 30, 2017
Publisher's recommended age(s) 12 - 18 14 - 17 8 - 17 14 - 18 NaN
Number of pages 336 304 496 320 240
Available on Nook, Hardback, iBooks, Kindle Nook, Hardback, iBooks, Kindle Nook, Audiobook (unabridged), Hardback, iBooks... Nook, Audiobook (abridged), Hardback, iBooks, ... Nook, Audiobook (unabridged), Hardback, Kindle
Last updated June 19, 2019 May 06, 2019 October 18, 2017 June 19, 2019 January 18, 2019
Illustrator NaN NaN Iacopo Bruno NaN NaN
Authors NaN NaN NaN NaN NaN
Awards NaN NaN NaN NaN NaN
Publishers NaN NaN NaN NaN NaN
Award NaN NaN NaN NaN NaN
Illustrators NaN NaN NaN NaN NaN

The picture is coming into focus. Again, since I collected the data, I know that the target is csm_rating which is the minimum age Common Sense Media (CSM) says a reader should be for the given book.

Also, we have essentially three types of features:

  • Numeric
    • par_rating : Ratings of the book by parents
    • kids_rating : Ratings of the book by children
    • :dart:csm_rating : Ratings of the books by Common Sense Media
    • Number of pages : Length of the book
    • Publisher's recommended age(s): Self explanatory
  • Date
    • Publication date : When the book was published
    • Last updated: When the book's information was updated

with all other features being text.

To make inspecting a little easier, lets clean those column names. 3

df.columns
Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'Author', 'Genre', 'Topics',
       'Book type', 'Publisher', 'Publication date',
       'Publisher's recommended age(s)', 'Number of pages', 'Available on',
       'Last updated', 'Illustrator', 'Authors', 'Awards', 'Publishers',
       'Award', 'Illustrators'],
      dtype='object')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.columns
Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'author', 'genre', 'topics',
       'book_type', 'publisher', 'publication_date',
       'publisher's_recommended_ages', 'number_of_pages', 'available_on',
       'last_updated', 'illustrator', 'authors', 'awards', 'publishers',
       'award', 'illustrators'],
      dtype='object')

Much better.

Given the number and variety of features, I'll focus on the numeric features in this post and analyze the text features in a part II.

Therefore, lets subset the data frame work with only the features of interest.

numeric = ['par_rating', 'kids_rating', 'csm_rating', 'number_of_pages', "publisher's_recommended_ages"]
df_numeric = df[numeric]
df_numeric.head()
par_rating kids_rating csm_rating number_of_pages publisher's_recommended_ages
0 17.0 14.0 12 336.0 12 - 18
1 NaN 14.0 14 304.0 14 - 17
2 11.0 11.0 8 496.0 8 - 17
3 NaN NaN 13 320.0 14 - 18
4 NaN NaN 13 240.0 NaN

:thumbsdown: publisher's_recommended_ages is a range instead of a single value.
:thumbsdown: It's the wrong data type (i.e., non-numeric)

df_numeric.dtypes
par_rating                      float64
kids_rating                     float64
csm_rating                        int64
number_of_pages                 float64
publisher's_recommended_ages     object
dtype: object

:thumbsup: We can fix both of those issues.

Given that we only care about the minimum age, we can:

  • split the string on the hyphen
  • keep only the first value since that will be the minimum age recommended by the publisher
  • convert the column to numeric
#Create a column with the minimum age
df_numeric['pub_rating'] = df.loc[:, "publisher\'s_recommended_ages"].str.split("-", n=1, expand=True)[0] 

#Set the column as numeric
df_numeric.loc[:, 'pub_rating'] = pd.to_numeric(df_numeric['pub_rating'])
df_numeric.head().T
0 1 2 3 4
par_rating 17 NaN 11 NaN NaN
kids_rating 14 14 11 NaN NaN
csm_rating 12 14 8 13 13
number_of_pages 336 304 496 320 240
publisher's_recommended_ages 12 - 18 14 - 17 8 - 17 14 - 18 NaN
pub_rating 12 14 8 14 NaN

Now we can drop the unnecessary column.

df_numeric = df_numeric.drop(columns="publisher's_recommended_ages")
df_numeric.head().T
0 1 2 3 4
par_rating 17.0 NaN 11.0 NaN NaN
kids_rating 14.0 14.0 11.0 NaN NaN
csm_rating 12.0 14.0 8.0 13.0 13.0
number_of_pages 336.0 304.0 496.0 320.0 240.0
pub_rating 12.0 14.0 8.0 14.0 NaN

Everything is in order so lets dig in and start by

Inspecting the target

df_numeric.loc[:, 'csm_rating'].describe()
count    5816.000000
mean        9.161623
std         3.881257
min         2.000000
25%         6.000000
50%         9.000000
75%        13.000000
max        17.000000
Name: csm_rating, dtype: float64

Good news! We do not have any missing values for our target! Also, we can see the lowest recommended age for a book is 2 years old, which has to be a picture book, while the highest is 17.

All useful info, but what does our target look like?

df_numeric.loc[:, 'csm_rating'].plot(kind= "hist", 
                             bins=range(2,18),
                             figsize=(24,10),
                             xticks=range(2,18),
                             fontsize=16);

Hmmm. Two thoughts:

First, the distribution is multimodal so when we split the data intotrain-test-validate splits, we'll need to do a stratified random sample. Also, the book recommendations seem to fall into one of three categories: really young readers, (e.g., 5 years old), tweens, and teens or older.

:bulb: Given this distribution, we could simplify our task from predicting an exact age and instead predict an age group. Something to keep in mind for future research.

Moving on.

Missing Values

Looking back at the output from df.info(), its obvious that several features have missing values but let's visualize it to make it clear.

msno.bar(df_numeric);

Good News!

  • There are fewer than 50 missing values for number_of_pages

Bad News!

  • pub_rating is missing a thousand values
  • Nearly half of the kids_rating are missing
  • More than half of the par_rating are missing.

When we get to the cleaning/feature engineering stage, we'll have to decide whether it's better to drop or impute) the missing values. However, before we do that, lets see visualize the data to get a better feel for it.

df_numeric['kids_rating'].plot(kind= "hist", 
                             bins=range(2,18),
                             figsize=(24,10),
                             xticks=range(2,18),
                             fontsize=16);

Hmmm. Looks like the children who wrote the bulk of the reviews think the books they reviewed were suitable for children between the ages of 8 and 14.

What about the parent's ratings?

df_numeric['par_rating'].plot(kind= "hist", 
                             bins=range(2,18),
                             figsize=(24,10),
                             xticks=range(2,18),
                             fontsize=16);

Same shape as the kids but a little less pronounced?

Finally, let's find out what the publishers think.

df_numeric['pub_rating'].plot(kind= "hist", 
                             bins=range(2,18),
                             figsize=(24,10),
                             xticks=range(2,18),
                             fontsize=16);

This looks promising. Let's compare it to our target csm_rating.

df_numeric.loc[:, ['pub_rating','csm_rating']].hist(bins=range(2,18),
                                                    figsize=(20,10));

:thinking: While not identical by any means, both distributions have the same multimodal shape.

Lets see how well our features correlate.

#Create the correlation matrix
corr = df_numeric.corr()

#Generate a mask to over the upper-right side of the matrix
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

#Plot the heatmap with correlations
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(8, 6))
    ax = sns.heatmap(corr, mask=mask, annot=True, square=True)

:thinking: The pub_rating is nearly a proxy for the target.

:thinking: The parents and kids ratings correlate more strongly with the target than they do with each other.

Looks like we're going to have to impute those missing values after all.

What about potential outliers?

A final step in EDA is to search for potential outliers which are values significantly different from the norm.

Our dataset could contain outliers for a couple of reasons:

  • data is miskeyed in, (e.g., someone types "100" instead of "10")
  • the observation genuinely is significantly different from the norm 4

The good news is that I chose to scrape Common Sense Media's book reviews because I felt confident in their ratings based on my knowledge of the domain and, given the professionalism of the organization, we can be fairly certain of the veracity of the data.

However, the old maxim of "Trust, but verify" exists for a reason.

This post is already much longer than I had planned so I won't go through all the numeric features, nor the multiple ways of identifying outliers, but a really simple one to plot the relationships between features so lets investigate the relationship between ratings and book length.

df_numeric.plot.scatter(x='number_of_pages', 
                        y="csm_rating",
                        figsize=(24,10),
                        fontsize=16);

We have spotted our first probable outliers: it is inconceivable for a ~400 page book to be meant for a two or three year old, right?

df.query('csm_rating < 6 & number_of_pages > 300')[['title','description']]
title description
2516 Beatrix Potter: The Complete Tales All of Beatrix Potter in one book, pre-K and K.
3158 Splat the Cat Even cats worry about the first day of school.
4076 Eloise: The Ultimate Edition All four stories about the irrepressible Eloise.
4463 A Giant Crush Fun Valentine story stresses self-esteem, frie...
4499 The Complete Tales & Poems of Winnie-the-Pooh Beloved, classic stories and poems in one volume.
4954 Mad About Madeline The stories vary in quality but still delight.

Well so much for that idea :grin:

To quote the AI Guru, instead of simply relying on printouts and plots, you "should always look at your bleeping data."

Summary

This post turned out to be part one of what will likely be at least three posts: EDA for:

  • :ballot_box_with_check: numeric data
  • :black_square_button: categorical data
  • :black_square_button: images (book covers)

Going forward, my key points to remember are:

Does the shape of the data make sense?

Based on my problem statement, I do not need normally distributed data. However, based on the question I'm trying to solve, I might expect the data to fit a certain distribution.

Similarly, are the values what I expect?

What would have happened if the only ratings I had were for 4 year olds? Clearly, I would have made a mistake somewhere along the line and would have to go back and fix it.

Also, I have to ask if the data makes sense or if I have outliers.

What's missing?

There will always be missing values. How many and in which features is going to drive a lot of feature engineering questions.

Speaking of which...

Are all the features I want present?

The numeric features I have are pretty complete, but what would happen if I combined the par_rating with the kids_rating to create a new feature? Would the two features combined be more valuable than either one on its own? Only one way to find out :smile:

Happy coding!