What is Exploratory Data Analysis(EDA)?

While I answered these questions in the last post, since all learning is repetition, I'll do it again :grin:

EDA is an ethos for how we scrutinize data including, but not limited to:

  • what we look for (i.e. shapes, trends, outliers)
  • the approaches we employ (i.e. five-number summary, visualizations)
  • and the decisions we reach1

Why is it done?

Two main reasons:

  1. If we collected the data ourselves, we need to know if our data suits our needs or if we need to collect more/different data.

  2. If we didn't collect the data ourselves, we need to interrogate the data to answer the "5 W's"

  • What kind of data do we have (i.e. numeric, categorical)?
  • When was the data collected? There could be more recent data which we could collect which would better inform our model.
  • How much data do we have? Also, how was the data collected?
  • Why was the data collected? The original motivation could highlight potential areas of bias in the data.
  • Who collected the data?

Some of these questions can't necessarily be answered by looking at the data alone which is fine because nothing comes from nothing; someone will know the answers so all we have to do is know where to look and whom to ask.

How do we do it in Python?

As always, I'll follow the steps outlined in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow

Step 1: Frame the Problem

"Given a set of features, can we determine how old someone needs to be to read a book?"

Step 2: Get the Data

We'll be using the same dataset as in the previous post.

Step 3: Explore the Data to Gain Insights (i.e. EDA)

As always, import the essential libraries, then load the data.

#For data manipulation
import pandas as pd
import numpy as np

#For visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno 

url = 'https://raw.githubusercontent.com/educatorsRlearners/book-maturity/master/csv/book_info_complete.csv'

df = pd.read_csv(url)

To review,

How much data do we have?

df.shape
(5816, 24)
  • 23 features
  • one target
  • 5,816 observations

What type of data do we have?

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5816 entries, 0 to 5815
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   title                           5816 non-null   object 
 1   description                     5816 non-null   object 
 2   plot                            5816 non-null   object 
 3   csm_review                      5816 non-null   object 
 4   need_to_know                    5816 non-null   object 
 5   par_rating                      2495 non-null   float64
 6   kids_rating                     3026 non-null   float64
 7   csm_rating                      5816 non-null   int64  
 8   Author                          5468 non-null   object 
 9   Genre                           5816 non-null   object 
 10  Topics                          3868 non-null   object 
 11  Book type                       5816 non-null   object 
 12  Publisher                       5675 non-null   object 
 13  Publication date                5816 non-null   object 
 14  Publisher's recommended age(s)  4647 non-null   object 
 15  Number of pages                 5767 non-null   float64
 16  Available on                    3534 non-null   object 
 17  Last updated                    5816 non-null   object 
 18  Illustrator                     2490 non-null   object 
 19  Authors                         348 non-null    object 
 20  Awards                          68 non-null     object 
 21  Publishers                      33 non-null     object 
 22  Award                           415 non-null    object 
 23  Illustrators                    61 non-null     object 
dtypes: float64(3), int64(1), object(20)
memory usage: 1.1+ MB

Looks like mostly categorical with some numeric.

Lets take a closer look.

df.head().T
0 1 2 3 4
title The Third Twin Small Damages The School for Good and Evil, Book 1 Agent of Chaos: The X-Files Origins, Book 1 Crossing Ebenezer Creek
description Gripping thriller skimps on character developm... Luminous story of pregnant teen's summer in Sp... Fractured fairy tale has plenty of twists for ... Series pictures Mulder as teen, captures essen... Heartbreaking novel follows freed slaves on Sh...
plot Twins Ava and Alexa "Lexi" Rios live in an aff... It's the summer of 1996, which 18-year-old Ken... When best friends Sophie and Agatha are stolen... Set in 1979, AGENT OF CHAOS follows a 17-year-... CROSSING EBENEZER CREEK is a YA novel from awa...
csm_review THE THIRD TWIN has an interesting, compelling ... This could well have been a minefield of clich... The School for Good and Evil is no run-of-the-... Popular TV characters don't always make a smoo... Beautifully written and poetically rendered, t...
need_to_know Parents need to know that The Third Twin is a ... Parents need to know that Small Damages is nar... Parents need to know that The School for Good ... Parents need to know that Agent of Chaos: The ... Parents need to know that Crossing Ebenezer Cr...
par_rating 17 NaN 11 NaN NaN
kids_rating 14 14 11 NaN NaN
csm_rating 12 14 8 13 13
Author CJ Omololu Beth Kephart Soman Chainani Kami Garcia Tonya Bolden
Genre Mystery Coming of Age Fairy Tale Science Fiction Historical Fiction
Topics Adventures, Brothers and Sisters, Friendship, ... Friendship, History, Horses and Farm Animals Magic and Fantasy, Princesses, Fairies, Mermai... Magic and Fantasy, Adventures, Great Boy Role ... Friendship, History
Book type Fiction Fiction Fiction Fiction Fiction
Publisher Delacorte Press Philomel HarperCollins Children's Books Imprint Bloomsbury Children's Books
Publication date February 24, 2015 July 19, 2012 May 14, 2013 January 3, 2017 May 30, 2017
Publisher's recommended age(s) 12 - 18 14 - 17 8 - 17 14 - 18 NaN
Number of pages 336 304 496 320 240
Available on Nook, Hardback, iBooks, Kindle Nook, Hardback, iBooks, Kindle Nook, Audiobook (unabridged), Hardback, iBooks... Nook, Audiobook (abridged), Hardback, iBooks, ... Nook, Audiobook (unabridged), Hardback, Kindle
Last updated June 19, 2019 May 06, 2019 October 18, 2017 June 19, 2019 January 18, 2019
Illustrator NaN NaN Iacopo Bruno NaN NaN
Authors NaN NaN NaN NaN NaN
Awards NaN NaN NaN NaN NaN
Publishers NaN NaN NaN NaN NaN
Award NaN NaN NaN NaN NaN
Illustrators NaN NaN NaN NaN NaN

Again, I collected the data so I know the target is csm_rating which is the minimum age Common Sense Media (CSM) says a reader should be for the given book.

Also, we have essentially three types of features:

  • Numeric
    • par_rating : Ratings of the book by parents
    • kids_rating : Ratings of the book by children
    • :dart:csm_rating : Ratings of the books by Common Sense Media
    • Number of pages : Length of the book
    • Publisher's recommended age(s): Self explanatory
  • Date
    • Publication date : When the book was published
    • Last updated: When the book's information was updated on the website

with the rest of the features being categorical and text; these features will be our focus for today.

Step 3.1 Housekeeping

Clean the feature names to make inspection easier. 3

df.columns
Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'Author', 'Genre', 'Topics',
       'Book type', 'Publisher', 'Publication date',
       'Publisher's recommended age(s)', 'Number of pages', 'Available on',
       'Last updated', 'Illustrator', 'Authors', 'Awards', 'Publishers',
       'Award', 'Illustrators'],
      dtype='object')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.columns
Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'author', 'genre', 'topics',
       'book_type', 'publisher', 'publication_date',
       'publisher's_recommended_ages', 'number_of_pages', 'available_on',
       'last_updated', 'illustrator', 'authors', 'awards', 'publishers',
       'award', 'illustrators'],
      dtype='object')

Much better.

Now lets subset the data frame so we only have the features of interest.

Given there are twice as many text features compared to non-text features, and the fact that I'm lazy efficient, I'll create a list of the features I don't want

numeric = ['par_rating', 'kids_rating', 'csm_rating', 'number_of_pages', "publisher's_recommended_ages", "publication_date", "last_updated"]

and use it to keep the features I do want.

df_strings = df.drop(df[numeric], axis=1)

Voila!

df_strings.head().T
0 1 2 3 4
title The Third Twin Small Damages The School for Good and Evil, Book 1 Agent of Chaos: The X-Files Origins, Book 1 Crossing Ebenezer Creek
description Gripping thriller skimps on character developm... Luminous story of pregnant teen's summer in Sp... Fractured fairy tale has plenty of twists for ... Series pictures Mulder as teen, captures essen... Heartbreaking novel follows freed slaves on Sh...
plot Twins Ava and Alexa "Lexi" Rios live in an aff... It's the summer of 1996, which 18-year-old Ken... When best friends Sophie and Agatha are stolen... Set in 1979, AGENT OF CHAOS follows a 17-year-... CROSSING EBENEZER CREEK is a YA novel from awa...
csm_review THE THIRD TWIN has an interesting, compelling ... This could well have been a minefield of clich... The School for Good and Evil is no run-of-the-... Popular TV characters don't always make a smoo... Beautifully written and poetically rendered, t...
need_to_know Parents need to know that The Third Twin is a ... Parents need to know that Small Damages is nar... Parents need to know that The School for Good ... Parents need to know that Agent of Chaos: The ... Parents need to know that Crossing Ebenezer Cr...
author CJ Omololu Beth Kephart Soman Chainani Kami Garcia Tonya Bolden
genre Mystery Coming of Age Fairy Tale Science Fiction Historical Fiction
topics Adventures, Brothers and Sisters, Friendship, ... Friendship, History, Horses and Farm Animals Magic and Fantasy, Princesses, Fairies, Mermai... Magic and Fantasy, Adventures, Great Boy Role ... Friendship, History
book_type Fiction Fiction Fiction Fiction Fiction
publisher Delacorte Press Philomel HarperCollins Children's Books Imprint Bloomsbury Children's Books
available_on Nook, Hardback, iBooks, Kindle Nook, Hardback, iBooks, Kindle Nook, Audiobook (unabridged), Hardback, iBooks... Nook, Audiobook (abridged), Hardback, iBooks, ... Nook, Audiobook (unabridged), Hardback, Kindle
illustrator NaN NaN Iacopo Bruno NaN NaN
authors NaN NaN NaN NaN NaN
awards NaN NaN NaN NaN NaN
publishers NaN NaN NaN NaN NaN
award NaN NaN NaN NaN NaN
illustrators NaN NaN NaN NaN NaN

Clearly, the non-numeric data falls into two groups:

  • text
    • description
    • plot
    • csm_review
    • need_to_know
  • categories
    • author/authors
    • genre
    • award/awards
    • etc.

Looking at the output above, so many questions come to mind:

  1. How many missing values do we have?
  2. How long are the descriptions?
  3. What's the difference between csm_review and need_to_know?
  4. Similarly, what the difference between description and plot?
  5. How many different authors do we have in the dataset?
  6. How many types of books do we have?

and I'm sure more will arise once we start.

Where to start? Lets answer the easiest questions first :grin:

Categories

How many missing values do we have?

A cursory glance at the output above indicates there are potentially a ton of missing values; lets inspect this hunch visually.

msno.bar(df_strings, sort='descending');

Hunch confirmed: 10 the 17 columns are missing values with some being practically empty.

To get a precise count, we can use sidetable.2

import sidetable

df_strings.stb.missing(clip_0=True, style=True)
Missing Total Percent
publishers 5783 5,816 99.43%
illustrators 5755 5,816 98.95%
awards 5748 5,816 98.83%
authors 5468 5,816 94.02%
award 5401 5,816 92.86%
illustrator 3326 5,816 57.19%
available_on 2282 5,816 39.24%
topics 1948 5,816 33.49%
author 348 5,816 5.98%
publisher 141 5,816 2.42%

OK, we have lots of missing values and several columns which appear to be measuring similar features (i.e., authors, illustrators, publishers, awards) so lets inspect these features in pairs.

author and authors

Every book has an author, even if the author is "Anonymous," so then why do we essentially have two columns for the same thing?

:thinking: author is for books with a single writer whereas authors is for books with multiple authors like Good Omens.

Let's test that theory.

msno.matrix(df_strings.loc[:, ['author', 'authors']]);

Bazinga!

We have a perfect correlation between missing data for author and authors but lets' have a look just in case.

df_strings.loc[df_strings['author'].isna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()
title author authors
6 Miss Educated: An Upper Class Novel #2 NaN Hobson Brown, Caroline Says, Taylor Materne
27 Zeroes, Book 1 NaN Scott Westerfeld, Margo Lanagan, Deborah Bianc...
34 Middle School: From Hero to Zero: Middle Schoo... NaN James Patterson, Chris Tebbetts
47 Legend: The Graphic Novel NaN Marie Lu, Leigh Dragoon
108 Survivors: Stranded, Book 3 NaN Jeff Probst, Chris Tebbetts
df_strings.loc[df_strings['author'].notna() & df_strings["authors"].isna(), ['title', 'author', 'authors']].head()
title author authors
0 The Third Twin CJ Omololu NaN
1 Small Damages Beth Kephart NaN
2 The School for Good and Evil, Book 1 Soman Chainani NaN
3 Agent of Chaos: The X-Files Origins, Book 1 Kami Garcia NaN
4 Crossing Ebenezer Creek Tonya Bolden NaN
df_strings.loc[df_strings['author'].notna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()
title author authors

My curiosity is satiated.

Now the question is how to successfully merge the two columns?

We could replace the NaN in author with the:

  • values in authors
  • word multiple
  • first author in authors
  • more/most popular of the authors in authors

and I'm sure I could come up with even more if I thought about/Googled it but the key is to understand that no matter what we choose, it will have consequences when we build our model3.

Next question which comes to mind is:

:thinking: How many different authors are there?

df_strings.loc[:, 'author'].nunique()
2668

Wow! Nearly half our our observations contain a unique name meaning this feature has high cardinality.

:thinking: Which authors are most represented in the data set?

Lets create a frequency table to find out.

author_counts = df_strings.loc[:, ["title", 'author']].groupby('author').count().reset_index()
author_counts.sort_values('title', ascending=False).head(10)
author title
670 Dr. Seuss 39
2167 Rick Riordan 26
1470 Kevin Henkes 25
1916 Mo Willems 24
1965 Neil Gaiman 18
767 Eoin Colfer 17
2619 Walter Dean Myers 16
2643 William Joyce 16
1103 Jeff Kinney 15
1579 Lemony Snicket 15

Given that I've scraped the data from a website focusing on children, teens, and young adults, the results above only make sense; authors like Dr. Seuss, Eoin Coifer, and Lemony Snicket are famous children's authors whereas Rick Riordan, Walter Dean Myers occupy the teen/young adult space and Neil Gaiman writes across ages.

:thinking: How many authors are only represented once?

That's easy to check.

from matplotlib.ticker import FuncFormatter

ax = author_counts['title'].value_counts(normalize=True).nlargest(5).plot.barh()
ax.invert_yaxis();

#Set the x-axis to a percentage
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x))) 

Wow! So approximately 60% of the authors have one title in our data set.

Why does that matter?

When it comes time to build our model we'll need to either label encode, one-hot encode, or hash this feature and whichever we decide to do will end up effecting the model profoundly due to the high cardinality of this feature; however, we'll deal with all this another time :grin:.

illustrator and illustrators

Missing values can be quite informative.

:thinking: What types of books typically have illustrators?
:bulb: Children's books!

Therefore, if a book's entries for both illustrator and illustrators is blank, that probably means that book doesn't have illustrations which would mean it is more likely to be for older children.

Let's test this theory in the simplest way I can think of :smile:

#Has an illustrator
df.loc[df['illustrator'].notna() | df['illustrators'].notna(), ['csm_rating']].hist();
#Doesn't have an illustrator
df.loc[df['illustrators'].isna() & df["illustrator"].isna(), ['csm_rating']].hist();

:bulb: Who the illustrator is doesn't matter as much as whether there is an illustrator.

Looks like when I do some feature engineering I'll need to create a has_illustrator feature.

book_type and genre

These two features should be relatively straightforward but we'll have a quick look anyway.

book_type should be easy because, after a cursory inspection using head above, I'd expect to only see 'fiction' or 'non-fiction' but I'll double check.

ax_book_type = df_strings['book_type'].value_counts().plot.barh();
ax_book_type.invert_yaxis()

Good! The only values I have are the ones I expected but the ratio is highly skewed.

:thinking: What impact will this have on our model?

genre (e.g. fantasy, romance, sci-fi) is a far broader topic than booktype but how many different genres are represented in the data set?

df_strings['genre'].nunique()
45

:roll_eyes: Great

What's the breakdown?

ax_genre = df_strings['genre'].value_counts().plot.barh();
ax_genre.invert_yaxis()

That's not super useful but what if I took 10 most common genres?

ax_genre_10 = df_strings['genre'].value_counts(normalize=True).nlargest(10).plot.barh();
ax_genre_10.invert_yaxis()

#Set the x axis to percentage
ax_genre_10.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x))) 

Hmmm. Looks like approximately half the books fall into one of three genres.

:bulb: To reduce dimensionality, recode any genre outside of the top 10 as 'other'.

Will save that idea for the feature engineering stage.

award and awards

Since certain awards (e.g. The Caldecott Medal) are only awarded to children's books whereas others, namely The RITA Award is only for "mature" readers.

:thinking: Will knowing if a work is an award winner provide insight?
:thinking: Which awards are represented?

award_ax = df_strings['award'].value_counts().plot.barh()
award_ax.invert_yaxis();
awards_ax = df_strings['awards'].str.split(",").explode().str.strip().value_counts().plot.barh()
awards_ax.invert_yaxis()

Hmmmmm. The Caldecott Medal is for picture books so that should mean the target readers are very young; however, we've already seen that "picture books" is the second most common value in genre so being a Caldecott Medal winner won't add much. Also, to be eligible for the other awards, a book needs to be aimed a t children 14 or below so that doesn't really tell us much either.

Conclusion: drop this feature.

While I could keep going and analyze publisher, publishers, and available_on, I'd be using the exact same techniques as above so, instead, time to move on to...

Text

description, plot, csm_review, need_to_know

Now for some REALLY fun stuff!

:thinking: How long are each of these observations?

Trying to be as efficient as possible, I'll:

  • make a list of the features I want
variables = ['description', 'plot', 'csm_review', 'need_to_know']
  • write a function to:
    • convert the text to lowercase
    • tokenize the text and remove stop words
    • identify the length of each feature
from nltk import word_tokenize
from nltk.corpus import stopwords

stop = stopwords.words('english')

def text_process(df, feature): 
    df.loc[:, feature+'_tokens'] = df.loc[:, feature].apply(str.lower)
    df.loc[:, feature+'_tokens'] = df.loc[:, feature+'_tokens'].apply(lambda x: [item for item in x.split() if item not in stop])
    df.loc[:, feature+'_len'] = df.loc[:, feature+'_tokens'].apply(len) 
    return df
  • loop through the list of variables saving it to the data frame
for var in variables: 
    df_text = text_process(df_strings, var)
df_text.iloc[:, -8:].head()
description_tokens description_len plot_tokens plot_len csm_review_tokens csm_review_len need_to_know_tokens need_to_know_len
0 [gripping, thriller, skimps, character, develo... 5 [twins, ava, alexa, "lexi", rios, live, afflue... 117 [third, twin, interesting,, compelling, premis... 89 [parents, need, know, third, twin, murder, mys... 77
1 [luminous, story, pregnant, teen's, summer, sp... 6 [summer, 1996,, 18-year-old, kenzie, planned, ... 56 [could, well, minefield, clichés, nd, preachin... 52 [parents, need, know, small, damages, narrated... 67
2 [fractured, fairy, tale, plenty, twists, fanta... 7 [best, friends, sophie, agatha, stolen, away, ... 72 [school, good, evil, run-of-the-mill, fairy, t... 65 [parents, need, know, school, good, evil, fres... 69
3 [series, pictures, mulder, teen,, captures, es... 8 [set, 1979,, agent, chaos, follows, 17-year-ol... 52 [popular, tv, characters, always, make, smooth... 64 [parents, need, know, agent, chaos:, x-files, ... 64
4 [heartbreaking, novel, follows, freed, slaves,... 7 [crossing, ebenezer, creek, ya, novel, award-w... 116 [beautifully, written, poetically, rendered,, ... 163 [parents, need, know, crossing, ebenezer, cree... 82

:thinking: description seems to be significantly shorter than the other three.

Let's plot them to investigate.

len_columns  = df_text.columns.str.endswith('len')
df_text.loc[:,len_columns].hist();
plt.tight_layout()

Yep - description is significantly shorter but how do the other three compare.

columns = ['plot_len', 'need_to_know_len', 'csm_review_len']
df_text[columns].plot.box()
plt.xticks(rotation='vertical');

Hmmm. Lots of outliers for csm_review but, in general, the three features are of similar lengths.

Next Steps

While I could create word clouds to visualize the most frequent words for each feature, or calculate the sentiment of each feature, my stated goal is to identify how old someone should be to read a book and not whether a review is good or bad.

To that end, my curiosity about these features is satiated so I'm ready to move on to another chapter.

Summary

  • :ballot_box_with_check: numeric data
  • :ballot_box_with_check: categorical data
  • :black_square_button: images (book covers)

Two down; one to go!

Going forward, my key points to remember are:

What type of categorical data do I have?

There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.

Are missing values really missing?

Several of the features had missing values which were, in fact, not truly missing; for example, the award and awards features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.

In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."

Happy coding!

Footnotes


2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package



4. Big Thank You to Chaim Gluck for providing this tip