What is Exploratory Data Analysis(EDA)?

While I answered these questions in the last post, since all learning is repetition, I'll do it again

EDA is an ethos for how we scrutinize data including, but not limited to:

what we look for (i.e. shapes, trends, outliers)
the approaches we employ (i.e. five-number summary, visualizations)
and the decisions we reach¹

Why is it done?

Two main reasons:

If we collected the data ourselves, we need to know if our data suits our needs or if we need to collect more/different data.
If we didn't collect the data ourselves, we need to interrogate the data to answer the "5 W's"

What kind of data do we have (i.e. numeric, categorical)?
When was the data collected? There could be more recent data which we could collect which would better inform our model.
How much data do we have? Also, how was the data collected?
Why was the data collected? The original motivation could highlight potential areas of bias in the data.
Who collected the data?

Some of these questions can't necessarily be answered by looking at the data alone which is fine because nothing comes from nothing; someone will know the answers so all we have to do is know where to look and whom to ask.

How do we do it in Python?

As always, I'll follow the steps outlined in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow

Step 1: Frame the Problem

"Given a set of features, can we determine how old someone needs to be to read a book?"

Step 2: Get the Data

We'll be using the same dataset as in the previous post.

Step 3: Explore the Data to Gain Insights (i.e. EDA)

As always, import the essential libraries, then load the data.

#For data manipulation
import pandas as pd
import numpy as np

#For visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno 

url = 'https://raw.githubusercontent.com/educatorsRlearners/book-maturity/master/csv/book_info_complete.csv'

df = pd.read_csv(url)

To review,

How much data do we have?

df.shape

(5816, 24)

23 features
one target
5,816 observations

What type of data do we have?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5816 entries, 0 to 5815
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   title                           5816 non-null   object 
 1   description                     5816 non-null   object 
 2   plot                            5816 non-null   object 
 3   csm_review                      5816 non-null   object 
 4   need_to_know                    5816 non-null   object 
 5   par_rating                      2495 non-null   float64
 6   kids_rating                     3026 non-null   float64
 7   csm_rating                      5816 non-null   int64  
 8   Author                          5468 non-null   object 
 9   Genre                           5816 non-null   object 
 10  Topics                          3868 non-null   object 
 11  Book type                       5816 non-null   object 
 12  Publisher                       5675 non-null   object 
 13  Publication date                5816 non-null   object 
 14  Publisher's recommended age(s)  4647 non-null   object 
 15  Number of pages                 5767 non-null   float64
 16  Available on                    3534 non-null   object 
 17  Last updated                    5816 non-null   object 
 18  Illustrator                     2490 non-null   object 
 19  Authors                         348 non-null    object 
 20  Awards                          68 non-null     object 
 21  Publishers                      33 non-null     object 
 22  Award                           415 non-null    object 
 23  Illustrators                    61 non-null     object 
dtypes: float64(3), int64(1), object(20)
memory usage: 1.1+ MB

Looks like mostly categorical with some numeric.

Lets take a closer look.

df.head().T

Again, I collected the data so I know the target is csm_rating which is the minimum age Common Sense Media (CSM) says a reader should be for the given book.

Also, we have essentially three types of features:

Numeric
- par_rating : Ratings of the book by parents
- kids_rating : Ratings of the book by children
- csm_rating : Ratings of the books by Common Sense Media
- Number of pages : Length of the book
- Publisher's recommended age(s): Self explanatory

Date
- Publication date : When the book was published
- Last updated: When the book's information was updated on the website

with the rest of the features being categorical and text; these features will be our focus for today.

Step 3.1 Housekeeping

Clean the feature names to make inspection easier. ³

df.columns

Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'Author', 'Genre', 'Topics',
       'Book type', 'Publisher', 'Publication date',
       'Publisher's recommended age(s)', 'Number of pages', 'Available on',
       'Last updated', 'Illustrator', 'Authors', 'Awards', 'Publishers',
       'Award', 'Illustrators'],
      dtype='object')

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

df.columns

Index(['title', 'description', 'plot', 'csm_review', 'need_to_know',
       'par_rating', 'kids_rating', 'csm_rating', 'author', 'genre', 'topics',
       'book_type', 'publisher', 'publication_date',
       'publisher's_recommended_ages', 'number_of_pages', 'available_on',
       'last_updated', 'illustrator', 'authors', 'awards', 'publishers',
       'award', 'illustrators'],
      dtype='object')

Much better.

Now lets subset the data frame so we only have the features of interest.

Given there are twice as many text features compared to non-text features, and the fact that I'm ~~lazy~~ efficient, I'll create a list of the features I don't want

numeric = ['par_rating', 'kids_rating', 'csm_rating', 'number_of_pages', "publisher's_recommended_ages", "publication_date", "last_updated"]

and use it to keep the features I do want.

df_strings = df.drop(df[numeric], axis=1)

Voila!

df_strings.head().T

Clearly, the non-numeric data falls into two groups:

text
- description
- plot
- csm_review
- need_to_know
categories
- author/authors
- genre
- award/awards
- etc.

Looking at the output above, so many questions come to mind:

How many missing values do we have?
How long are the descriptions?
What's the difference between csm_review and need_to_know?
Similarly, what the difference between description and plot?
How many different authors do we have in the dataset?
How many types of books do we have?

and I'm sure more will arise once we start.

Where to start? Lets answer the easiest questions first

Categories

How many missing values do we have?

A cursory glance at the output above indicates there are potentially a ton of missing values; lets inspect this hunch visually.

msno.bar(df_strings, sort='descending');

Hunch confirmed: 10 the 17 columns are missing values with some being practically empty.

To get a precise count, we can use sidetable.²

import sidetable

df_strings.stb.missing(clip_0=True, style=True)

OK, we have lots of missing values and several columns which appear to be measuring similar features (i.e., authors, illustrators, publishers, awards) so lets inspect these features in pairs.

`author` and `authors`

Every book has an author, even if the author is "Anonymous," so then why do we essentially have two columns for the same thing?

author is for books with a single writer whereas authors is for books with multiple authors like Good Omens.

Let's test that theory.

msno.matrix(df_strings.loc[:, ['author', 'authors']]);

Bazinga!

We have a perfect correlation between missing data for author and authors but lets' have a look just in case.

df_strings.loc[df_strings['author'].isna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()

df_strings.loc[df_strings['author'].notna() & df_strings["authors"].isna(), ['title', 'author', 'authors']].head()

df_strings.loc[df_strings['author'].notna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()

My curiosity is satiated.

Now the question is how to successfully merge the two columns?

We could replace the NaN in author with the:

values in authors
word multiple
first author in authors
more/most popular of the authors in authors

and I'm sure I could come up with even more if I thought about/Googled it but the key is to understand that no matter what we choose, it will have consequences when we build our model³.

Next question which comes to mind is:

How many different authors are there?

df_strings.loc[:, 'author'].nunique()

2668

Wow! Nearly half our our observations contain a unique name meaning this feature has high cardinality.

Which authors are most represented in the data set?

Lets create a frequency table to find out.

author_counts = df_strings.loc[:, ["title", 'author']].groupby('author').count().reset_index()
author_counts.sort_values('title', ascending=False).head(10)

Given that I've scraped the data from a website focusing on children, teens, and young adults, the results above only make sense; authors like Dr. Seuss, Eoin Coifer, and Lemony Snicket are famous children's authors whereas Rick Riordan, Walter Dean Myers occupy the teen/young adult space and Neil Gaiman writes across ages.

How many authors are only represented once?

That's easy to check.

from matplotlib.ticker import FuncFormatter

ax = author_counts['title'].value_counts(normalize=True).nlargest(5).plot.barh()
ax.invert_yaxis();

#Set the x-axis to a percentage
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x)))

Wow! So approximately 60% of the authors have one title in our data set.

Why does that matter?

When it comes time to build our model we'll need to either label encode, one-hot encode, or hash this feature and whichever we decide to do will end up effecting the model profoundly due to the high cardinality of this feature; however, we'll deal with all this another time .

`illustrator` and `illustrators`

Missing values can be quite informative.

What types of books typically have illustrators?
Children's books!

Therefore, if a book's entries for both illustrator and illustrators is blank, that probably means that book doesn't have illustrations which would mean it is more likely to be for older children.

Let's test this theory in the simplest way I can think of

#Has an illustrator
df.loc[df['illustrator'].notna() | df['illustrators'].notna(), ['csm_rating']].hist();

#Doesn't have an illustrator
df.loc[df['illustrators'].isna() & df["illustrator"].isna(), ['csm_rating']].hist();

Who the illustrator is doesn't matter as much as whether there is an illustrator.

Looks like when I do some feature engineering I'll need to create a has_illustrator feature.

`book_type` and `genre`

These two features should be relatively straightforward but we'll have a quick look anyway.

book_type should be easy because, after a cursory inspection using head above, I'd expect to only see 'fiction' or 'non-fiction' but I'll double check.

ax_book_type = df_strings['book_type'].value_counts().plot.barh();
ax_book_type.invert_yaxis()

Good! The only values I have are the ones I expected but the ratio is highly skewed.

What impact will this have on our model?

genre (e.g. fantasy, romance, sci-fi) is a far broader topic than booktype but how many different genres are represented in the data set?

df_strings['genre'].nunique()

45

Great

What's the breakdown?

ax_genre = df_strings['genre'].value_counts().plot.barh();
ax_genre.invert_yaxis()

That's not super useful but what if I took 10 most common genres?

ax_genre_10 = df_strings['genre'].value_counts(normalize=True).nlargest(10).plot.barh();
ax_genre_10.invert_yaxis()

#Set the x axis to percentage
ax_genre_10.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x)))

Hmmm. Looks like approximately half the books fall into one of three genres.

To reduce dimensionality, recode any genre outside of the top 10 as 'other'.

Will save that idea for the feature engineering stage.

`award` and `awards`

Since certain awards (e.g. The Caldecott Medal) are only awarded to children's books whereas others, namely The RITA Award is only for "mature" readers.

Will knowing if a work is an award winner provide insight?
Which awards are represented?

award_ax = df_strings['award'].value_counts().plot.barh()
award_ax.invert_yaxis();

awards_ax = df_strings['awards'].str.split(",").explode().str.strip().value_counts().plot.barh()
awards_ax.invert_yaxis()

Hmmmmm. The Caldecott Medal is for picture books so that should mean the target readers are very young; however, we've already seen that "picture books" is the second most common value in genre so being a Caldecott Medal winner won't add much. Also, to be eligible for the other awards, a book needs to be aimed a t children 14 or below so that doesn't really tell us much either.

Conclusion: drop this feature.

While I could keep going and analyze publisher, publishers, and available_on, I'd be using the exact same techniques as above so, instead, time to move on to...

Text

`description`, `plot`, `csm_review`, `need_to_know`

Now for some REALLY fun stuff!

How long are each of these observations?

Trying to be as efficient as possible, I'll:

make a list of the features I want

variables = ['description', 'plot', 'csm_review', 'need_to_know']

write a function to:
- convert the text to lowercase
- tokenize the text and remove stop words
- identify the length of each feature

from nltk import word_tokenize
from nltk.corpus import stopwords

stop = stopwords.words('english')

def text_process(df, feature): 
    df.loc[:, feature+'_tokens'] = df.loc[:, feature].apply(str.lower)
    df.loc[:, feature+'_tokens'] = df.loc[:, feature+'_tokens'].apply(lambda x: [item for item in x.split() if item not in stop])
    df.loc[:, feature+'_len'] = df.loc[:, feature+'_tokens'].apply(len) 
    return df

loop through the list of variables saving it to the data frame

for var in variables: 
    df_text = text_process(df_strings, var)

df_text.iloc[:, -8:].head()

description seems to be significantly shorter than the other three.

Let's plot them to investigate.

len_columns  = df_text.columns.str.endswith('len')
df_text.loc[:,len_columns].hist();
plt.tight_layout()

Yep - description is significantly shorter but how do the other three compare.

columns = ['plot_len', 'need_to_know_len', 'csm_review_len']
df_text[columns].plot.box()
plt.xticks(rotation='vertical');

Hmmm. Lots of outliers for csm_review but, in general, the three features are of similar lengths.

Next Steps

While I could create word clouds to visualize the most frequent words for each feature, or calculate the sentiment of each feature, my stated goal is to identify how old someone should be to read a book and not whether a review is good or bad.

To that end, my curiosity about these features is satiated so I'm ready to move on to another chapter.

Summary

numeric data
categorical data
images (book covers)

Two down; one to go!

Going forward, my key points to remember are:

What type of categorical data do I have?

There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.

Are missing values really missing?

Several of the features had missing values which were, in fact, not truly missing; for example, the award and awards features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.

In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."

Happy coding!

Footnotes

1. Adapted from Engineering Statistics Handbook↩

2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package↩

3. See this post on Smarter Ways to Encode Categorical Data ↩

4. Big Thank You to Chaim Gluck for providing this tip↩

	0	1	2	3	4
title	The Third Twin	Small Damages	The School for Good and Evil, Book 1	Agent of Chaos: The X-Files Origins, Book 1	Crossing Ebenezer Creek
description	Gripping thriller skimps on character developm...	Luminous story of pregnant teen's summer in Sp...	Fractured fairy tale has plenty of twists for ...	Series pictures Mulder as teen, captures essen...	Heartbreaking novel follows freed slaves on Sh...
plot	Twins Ava and Alexa "Lexi" Rios live in an aff...	It's the summer of 1996, which 18-year-old Ken...	When best friends Sophie and Agatha are stolen...	Set in 1979, AGENT OF CHAOS follows a 17-year-...	CROSSING EBENEZER CREEK is a YA novel from awa...
csm_review	THE THIRD TWIN has an interesting, compelling ...	This could well have been a minefield of clich...	The School for Good and Evil is no run-of-the-...	Popular TV characters don't always make a smoo...	Beautifully written and poetically rendered, t...
need_to_know	Parents need to know that The Third Twin is a ...	Parents need to know that Small Damages is nar...	Parents need to know that The School for Good ...	Parents need to know that Agent of Chaos: The ...	Parents need to know that Crossing Ebenezer Cr...
par_rating	17	NaN	11	NaN	NaN
kids_rating	14	14	11	NaN	NaN
csm_rating	12	14	8	13	13
Author	CJ Omololu	Beth Kephart	Soman Chainani	Kami Garcia	Tonya Bolden
Genre	Mystery	Coming of Age	Fairy Tale	Science Fiction	Historical Fiction
Topics	Adventures, Brothers and Sisters, Friendship, ...	Friendship, History, Horses and Farm Animals	Magic and Fantasy, Princesses, Fairies, Mermai...	Magic and Fantasy, Adventures, Great Boy Role ...	Friendship, History
Book type	Fiction	Fiction	Fiction	Fiction	Fiction
Publisher	Delacorte Press	Philomel	HarperCollins Children's Books	Imprint	Bloomsbury Children's Books
Publication date	February 24, 2015	July 19, 2012	May 14, 2013	January 3, 2017	May 30, 2017
Publisher's recommended age(s)	12 - 18	14 - 17	8 - 17	14 - 18	NaN
Number of pages	336	304	496	320	240
Available on	Nook, Hardback, iBooks, Kindle	Nook, Hardback, iBooks, Kindle	Nook, Audiobook (unabridged), Hardback, iBooks...	Nook, Audiobook (abridged), Hardback, iBooks, ...	Nook, Audiobook (unabridged), Hardback, Kindle
Last updated	June 19, 2019	May 06, 2019	October 18, 2017	June 19, 2019	January 18, 2019
Illustrator	NaN	NaN	Iacopo Bruno	NaN	NaN
Authors	NaN	NaN	NaN	NaN	NaN
Awards	NaN	NaN	NaN	NaN	NaN
Publishers	NaN	NaN	NaN	NaN	NaN
Award	NaN	NaN	NaN	NaN	NaN
Illustrators	NaN	NaN	NaN	NaN	NaN

	0	1	2	3	4
title	The Third Twin	Small Damages	The School for Good and Evil, Book 1	Agent of Chaos: The X-Files Origins, Book 1	Crossing Ebenezer Creek
description	Gripping thriller skimps on character developm...	Luminous story of pregnant teen's summer in Sp...	Fractured fairy tale has plenty of twists for ...	Series pictures Mulder as teen, captures essen...	Heartbreaking novel follows freed slaves on Sh...
plot	Twins Ava and Alexa "Lexi" Rios live in an aff...	It's the summer of 1996, which 18-year-old Ken...	When best friends Sophie and Agatha are stolen...	Set in 1979, AGENT OF CHAOS follows a 17-year-...	CROSSING EBENEZER CREEK is a YA novel from awa...
csm_review	THE THIRD TWIN has an interesting, compelling ...	This could well have been a minefield of clich...	The School for Good and Evil is no run-of-the-...	Popular TV characters don't always make a smoo...	Beautifully written and poetically rendered, t...
need_to_know	Parents need to know that The Third Twin is a ...	Parents need to know that Small Damages is nar...	Parents need to know that The School for Good ...	Parents need to know that Agent of Chaos: The ...	Parents need to know that Crossing Ebenezer Cr...
author	CJ Omololu	Beth Kephart	Soman Chainani	Kami Garcia	Tonya Bolden
genre	Mystery	Coming of Age	Fairy Tale	Science Fiction	Historical Fiction
topics	Adventures, Brothers and Sisters, Friendship, ...	Friendship, History, Horses and Farm Animals	Magic and Fantasy, Princesses, Fairies, Mermai...	Magic and Fantasy, Adventures, Great Boy Role ...	Friendship, History
book_type	Fiction	Fiction	Fiction	Fiction	Fiction
publisher	Delacorte Press	Philomel	HarperCollins Children's Books	Imprint	Bloomsbury Children's Books
available_on	Nook, Hardback, iBooks, Kindle	Nook, Hardback, iBooks, Kindle	Nook, Audiobook (unabridged), Hardback, iBooks...	Nook, Audiobook (abridged), Hardback, iBooks, ...	Nook, Audiobook (unabridged), Hardback, Kindle
illustrator	NaN	NaN	Iacopo Bruno	NaN	NaN
authors	NaN	NaN	NaN	NaN	NaN
awards	NaN	NaN	NaN	NaN	NaN
publishers	NaN	NaN	NaN	NaN	NaN
award	NaN	NaN	NaN	NaN	NaN
illustrators	NaN	NaN	NaN	NaN	NaN

	title	author	authors
6	Miss Educated: An Upper Class Novel #2	NaN	Hobson Brown, Caroline Says, Taylor Materne
27	Zeroes, Book 1	NaN	Scott Westerfeld, Margo Lanagan, Deborah Bianc...
34	Middle School: From Hero to Zero: Middle Schoo...	NaN	James Patterson, Chris Tebbetts
47	Legend: The Graphic Novel	NaN	Marie Lu, Leigh Dragoon
108	Survivors: Stranded, Book 3	NaN	Jeff Probst, Chris Tebbetts

	title	author	authors
0	The Third Twin	CJ Omololu	NaN
1	Small Damages	Beth Kephart	NaN
2	The School for Good and Evil, Book 1	Soman Chainani	NaN
3	Agent of Chaos: The X-Files Origins, Book 1	Kami Garcia	NaN
4	Crossing Ebenezer Creek	Tonya Bolden	NaN

	description_tokens	description_len	plot_tokens	plot_len	csm_review_tokens	csm_review_len	need_to_know_tokens	need_to_know_len
0	[gripping, thriller, skimps, character, develo...	5	[twins, ava, alexa, "lexi", rios, live, afflue...	117	[third, twin, interesting,, compelling, premis...	89	[parents, need, know, third, twin, murder, mys...	77
1	[luminous, story, pregnant, teen's, summer, sp...	6	[summer, 1996,, 18-year-old, kenzie, planned, ...	56	[could, well, minefield, clichés, nd, preachin...	52	[parents, need, know, small, damages, narrated...	67
2	[fractured, fairy, tale, plenty, twists, fanta...	7	[best, friends, sophie, agatha, stolen, away, ...	72	[school, good, evil, run-of-the-mill, fairy, t...	65	[parents, need, know, school, good, evil, fres...	69
3	[series, pictures, mulder, teen,, captures, es...	8	[set, 1979,, agent, chaos, follows, 17-year-ol...	52	[popular, tv, characters, always, make, smooth...	64	[parents, need, know, agent, chaos:, x-files, ...	64
4	[heartbreaking, novel, follows, freed, slaves,...	7	[crossing, ebenezer, creek, ya, novel, award-w...	116	[beautifully, written, poetically, rendered,, ...	163	[parents, need, know, crossing, ebenezer, cree...	82

	Missing	Total	Percent
publishers	5783	5,816	99.43%
illustrators	5755	5,816	98.95%
awards	5748	5,816	98.83%
authors	5468	5,816	94.02%
award	5401	5,816	92.86%
illustrator	3326	5,816	57.19%
available_on	2282	5,816	39.24%
topics	1948	5,816	33.49%
author	348	5,816	5.98%
publisher	141	5,816	2.42%

	author	title
670	Dr. Seuss	39
2167	Rick Riordan	26
1470	Kevin Henkes	25
1916	Mo Willems	24
1965	Neil Gaiman	18
767	Eoin Colfer	17
2619	Walter Dean Myers	16
2643	William Joyce	16
1103	Jeff Kinney	15
1579	Lemony Snicket	15