E is for Exploratory Data Analysis: Categorical Data
What is Exploratory Data Analysis (EDA), why is it done, and how do we do it in Python?
What is Exploratory Data Analysis(EDA)?
While I answered these questions in the last post, since all learning is repetition, I'll do it again
EDA is an ethos for how we scrutinize data including, but not limited to:
- what we look for (i.e. shapes, trends, outliers)
- the approaches we employ (i.e. five-number summary, visualizations)
- and the decisions we reach1
Why is it done?
Two main reasons:
-
If we collected the data ourselves, we need to know if our data suits our needs or if we need to collect more/different data.
-
If we didn't collect the data ourselves, we need to interrogate the data to answer the "5 W's"
- What kind of data do we have (i.e. numeric, categorical)?
- When was the data collected? There could be more recent data which we could collect which would better inform our model.
- How much data do we have? Also, how was the data collected?
- Why was the data collected? The original motivation could highlight potential areas of bias in the data.
- Who collected the data?
Some of these questions can't necessarily be answered by looking at the data alone which is fine because nothing comes from nothing; someone will know the answers so all we have to do is know where to look and whom to ask.
How do we do it in Python?
As always, I'll follow the steps outlined in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow
Step 1: Frame the Problem
"Given a set of features, can we determine how old someone needs to be to read a book?"
Step 2: Get the Data
We'll be using the same dataset as in the previous post.
Step 3: Explore the Data to Gain Insights (i.e. EDA)
As always, import the essential libraries, then load the data.
#For data manipulation
import pandas as pd
import numpy as np
#For visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
url = 'https://raw.githubusercontent.com/educatorsRlearners/book-maturity/master/csv/book_info_complete.csv'
df = pd.read_csv(url)
To review,
How much data do we have?
df.shape
- 23 features
- one target
- 5,816 observations
What type of data do we have?
df.info()
Looks like mostly categorical with some numeric.
Lets take a closer look.
df.head().T
Again, I collected the data so I know the target is csm_rating
which is the minimum age Common Sense Media (CSM) says a reader should be for the given book.
Also, we have essentially three types of features:
- Numeric
-
par_rating
: Ratings of the book by parents -
kids_rating
: Ratings of the book by children -
csm_rating
: Ratings of the books by Common Sense Media -
Number of pages
: Length of the book -
Publisher's recommended age(s)
: Self explanatory
-
- Date
-
Publication date
: When the book was published -
Last updated
: When the book's information was updated on the website
-
with the rest of the features being categorical and text; these features will be our focus for today.
Step 3.1 Housekeeping
Clean the feature names to make inspection easier. 3
df.columns
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.columns
Much better.
Now lets subset the data frame so we only have the features of interest.
Given there are twice as many text features compared to non-text features, and the fact that I'm lazy efficient, I'll create a list of the features I don't want
numeric = ['par_rating', 'kids_rating', 'csm_rating', 'number_of_pages', "publisher's_recommended_ages", "publication_date", "last_updated"]
and use it to keep the features I do want.
df_strings = df.drop(df[numeric], axis=1)
Voila!
df_strings.head().T
Clearly, the non-numeric data falls into two groups:
- text
description
plot
csm_review
need_to_know
- categories
-
author
/authors
genre
-
award
/awards
- etc.
-
Looking at the output above, so many questions come to mind:
- How many missing values do we have?
- How long are the descriptions?
- What's the difference between
csm_review
andneed_to_know
? - Similarly, what the difference between
description
andplot
? - How many different authors do we have in the dataset?
- How many types of books do we have?
and I'm sure more will arise once we start.
Where to start? Lets answer the easiest questions first
Categories
How many missing values do we have?
A cursory glance at the output above indicates there are potentially a ton of missing values; lets inspect this hunch visually.
msno.bar(df_strings, sort='descending');
Hunch confirmed: 10 the 17 columns are missing values with some being practically empty.
To get a precise count, we can use sidetable
.2
import sidetable
df_strings.stb.missing(clip_0=True, style=True)
OK, we have lots of missing values and several columns which appear to be measuring similar features (i.e., authors, illustrators, publishers, awards) so lets inspect these features in pairs.
author
and authors
Every book has an author, even if the author is "Anonymous," so then why do we essentially have two columns for the same thing?
author
is for books with a single writer whereas authors
is for books with multiple authors like Good Omens.
Let's test that theory.
msno.matrix(df_strings.loc[:, ['author', 'authors']]);
Bazinga!
We have a perfect correlation between missing data for author
and authors
but lets' have a look just in case.
df_strings.loc[df_strings['author'].isna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()
df_strings.loc[df_strings['author'].notna() & df_strings["authors"].isna(), ['title', 'author', 'authors']].head()
df_strings.loc[df_strings['author'].notna() & df_strings["authors"].notna(), ['title', 'author', 'authors']].head()
My curiosity is satiated.
Now the question is how to successfully merge the two columns?
We could replace the NaN
in author
with the:
- values in
authors
- word
multiple
- first author in
authors
- more/most popular of the authors in
authors
and I'm sure I could come up with even more if I thought about/Googled it but the key is to understand that no matter what we choose, it will have consequences when we build our model3.
Next question which comes to mind is:
How many different authors are there?
df_strings.loc[:, 'author'].nunique()
Wow! Nearly half our our observations contain a unique name meaning this feature has high cardinality.
Which authors are most represented in the data set?
Lets create a frequency table to find out.
author_counts = df_strings.loc[:, ["title", 'author']].groupby('author').count().reset_index()
author_counts.sort_values('title', ascending=False).head(10)
Given that I've scraped the data from a website focusing on children, teens, and young adults, the results above only make sense; authors like Dr. Seuss, Eoin Coifer, and Lemony Snicket are famous children's authors whereas Rick Riordan, Walter Dean Myers occupy the teen/young adult space and Neil Gaiman writes across ages.
How many authors are only represented once?
That's easy to check.
from matplotlib.ticker import FuncFormatter
ax = author_counts['title'].value_counts(normalize=True).nlargest(5).plot.barh()
ax.invert_yaxis();
#Set the x-axis to a percentage
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x)))
Wow! So approximately 60% of the authors have one title in our data set.
Why does that matter?
When it comes time to build our model we'll need to either label encode, one-hot encode, or hash this feature and whichever we decide to do will end up effecting the model profoundly due to the high cardinality of this feature; however, we'll deal with all this another time .
illustrator
and illustrators
Missing values can be quite informative.
What types of books typically have illustrators?
Children's books!
Therefore, if a book's entries for both illustrator
and illustrators
is blank, that probably means that book doesn't have illustrations which would mean it is more likely to be for older children.
Let's test this theory in the simplest way I can think of
#Has an illustrator
df.loc[df['illustrator'].notna() | df['illustrators'].notna(), ['csm_rating']].hist();
#Doesn't have an illustrator
df.loc[df['illustrators'].isna() & df["illustrator"].isna(), ['csm_rating']].hist();
Who the illustrator is doesn't matter as much as whether there is an illustrator.
Looks like when I do some feature engineering I'll need to create a has_illustrator
feature.
ax_book_type = df_strings['book_type'].value_counts().plot.barh();
ax_book_type.invert_yaxis()
Good! The only values I have are the ones I expected but the ratio is highly skewed.
What impact will this have on our model?
genre
(e.g. fantasy, romance, sci-fi) is a far broader topic than booktype
but how many different genres are represented in the data set?
df_strings['genre'].nunique()
Great
What's the breakdown?
ax_genre = df_strings['genre'].value_counts().plot.barh();
ax_genre.invert_yaxis()
That's not super useful but what if I took 10 most common genres?
ax_genre_10 = df_strings['genre'].value_counts(normalize=True).nlargest(10).plot.barh();
ax_genre_10.invert_yaxis()
#Set the x axis to percentage
ax_genre_10.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x)))
Hmmm. Looks like approximately half the books fall into one of three genres.
To reduce dimensionality, recode any genre outside of the top 10 as 'other'.
Will save that idea for the feature engineering stage.
award
and awards
Since certain awards (e.g. The Caldecott Medal) are only awarded to children's books whereas others, namely The RITA Award is only for "mature" readers.
Will knowing if a work is an award winner provide insight?
Which awards are represented?
award_ax = df_strings['award'].value_counts().plot.barh()
award_ax.invert_yaxis();
awards_ax = df_strings['awards'].str.split(",").explode().str.strip().value_counts().plot.barh()
awards_ax.invert_yaxis()
Hmmmmm. The Caldecott Medal is for picture books so that should mean the target readers are very young; however, we've already seen that "picture books" is the second most common value in genre
so being a Caldecott Medal winner won't add much. Also, to be eligible for the other awards, a book needs to be aimed a t children 14 or below so that doesn't really tell us much either.
Conclusion: drop this feature.
While I could keep going and analyze publisher
, publishers
, and available_on
, I'd be using the exact same techniques as above so, instead, time to move on to...
Text
description
, plot
, csm_review
, need_to_know
Now for some REALLY fun stuff!
How long are each of these observations?
Trying to be as efficient as possible, I'll:
- make a list of the features I want
variables = ['description', 'plot', 'csm_review', 'need_to_know']
- write a function to:
- convert the text to lowercase
- tokenize the text and remove stop words
- identify the length of each feature
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = stopwords.words('english')
def text_process(df, feature):
df.loc[:, feature+'_tokens'] = df.loc[:, feature].apply(str.lower)
df.loc[:, feature+'_tokens'] = df.loc[:, feature+'_tokens'].apply(lambda x: [item for item in x.split() if item not in stop])
df.loc[:, feature+'_len'] = df.loc[:, feature+'_tokens'].apply(len)
return df
- loop through the list of variables saving it to the data frame
for var in variables:
df_text = text_process(df_strings, var)
df_text.iloc[:, -8:].head()
description
seems to be significantly shorter than the other three.
Let's plot them to investigate.
len_columns = df_text.columns.str.endswith('len')
df_text.loc[:,len_columns].hist();
plt.tight_layout()
Yep - description
is significantly shorter but how do the other three compare.
columns = ['plot_len', 'need_to_know_len', 'csm_review_len']
df_text[columns].plot.box()
plt.xticks(rotation='vertical');
Hmmm. Lots of outliers for csm_review
but, in general, the three features are of similar lengths.
Next Steps
While I could create word clouds to visualize the most frequent words for each feature, or calculate the sentiment of each feature, my stated goal is to identify how old someone should be to read a book and not whether a review is good or bad.
To that end, my curiosity about these features is satiated so I'm ready to move on to another chapter.
Summary
- numeric data
- categorical data
- images (book covers)
Two down; one to go!
Going forward, my key points to remember are:
What type of categorical data do I have?
There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.
Are missing values really missing?
Several of the features had missing values which were, in fact, not truly missing; for example, the award
and awards
features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.
In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."
Happy coding!
Footnotes
1. Adapted from Engineering Statistics Handbook↩
2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package↩
3. See this post on Smarter Ways to Encode Categorical Data↩
4. Big Thank You to Chaim Gluck for providing this tip↩