import pandas as pd
import csv
import matplotlib.pyplot as plt
With a new age of technology, comes with it a new way of communication. Social media and smart phones seems to dominate the lives of billions across the globe. It feels as though with every passing day traditional modes of activity seem to slowly fade into the background. But buried deep within the world of technology is a classic and ageless pastime--- reading. No, not captions on your friends instagram, not posts from your family's facebook, not articles from your favorite newsletter; I'm talking about reading books. Good ol' books is what they'd be called.
With a vast online community of book lovers, you are able to see what kinds of books people love to read. And with Amazon being the biggest bookstore in the world, you are able to see the kinds of books people love to buy.
When it comes to viral videos on the internet and the internet sensations that come to be, there is always some inclination that gets the person to click. But a click isn't as easy as cracking open a book that you have to pay for (or downloading that e-book to kindle). So, I wanted to know what attributes makes a successful, popular, best selling book.
data = pd.read_csv("amazon best selling books.csv")
The first step in the data science lifecycle is identifying and gathering information. We gather data from using the excel spread sheet from Kaggle which contains all the Amazon books that have been best sellers from within the past decade (specifically 2009 - 2019)
data.head()
Here is a snippet of what the data set looks like, Name, Author, User Rating, Reviews, Price, Year, Genre
Many people are very used to hearing about the New York Times best sellers. The systems for the rankings and how those books comes to be follow certain metrics that are different for each company. The NYT numbers are based on anticipated sales for the week. Amazon uses a different system: it tracks actual sales by the hour. To find Amazon bestsellers, see the Amazon Sales page on the Author Central feature, which pulls from BookScan. Amazon’s numbers are strictly based on which books have the highest sales. Here's a link for some more information on this. --https://www.book-editing.com/amazon-nytimes-bestselling-books/
(A few things to note. These prices are not for e-book kindle prices, they are strictly the hard physical copy of these books. Which, shockingly is cheaper on average for many of these best sellers and books across the board in general. I was thorougly confused of this at first but here is a link to a brief article that highlights the key reasons why e-books are more expensive.) -- https://www.makeuseof.com/tag/ebooks-expensive-real-books/
data.plot(x = "Year", y = "Price", kind = "scatter", marker = "*");
plt.title('Best Selling Book Scatter Prices Over Time')
plt.xlabel('Year')
plt.ylabel('Prices')
This scatter plot is a visualization for the price of best-selling books sold on Amazon overtime. And as you can tell the data points over time start to consolidate into a more refined location below the 30 dollar mark. Keep in mind a few 100 dollar price books that took place in the year 2013-2014, make note of that in the line charts below. There is also an 82 dollar book sold in 2009 (the heart of the great recession)
data.groupby('Year').mean().plot(y='Price');
plt.title('Best Selling Book Price Declines')
plt.xlabel('Year')
plt.ylabel('Prices')
This line chart is another visualization for the price of best-selling books sold on Amazon overtime.
data[data.Genre == 'Fiction'].groupby('Year').mean().plot(y='Price');
plt.title('Fiction Book Price Overtime')
plt.xlabel('Year')
plt.ylabel('Prices')
The chart above describes the decline in specifically Fiction genre book pricing. And it seems that some sharp drops may have happened periodically throughout the years.
data[data.Genre == 'Non Fiction'].groupby('Year').mean().plot(y='Price');
plt.title('Non Fiction Book Prices Overtime')
plt.xlabel('Year')
plt.ylabel('# of Reviews')
The chart above describes the decline in specifically Non-Fiction genre book pricing. And it unlike the best-selling fiction book pricing, non-fiction book pricing was in fact on the incline from 2009-2014 and then it took a very sharp drop from 21 dollars per book on average to 11 dollars per book which is close to a 50% decline. Both genres have suffered price declines and both rest around the 9-11 dollar mark. Except how both genres got to this point is the clear distinction.
Where Fiction had a sharper price drop in the earlier parts of the decade, while Non-Fiction books lagged behind. Some of that volatility came from a few highly priced books in both categories as shown in the scatter plot.
data[data.Genre == 'Fiction'].groupby('Year').mean("Price")
data[data.Genre == 'Non Fiction'].groupby('Year').mean("Price")
A quick eye ball of the average prices from 2009 - 2019 shows a clear price differential in that Non-Fiction books are consistently more expensive than Fiction books. Research around this topic proves this and it falls back to basic economics where the publishers who are pushing these books onto market recognize that there is a higher demand and thus can charge a higher price. (especially in recent years amidst all the consistent successes around political tell-alls with the white house and such) Here is an article that talks about publishing companies selling more non-fiction than fiction books, and highlighting the size of each market.
"Adult non-fiction revenue totalled 6.18 billion across the publishing industry in 2017, while adult fiction revenues reached $4.3 billion, according to Penguin Random House" -- https://www.forbes.com/sites/adamrowe1/2018/08/30/traditional-publishers-are-selling-way-more-non-fiction-than-fiction/?sh=3864328756d0
Here is a hard count of the books in each genre further complimenting the article linked above
x = data.groupby("Genre").count()
print (x)
data.groupby('Year').mean().plot(y='Reviews');
plt.title('Amazon Reviews of Best Sellers Overtime')
plt.xlabel('Year')
plt.ylabel('# of Reviews')
Clearly over the past decade from 09-19 there is a clear sentiment around book readers and how they feel about their best-selling book they bought. Part of this is a shift in the internet where there is more activity today then before 2010 in online forums, blogs, and other places of network were not as massive.
Also, because Amazon has taken market share from other book sellers over the years they have consolidated a lot of the demand and purchases of these best-selling books. Thus, more buyers of the books and more reviews in turn.
data[data.Genre == 'Non Fiction'].groupby('Year').mean().plot(y='Reviews');
plt.title('Reviews of Nonfiction')
plt.xlabel('Year')
plt.ylabel('# of Reviews')
data[data.Genre == 'Fiction'].groupby('Year').mean().plot(y='Reviews');
plt.title('Reviews of Fiction')
plt.xlabel('Year')
plt.ylabel('# of Reviews')
Shockingly, there are far more reviews on average across the board on the fiction category. Aside from the fact that a few fiction books have sold like crazy, the book enthusiasts who lean on their imagination more love fiction books and are more likely to rave and chant about the book because of the pure entertainment it has to offer. For example, take 50 Shades of Grey or Harry Potter, which are both extremely high selling books that entire movie series's were made from. This is because it is just simply entertaining to more people than the mundane things that happen in the real world. And it pulls the opinions out of them, thus they are more likely to leave a review.
At the most recent drop in 2018, best-selling fiction books punch their weight in terms where they stand next to best-selling non-fiction books which were at their peak. 12500 vs 15000 respectively, but currently it is 18000 vs 14000. The peak of fiction vs non-fiction 23500 vs 15000. Showing that there is a higher bar for Amazon book reviews in the fiction category.
data.groupby('Year').mean().plot(y='User Rating');
plt.title('Ratings Over Time')
plt.xlabel('Year')
plt.ylabel('User Rating')
Unlike reviews, a rating is just a simple 1-5 stars on the stat sheet, and it is optional to explain why you gave the amount of stars you gave. Thus meaning that every review has a user rating, but not every user rating has a review. With that being said we see a clear up-tick in the user rating department. On average across both genres, the best selling books on Amazon parred higher with the ratings over the past decade. Now, the chart is slightly misleading in that the change looks drastic. But if you look at the y-axis, the change at the start and end of the decade goes from 4.57 to 4.75 which makes it close to a 4% increase. (Not drastic, but still enough to draw an inference). And because more reviews are being made (close to double/triple on average) over this time period you can argue there is a larger sample size which means "law of large numbers", you had less bad reviews skewing the books in one direction over the other. Or I can also hypothesize that the what makes a best-selling Amazon book is now harder because there are more renown authors and because it is harder, the quality of the book has to be better to reach those feats.
data[data.Genre == 'Non Fiction'].groupby('Year').mean().plot(y='User Rating');
plt.title('(Non Fiction) Ratings Over Time')
plt.xlabel('Year')
plt.ylabel('User Rating')
4.85 in 2019, up from 4.575 which makes that roughly 6% increase
data[data.Genre == 'Fiction'].groupby('Year').mean().plot(y='User Rating');
plt.title('(Fiction) Ratings Over Time')
plt.xlabel('Year')
plt.ylabel('User Rating')
4.90 in 2019, up from 4.58 in 2009 which makes that a roughly 7% increase. Slighlty better than Nonfiction with the ratings improvement.
data.plot(x = "Year", y = "User Rating", kind = "scatter", marker ="*");
plt.title('Ratings')
plt.xlabel('Year')
plt.ylabel('User Rating')
Once again, you can clearly see consolidation occuring over time as more and more ratings condense in a higher region of the graph highlighting the improvement in the Amazon Best Seller category. No outliers, and not a single average rating mark below 4.2
It seems many of the attributes among best selling books are fairly obvious and clear cut, like the vast number of user reviews and the extremely high user ratings. In recent years this has been slightly on the up tick and it is continuously trending in this direction, slowly, but surely. Soon I assume it will level off and reach an apex where it will begin to slowly drop and slowly rise over and over with more unpredictable fluctuations.
Best selling books (paper back) on Amazon over the past decade have been steadily falling. A critic of why Amazon book prices have been falling is here (http://states.jsa.org/midwest/2020/02/09/stop-buying-books-from-amazon/) and without getting into too much detail, basically Amazon's profit margins in other categories of their busisness is good enough to sustain a small loss in the book department which comes back to the critera they use to rank their books. It is primarily based on raw volume.
With a decent ebook market share that exists on Amazon with the flagship Kindle, the traditional physical copy still seems to dominate the market and not lose any fashion. https://www.cnbc.com/2019/09/19/physical-books-still-outsell-e-books-and-heres-why.html#:~:text=Publishers%20of%20books%20in%20all,American%20Publishers'%20annual%20report%202019.
Fiction and Non Fiction books have many similarities in some trends but also have some stark differences, like the number of reviews across the board for Fiction beats out Non Fiction by a fair bit. But, shockingly there are 310 Non Fiction books that made the Amazon Best Sellers list over the past decade and only 240 Fiction books, which makes the stat even more interesting.
Collective citations from throughout the project:
https://www.book-editing.com/amazon-nytimes-bestselling-books/
https://www.makeuseof.com/tag/ebooks-expensive-real-books/
http://states.jsa.org/midwest/2020/02/09/stop-buying-books-from-amazon/