OUR WORK
Researching Data Science Methodologies in Parsing Book Summaries
THE DATA
Our MVP dataset is obtained from the “Best Books Ever Dataset” which was compiled by Lorena Casanova Lozano and Sergio Costa Planells from Goodreads data as part of their Data Science Masters Program. We then cleaned, filtered, and formatted the data. We have 29,652 different books in our dataset with over 14,000 authors and a large expanse of genres.
ARCHITECTURE
We cleaned our data in Google Colab. We then loaded our data into an S3 bucket on Amazon Web Services. From there we created EC2 instances with high powered computing resources and launched Jupyter Lab to run our multiple models and evaluation of them. We then moved the results into the S3 bucket and from there link the data to Google Sheets where our application runs.
THE MODEL
After analyzing our data, we went with an ensemble model methodology. For some life events, topic modeling was quite adept at understanding the concept. While others were more nuanced and needed a different approach. We used a synsets and lemmas model paired with a fine-tuned BERT model. We ran both models on the description field of each book to categorize them for search in our application.