scrapereads¶

scrapereads.api¶

Simple API to connect and extract data from Good Reads servers.

class scrapereads.api.GoodReads(verbose=False, sleep=0, user=None)[source]¶

Main API for Good Reads scrapping.

It basically wraps Author, Book and Quote classes.

static get_author(author_id, encode=None)[source]¶

Get an author in a JSON format.

Parameters:	author_id (string) – name of the author. encode (string) – encode to ASCII format or not.
Returns:	dict

static get_books(author_id, top_k=10)[source]¶

Get all books in a JSON format from an author.

Parameters:	author_id (string) – name of the author to get. top_k (int) – number of books to retrieve.
Returns:	list(dict)

static get_quotes(author_id, top_k=10)[source]¶

Get all quotes in a JSON format from an author.

Parameters:	author_id (string) – name of the author to get. top_k (int) – number of quotes to retrieve.
Returns:	list(dict)

static search_author(author_id)[source]¶

Search an author from Good Reads server.

Parameters:	author_id (string) – name of the author to get.
Returns:	Author

static search_book(author_id, book_id)[source]¶

Search an book from Good Reads server.

Parameters:	author_id (string) – name of the author who made the book. book_id (string) – name of the book.
Returns:	Book

static search_books(author_id, top_k=10)[source]¶

Search books in from an author.

Parameters:	author_id (string) – name of the author to get. top_k (int) – number of books to retrieve.
Returns:	list(Book)

static search_quotes(author_id, top_k=50)[source]¶

Search quotes from Good Reads server.

Parameters:	author_id (string) – name of the author who made the quote. top_k (int) – number of quotes to retrieve.
Returns:	Quote

static set_sleep(sleep)[source]¶

Time before connecting again to a new page.

Parameters:	sleep (float) – seconds to wait.

static set_user(user)[source]¶

Change the user agent used to connect on internet.

Parameters:	user (string) – user agent to use with urllib.request.

static set_verbose(verbose)[source]¶

Change the log / display while surfing on internet.

Parameters:	verbose (bool) – if `True` will display a log message each time it is connected to a page.

scrapereads.meta¶

Baseline class for Good Reads objects. This class handles connection to Good Reads server.

class scrapereads.meta.AuthorMeta(author_id, author_name=None)[source]¶

Defines an abstract author, from the page info from https://www.goodreads.com/.

author_name: name of the author.
author_id: key id of the author.
base: base page of Good Reads.
href: href page of the author.
url: url page of the author.

to_json()[source]¶

Encode the author to a JSON format.

Returns:	dict

class scrapereads.meta.BookMeta(author_id, book_id, book_name=None, author_name=None, edition=None, year=None)[source]¶

Abstract Book class, used as baseline.

author_name: name of the author.
author_id: key id of the author.
book_name: name of the book.
book_id: key if of the book.
year: year of publication of the book.
edition: edition of the book.
base: base page of Good Reads.
href: href page of the book.
url: url page of the book.

get_author()[source]¶

Get the author pointing to the quote.

Returns:	Author

register_author(author)[source]¶

Point a quote to an Author.

Parameters:	author (Author) – author to link the quote.

to_json(encode='ascii')[source]¶

Encode the book to a JSON format.

Returns:	dict

class scrapereads.meta.GoodReadsMeta[source]¶

Defines the base of all Good Reads objects, that scrape and extract online data.

base: base page of the Good Reads.
href: href of a page.
url: url page of a Good Reads element.

connect(href=None)[source]¶

Connect to a Good Reads page.

Parameters:	href (string, optional) – if provided, connect to the page reference, else connect to the main page.
Returns:	bs4.element.Tag

class scrapereads.meta.QuoteMeta(author_id, quote_id, quote_name=None, text=None, author_name=None, tags=None, likes=None)[source]¶

Defines a quote from the quote page from https://www.goodreads.com/author/quotes/.

quote_id: nif of the quote.
book_name: name of the book / title.
book_name: name of the book / title.
book_name: name of the book / title.
quote: text.

get_author()[source]¶

Get the author pointing to the quote.

Returns:	Author

get_book()[source]¶

Get the book pointing to the quote.

Returns:	Book

register_author(author)[source]¶

Point a quote to an Author.

Parameters:	author (Author) – author to link the quote.

register_book(book)[source]¶

Point a quote to a Book.

Parameters:	book (Book) – book to link the quote.

to_json(encode='ascii')[source]¶

Encode the quote to a JSON format.

Returns:	dict

scrapereads.connect¶

A scrapper is used to connect to a website and extract data.

scrapereads.connect.connect(url)[source]¶

Connect to an URL.

Parameters:	url (string) – url path sleep (float) – number of seconds to sleep before connection. verbose (bool) – print the url if `True`.
Returns:	soup

scrapereads.scrape¶

Scrape quotes, books and authors from Good Reads website.

scrapereads.scrape.get_author_book_author(book_tr)[source]¶

Get the author <a> element from a table <tr> element.

Parameters:	book_tr (bs4.element.Tag) – `<tr>` book element.
Returns:	author name `<a>` element.
Return type:	bs4.element.Tag

Examples::

>>> for book_tr in scrape_author_books(soup):
...     book_author = get_author_book_author(book_tr)
...     print(book_author.text, book_author.get('href'))
    Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath
    Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath
    Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath
    Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath
    Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath

scrapereads.scrape.get_author_book_date(book_tr)[source]¶

Get the published date from a table <tr> element from an author page.

Parameters:	book_tr (bs4.element.Tag) – `<tr>` book element.
Returns:	date of publication
Return type:	int

Examples::

>>> for book_tr in scrape_author_books(soup):
...     book_date = get_author_book_date(book_tr)
...     print(book_date)
    None
    None
    1958
    2009
    ...

scrapereads.scrape.get_author_book_edition(book_tr)[source]¶

Get the edition <a> element from a table <tr> element from an author page.

Parameters:	book_tr (bs4.element.Tag) – `<tr>` book element.
Returns:	book edition `<a>` element.
Return type:	bs4.element.Tag

Examples::

>>> for book_tr in scrape_author_books(soup):
...     book_edition = get_author_book_edition(book_tr)
...     if book_edition:
...         print(book_edition.text, book_edition.get('href'))
...         print()
    493 editions /work/editions/1385044-the-bell-jar
    80 editions /work/editions/1185316-ariel
    30 editions /work/editions/1003095-the-collected-poems
    45 editions /work/editions/3094683-the-unabridged-journals-of-sylvia-plath
    ...

scrapereads.scrape.get_author_book_ratings(book_tr)[source]¶

Get the ratings <span> element from a table <tr> element from an author page.

Parameters:	book_tr (bs4.element.Tag) – `<tr>` book element.
Returns:	ratings `<span>` element.
Return type:	bs4.element.Tag

Examples::

>>> for book_tr in scrape_author_books(soup):
...     ratings_span = get_author_book_ratings(book_tr)
...     print(ratings_span.contents[-1])
     4.55 avg rating — 2,414 ratings
     3.77 avg rating — 1,689 ratings
     4.28 avg rating — 892 ratings
     4.54 avg rating — 490 ratings
     ...

scrapereads.scrape.get_author_book_title(book_tr)[source]¶

Get the book title <a> element from a table <tr> element from an author page.

Parameters:	book_tr (bs4.element.Tag) – `<tr>` book element.
Returns:	book title `<a>` element.
Return type:	bs4.element.Tag

Examples::

>>> for book_tr in scrape_author_books(soup):
...     book_title = get_author_book_title(book_tr)
...     print(book_title.text.strip(), book_title.get('href'))
    The Bell Jar /book/show/6514.The_Bell_Jar
    Ariel /book/show/395090.Ariel
    The Collected Poems /book/show/31426.The_Collected_Poems
    The Unabridged Journals of Sylvia Plath /book/show/11623.The_Unabridged_Journals_of_Sylvia_Plath

scrapereads.scrape.get_author_desc(soup)[source]¶

Get the author description / biography.

Parameters:	soup (bs4.element.Tag) – connection to the author page.
Returns:	long description of the author.
Return type:	str

Examples::

>>> from scrapereads import connect
>>> url = 'https://www.goodreads.com/author/show/1077326'
>>> soup = connect(url)
>>> get_author_desc(soup)
    See also: Robert Galbraith
    Although she writes under the pen name J.K. Rowling, pronounced like rolling,
    her name when her first Harry Potter book was published was simply Joanne Rowling.
    ...

scrapereads.scrape.get_author_info(soup)[source]¶

Get all information from an author (genres, influences, website etc.).

Parameters:	soup (bs4.element.Tag) – author page connection.
Returns:	dict

scrapereads.scrape.get_author_name(soup)[source]¶

Get the author’s name from its main page.

Parameters:	soup (bs4.element.Tag) – connection to the author page.
Returns:	name of the author.
Return type:	string

Examples::

>>> from scrapereads import connect
>>> url = 'https://www.goodreads.com/author/show/1077326'
>>> soup = connect(url)
>>> get_author_name(soup)
    J.K. Rowling

scrapereads.scrape.get_book_quote_page(soup)[source]¶

Find the <a> element pointing to the quote page of a book.

Parameters:	soup (bs4.element.Tag) –

Returns:

scrapereads.scrape.get_quote_author_name(quote_div)[source]¶

Get the author’s name from a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element from a quote page.
Returns:	string

scrapereads.scrape.get_quote_book(quote_div)[source]¶

Get the reference (book) from a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element from a quote page.
Returns:	bs4.element.Tag

scrapereads.scrape.get_quote_likes(quote_div)[source]¶

Get the likes <a> tag from a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element from a quote page.
Returns:	`<a>` tag for likes.
Return type:	bs4.element.Tag

scrapereads.scrape.get_quote_name_id(quote_div)[source]¶

Get the name and id of a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element from a quote page.
Returns:	id and name.
Return type:	tuple

scrapereads.scrape.get_quote_text(quote_div)[source]¶

Get the text from a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element to extract the text.
Returns:	string

scrapereads.scrape.scrape_author_books(soup)[source]¶

Retrieve books from an author’s page.

Parameters:	soup (bs4.element.Tag) – connection to an author books page.
Returns:	`<tr>` element.
Return type:	yield bs4.element.Tag

scrapereads.scrape.scrape_quote_tags(quote_div)[source]¶

Scrape tags from a <div> quote element.

Parameters:	quote_div (bs4.element.Tag) – `<div>` quote element from a quote page.
Returns:	yield `<a>` tags

scrapereads.scrape.scrape_quotes(soup)[source]¶

Retrieve all <div> quote element from a quote page.

Parameters:	soup (bs4.element.Tag) – connection to the quote page.
Returns:	yield bs4.element.Tag

scrapereads.scrape.scrape_quotes_container(soup)[source]¶

Get the quote container from a quote page.

Parameters:	soup (bs4.element.Tag) – connection to the quote page.
Returns:	bs4.element.Tag

scrapereads.utils¶

Functional functions to process names and data.

scrapereads.utils.clean_num(quote)[source]¶

Remove romans numbers from a quote.

Parameters:	quote (string) – quote.
Returns:	string

scrapereads.utils.name_to_goodreads(name)[source]¶

Process and convert names in scrapereads format.

Parameters:	name (string) – name of an author.
Returns:	string

scrapereads.utils.num2roman(num)[source]¶

Convert a number to roman’s format.

Parameters:	num (int) – number to convert.
Returns:	string

scrapereads.utils.parse_author_href(href)[source]¶

Split an href and retrieve the author’s name and its key.

Parameters:	href (string) – `Good Reads` href pointing to an author page.
Returns:	author’s name and key.
Return type:	tuple

scrapereads.utils.process_quote_text(quote_text)[source]¶

Clean up the text from a <div> quote element.

Parameters:	quote_text (string) – quote text to clean.
Returns:	string

scrapereads.utils.remove_punctuation(string_punct)[source]¶

Remove punctuation from a string.

Parameters:	string_punct (string) – string with punctuation.
Returns:	string

scrapereads.utils.serialize_dict(dict_raw)[source]¶

Serialize a dictionary in ASCII format so it can be saved as a JSON.

Parameters:	dict_raw (dict) –
Returns:	dict

scrapereads.utils.serialize_list(list_raw)[source]¶

Serialize a list in ASCII format, so it can be saved as a JSON.

Parameters:	list_raw (list) –
Returns:	list

scrapereads.utils.to_ascii(text)[source]¶

Convert a text to ASCII format.

Parameters:	text (string) – text to process.
Returns:	string

scrapereads.reads¶

scrapereads.reads.author¶

Defines an Author from Good Reads. Connect to https://www.goodreads.com/ to extract quotes and books from famous authors.

class scrapereads.reads.author.Author(author_id, author_name=None)[source]¶

Defines an author, from the page info from https://www.goodreads.com/.

name: name of the author.
key: key id of the author.
url: url page of the author.

add_book(book)[source]¶

Add a book to an Author.

Parameters:	book (Book) – book or book’s name to add.

add_quote(quote)[source]¶

Add a quote to an Author.

Parameters:	quote (Quote or string) – quote or text to add.

books(cache=True)[source]¶

Get all books from an author address. This function extract online data from Good Reads if nothing is already saved in the cache.

Parameters:	cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	yield Quote

classmethod from_url(url)[source]¶

Construct the class from an url.

Parameters:	url (string) – url.
Returns:	Author

get_books(top_k=None, cache=True)[source]¶

Get all books from an author address.

Parameters:	top_k (int) – number of books to return. cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	list(Book)

get_info()[source]¶

Get author information (genres, influences, description etc.)

Returns:	dict

get_quotes(lang=None, top_k=None, cache=True)[source]¶

Get all quotes from an author address.

Parameters:	lang (string) – language to pick up quotes. top_k (int) – number of quotes to retrieve (ordered by popularity). cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	list(Quote)

get_similar_authors(top_k=None)[source]¶

Get similar artists from the author.

Parameters:	top_k (int) – number of authors to retrieve (ordered by popularity).
Returns:	list(Author)

quotes(cache=True)[source]¶

Yield all quotes from an author address. This function extract online data from Good Reads if nothing is already saved in the cache.

Parameters:	cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	yield Quote

search_book(book_id, attr='book_id', cache=True)[source]¶

Search a book from the books saved in the author’s cache.

Parameters:	book_id (string) – book id (or name) to look for. attr (string, optional) – attribute to search the book from. Options are `'book_id'` and `'book_name'` cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	Book

search_quote(quote_id, attr='quote_id', cache=True)[source]¶

Search a quote from the books saved in the author’s cache.

Parameters:	quote_id (string) – quote’id to look for. attr (string, optional) – attribute to search the quote from. Options are `'quote_id'` and `'quote_name'` cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	Book

to_json(encode=None)[source]¶

Encode the author to a JSON format.

Parameters:	encode (string) – encode to ASCII format or not.
Returns:	dict

scrapereads.reads.book¶

Defines a book from an Author.

class scrapereads.reads.book.Book(author_id, book_id, book_name=None, author_name=None, edition=None, year=None, ratings=None)[source]¶

add_quote(quote)[source]¶

Add a quote to the Book, that will be saved in the cache.

Parameters:	quote (Quote) – quote to add.

get_quotes(lang=None, top_k=None, cache=True)[source]¶

Get all quotes from a book address.

Parameters:	lang (string) – language to pick up quotes. top_k (int) – number of quotes to retrieve (ordered by popularity). cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	list(Quote)

quotes(cache=True)[source]¶

Yield all quotes from a book address. This function extract online data from Good Reads if nothing is already saved in the cache.

Parameters:	cache (bool) – if `True`, will look for cache items only (and won’t scrape online).
Returns:	yield Quote

to_json(encode='ascii')[source]¶

Encode the book to a JSON format.

Returns:	dict

scrapereads.reads.quote¶

Defines a quote from an Author.

class scrapereads.reads.quote.Quote(author_id, quote_id, text='', quote_name=None, author_name=None, tags=None, likes=None)[source]¶

Defines a quote from the quote page from https://www.goodreads.com/author/quotes/.

to_json(encode='ascii')[source]¶

Encode the quote to a JSON format.

Returns:	dict