scrapereads¶
scrapereads.api¶
Simple API to connect and extract data from Good Reads
servers.
-
class
scrapereads.api.
GoodReads
(verbose=False, sleep=0, user=None)[source]¶ Main API for Good Reads scrapping.
It basically wraps
Author
,Book
andQuote
classes.Get an author in a JSON format.
Parameters: - author_id (string) – name of the author.
- encode (string) – encode to ASCII format or not.
Returns: dict
-
static
get_books
(author_id, top_k=10)[source]¶ Get all books in a JSON format from an author.
Parameters: - author_id (string) – name of the author to get.
- top_k (int) – number of books to retrieve.
Returns: list(dict)
-
static
get_quotes
(author_id, top_k=10)[source]¶ Get all quotes in a JSON format from an author.
Parameters: - author_id (string) – name of the author to get.
- top_k (int) – number of quotes to retrieve.
Returns: list(dict)
Search an author from Good Reads server.
Parameters: author_id (string) – name of the author to get. Returns: Author
-
static
search_book
(author_id, book_id)[source]¶ Search an book from Good Reads server.
Parameters: - author_id (string) – name of the author who made the book.
- book_id (string) – name of the book.
Returns: Book
-
static
search_books
(author_id, top_k=10)[source]¶ Search books in from an author.
Parameters: - author_id (string) – name of the author to get.
- top_k (int) – number of books to retrieve.
Returns: list(Book)
-
static
search_quotes
(author_id, top_k=50)[source]¶ Search quotes from Good Reads server.
Parameters: - author_id (string) – name of the author who made the quote.
- top_k (int) – number of quotes to retrieve.
Returns: Quote
-
static
set_sleep
(sleep)[source]¶ Time before connecting again to a new page.
Parameters: sleep (float) – seconds to wait.
scrapereads.meta¶
Baseline class for Good Reads objects. This class handles connection to Good Reads server.
-
class
scrapereads.meta.
AuthorMeta
(author_id, author_name=None)[source]¶ Defines an abstract author, from the page info from
https://www.goodreads.com/
.author_name
: name of the author.author_id
: key id of the author.base
: base page of Good Reads.href
: href page of the author.url
: url page of the author.
-
class
scrapereads.meta.
BookMeta
(author_id, book_id, book_name=None, author_name=None, edition=None, year=None)[source]¶ Abstract Book class, used as baseline.
author_name
: name of the author.author_id
: key id of the author.book_name
: name of the book.book_id
: key if of the book.year
: year of publication of the book.edition
: edition of the book.base
: base page of Good Reads.href
: href page of the book.url
: url page of the book.
Get the author pointing to the quote.
Returns: Author
Point a quote to an Author.
Parameters: author (Author) – author to link the quote.
-
class
scrapereads.meta.
GoodReadsMeta
[source]¶ Defines the base of all Good Reads objects, that scrape and extract online data.
base
: base page of the Good Reads.href
: href of a page.url
: url page of a Good Reads element.
-
class
scrapereads.meta.
QuoteMeta
(author_id, quote_id, quote_name=None, text=None, author_name=None, tags=None, likes=None)[source]¶ Defines a quote from the quote page from
https://www.goodreads.com/author/quotes/
.quote_id
: nif of the quote.book_name
: name of the book / title.book_name
: name of the book / title.book_name
: name of the book / title.quote
: text.
Get the author pointing to the quote.
Returns: Author
Point a quote to an Author.
Parameters: author (Author) – author to link the quote.
scrapereads.connect¶
A scrapper is used to connect to a website and extract data.
scrapereads.scrape¶
Scrape quotes, books and authors from Good Reads
website.
Get the author
<a>
element from a table<tr>
element.Parameters: book_tr (bs4.element.Tag) – <tr>
book element.Returns: author name <a>
element.Return type: bs4.element.Tag - Examples::
>>> for book_tr in scrape_author_books(soup): ... book_author = get_author_book_author(book_tr) ... print(book_author.text, book_author.get('href')) Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath Sylvia Plath https://www.goodreads.com/author/show/4379.Sylvia_Plath
Get the published date from a table
<tr>
element from an author page.Parameters: book_tr (bs4.element.Tag) – <tr>
book element.Returns: date of publication Return type: int - Examples::
>>> for book_tr in scrape_author_books(soup): ... book_date = get_author_book_date(book_tr) ... print(book_date) None None 1958 2009 ...
Get the edition
<a>
element from a table<tr>
element from an author page.Parameters: book_tr (bs4.element.Tag) – <tr>
book element.Returns: book edition <a>
element.Return type: bs4.element.Tag - Examples::
>>> for book_tr in scrape_author_books(soup): ... book_edition = get_author_book_edition(book_tr) ... if book_edition: ... print(book_edition.text, book_edition.get('href')) ... print() 493 editions /work/editions/1385044-the-bell-jar 80 editions /work/editions/1185316-ariel 30 editions /work/editions/1003095-the-collected-poems 45 editions /work/editions/3094683-the-unabridged-journals-of-sylvia-plath ...
Get the ratings
<span>
element from a table<tr>
element from an author page.Parameters: book_tr (bs4.element.Tag) – <tr>
book element.Returns: ratings <span>
element.Return type: bs4.element.Tag - Examples::
>>> for book_tr in scrape_author_books(soup): ... ratings_span = get_author_book_ratings(book_tr) ... print(ratings_span.contents[-1]) 4.55 avg rating — 2,414 ratings 3.77 avg rating — 1,689 ratings 4.28 avg rating — 892 ratings 4.54 avg rating — 490 ratings ...
Get the book title
<a>
element from a table<tr>
element from an author page.Parameters: book_tr (bs4.element.Tag) – <tr>
book element.Returns: book title <a>
element.Return type: bs4.element.Tag - Examples::
>>> for book_tr in scrape_author_books(soup): ... book_title = get_author_book_title(book_tr) ... print(book_title.text.strip(), book_title.get('href')) The Bell Jar /book/show/6514.The_Bell_Jar Ariel /book/show/395090.Ariel The Collected Poems /book/show/31426.The_Collected_Poems The Unabridged Journals of Sylvia Plath /book/show/11623.The_Unabridged_Journals_of_Sylvia_Plath
Get the author description / biography.
Parameters: soup (bs4.element.Tag) – connection to the author page. Returns: long description of the author. Return type: str - Examples::
>>> from scrapereads import connect >>> url = 'https://www.goodreads.com/author/show/1077326' >>> soup = connect(url) >>> get_author_desc(soup) See also: Robert Galbraith Although she writes under the pen name J.K. Rowling, pronounced like rolling, her name when her first Harry Potter book was published was simply Joanne Rowling. ...
Get all information from an author (genres, influences, website etc.).
Parameters: soup (bs4.element.Tag) – author page connection. Returns: dict
Get the author’s name from its main page.
Parameters: soup (bs4.element.Tag) – connection to the author page. Returns: name of the author. Return type: string - Examples::
>>> from scrapereads import connect >>> url = 'https://www.goodreads.com/author/show/1077326' >>> soup = connect(url) >>> get_author_name(soup) J.K. Rowling
-
scrapereads.scrape.
get_book_quote_page
(soup)[source]¶ Find the
<a>
element pointing to the quote page of a book.Parameters: soup (bs4.element.Tag) – Returns:
Get the author’s name from a
<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element from a quote page.Returns: string
-
scrapereads.scrape.
get_quote_book
(quote_div)[source]¶ Get the reference (book) from a
<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element from a quote page.Returns: bs4.element.Tag
-
scrapereads.scrape.
get_quote_likes
(quote_div)[source]¶ Get the likes
<a>
tag from a<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element from a quote page.Returns: <a>
tag for likes.Return type: bs4.element.Tag
-
scrapereads.scrape.
get_quote_name_id
(quote_div)[source]¶ Get the name and id of a
<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element from a quote page.Returns: id and name. Return type: tuple
-
scrapereads.scrape.
get_quote_text
(quote_div)[source]¶ Get the text from a
<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element to extract the text.Returns: string
Retrieve books from an author’s page.
Parameters: soup (bs4.element.Tag) – connection to an author books page. Returns: <tr>
element.Return type: yield bs4.element.Tag
Scrape tags from a
<div>
quote element.Parameters: quote_div (bs4.element.Tag) – <div>
quote element from a quote page.Returns: yield <a>
tags
scrapereads.utils¶
Functional functions to process names and data.
-
scrapereads.utils.
clean_num
(quote)[source]¶ Remove romans numbers from a quote.
Parameters: quote (string) – quote. Returns: string
-
scrapereads.utils.
name_to_goodreads
(name)[source]¶ Process and convert names in scrapereads format.
Parameters: name (string) – name of an author. Returns: string
-
scrapereads.utils.
num2roman
(num)[source]¶ Convert a number to roman’s format.
Parameters: num (int) – number to convert. Returns: string
Split an href and retrieve the author’s name and its key.
Parameters: href (string) – Good Reads
href pointing to an author page.Returns: author’s name and key. Return type: tuple
-
scrapereads.utils.
process_quote_text
(quote_text)[source]¶ Clean up the text from a
<div>
quote element.Parameters: quote_text (string) – quote text to clean. Returns: string
-
scrapereads.utils.
remove_punctuation
(string_punct)[source]¶ Remove punctuation from a string.
Parameters: string_punct (string) – string with punctuation. Returns: string
-
scrapereads.utils.
serialize_dict
(dict_raw)[source]¶ Serialize a dictionary in ASCII format so it can be saved as a JSON.
Parameters: dict_raw (dict) – Returns: dict
scrapereads.reads¶
scrapereads.reads.author¶
Defines an Author from Good Reads
.
Connect to https://www.goodreads.com/ to extract quotes and books from famous authors.
Defines an author, from the page info from
https://www.goodreads.com/
.name
: name of the author.key
: key id of the author.url
: url page of the author.
Add a book to an Author.
Parameters: book (Book) – book or book’s name to add.
Add a quote to an Author.
Parameters: quote (Quote or string) – quote or text to add.
Get all books from an author address. This function extract online data from Good Reads if nothing is already saved in the cache.
Parameters: cache (bool) – if True
, will look for cache items only (and won’t scrape online).Returns: yield Quote
Construct the class from an url.
Parameters: url (string) – url. Returns: Author
Get all books from an author address.
Parameters: - top_k (int) – number of books to return.
- cache (bool) – if
True
, will look for cache items only (and won’t scrape online).
Returns: list(Book)
Get author information (genres, influences, description etc.)
Returns: dict
Get all quotes from an author address.
Parameters: - lang (string) – language to pick up quotes.
- top_k (int) – number of quotes to retrieve (ordered by popularity).
- cache (bool) – if
True
, will look for cache items only (and won’t scrape online).
Returns: list(Quote)
Get similar artists from the author.
Parameters: top_k (int) – number of authors to retrieve (ordered by popularity). Returns: list(Author)
Yield all quotes from an author address. This function extract online data from Good Reads if nothing is already saved in the cache.
Parameters: cache (bool) – if True
, will look for cache items only (and won’t scrape online).Returns: yield Quote
Search a book from the books saved in the author’s cache.
Parameters: - book_id (string) – book id (or name) to look for.
- attr (string, optional) – attribute to search the book from. Options are
'book_id'
and'book_name'
- cache (bool) – if
True
, will look for cache items only (and won’t scrape online).
Returns: Book
Search a quote from the books saved in the author’s cache.
Parameters: - quote_id (string) – quote’id to look for.
- attr (string, optional) – attribute to search the quote from. Options are
'quote_id'
and'quote_name'
- cache (bool) – if
True
, will look for cache items only (and won’t scrape online).
Returns: Book
Encode the author to a JSON format.
Parameters: encode (string) – encode to ASCII format or not. Returns: dict
scrapereads.reads.book¶
Defines a book from an Author.
-
class
scrapereads.reads.book.
Book
(author_id, book_id, book_name=None, author_name=None, edition=None, year=None, ratings=None)[source]¶ -
add_quote
(quote)[source]¶ Add a quote to the Book, that will be saved in the cache.
Parameters: quote (Quote) – quote to add.
-
get_quotes
(lang=None, top_k=None, cache=True)[source]¶ Get all quotes from a book address.
Parameters: - lang (string) – language to pick up quotes.
- top_k (int) – number of quotes to retrieve (ordered by popularity).
- cache (bool) – if
True
, will look for cache items only (and won’t scrape online).
Returns: list(Quote)
-
scrapereads.reads.quote¶
Defines a quote from an Author.