Part 2: How to Extract the Yelp Downloading Algorithm?

Facebook-Logo
Foodspark
  • Date Published
  • Categories Blog
  • Reading Time 6-Minute Read

FoodSpark can easily scale the Yelp downloading algorithm and assist in scraping more information

This blog will explain the working of the algorithm using web scraping services and what kind of steps will be required to build a structured algorithm.

The following steps are frequently required when creating a sophisticated algorithm:

  • You will start with a basic algorithm that solves a little problem.
  • Need to scale it up such that it may be used to solve several instances of the same problem.
  • The algorithm is made more complex by adding layers of complexity.

After these processes are finished, you can gradually add more features, like Machine Learning, exploratory data analysis or insight extraction, and visualization.

The Basic Algorithm

This is the code used to extract data from the Yelp page and give you an idea of what algorithm is used.

import requests
from bs4 import BeautifulSoup
import timecomment_list = list()
for pag in range(1, 29):
time.sleep(5)
URL = “https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start=”+str(pag*10)+”&sort_by=rating_asc”
print(‘downloading page ‘, pag*10)
page = requests.get(URL)
#next step: parsing
soup = BeautifulSoup(page.content, ‘lxml’)
soupfor comm in soup.find(“yelp-react-root”).find_all(“p”, {“class” : “comment__373c0__Nsutg css-n6i4z7”}):
comment_list.append(comm.find(“span”).decode_contents())
print(comm.find(“span”).decode_contents())

As you can see, it’s quite little and simple to comprehend if you’ve worked with Python and some of its modules before.

Instructing the Algorithm

Developing a control panel will be an efficient method to structure the code.

  • This includes a list of links that the algorithm will attempt to connect.
  • A function that controls the algorithm’s parameters.

Simultaneously, the algorithm must be written in sections that correspond to the best programming procedures:

  • Libraries are being imported.
  • Variable addition
  • Configuring the functionality
  • Carry out the algorithm
  • Analyze and export your findings

Importing Libraries

The first stage, like with any acceptable algorithm, will be to dedicate a small part of code to the libraries we’ll be using across our entire code. You won’t need to use pip to install anything because all of the libraries You will need are already included in the Python bundle.

import requests
from bs4 import BeautifulSoup
import time
from textblob import TextBlob
import pandas as pd

Adding Variables

You can manage the webpages that are downloaded using BeautifulSoup by utilizing this collection of settings. You will just need to use two to present a simple example. You will need a lot of data to effectively direct scraper. To build the connection, you will need to link each restaurant, the number of review pages that you would like to scrape, and the name of the restaurant to include in the dataset.

Either a stacked list or a dictionary would be the best approach to keep this information (which is the equivalent of a JavaScript Object, NoSQL if you wish). Once you’ve become used to using dictionaries, they can help you simplify a lot of your work and make your code more understandable.

rest_dict = [
{ “name” : “the-cortez-raleigh”,
“link” : “https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start=”,
“pages” : 3
},
{ “name” : “rosewater-kitchen-and-bar-raleigh”,
“link” : “https://www.yelp.com/biz/rosewater-kitchen-and-bar-raleigh?osq=Restaurants&start=”,
“pages” : 3
}
]

Installing the Functions

You will want to describe every single detail of the function, if you just want the code, It is recommended to download it directly; compiling and pasting these distinct lines of code into your IDE would be a pain when you can do it with a single click.

Now that you have all of the necessary information, you can create algorithm, which you will call cape scraper. The rationale is straightforward, as the code will follow a set of steps:

def scraper(rest_list):
all_comment_list = list()
for rest in rest_list:
comment_list = list()

It will cycle through all of the dictionaries in the list first.

for pag in range(1, rest[‘pages’]):

Also add a try statement so that you don’t have to start over if there’s an error in the code or a problem with our connection that causes the algorithm to stop working. We need to take safeguards to avoid our algorithm from halting because faults are typical during web scraping because they are so dependent on the structure of a website that we have not constructed ourselves. If this happens, you will either have to spend more time figuring out where the algorithm has stopped and tuning the scraping parameters, given that you have been able to save the data thus far, or you will have to start over.

try:

To avoid IP being refused, we will impose a 5-second delay before initiating a request. When you perform too many queries, the website often recognizes that we aren’t human and decides to deny our connection request. The algorithm will throw an error unless we have a try statement.

time.sleep(5)

Connect to Yelp scraper and copy the HTML, then repeat for the appropriate amount of pages.

osq=Restaurants&start=”+str(pag*10)+”&sort_by=rating_asc”
URL = rest[‘link’]+str(pag*10)
print(rest[‘name’], ‘downloading page ‘, pag*10)
page = requests.get(URL)

Convert the HTML into a code that beautifulsoup can understand. This process must work because it’s the only way we’ll be able to extract data using the library’s functions.

#next step: parsing
soup = BeautifulSoup(page.content, ‘lxml’)
soup

Take the reviews out of this 1,000-line string. After a thorough examination of the code, we were able to determine where the reviews were stored in HTML elements. This code will retrieve the content of these components to the letter.

for comm in soup.find(“yelp-react-root”).find_all(“p”, {“class” : “comment__373c0__Nsutg css-n6i4z7”}):

We will save the content of a single restaurant in a list named comment list that contains each review matched with the restaurant’s name.

comment_list.append(comm.find(“span”).decode_contents())
print(comm.find(“span”).decode_contents())
except:
print(“could not work properly!”)

We will save the reviews in comment list into a general list named all comment list before scraping the next page. The comment list variable will be reset in the following iteration.

all_comment_list.append([comment_list, rest[‘name’]])
return all_comment_list

Finally, you will be able to execute the algorithm with only one line of code and save all of the results in a list named reviews.

reviews = scraper(rest_dict)

Looking to scale the Yelp Downloading algorithm using Yelp data scraper? Contact FoodSpark, today!!