6 Ways To Rapidly Collect Massive Datasets in your Apps
What is web scraping?
Web Scraping is a technique where a computer program extracts data from human-readable output coming from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.
While Web Scraping can be done manually by a software user, the term typically refers to automated processes implemented using a program, bot, or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database, spreadsheet, API, or any format that is more useful for the user, for later retrieval or analysis.
First, the app needs to interpret a web page as data
Web pages are built using text-based mark-up languages; like HTML and XHTML, and frequently contain rich and useful data in text form. Quite obviously, most web pages are designed for human end-users and not really for ease of automated use. As a result, this can make it a challenging task to build specialized tools and software to facilitate the scraping of any web pages.
Delphi plus Python is a powerful combination for web scraping
In this tutorial, we’ll build Windows Apps with extensive Web Scraping capabilities by integrating Python’s Web Scraping libraries with Embarcadero’s Delphi, using Python4Delphi (P4D).
P4D empowers Python users with Delphi’s award-winning VCL functionalities for Windows which enables us to build native Windows apps 5x faster. This integration enables us to create a modern GUI with Windows 10 looks and responsive controls for our Python Web Scraping applications. Python4Delphi also comes with an extensive range of demos, use cases, and tutorials.
We’re going to cover the following…
How to use Requests, BeautifulSoup, Instaloader, Snscrape, Tweepy, and Feedparser Python libraries to perform Web Scraping tasks
All of them would be integrated with Python4Delphi to create Windows Apps with Web Scraping capabilities.
Prerequisites
Before we begin to work, download and install the latest Python for your platform. Follow the Python4Delphi installation instructions mentioned here. Alternatively, you can check out the easy instructions found in the Getting Started With Python4Delphi video by Jim McKeeth.
Time to get started!
First, open and run our Python GUI using project Demo1 from Python4Delphi with RAD Studio. Then insert the script into the lower Memo, click the Execute button, and get the result in the upper Memo. You can find the Demo1 source on GitHub. The behind the scene details of how Delphi manages to run your Python code in this amazing Python GUI can be found at this link.
How do I Scrape Website’s Data using Python Requests?
“Requests” is a simple, yet elegant HTTP library. Requests allow you to execute standard HTTP requests extremely easily. Using this library, you can pass parameters to requests, add headers, receive and process responses, execute authenticated requests.
Requests are ready for the demands of building robust and reliable HTTP–speaking applications, for the needs of today.
- Keep-Alive & Connection Pooling
- International Domains and URLs
- Sessions with Cookie Persistence
- Browser-style TLS/SSL Verification
- Basic & Digest Authentication
- Familiar dict–like Cookies
- Automatic Content Decompression and Decoding
- Multi-part File Uploads
- SOCKS Proxy Support
- Connection Timeouts
- Streaming Downloads
- Automatic honoring of .netrc
- Chunked HTTP Requests
After installing Python4Delphi properly, you can get Requests using pip or easy install to your command prompt:
and don’t forget to put the path where your Requests installed, to the System Environment Variables
Example System Environment Variables
C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Lib/site-packages C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Scripts C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38 |
The following is a code example of Requests to get content, status, and list of response headers (run this inside the lower Memo of Python4Delphi Demo01 GUI):
Example Python Requests
r = requests.get(‘https://example.com’)
print(r.text)
print(r.headers)
print(r.status_code)
import requests
r = requests.get(‘https://example.com’)
print(r.text) print(r.headers) print(r.status_code) |
Here is the result in Python GUI
Requests is one of the most downloaded Python packages today, pulling in around 14M downloads/week—according to GitHub, Requests is currently depended upon by 500,000+ repositories. Knowing these facts, you may certainly put your trust in this credible library.
Read more: https://pythongui.org/learn-to-build-a-python-gui-for-working-with-http-requests-using-requests-library-in-a-delphi-windows-app/
How do I Scrape Websites using Python BeautifulSoup?
BeautifulSoup is a library that makes it easy to scrape information from web pages. It sits built on top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. Since 2004, BeautifulSoup has been saving programmers hours or days of work on quick-turnaround screen scraping projects.
BeautifulSoup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:
- BeautifulSoup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application
- BeautifulSoup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.
- BeautifulSoup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.
.Are you looking for tools to build website scrapers to automate your data collecting process, and build a nice Windows GUI for them? This section will show you how to get started!
Here is how you can get BeautifulSoup
pip install beautifulsoup4 |
Example of using Python BeautifulSoup to collect and gather weather data
The following is an example of BeautifulSoup for scraping the Austin/San Antonio, TX weather data from the National Weather Service (run this inside the lower Memo of Python4Delphi Demo01 GUI):
# Read url
page = requests.get(“https://forecast.weather.gov/MapClick.php?lat=30.2676&lon=-97.743″)
# Download the page and start parsing
soup = BeautifulSoup(page.content, ‘html.parser’)
seven_day = soup.find(id=”seven-day-forecast”)
forecast_items = seven_day.find_all(class_=”tombstone-container”)
tonight = forecast_items[0]
# Extract the name of the forecast item, the short description, and the temperature
period = tonight.find(class_=”period-name”).get_text()
short_desc = tonight.find(class_=”short-desc”).get_text()
temp = tonight.find(class_=”temp”).get_text()
# Extract the title attribute from the img tag
img = tonight.find(“img”)
desc = img[‘title’]
# Extracting all the information from the page
period_tags = seven_day.select(“.tombstone-container .period-name”)
periods = [pt.get_text() for pt in period_tags]
short_descs = [sd.get_text() for sd in seven_day.select(“.tombstone-container .short-desc”)]
temps = [t.get_text() for t in seven_day.select(“.tombstone-container .temp”)]
descs = [d[“title”] for d in seven_day.select(“.tombstone-container img”)]
# Combining our data into a Pandas Dataframe
weather = pd.DataFrame({
“period”: periods,
“short_desc”: short_descs,
“temp”: temps,
“desc”:descs
})
# Print the dataframe
print(weather)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
from bs4 import BeautifulSoup import requests import pandas as pd
# Read url page = requests.get(“https://forecast.weather.gov/MapClick.php?lat=30.2676&lon=-97.743”)
# Download the page and start parsing soup = BeautifulSoup(page.content, ‘html.parser’) seven_day = soup.find(id=“seven-day-forecast”) forecast_items = seven_day.find_all(class_=“tombstone-container”) tonight = forecast_items[0]
# Extract the name of the forecast item, the short description, and the temperature period = tonight.find(class_=“period-name”).get_text() short_desc = tonight.find(class_=“short-desc”).get_text() temp = tonight.find(class_=“temp”).get_text()
# Extract the title attribute from the img tag img = tonight.find(“img”) desc = img[‘title’]
# Extracting all the information from the page period_tags = seven_day.select(“.tombstone-container .period-name”) periods = [pt.get_text() for pt in period_tags]
short_descs = [sd.get_text() for sd in seven_day.select(“.tombstone-container .short-desc”)] temps = [t.get_text() for t in seven_day.select(“.tombstone-container .temp”)] descs = [d[“title”] for d in seven_day.select(“.tombstone-container img”)]
# Combining our data into a Pandas Dataframe weather = pd.DataFrame({ “period”: periods, “short_desc”: short_descs, “temp”: temps, “desc”:descs })
# Print the dataframe print(weather) |
Here is the BeautifulSoup result in the Python GUI:
How do I Scrape Instagram Data using Python Instaloader?
“Instaloader” is a tool to download Instagram pictures (or videos) and retrieve their captions and other metadata.
The following are Instaloader features and functionalities:
- Downloads public and private profiles, hashtags, user stories, feeds, and saved media.
- Downloads comments, geotags, and captions of each post.
- Automatically detects profile name changes and renames the target directory accordingly.
- Allows fine-grained customization of filters and where to store downloaded media.
- Automatically resumes previously interrupted download iterations.
Are you looking for tools to build Instagram scrapers to automate your data collecting or information retrieval process, and build a nice GUI for them? This section will show you how to get started!
This section will guide you to combine Python4Delphi with the Instaloader library, inside Delphi and C++Builder, from installing Instaloader with pip to downloading all @embarcaderotech Instagram content instantly!
Here is how you can get Instaloader
Python’s Instaloader allows us to get any posts from any public profile easily. We just need to use the get_posts() method. We will use this method on the profile of @embarcaderotech.
How to download Instagram posts using Delphi and Python
Let’s download each post image/video and caption, by looping over the generator object using the .download_post()
method. Run the following script in Python4Delphi GUI:
# Create an instance of Instaloader class
bot = instaloader.Instaloader()
# Load a profile from an Instagram handle
#profile = instaloader.Profile.from_username(bot.context, ’embarcaderotech’)
# Get all posts in a generator object
posts = profile.get_posts()
# Iterate and download
for index, post in enumerate(posts, 1):
bot.download_post(post, target=f”{profile.username}_{index}”)
# Import the module import instaloader
# Create an instance of Instaloader class bot = instaloader.Instaloader()
# Load a profile from an Instagram handle #profile = instaloader.Profile.from_username(bot.context, ’embarcaderotech’)
# Get all posts in a generator object posts = profile.get_posts()
# Iterate and download for index, post in enumerate(posts, 1): bot.download_post(post, target=f“{profile.username}_{index}”) |
It will save the post and create new folders with folder name “embarcadero_1” until “embarcadero_n”, inside the directory that contains Python4Delphi Demo01.exe that we use to run all the scripts above.
In each folder, you will see the actual content of the posts of the profile like a video or images. The scripts above are taken and modified from this post.
Instaloader Python4Delphi results
There are a lot of results so it’s not very screenshot-friendly!
Where do the screen-scraping results go?
All the contents will be retrieved automatically to the directory where you save or run the Python4Delphi Demo01 GUI:
You will get all the contents you need in no time (compared with the hard work to download them manually)!
Here are the contents of each folder look like:
How do I Retrieve Twitter Data using Python Snscrape?
“Snscrape” is a library that allows anyone to scrape social networking services (SNS) without requiring personal API keys. It can return thousands of user profiles, hashtags, contents, or searches in seconds and has powerful and highly customizable tools.
The following services are currently supported:
- Facebook: User profiles, groups, and communities (aka visitor posts)
- Instagram: User profiles, hashtags, and locations
- Reddit: Users, subreddits, and searches (via Pushshift)
- Telegram: Channels
- Twitter: Users, user profiles, hashtags, searches, threads, and list posts
- VKontakte: User profiles
- Weibo (Sina Weibo): User profiles
In this tutorial, we will only focus on using Python Snscrape for Twitter.
How do I get the Python Snscrape library?
Run the following command in cmd to get all tweets by Embarcadero Technologies (@EmbarcaderoTech):
snscrape twitter-user EmbarcaderoTech > twitter–@EmbarcaderoTech.txt |
These scraping results would be stored in [email protected] file:
How do I Retrieve Twitter Data using Python Tweepy?
“Tweepy” is an easy-to-use Python library for accessing the Twitter API.
There are limitations in using Tweepy for scraping tweets. The standard API only allows you to retrieve tweets up to 7 days ago and is limited to scraping 18,000 tweets per 15-minute window. But, combining Tweepy with Snscrape, can enable you to bypass the API limitations, make it possible for you to scrape all the tweets you want, as long as their URLs are already scraped and stored in .txt files, as shown in the previous section!
Getting started with Python Tweepy
To get started with Tweepy you’ll need to do the following things:
- Set up a Twitter account if you don’t have one already.
- Using your Twitter account, you will need to apply for Developer Access and then create an application that will generate the API credentials that you will use to access Twitter from Python.
- Install and import the Tweepy package.
Once you’ve done these things, you are ready to begin querying Twitter’s API to see what you can learn about tweets!
Run this pip command to install Tweepy:
Example Python code showing how to retrieve Twitter tweets
The following is a code to use Tweepy to retrieve all @EmbarcaderoTech tweets as listed in the section 4 (run this inside the lower Memo of Python4Delphi Demo01 GUI):
# Key & access tokens
consumer_key = “YOUR CONSUMER KEY”
consumer_secret = “YOUR CONSUMER SECRET”
access_token = “YOUR ACCESS TOKEN”
access_token_secret = “YOUR ACCESS TOKEN SECRET”
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Open your text file/snscrape output
tweet_url = pd.read_csv(“twitter-@EmbarcaderoTech.txt”, index_col= None, header = None, names = [“links”])
print(tweet_url.head())
# Extract the tweet_id using .split function
af = lambda x: x[“links”].split(“/”)[-1]
tweet_url[‘id’] = tweet_url.apply(af, axis=1)
print(tweet_url.head())
# Convert our tweet_url Series into a list
ids = tweet_url[‘id’].tolist()
# Process the ids by batch or chunks.
total_count = len(ids)
chunks = (total_count – 1) // 50 + 1
# Username, date and the tweet themselves, so my code will only include those queries.
def fetch_tw(ids):
list_of_tw_status = api.statuses_lookup(ids, tweet_mode= “extended”)
empty_data = pd.DataFrame()
for status in list_of_tw_status:
tweet_elem = {“tweet_id”: status.id,
“screen_name”: status.user.screen_name,
“Tweet”:status.full_text,
“Date”:status.created_at,
“retweet_count”: status.retweet_count,
“favorite_count”: status.favorite_count}
empty_data = empty_data.append(tweet_elem, ignore_index = True)
empty_data.to_csv(“embarcaderoTech_Tweets.csv”, mode=”a”)
# Create another for loop to loop into our batches while processing 50 entries every loop
for i in range(chunks):
batch = ids[i*50:(i+1)*50]
result = fetch_tw(batch)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# Import libraries import pandas as pd, tweepy
# Key & access tokens consumer_key = “YOUR CONSUMER KEY” consumer_secret = “YOUR CONSUMER SECRET” access_token = “YOUR ACCESS TOKEN” access_token_secret = “YOUR ACCESS TOKEN SECRET”
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth)
# Open your text file/snscrape output print(tweet_url.head())
# Extract the tweet_id using .split function af = lambda x: x[“links”].split(“/”)[–1] tweet_url[‘id’] = tweet_url.apply(af, axis=1) print(tweet_url.head())
# Convert our tweet_url Series into a list ids = tweet_url[‘id’].tolist()
# Process the ids by batch or chunks. total_count = len(ids) chunks = (total_count – 1) // 50 + 1
# Username, date and the tweet themselves, so my code will only include those queries. def fetch_tw(ids): list_of_tw_status = api.statuses_lookup(ids, tweet_mode= “extended”) empty_data = pd.DataFrame() for status in list_of_tw_status: tweet_elem = {“tweet_id”: status.id, “screen_name”: status.user.screen_name, “Tweet”:status.full_text, “Date”:status.created_at, “retweet_count”: status.retweet_count, “favorite_count”: status.favorite_count} empty_data = empty_data.append(tweet_elem, ignore_index = True) empty_data.to_csv(“embarcaderoTech_Tweets.csv”, mode=“a”)
# Create another for loop to loop into our batches while processing 50 entries every loop for i in range(chunks): batch = ids[i*50:(i+1)*50] result = fetch_tw(batch) |
Using Python and Tweepy for powerful Twitter scraping
Tweepy Twitter scraping results in an Excel spreadsheet
We successfully scrape all @EmbarcaderoTech tweets, from 2009 until current tweets, and we store it to “embarcaderoTech_Tweets.csv” file
How do I Pull RSS Feed Data using Feedparser?
“Feedparser” or Universal Feed Parser is a library to parse Atom and RSS feeds in Python. feedparser can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple’s iTunes extensions.
To use feedparser, you will need Python 3.6 or later. feedparser is not meant to run standalone; it is a module for you to use as part of a larger Python program.
feedparser is easy to use; the module is self-contained in a single file, feedparser.py, and it has only one primary public function, parse. parse takes a number of arguments, but only one is required, and it can be a URL, a local filename, or a raw string containing feed data in any format.
Here is how you can get feedparser
Run the following script to parsing data from Stack Overflow RSS Feed:
d = feedparser.parse(‘http://stackoverflow.com/feeds’)
print(d.feed.title)
print(d.feed.title_detail)
print(d.feed.link)
print(d.entries)
import feedparser
d = feedparser.parse(‘http://stackoverflow.com/feeds’)
print(d.feed.title) print(d.feed.title_detail) print(d.feed.link) print(d.entries) |
Feedparser results in the Python4Delphi GUI
It’s a lot of data, again!
Want to know some more? Then check out Python4Delphi which easily allows you to build Python GUIs for Windows using Delphi.