For this assignment, you'll collect some stock data. We'll make use of investing.com to collect information on the most active stocks in the market, through web scraping. We'll supplement this with historical data about these stocks gathered through API requests.
You'll then be responsible for cleaning the data, creating a database from it, and analyzing stocks by querying your database.
You can click here to get the stencil code for Homework 2. Reference this guide for more information about Github and Github Classroom.
The data is located in the data folder. To ensure compatibility with the autograder, you should not modify the stencil unless instructed otherwise. For this assignment, please write your solutions in the respective .py
and .sql
files. Failing to do so may hinder with the autograder and result in a low grade.
To get started, we’re going to want to collect some data on the most active stocks in the market. Conveniently, investing.com publishes this exact data. To collect this data, you’ll make use of web scraping.
For purposes of this assignment, we've made a copy of this page to keep the data static. Note, some of the data in our static copy is intentionally modified from real stock data to ensure you've cleaned your data and handled edge cases. As such, you will scrape from this URL: https://cs1951a-s21-brown.github.io/resources/stocks_scraping_2021.html
Before scraping, you'll need your code to access this webpage.
You should make use of the requests
library to make an HTTP request and collect the HTML. If you're not familiar with the requests
library, you can read about it here.
Once you have accessed the HTML and assigned it to some variable, you'll want to scrape it, collecting the following for each stock in the table and storing it in a data structure.
You'll use Beautiful Soup, a Python package, to scrape the HTML. This will require looking at the HTML structure of the investing.com page. You can select various HTML elements on a page by tag name, class name, and/or id. Using inspect element on your web browser, you can check what HTML tags and classes contain the relevant information.
Note: You should collect information from the 50 most active stocks in the investing.com table. This is what the investing.com HTML will contain by default.
Hint: All extracted information will be strings. You’ll want to make sure the price and percent change are floats (i.e., "24.5%" should become the float 24.5), and volume is an integer. Lastly, make sure to fix the HQ state capitalization (i.e. California and california should be interpreted the same way).
Another Hint: You will probably want to look ahead to the queries you will ultimately ask in Part 4--this will affect the type of cleaning you need to do.
In this part of the assignment, note that you are asked to scrape data from a public website. List and explain three scenarios where scraping data may pose a risk to external and internal stakeholders. Feel free to refer to this
Consider the following simple HTML page with an unordered lists:
<html> <body> <h1>Welcome to My Website <ul> <li>Coffee <li>Tea <li>Coke </ul> </body> </html>
Imagine we want to get the items in the list. The ul
tag indicates an unordered list. We’ll then want to get each list item (list items are in li
tags). Specifically, we’ll want to extract the text inside each list item. To do this, we’ll use the following code, where page.text
is the HTML of the page.
soup = BeautifulSoup(page.text, 'html.parser') items = soup.find(ul).find_all("li")
You’ll notice that items
is a list of three items, since there are three list items in the unordered list. You’ll also see that items[0].text
will give you the text of the first list item!
With investing.com, the stock data you have been asked to look into is public content, but many websites we use daily collect user-generated content. Consider one of the following websites (or a website of your choice) that collects user-generated content and has interesting social/ethical implications:
Rather than using web scraping to collect this data, we’ll make use of an API.
You’ll make requests to this API using Python’s requests
library.
IEX Trading offers an API with various endpoints that offer information about stocks.
Click here for a walkthrough on creating a free account.
Note: You have a limited number of API calls you can make with a free account. You'll start receiving 402 errors if you reach the limit - if this happens, use a different email address to make a new (free) account (A sneaky trick is to add a suffix to your email address using a "+". For example, if your email is ellie_pavlick@brown.edu, you can sign up with ellie_pavlick+1@brown.edu, ellie_pavlick+2@brown.edu, and so on. This means you don't have to make any new email accounts for this, you can just use the one you already have to generate infinitely many new free accounts.). You probably won't reach the limit, but to check if you're close, you can click the "Message Use" tab on the left side of the API console site. Free accounts give you 50,000 messages - 1 API call usually costs more than 1 message.
We’re going to want to collect two pieces of information for each stock in investing.com's most active stock table:
To do this, you’ll want to make use of the chart endpoint to collect the historical stock pricing.
Then, you will want to parse through the data and average the closing price for each stock.
IMPORTANT: Set the parameter "chartCloseOnly" to True
when you request from the chart endpoint to avoid immediately reaching your API call limit! (Read about URL parameters here.)
Using the Previous Day Price endpoint , you should get the previous day adjusted price data for a specific stock. You should not add stocks with no previous price to the table.
Hint: Some stocks from investing.com are not listed on major stock exchanges, and thus the IEX Trading API does not have data on them. In this case, the IEX Trading API will return a 404 status code. Your program should handle this error by disregarding stocks from investing.com if they are not present in the IEX Trading API. That is, these stocks should not be added to the database. You can check the status code of a request by checking requests.get(...).status_code
Another Hint: When calculating average closing price for the last 5 days, make sure to only include the stocks that have data for at least one of the past five days. (i.e. if a stock doesn't list any of the closing prices for the past five days, including that stock might give you an error.)
Let’s think about missing data for a second. It’s important to remember that the scraped data isn’t all encompassing. View this artwork by Mimi Onuha: here
What kind of data may be difficult to get access to? Give one example. Why do you think that is? What is the impact of this missing or inaccessible data? (3 points)
Read the article here
Question: The concept of a “data economy” comes with the perspective that data is a form of economic resource and capital. To what, if any, standards of social responsibility should we hold researchers, companies, and other entities that use data to turn a profit? What price would you put on your own personal data, if any?
Justify your opinion and describe which opinions you agree or disagree with in the article.
You now realize that to truly harness the data, you need to turn it into a database you can query. Using the provided stencil, create a database with these tables:
symbol
, a string of the stock symbol that is the primary key of this tablename
, a string of the company namelocation
, a string of the company's HQ locationsymbol
, a string of the stock symbol that is the primary key of this tableprev_close
, the previous closing price of this stock, a numberprice
, the current stock price, a numberavg_price
, the average closing price over the last five days, a numbervolume
, the volume of this stock as a numberchange_pct
, the percent change in the stock’s price today, as a decimalTo create a connection to the database, and a cursor, we include the following lines in the stencil:
# Create connection to database conn = sqlite3.connect('data.db') c = conn.cursor()
We also prepare the database for you by clearing out relevant tables if they already exist. This allows you to run your code multiple times and replace your old version of data.
# Delete tables if they exist c.execute('DROP TABLE IF EXISTS "companies";') c.execute('DROP TABLE IF EXISTS "quotes";')
To create a database table, you'd do something like this:
c.execute('CREATE TABLE person(person_id int not null, name text') conn.commit()
To insert a row into a table, you'd do something like this:
c.execute('INSERT INTO person VALUES (?, ?)', (some_variable, another_variable))
The data is saved into the data.db file. If you want to take a look at it, one way is to use a website such as https://inloop.github.io/sqlite-viewer/ where you can load the file and query data from the tables.
Each SQL statement should be stored in its own file: query1.sql
, query2.sql
, etc.
After finishing the assignment (and any assignment in the future), run python3 zip_assignment.py
in the command line from your assignment directory, and fix any issues brought up by the script.
After the script has been run successfully, you should find the file scraping-submission-1951A.zip
in your assignment directory. Please submit this zip file on Gradescope under the respective assignment.
(If you have not signed up for Gradescope already, please refer to this guide.)
Note: Please make sure that the autograder works properly after you have submitted your zip file, just to make sure that you don't run out of your API call allotment exactly at the moment of submission. Failure to ensure that the autograder works properly on your code will result in a low grade.
Made with ♥ by Jacob Meltzer and Tanvir Shahriar (2019 TAs), updated by Natalie Delworth and Nazem Aldroubi (2020 TAs); Daniela Haidar, Daniel Civita Ramirez, JP Champa and Nam Do (2021 Spring TAs), and again by Aakansha Mathur and Nam Do (2021 Summer TAs).
Updated by Daniela Haidar, Benjamin Shih, James Shi, Micah Bruning, Aakansha Mathur in Spring 2022. STA component was updated by Aanchal Seth and Joanna Tasmin in Spring 2022.