DSCI 510: Principles of Programming for Data Science
Final Project Guidelines
In the ffnal project for this class, you will have the opportunity to apply the knowledge and
programming skills you have learned to a real-world problem. Your project should focus on
web scraping (or collection data through APIs), data cleaning, analysis, and visualization using
Python.
Final Project Due Date: December 19th, 2024 at 4pm PT
Final grade submission via Grading and Roster System (GRS) for Fall 2024 is the week after
December 19th and we should have graded every project by then. We need to set some time aside
in order to be able to grade your projects, therefore we have to be strict about this deadline.
Please refer to the Academic Calendar for the speciffc dates.
Final Project Submission via GitHub Classroom
In order to submit your ffnal project assignment you will need to accept the assignment on our
GitHub Classroom (similar to the lab assignments). With the ffnal assignment repository you
will get a template where you can upload all of your ffles. To get started, Project Proposal
You may send a one page proposal document (in a PDF format) describing your ffnal project.
This proposal should include the following:
1. Name of your ffnal project and a short synopsis/description (1 paragraph max).
2. What problem are you trying to solve, which question(s) are you trying to answer?
3. How do you intend to collect the data and where on the web is it coming from?
4. What type of data cleaning and/or analysis are you going to perform on the data?
5. What kind of visualizations are you going to use to illustrate your ffndings?
There is no offfcial due date for the proposal, but the sooner you send it to us the sooner you will
get feedback on it. We will provide feedback and suggest changes if required. This is usually to
test the feasibility of the project and give you a sense of whether you need to scale back because
it is too ambitious or if you need to do more work in order to improve your grade. Please upload
the original proposal in the same repository with the other ffles of your ffnal project.
Note: For faster processing, you can send us an email: Gleb (gleb@isi.edu), Mia (osultan@usc.edu)
or Zhivar (souratih@usc.edu) an email with the subject “DSCI 510: Final Project Proposal”,
please also upload your proposal document to the ffnal project GitHub repository. The
email should contain a link to your GitHub repository or the proposal.pdf ffle itself.
1Project Goals and Steps
1. Data Collection (20%)
You should identify websites or web resources from which you will get raw data for your
project. You can either web-scrape data or collect data using publicly available APIs.
This could include news articles, e-commerce websites, social media posts, weather data,
or any other publicly available web content. This step should be fairly sophisticated as
to demonstrate the techniques you have learned in the class. Use multiple data sources
to compare different data in your analysis. Using Python libraries like BeautifulSoup and
requests, you should be able to write scripts to scrape data from the chosen websites. This
step includes making HTTP requests, handling HTML parsing, and extracting relevant
information.
Please note that if you need to collect data that changes over time, you might want to
setup a script that runs every day and collects the data at a certain time of the day. That
way you can collect enough data to run your analysis for the ffnal project later.
We recommend that you scrape data from static websites, or use publicly available APIs.
If you scrape data from dynamically generated pages, you might run into issues as certain
websites are not keen on giving away their data (think sites like google, amazon, etc).
Please note that some APIs are not free and you need to pay to use them - you should
try to avoid those as when we are grading your ffnal project we should be able to replicate
your code without paying for an API.
2. Data Cleaning (20%)
Once your data collection is complete, you will need to clean the data in order to be able
to process it. This will involve handling missing values, cleaning HTML tags, removing
duplicates, and converting data into a structured format for analysis in Python. If your
raw data is not in English, you should attempt to translate the data into English as part
of this step.
Depending on the size of your data you can upload both raw and preprocessed data to the
data folder in the repository of your ffnal project.
3. Data Analysis (20%)
In this step, you will perform an analysis on the scraped data to gain insights or answer
speciffc questions. You should perform statistical analyses, generate descriptive statistics,
using libraries such as Pandas or NumPy (or any other library you prefer to use). You
should add a detailed description of this step and your speciffc methods of analysis in the
ffnal report at the end.
4. Data Visualization (20%)
Last but not least, you should create plots, graphs, or charts using Matplotlib, Seaborn,
D3.js, Echarts or any other data visualization library, to effectively communicate your
ffndings. Visualizations created in this step could be static or interactive, if they are
interactive - you need to describe this interaction and its added value in the ffnal report.
Our team should be able to replicate your interactive visualizations when we are grading
your ffnal projects.
5. Final Report (20%)
Finally, you will submit a ffnal report, describing your project, the problem you are trying
to solve or the questions that you are trying to answer. What data did you collect as well
2as how it was collected. What type of data processing/cleaning did you perform? You
would also need to explain your analysis and visualizations. See Final Report section for
more information.
The percentages used for grading here are used as a general guideline, but it can be changed
based on your project. If your data collection is trivial but the analysis is fairly complicated,
you could score more points in the data analysis step to compensate. Similarly, complexity of
the ffnal data visualizations could be used to get additional points if you decide to make your
visualizations more interactive and engaging to the end users.
Project Deliverables
GitHub Repository
We will create an assignment for the ffnal project. You will need to accept the assignment and
commit your code and any additional ffles (e.g. raw data or processed data) to the repository.
Here is a generic structure of the repository:
github_repository/
.gitignore
README.md
requirements.txt
data/
raw/
processed/
proposal.pdf
results/
images/
final_report.pdf
src/
get_data.py
clean_data.py
analyze_data.py
visualize_results.py
utils/
And here is a description of what each of the folders/ffles could contain:
1. proposal.pdf
The project proposal ffle (PDF). This is what you can send us in advance to see if your
project meets the minimum requirements or if the scope is too large and if you need to
scale it back. See the section: Project Proposal.
2. requirements.txt
This ffle lists all of the external libraries you have used in your project and the speciffc
version of the library that you used (e.g. pandas, requests, etc). You can create this ffle
manually or use the following commands in your virtual (conda) environment:
You can run this command to create the requirements.txt ffle:
3pip freeze >> requirements.txt
To install all of the required libraries based on this requirements ffle, run this command:
pip install -r requirements.txt
3. README.md
This ffle typically contains installation instructions, or the documentation on how to install
the requirements and ultimately run your project. Here you can explain how to run your
code, explain how to get the data, how to clean data, how to run analysis code and ffnally
how to produce the visualizations. We have created sections in the README.md ffle for
you to ffll in. Make sure you ffll in all of the sections.
Please note that this ffle is most important to us as we will try to reproduce your results
on our end to verify that everything is working. If there is anything that is tricky about
the installation of your project, you want to mention it here to make it easier for us to run
your project.
4. data/ directory
Simply put, this folder contains the data that you used in this project.
(a) The raw data folder will have the raw ffles you downloaded/scraped from the web. It
could contain (not exhaustive) html, csv, xml or json ffles. If your raw data happens
to be too large to upload to GitHub (i.e. larger than 25mb) then please upload your
data to the USC Google Drive and provide a link to the data in your README.md
ffle.
(b) The processed data folder will contain your structured ffles after data cleaning. For
example, you could clean the data and convert them to JSON or CSV ffles. Your
analysis and visualization code should perform operations on the ffles in this folder.
Note: Make sure your individual ffles are less than 25mb in size, you can use
USC Google Drive if the ffles are larger than 25mb. In that case, please provide
a link for us to get to the data in your README.md ffle.
5. results/ directory
This folder will contain your ffnal project report and any other ffles you might have as part
of your project. For example, if you choose to create a Jupyter Notebook for your data
visualizations, this notebook ffle should be in this results folder. If you have any static
images of the data visualizations, those images should go in this folder as well.
6. src/ directory
This folder contains the source code for your project.
(a) get data.py will download, web-scrape or fetch the data from an API and store it in
the data/raw folder.
(b) clean data.py will clean the data, transform the data and store structured data ffles
in the data/processed folder, for example as csv or json ffles.
(c) analyze data.py will contain methods used to analyze the data to answer the project
speciffc questions.
(d) visualize results.py will create any data visualizations using matplotlib or any other
library to conclude the analysis you performed.
4(e) utils/ folder should contain any utility functions that you need in order to process
your code, this could be something generic such as regular expressions used to clean
the data or to parse and lowercase otherwise case-sensitive information.
7. .gitignore
Last but not least, the .gitignore ffle is here to help ignore certain meta-data or otherwise
unnecessary ffles from being added to the repository. This includes ffles that were used
in development or were created as a by-product but are not necessary for you to run the
project (for example, cached ffles added by using various IDEs like VS Code or PyCharm.
Please note that this project structure is only a suggestion, feel free to add more ffles or change
the names of ffles and folders as you prefer. That being said, please take into account that we
will be looking for the speciffc ffles to get the data, clean the data, analyze data, etc. You can
change this structure or create more ffles in this repository as you like but please do mention
where what is in your README.md ffle.
Final Report
You’ll ffnd an empty template for the ffnal report document (pdf) in the GitHub repository once
you accept our ffnal project assignment. At the very least, your ffnal report should have the
following sections:
1. What is the name of your project?
(a) Please write it as a research question and provide a short synopsis/description.
(b) What is/are the research question(s) that you are trying to answer?
2. What type of data did you collect?
(a) Specify exactly where the data is coming from.
(b) Describe the approach that you used for data collection.
(c) How many different data sources did you use?
(d) How much data did you collect in total? How many samples?
(e) Describe what changed from your original plan (if anything changed) as well as the
challenges that you encountered and resolved.
3. What kind of analysis and visualizations did you do?
(a) Which analysis techniques did you use, and what are your ffndings?
(b) Describe the type of data visualizations that you made.
(c) Explain the setup and meaning of each element.
(d) Describe your observations and conclusion.
(e) Describe the impact of your ffndings.
4. Future Work
(a) Given more time, what would you do in order to further improve your project?
5(b) Would you use the same data sources next time? Why yes or why not?
Your final project report should be no less than 2 and no more than 5 pages including any images
(e.g. of data visualizations) that you want to embed in the report. Please spend a decent amount
of time on the report. Your report is the first file we will read. We will not know how great your
project is if you don’t explain it clearly and in detail.
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp