INFO 370: Introduction to Data Science

email Greg | email Umang | email class
Instructor: Gregory L. Nelson
TA: Umang Seghal
Winter 2018. Section B

ScheduleGradingActivitiesReadingIndividual HomeworkProjectTeamsResources

In this class you'll learn how to think like a data scientist. You'll learn what data scientists do and how they do it. You'll also learn about the contexts in which a data scientist exists. By the end of the course, you should be able to enter any organization and begin to understand the social and technical contexts in which you help make decisions. If you want to be a great data scientist, this is the course for you.

Learning objectives for the course:
  1. Comprehend the practice of data science as an interactive, iterative process.
  2. Critique the quality of data, models, and results within a decision context.
  3. Consider contextual and critical perspectives in data science
  4. Have familiarity using computational tools that support data scientists
  5. Understand what data scientists do in various organizational and social contexts

Prerequisites

You should have aspirations to be a data scientist or to work closely with them. Because we'll use data to inform decisions, you should also know:

  • How to use a scripting language (R, Python) to manipulate data
  • How to use command line
  • How to use Git and GitHub
  • How to access a Web API

The prerequisite course, INFO 201 (the Technical Foundations of Informatics), should be suitable preparation for the above. Refer to the INFO 201 online book to refresh your knowledge of the course.

Office Hours

We are available to talk about jobs, careers, graduate school, research, class, taboos, and anything else. Greg's office hours this quarter will be held twice a week, Monday 12:30pm-1:30pm and Wednesday 5:30pm-6:30pm, both at CSE 3rd Floor Breakout area next to the stairs (large whiteboard wall and windowed area). Umang's office hours this quarter are twice weekly ,Tuesday - 11am to 12am (MGH Commons) and Thursday 11 am to 12 am (MGH Commons). Occasionally we need to schedule things over office hours. To guarantee we'll be around, write to us in advance to secure a time.

Devices in Class

We will use smartphones and laptops throughout the quarter to facilitate activities and project work in-class. However, research and student feedback clearly shows that using devices on non-class related activities not only harms your own learning, but other students' learning as well. Therefore, I only allow device usage during activities that require devices. At all other times, you should not be using your device. We'll help you remember this by announcing when to bring devices out and when to put them away.

Typical Week

  • Sunday: Do reading assignments (read readings, review and run scripts). Complete reflection survey.
  • Monday: Go to class and participate. After class, report struggles online.
  • Tuesday:Do reading assignments (read readings, review and run scripts). Complete reflection survey.
  • Wednesday: Go to class and participate. After class, report struggles online.
  • Wednesday-Saturday: Homework or group project work
  • Sunday: Homework due

Schedule

>
Week 0 — What is data science?
1/3Lecture Data science: welcome and opportunity Assigned: Homework 1. Due Fri 1/12.
Week 1 — Decision Making in Data Science
1/8Lecture and Lab Data science is a process
1/10Lecture Understanding domain and applying decision theory Assigned: Homework 2. Due Sun 1/14 before class Weds 1/17.
Week 2 — Framing your analysis
1/15No class, holiday
1/17Lecture Framing: Using data and models for decisions and questions Not yet posted: Homework 3: Framing . Due Sun 1/21.
Assigned: Project Milestone 1: Group formation & initial domain understanding. Due Sun 1/21 Tues 1/23.
Week 3 — Modeling concepts; finding data
1/22Lecture Causal Diagrams and Scoping
1/22Lab Using Causal Loop Diagram for Scoping
1/24Lecture PM1 Review; Ideating, finding and selecting data sources Assigned: Project Milestone 2: Refining Framing and Evaluating Feasibility and Potential Impact. Due Weds l/31.
Week 4 — Collecting and Making Sense of Data
1/29Lecture Visualizing Data
1/29Lab Lab - Review PM2, Decisions: Focus on Choices
1/31Lecture Tidying and cleaning data
Week 5 — Visualization; Model Fitting
2/5Lecture Web scraping; Exploratory Data Analysis and PM2 examples
2/5Lab Web Scraping
2/7Lecture Models
Week 6 — Modeling
2/12Lecture Modeling as a search for "optimal" parameters
2/12Lab Fitting basic models in R
2/14Lecture Evaluating quality of model parameters (fitted models)
Week 7 — Interpreting models
2/19Holiday
2/21Lecture Logistic Regression; Evaluating models with cross-validation
Week 8 — Understanding Models
2/26Lecture Logistic Regression; simulating decisions using models and residuals
2/26Lab Trying different thresholds for logistic regression and simulating decisions
2/28Lecture Bias; Simulating decision using models and residuals
Week 9 — Debugging and Limitations of Models
3/5Lecture Debugging strategies for R code and Monte Carlo Simulations
3/7Lecture Interpreting models and limitations; reflecting on class projects
Week 10 — Project Fair
3/13 TuesProject Fair Room EEB037
  • Slides on interviewing and wrap-up: slides
  • Optional but highly recommended reading on limits of statistics:The fourth quadrant
Homework 10 (Project and Course Reflection) Due 3/14.

Grading

There are 100 points you can earn in this class:

  • Activities (12 points, 0.5 points for each class or lab). Show up and engage to get credit.
  • Reading (15 points, 1 point each). Prove you read and understood the reading.
  • Individual Homework (33 points). Prove you understand important data sciencetopics.
  • Project (40 points, team score). Reach several milestones related to your team data science project.

We will use the iSchool Standard Grading Scale to convert your grade percentage (as shown in Canvas) to a 4.0 scale.

≥ 97% → 4.0 90.5 → 3.5 83.9 → 3.0 78 → 2.5 73 → 2.0* 68 → 1.5 62 → 0.9
95.7 → 3.9 89.2 → 3.4 82.6 → 2.9 77 → 2.4 72 → 1.9 67 → 1.4 61 → 0.8
94.4 → 3.8 87.8 → 3.3 81.3 → 2.8 76 → 2.3 71 → 1.8 65 → 1.2 60 → 0.7***
93.1 → 3.7 86.5 → 3.2 80 → 2.7 75 → 2.2 70 → 1.7** 64 → 1.1 < 60 → 0.0
91.8 → 3.6 85.2 → 3.1 79 → 2.6 74 → 2.1 69 → 1.6 63 → 1.0
*: 2.0 is the minimum grade required for any required INFO course to count towards an informatics degree.
**: The UW requires a 1.7 or better for non-degree requirements for undergraduate courses.
***: 0.7 is lowest passing grade in an undergraduate course.

Late work receives no credit unless you can provide a note from a health care professional or provost documenting the reason for your absence, or you make arrangements with the instructor. However, you can miss up to 3 activities without penalty and without documentation. This should be enough to allow for sickness, unavoidable travel, or other personal matters.

If you miss a reading quiz due to sickness, you can make up the quiz credit by sending a 250-500 word critique of the reading and submitting it to your Google Drive folder within a week of the quiz you missed. Title the Google doc with the class number and "make up quiz". E.g. "2.3 make up quiz" for the make up quiz for week 2 and class 3/wednesday lecture.

Activities

Each day in class we'll practice some skill. You'll get 0.5 points if you engage in and complete the activity. How to get credit for the activity will depend on the activity; sometimes being present will be enough, sometimes being to class on time will be enough, and sometimes you'll have to turn something in.

Reading

To access the readings, you will do the following:

  1. Click on the reading (a link to a Google doc) on the course schedule
  2. Copy the google doc to your personal INFO 370 folder (which we shared with you at the beginning of the course). Instructions on making a copy of a file in Google Drive.
  3. Read through the google doc/reading. Highlight and comment any parts which are confusing.
  4. Complete the questions marked "TODO".

You should complete your readings and reflection before at the beginning of each lecture (twice a week). The Google Doc in your personal Drive folder is your submission (not using Canvas for readings). Each class, you'll come prepared to discuss the assigned reading.

The day that each reading is due, we'll do the following:

  • We clarify confusions based on reading reflections.
  • We give you some questions to answer individually about the assigned reading (a "Reading Quiz").
  • You turn in your answer.
  • You discuss your answers with your neighbor.
  • We discuss the correct answers as a class.

You will receive 0.75 points for completing the reading and reflection before class (on the Google Doc). You will receive up to another 0.25 points for getting the in-class reading quiz correct. We will give partial credit for partially correct answers on the reading quiz, at our discretion. In total, you can receive up to 1 point per reading.

Individual Homework

There will be about one individual homework assignment each week, which are separate from reading assignments and project milestones. These will give you practice and feedback on the skills in a narrower context than your project. They will be due on the nearest Sunday.

All homeworks are due by 11:59:00 PM PST on the specified date.

The goal of the individual homework assignments is to check and deepen your understanding of specific concepts which are critical to your understanding of data science.

Project

The project is split across 8 milestones/assignments, each worth a different amount:

  1. Group formation and initial questions (2 points).
  2. Pilot study (3 points).
  3. Proposal (4 points).
  4. Proposal Review (2 points).
  5. Proposal Revision (3 points).
  6. Project check-in meeting (2 points).
  7. Project Fair (8 points).
  8. Artifact (16 points).

All assignments except the Project check-in meeting are due by 11:59:00 PM PST on the specified date.

The goal of the project is for you to practice the process of data science to make or inform a decision, so you can experience the nuances of formulating a good question, setting up process, constraints, and plans in relation to a context. Note, however, that because the timeline for the project is so short, it won't give you a deep, longitudinal experience with data science, nor will it give you practice with massive complexity or scale. I believe these are experiences best left to practice in industry, as they're very difficult to replicate in the artificial setting of school.

Project Teams

Resources

Links to Data Science communities at/near UW:

Links to recommended learning resources (most of which are free)

  • R Graphics Cookbook (Chang, 2012): Practical book on data visualizations in R. Website provides R code with great explanations for common tasks.
  • Data mining with R : learning with case studies (Torgo, 2017): Practical book on data mining (a former buzzword, mostly like data science) using detailed case studies and good commentary. Good for learning R and how to apply data science to problems (without reframing those problems - it takes the questions mostly as given).
  • R for Data Science (Grolemund & Wickham): Free online textbook which provides teaches you to use R for data science. We use their chapter on exploratory data analysis (Ch 7) and modeling (Part IV).
  • Data + Design (Infoactive): Free and open-source ebook which (beautifully) introduces the fundamentals of data and how to prepare and visualize it. We draw upon their chapter on "Data Fundamentals."
  • Statistical Rethinking (McElreath, 2016): Textbook which provides a nice framing of statistics as engineering, teaching a Bayesian perspective to statistics with R. We use the first 3 chapters to introduce model creation.
  • RStudio Cheat Sheets: Fantastic cheatsheets for anyone learing or using R.
  • Datacamp: Online learning platform for data science that we use for some assignments.

Links to important UW resources:

  • Disability Services Office: If you require disability accommodations for this course, work with the DSO.
  • SafeCampus: Resources and points of contact to promote a safer UW community.