INFO 4871/5871

Web Data Science

INFO 4871/5871 “Web Data Science” is a semester-length and cross-listed undergraduate elective and graduate course. The internet makes many kinds of information easy to access. The ability to retrieve, parse, and analyze this information is a valuable skill for data scientists. This course will provide an overview of computational tools and practices for transforming web documents and APIs into data for common research designs.

Learning objectives

  • Understand the legal and ethical contours of web data access
  • Navigate and parse common web data formats like XML and JSON for data
  • Retrieve and automate data extraction from HTML and PDF documents
  • Access popular APIs to collect data for common research designs
  • Understand the methods and research designs for using web tools to audit algorithmic behavior


Module Week Skills
Fundamentals 1 Introductions
  2 XML & JSON
  3 Protocols
Structure 4 Static web pages
  5 Archived web pages
  6 Dynamic web pages
  7 PDFs
Dynamics 8 APIs
  9 Wikipedia
  10 Census
  11 Homophily and selection
  12 Automation
Applications 12  
  13 Fall Break
  14 Final Projects
  15 Final Projects
  16 Final Projects

Course materials