Workshop Schedule
(In Central Time)

July 26
July 27
July 28
  • 10:00AM - Overview of Cloud Computing Platforms
    Judy Qiu
    @Indiana University 
  • 10:45AM - Introduction to Azure
    Jaliya Ekanayake
    @Indiana University & Microsoft Research 
  • 11:30AM - Break (lunch for Eastern, Central time)
  • 12:30PM - Introduction to DryadLINQ
    Christophe Poulain
    @Microsoft Research 
  • 1:30PM - Iterative MapReduce
    Jaliya Ekanayake
    @Indiana Univesity & Microsoft Research 
  • 2:00PM - Break (lunch Mountain, Pacific time)
  • 3:00PM - Iterative MapReduce (continued)
    Jaliya Ekanayake
    @Indiana Univesity & Microsoft Research 
  • 4:30PM - AzureMapReduce
    Thilina Gunarathne
    @Indiana University & IBM Research 
  • 5:00PM - Hands-on & Laboratory Time
  • 7:00PM - Local Activities
July 29
July 30

Big Data for Science Workshop

July 26-30, 2010, NCSA Summer School

Humans are generating, sensing, and harvesting massive amounts of digital data, and many of these unprecedentedly large data sets will be archived in their entirety. We find ourselves surrounded by huge volumes of "data at rest," that is, data written once and destined to live forever. Data movement will become the exception rather than rule.

Digital data owners will control the data distribution channels via "cloud computing" infrastructure where data is unstructured and devoid of schema, begging for semantic metadata, preservation, and curation. The familiar notions of sequential or random access files no longer apply in the cloud. Instead developers will write code that mines this mass of unstructured data, extracts what is of interest, and then inserts the resulting data subset into a relational database or other structured data store where it will be analyzed and visualized.

The disciplines on the forefront of this paradigm shift are astroscience, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by refining the techniques pioneered by Google and Facebook and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.

As the computing landscape becomes increasingly data-centric, computational scientists will employ new tools based on new models of computation. In a data-intensive world where the sheer volume of data demands new approaches and techniques, the inclination is to move the computation to the data, a basic theme underlying this course. Called the "fourth paradigm" (after theory, experiment, and computation), data-intensive computing is poised to transform scientific research.

Students will learn about:

Participants will get hands-on programming experience with data-intensive computing languages such as MapReduce.

Geoffrey C. Fox, distinguished scientist and director, Community Grids Lab, Pervasive Technology Institute, Indiana University

Judy Qiu, assistant director, Community Grids Lab, Pervasive Technology Institute, Indiana University


Course outline:

NOTE: Students are required to provide their own laptops.

The following sites are fully participating in the Big Data for Science course:

The following sites will host remote presenters (with no audience):