Data is all around us. It is being gathered and stored in databases and data warehouses at unprecedented and ever growing rates, pertaining to every imaginable field of human endeavor -- science, politics, arts, commerce, literature, entertainment, medicine, social media, sports, etc. It is imperative that we develop methods of transforming this overwhelming collection of data into practical, actionable information. This can be accomplished using database query systems to extract factual, or "shallow" knowledge. In this course, however, we will investigate the theory and practice behind extracting patterns or regularities from data that represent "hidden" knowledge. This topic is referred to as data mining, and sometimes knowledge discovery or predictive analytics. Our focus will be on several powerful data mining strategies and the various algorithms used to implement those strategies. These algorithms involve machine learning, whereby computers learn important concepts from the data and express them using a variety of accessible models. We will experiment with several different strategies, including classification, numerical estimation, prediction, clustering, and association learning. These concepts will be applied to a variety of domains, including business, the arts, the social sciences, and the natural sciences, always with an emphasis on understanding and predicting human behavior. We will primarily use an open source tool called WEKA for these investigations, along with Google's n-gram viewer and some features of Excel. The techniques learned in the course will be put to use in a significant team-oriented term project.
Specific topics to be covered include:
- Data mining terminology
- Appreciating the power of data mining through a variety of examples from different domains
- Recognizing how data mining can be used to describe and predict human behavior in different contexts
- Understanding when data mining is an appropriate problem solution
- Finding, assembling, and preparing data for the mining process
- Understanding basic data mining strategies and algorithms and learning how to apply them appropriately
- Recognizing how knowledge of important concepts can be represented and communicated effectively via different types of models and visualizations
- Evaluating the accuracy and effectiveness of models resulting from data mining
- Applying the results of an accurate model to real-world problems relating to human behavior
- Consideration of the ethical issues raised by data mining technology
- Investigation and discussion of how data mining is utilized in society on a daily basis, and how it can enlighten and inform us about our own human culture, past and present
In short, the primary objective of the course is to build a foundation for using large collections of data to better understand ourselves and the world we live in. This foundation, built upon algorithmic problem solving and special-purpose software, will provide you with the critical ability to understand -- and contribute to -- future developments and innovations in data science as well.
I am very excited about this offering of CSC-272, and am open to your suggestions about directions the course might take. Please feel free to share them with me.
Go to the Department of Computer Science Home Page