Introduction

Course Topics

In the age of ubiquitous data the ability to effectively access, manipulate, and analyze data to support the improvement of organizational performance has become increasingly valuable. Business analytics is the application of these data skills to business. This book introduces a set of free, powerful tools for working with data based on the python programming language and applies them to the fundamentals of business analytics.

We will first learn how to clean, subset, summarize, reshape, reformat, and visualize data efficiently using python-based tools. This set of skills is sometimes referred to as data munging. It is useful in almost any field.

Next, we will cover two important techniques from inferential statistics, multiple linear regression and logistic regression. These techniques are frequently used in business research and are typically not covered in a first statistics course. So, we cover them in this book to extend your statistical knowledge. We also cover to provide a nice segue into our final topic - machine learning for predictive analytics.

Machine learning makes use of a variety of underlying algorithms, including statistical models such as linear regression and logistic regression, to learn from historical data and outcomes how to predict numeric or categorical outcomes when presented with new input data.

Why Python?

Many of you have no background in programming other than the BUSBIS 100 - Programming Essentials class. Many of you also likely came to Pitt Business with no idea that you would be required to take a series of courses that require python programming. And, I would venture to guess that some (many?) of you may be asking yourselves Why do we need to learn python anyway? We’re business students!

The answer to that question has two parts. In short, businesses are demanding employees with data manipulation and analysis skills (business analytics skills), and python-based tools are among the best tools for performing data manipulation and analysis. One of the reasons Pitt Business is requiring students to take the Business Analytics series of courses is because employers let us know that they needed employees with data manipulation and analysis skills, particularly python skills.

So, why python? I will provide some reasons here. For now, you can take my word for it, but as the course progresses hopefully you will come to some of these same conclusions.

Python for Data Munging

Business and other organizations have large amounts of data that they can potentially analyze to describe, assess, and improve their business processes. Unfortunately, data is not always directly usable in its native form. Analysts typically need to clean the data, which refers to tasks such as dealing with missing values, correcting or removing errors, changing data formats, and removing duplicates. After cleaning the data, the analyst may need to filter the data to select a subset, join the data to another set of data, create summary statistics for the entire dataset or for different groups within the dataset, or create new columns of data by transforming the existing data or making calculations based on the existing data. Next, the analyst might need to make create some visualizations of the data, such as histograms, scatterplots, bar charts, etc.

There are tools for doing these tasks using a graphical user interface (GUI). For example, you might use Microsoft Excel or Power BI. Such tools, however, are typically expensive, and are not as powerful or efficient as python or other code-based tools, such as the programming language R.

Python is a computer programming language. Writing python instructions, or coding in python, is done by writing plain text statements that are then interpreted by the python interpreter. Performing data munging and visualization using python (or other code-based tools) has several advantages over performing such tasks with a GUI application.

  • Python is open-source and free to use - Python is free to use, even for commercial purposes, and the source code for python is freely available, which makes it easier to do security audits on the codebase or cusomize some aspect of python’s behavior.
  • Code is self-documenting - Once you have written the code to do the data manipulation or analysis you have excellent documentation of what you did to the data, in the form of the code itself! With a GUI application it is more difficult to document what you did to the data, because you would need to keep track of what you did to the data and what menus, checkboxes, dropdowns, etc. you utilized.
  • Analyses done with code are easily repeatable - When you do an analysis with code you can redo the entire analysis simply be re-runnng the code. With a GUI application you would need to go back and click on all the same menus and dropdowns, check the same boxes, etc. This is a huge advantage for code, because analyses are typically run more than once. One reason for running an analysis more than once is that you realize that you need to revise the analysis in some way. With code you can simply revise relevant parts of the code and then re-run the analysis. Another reason for running an analysis multiple times is that the analysis may need to be done on a periodic basis. For example, a business may need to generate reports every week, or analyze sales every month. Once the code is set for such a periodic analysis it can simply be run on new data each period.
  • Code is easy to store, share, and collaborate on - Code such as python code can be stored on disc without taking up much space, can be easily shared with others, and is amenable to source code control systems such as git, which facilitate code evolution and collaboration on the code by multiple programmers.
  • Python has powerful tools available as packages - Because python is such a nice language to work with there have been several tools developed as add-on packages that make extra functionality available in python code. For example, we will be working with a freely-available add-on package called pandas that has very powerful tools for data munging. pandas is used heavily by data analytics professionals in many different fields, and has become an industry-standard tool. Try doing a job search with pandas added as one of the keywords and you will see lots of jobs for which pandas is a required or desired skillset. In this course we will also work with several other add-on packages that are widely used data visualization, statistical analysis, and machine learning.

Python for Machine Learning

There are many packages available for doing machine learning in python. The one we will use is called scikit-learn. There are also GUI applications available that may be used to do machine learning analyses, but scikit-learn is free, powerful, and is widely used in business organizations. Using a code-based tool such as python for machine learning has many of the same advantages as those listed above for using code-based tools for data munging. In addition, as we will see, machine learning is typically conducted using nested layers of iterative steps. Computer code is well suited for implementing nested iterative tasks. With GUI applications the iterative and nested nature of the analyses is necessarily hidden behind the scenes.

Learning to Code

Computer programming, informally referred to as coding, is sometimes labeled as difficult or tricky. College students required to do coding sometimes ask their instructors why do i have to do this with code? I can do it in Excel [or insert some other GUI application here] and coding is difficult!

First, as outlined in the sections above, code-based approaches have several advantages over GUI applications. They are flexible, powerful, repeatable, easy to document and communicate, and are typically available for free. Moreover, because of the flexibility and power of code-based approaches you often can do things with code that you cannot do with GUI applications.

Second - and you might be surprised to hear this - accomplishing a task with code is often easier than using a GUI application! Why do you think coding is used so frequently in real-world organizations? It isn’t because the programmers want to show how smart they are; it is because for many tasks coding is the easiest or only way to get the job done!

The caveat is that coding is only easier after you are comfortable with the programming paradigm and have had a little practice. Most of you have grown up using GUI applications, so they feel familiar and comfortable to you. When you need to accomplish a data-related task you will naturally select the tools with which you have the most familiarity and comfort. This class gives you an opportunity to reach a level of comfort and familiarity with coding that enables you to use code-based tools to do some really amazing things!

Here are some tips about how to learn to code:

Look for opportunities to practice!

Would you expect to be able to do a complex dance or gymnastics move, or execute a slick dribble move with a basketball immediately after it was demonstrated by your coach? Of course not! Programming is the same way! To get good at it you need to practice. You will have assignments in this course that give you an opportunity to practice, but you should try to go beyond just completing the assignments. You can review examples in the book or in notebooks posted by the instructor. You can try to do some of the optional practice problems that the instructor will provide, and then compare your work to the solutions. Finally, you should look for opportunities to use programming in your other classes or in your activities or internships. You may have assignments or projects in your other classes that require you to analyze data. See if you can do them with python!

Become comfortable with error messages

Coding syntax is very precise, and if you make an error the interpreter will tell you about it by returning an error message. Don’t let errors throw you off. Even experienced programmers often receive error messages on their first attempt at a coding task. I imagine that you will see me get some error messages when we are doing live coding in class!

When you get an error message look at the top of the message to see if you can determine what code caused the message and then scroll to the bottom of the message, where you will usually find a short summary of the error. Sometimes those will be the only clues you need to figure out how to fix the problem. Another thing you can do is look over your code and see if you can find any obvious errors. If those approaches don’t work you can Google the error message to see if you can find any insight online. If that still doesn’t help show your code to me or one of the TAs and we will help you troubleshoot.

As you get more practice you will often know immediately after glancing at the error what went wrong and be able to quickly fix your code.

So, the bottom line is to treat error messages as learning opportunities. Don’t let them cause you to be stressed or to give up.

Develop the habit of working from examples

Don’t feel like you need to write all your code from scratch. Even professional programmers do a lot of their work by starting with existing code and then adapting it to a new purpose. When programmers want to learn how to do a new thing with code they may Google it or ask ChatGPT and then adapt the code examples they find to their specific purpose.

This textbook contains lots of example code. I will be working through other examples in class and sometimes posting still more practice tasks with example solutions. When the time comes to do a task for an assignment the first thing you should do is see if you remember how to do that task. If you don’t remember, look through the textbook and class notes and you are almost sure to find an example that is very similar! Take that example and adapt the code to the specific requirements of the assignment task.

Another way you can utilize this principle is to look at code written by good coders and try to learn from it. You can find many code examples in books and online. If there are a lot of things in the code that are new you should choose one or two, look them up, and see if you can figure out how they work. You can do this by looking them up in the online documentation, and this leads to the next point.

Make use of official documentation

The tools we will be using in this course, including python, pandas, numpy, matplotlib, seaborn, statsmodels and scikit-learn all have official websites. These websites have documentation sections to help new users get started, as well as more detailed sections that can be consulted when troubleshooting or learning more advanced techniques. The official documentation for a tool is often better than other books or online tutorials because it is typically written by the developers of the tool, who have intimate familiarity with how it works, and the official documentation is kept always up to date. When you use online tutorials other than the official documentation you sometimes will encounter code examples that are out of date, that don’t use the tool in the best way, or that are simply bad coding. Developing the habit of using the official documentation for a tool will help you not only in this course but throughout your studies and your career.

Try to enjoy learning to code

Humans, by nature, are wired to enjoy learning. However, being asked to learn as part of a class, a process that requires you to complete defined assignments at specific times and have your work evaluated and graded, can sometimes remove the joy of learning. Try not to let this happen! Just because you are being graded on it doesn’t mean you can’t enjoy it!

By the end of this course you will have some very useful skills that will enable you to do things that most business students cannot do. You will be able to manipulate and summarize data, create interesting visualizations, and build predictive analytics models with python-based tools. The problem-solving and creative building involved in doing these things can be fun, and it is satisfying to see the completed products you create. Don’t let the stress of doing a class for a grade prevent you from enjoying the learning experience and marveling at the things you are learning and doing!