10 DataFrame Methods You Need to Know for Data Science

Zach English
Geek Culture
Published in
6 min readOct 4, 2021

--

Manipulating a Pandas DataFrame is the cornerstone of many data projects. I have provided the 10 most important DataFrame methods and functions that you need to know. Follow along as I explain them and prepare an NFL dataset for a sport betting model!

The fundamentals of any project involving large amounts of data are in the preparation. Cleaning, moving, verifying and morphing the data will set you up for success later on. The 10 methods listed below not in ranked order, they are listed in the order that I used them for this specific use case.

Can we accurately predict who will cover the spread?

I have gathered10 years worth of NFL game data with the end goal of setting up a DataFrame ready to create a model predicting if a team will cover the spread.

In our starting data, each row represents a team and their respective score for each game. In order to get the best data for the model we are going to create several new attributes for our data and then get the home and away teams together on each line. I gathered excel files for the last 10 years in this format from a few quick google searches.

As you can see below, we have some basic information for each game and some stats about the betting odds. Not bad. But we can make it much more useful with a few easy functions. Here is what our starting Data Frame initially looks like after it is read in from our working directory:

Initial NFL DataFrame
  1. where()

Where() is a numpy function that allows us to create a new column in our Data Frame based on parameters in another column. Due to the format of our data, I am going to separate the spread into its own column using where. In the data the spread is stored in the Close column for the home team. It works similar to an IF() statement in excel. If the condition is true then the second value is passed into the new column, if the condition is false then the final value is placed in the new column.

Creating a new Data Frame Column ‘Spread’ with only the spread values

2. gt() and lt()

Greater than and Less than are very useful Pandas functions for comparing values to a baseline or creating an indicator for later use. It returns a boolean value based on the comparison outcome. Here I am creating a new Data Frame column ‘Favorite_ML’. If the original column ‘ML’ is less than 0 then the newly created column will show a value of True. Favorites will have negative money line values and underdogs will have positive ones. Using lt() we can determine which team is favored in each game.

Creating a new Boolean column by using the less than function

3. diff()

Creating a Win Margin column based on the ‘Final’ differences

The diff() function will subtract the value in the Final column for row 2 from row 1. I am using this function to effectively get the winning margin of each game. It subtracts the Home team score from the away team’s score. This means that every odd row will have the correct margin of victory (positive or negative) and each even row will have the difference of two teams that did play each other.

We will filter these out by referencing the even index columns, creating a new DataFrame with those values and then moving the values back into our original DataFrame in a new and final winning margin column with odd index columns filled with zeros.

Filtering out the incorrect winning margin columns

4. drop()

After creating our new winning margin column (‘W_M’) we can drop our old one. You can specify multiple columns as once but we are only trying to remove ‘Win_Margin’.

Dropping the old Winning Margin column

Cleaning as you prepare your data is essential to eliminating noise in the data. It will make creating a model and working with the dataset easier going forward. This brings us the next function:

5. replace()

We are using this function to normalize the team names to the corresponding mascots to improve our data quality. Some teams were listed differently in each year of the data. For example the New York Jets were listed as ‘NYJets’ for several years and ‘NewYork Jets’ for another. Normalizing this will save you time and confusion as you progress. You can see what I did for all 32 teams below:

Cleaning the Team Names with replace()

6. drop_duplicates()

In order to check that I fixed all the team names I created a new DataFrame with only the team names. I then dropped the duplicates to check two things: 1. how many unique values were in the team names and 2. The team names were what I expected them to be. There are 32 teams in the NFL and we ended up with 32 unique Team names in the df_names DataFrame. We can now go back to our original DataFrame and continue.

Verifying that we have 32 unique team names

7 & 8. groupby() and get_group()

Using these functions we are going to split our Data Frame based on if the team is Home (‘H’) or Away (‘V’). By seperating the DataFrame by home or away, we can later merge our newly created home and away DataFrames in order to get both teams in the same row for each game.

The Resulting Home DataFrame :

Home_df first five rows

The Resulting Away Team DataFrame :

Away_df first five rows:

9. merge()

Prior to grouping the data in the previous step, I added a ‘Game ID’ column into the DataFrame. Now on the home and away DataFrames we will have a key to merge them back together on.

Merge is exceptionally useful when for combining data sources and data engineering like this use case. Merge is equivalent to a join in SQL. Common keys to join on include dates, IDS, and other unique identifiers. Once joined, we will call the result of our merged datasets ‘final’.

10. rename()

After the complex work, we are going to wrap up by cleaning up the attribute names. After the merge the attribute names end with ‘_x’ or ‘_y’ to distinguish where the came from. For our purposes we are going to change the x’s to home and y’s to away to make the DataFrame easier to read. In addition, I dropped some redundant columns from the merge. You can do this in the same column but I split it into multiple lines to make it easier to read.

The Final DataFrame!

Cleaned of hard to read attribute names, irrelevant columns and ready to be used in a number of different models! Pasted below is a snipet of the final DataFrame for those of you that followed along!

The Final cleaned DataFrame

Next Steps and Follow Up:

After cleaning and preparing our data, we are ready to take it to the next level and create a model. Stay tuned for a follow up on what I did with this data to predict NFL games.

Thank you for reading! Tweet at me @ZachEng1ish or shoot me an email zach.zlish@gmail.com with any questions!

--

--

Zach English
Geek Culture

Data engineering consultant and aspiring data scientist, writing about tech and finance