Wednesday, April 17, 2024
Analytics Education STEM

Combining multiple datasets into a single data file (Excel or CSV)

The Scenario

I want to analyze data for New England Patriots over the past 10 seasons. The data files are downloaded as a file by year. So, for 2010 through 2019 NFL season, I have 10 separate data files. I need to combine them all into a single file so I can do an all-up analysis,  do various pivoting, charting, etc.

The Solution

I could open each file, copy the entire range, create a new workbook and paste into its new sheet, and repeat for each data file. Or, I could make Python do it for me in seconds without having to even launch Excel and click anywhere! I’ll put all the files in a single directory, then start combining all those files into a new file in that directory that contains all 10 years worth of data.

Code Walkthrough

I’ll use Pandas library’s dataframe features. I’ll also use os library for traversing system directory.

import os
import pandas as pd

Then I need an empty dataframe where my merged content will be held.

new_file = pd.DataFrame()

Let’s get the current working directory first, I’ll save it to restore to it after I’m done, and change to the directory that contains all the separate data files. I’ll pretend they’re in a folder called ‘Stats-2010-2019-patriots’ in C’s root.

startdir=os.getcwd()
datasourcedir="C:\\Stats-2010-2019-patriots"
os.chdir(datasourcedir)

Then I want to load all files with a specific extension from that directory only. In this case, my downloads have the extension XLSX

allfiles = os.listdir(datasourcedir)
xlsx_files = [i for i in allfiles if i.endswith('.xlsx')]

Let’s look at the names to verify that we found all expected files:

print(xlsx_files)

Then I’ll read the files into a dataframe object df:

df=pd.read_excel("2010_sportsref_download_pat.xls.xlsx")
for file in xlsx_files:
     df=pd.read_excel(file) # this creates a df with each file name from the list files
     new_file = pd.concat([new_file, df])

The concat() keeps on appending to the data we opened first as it loops through all matching files one by one, loading their respective data.

Finally, the new file is persisted as ‘combined_data_python.xlsx’ in the same directory:

new_file.to_excel("combined_data_python.xlsx", index=False)

Let’s then restore the original code directory for next session:

os.chdir(startdir)

And that is how it is done 🙂 To see what I did next with the combined file, read this post.


This post is not meant to be a step-by-step, detailed tutorial, instead to serve key concepts, and approaches to problem-solving. Basic->Intermediate technical/coding knowledge is assumed.
If you like more info or source file to learn/use, or need more basic assistance, you may contact me at tanman1129 at hotmail dot com. To support this voluntary effort, you can donate via Paypal from the button in my Home page, or become a Patron via my Patreon site. A supporter/patron gets direct answers on any of my blogs including code/data/source files whenever possible upon request.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top