Confidence, Collaboration, and Coding
Jason Mix
Posted on December 19, 2023
We've just wrapped up our first projects for the Data Science bootcamp at Flatiron School. I'm not sure which was more daunting: the aspect of doing my first data science project ever or the fact that this was to be done in a group. But, knowing what I know now, I wouldn't have had it any other way.
As the project was assigned and explained, my stomach did a flip. Sure, we were given a business question to answer and a data set to use, and, sure, we were given some instructions for how to approach the project. However, this was quite different than anything I had done before. This was not a lab or a guided project in a Jupyter notebook that leads me through someone else's thought process for how to answer the business question. There was no solution branch in a GitHub repository to check our approach or our code against. There was no step-by-step instructions for cleaning the data. We were not assigned a statistic to find or a visual to create--we had to decide for ourselves what statistics and visualizations would be useful for answering the business question.
I did not doubt my ability to write code that would remove null values or impute a central value. I knew I could create visualizations in Pandas, MatPlotLib, Seaborn, or Tableau alike. I could calculate statistics and create new columns in a Pandas DataFrame with the best of 'em. I even felt like I had a good understanding of the need to create normalized statistics in order to compare differently-sized data subsets. Yet, with all my skills and confidence, I found myself having a small panic attack when I was faced with the prospect of an open-ended assignment such as this.
I soon found my anxiety at the novelty of this assignment was compounded by the fact that we were working in a group. We did not have a project manager assigning tasks and roles and imposing deadlines. We were on our own to manage ourselves. I was used to working on my own--if I encountered a difficulty I knew I could work through it or find helpful resources. Working in a group, however, meant that we encountered these difficulties as a group. Troubleshooting an issue became much more complicated because we had to communicate the issue amongst ourselves, discuss and make sure everyone was understanding the issue the same way, and then assign roles in addressing it. I found that I could not rely solely on the brainpower and intuition that had gotten me to this point--I needed to communicate, collaborate, and occasionally acquiesce to the majority opinion of my group.
The goal of the project was to provide actionable insights to a businessperson looking to invest in airplanes. We were to use a dataset of aircraft accidents since 1962 to make recommendations that would minimize the investor's risk.
We hit a snag almost right away as a group. I felt that our first task was to clean and filter the data to create master data set that we would all work from. Other groupmates wanted to throw the dataset into Excel in order to come up with some preliminary findings to get a sense of where we were going and pick a direction for our project. We would try to make decisions about certain details of the data cleaning that would lead to questions about where our project was going to go, which would lead back to, "we don't know where our project is going to go because we don't have a master data set yet".
Eventually, to my delight, this all came together. I was able to convince my groupmates of the importance of cleaning the data before determining what the data was going to tell us. Being able to quickly look at subsets of the data in Excel allowed us to make better decisions about the data cleaning process. For instance, as our dataset included many data points for aircrafts other than airplanes (e.g., blimps, hot air balloons, etc.), we needed to filter our dataset to only include airplanes. However, there were many data points that did not specify the type of aircraft. It would have been easy enough to just drop all the rows that had a missing value in the 'Aircraft.Category' column. However, in Excel, it was easier to see a way around losing so much data (thousands of data points, in fact). We noticed that there were many data points where the type of aircraft was missing, but the make and model were present. If we could find another row that specified that make and model as an airplane, then we could deduce that this aircraft was an airplane as well.
This was easy enough to do with Python and Pandas. We made lists of all the makes and models where the 'Aircraft.Category' column specified 'Airplane'. Then we went row by row, checking if the 'Aircraft' category was empty and if the make and model were in the aforementioned lists. If all of the above was true, then we would impute 'Airplane' for the 'Aircraft.Category'. See the code below:
#First we filtered the dataframe to only include rows where the
#Aircraft.Category column specified Airplane
airplane_df = df[df['Aircraft.Category']=='Airplane']
#Then we made respective lists of the makes and models of those
#airplanes
airplane_make_list = [make for make in airplane_df['Make']]
airplane_model_list = [model for model in airplane_df['Model']]
#Here we defined a function that checks for airplanes with the
#above makes and models that have a missing value for
#'Aircraft.Category' and imputes 'Airplane' where appropriate
def replace_airplane(row):
if pd.isnull(row['Aircraft.Category']) and row['Make'] in
airplane_make_list and row['Model'] in airplane_model_list:
return 'Airplane'
else:
return row['Aircraft.Category']
#Then we applied that function to the original DataFrame
df['Aircraft.Category'] = df.apply(replace_airplane, axis=1)
#Finally, we were able to filter for just airplanes without losing
#thousands of data points needlessly
airplane_df2 = df[df['Aircraft.Category']=='Airplane']
This is such a powerful example of the benefit of working in a group. Although we clashed at first, everyone brought a different set of skills and ideas. This allowed us to create a more thorough and accurate dataset and, therefore, more thorough and accurate insights.
I am so grateful for the experience of our first project. Even though I initially felt overwhelmed by the open-ended nature of the project, now I feel like if I got a similar project I would intuitively know what to do. This is a byproduct of being thrown into the deep end, so to speak. I've had this experience of getting a project, not quite knowing how to proceed, yet persevering and getting the project done one step at a time. I feel that this was an initiation that I can continue to build on in future projects.
Despite the fact that our group clashed at times, I know that I learned and grew from the experience. Our differences of opinions and intuitions about how to proceed ended up being a benefit. Furthermore, we learned how to communicate and assign roles/tasks. I very much look forward to our next project--I can't wait to continue to build my confidence and skills in collaboration as well as coding.
Posted on December 19, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024