SpaceX launch exploration

Analysis and report of SpaceX launches as part of IBM Data Science capstone project

Data Science


This analysis and report are part of the capstone project of the IBM Data Science Professional Certificate. Here we examine the capabilities of launching and returning rockets in relation to their payload and other factors.

Executive Summary

  • Intro: Data collection (API, Web Scraping), Data Processing
  • Methodology: EDA, EDA with Data Visualization (Folium, Plotly Dash), Predictive Analysis (Classification)
  • Insights and predictions into launch outcomes in relation to payload, launch site

Project background and context

  • Aerospace industry is a cost intensive business. First stages of rockets are large and expensive.
  • Recovering the first stages can be an immense cost relief.
  • The first stage booster rocket differ in their capability to transport equipment to space (payload).


Problems we want to explore

  • Insights about the reusage of rocket-stages
  • Determine if we can predict the launch cost, if the first stage will land.


  • Data collection methodology
  • SpaceX REST API, Web Scraping
  • data wrangling
  • Json normalized, data sampled, dealing with nulls, created new aggregated data columns
  • exploratory data analysis (EDA) using visualization and SQL
  • interactive visual analytics using Folium and PlotlyDash
  • predictive analysis using classification models
  • Class creation, standardize data, split-train-test data, find best hyperparameter (SVM, Classification Tree, Logistic Regression)

Data Collection – API and Web Scraping

  • data collection of SpaceX Rest API with additional endpoints
  • web scraping of Wikipedia (with BeautifulSoup)


Data Wrangling

  • the API data requests resulted in json format files
  • json files were normalized into a dataframe
  • additional data was acquired through other APIs (rockets, launchpads, etc.)
  • Sampling data (head), dealing with Nulls
  • Calculate number of launches on each site
  • creating a landing outcome column / success class of landing

Exploratory Data Analysis (EDA) with Data Visualization

  • Variables: Payload Mass, Flight Number, Launch Site, Orbit Types, Success rates
  • Gained insights about best places to start launches (CCAFS)
  • Success rates in relation to launch sites
  • Overall success rate is rising since 2013

Flight Number vs. Launch Site

Payload vs. Launch Site

  • VAFB site has a no launches over 10 000 kg
  • CCAFS has most launches with heaviest payloads
  • CCAFS launches with less than 8000 show higher failure rate

Success Rate vs. Orbit Type

  • ES L1, GEO, HEO, SSO: highest success rates around 100%
  • GTO: lowest success rate around 50%
  • SSO: probably no data or 100% Failure

Launch Success Yearly Trend

  • success rate starts increasing 2013
  • some failures after 2017
  • again an increasing success rate around 2018

Success rates of launch sites (Folium Map)

  • CCAFS (Cape Canaveral) LC 40: most starts with most failures
  • KSC (Kennedy Space Center) LC 39A: most successful starts
  • CCAFS (Cape Canaveral) SLC 40: least starts

Model accuracy for all built classification models (bar chart)

  • Logistic Regression (LR)
  • Support Vector Machine (SVM)
  • KNeigborsClassifier (all reach very similar scores around: 0.833)
  • Decision Tree Classifier has the lowest score: 0.72 2

Predictive Analysis (Classification)

  • data was transformed and preprocessed
  • A prediction class was calculated ( Numpy Array)
  • The test data was separated into two parts, one to train the model on the data, one to test the model on unknown data

Confusion Matrix of one of the best performing model

  • This KNN model predicts 12 labels correctly as landed
  • it predicted 3 not landed labels as not landed
  • It falsely predicted 3 not landed as landed
  • It did not predict any landed as not landed


  • Most successful starts from: Kennedy Space Center Launch Complex 39 (KSC LC 39A)
  • Importance of launch site: different specifications due to weather, geo location
  • Payload may result in difficulties to launch / land
  • Different orbits result in different success rates
  • Based on all the acquired and processed data we can make predictions about the successful outcome of a launch / landing: best performing models score around 0.83

this project was part of the IBM Data Science Professional certificate

Github repository with all notebooks