Pandas ====== Materials: ---------- - `Code-along Jupyter notebook `__ - `Filled-in Jupyter notebook `__ While NumPy can be used to important data, it is optimized around numerical data. Many data sets include categorical variables. For these data sets, it is best to use a library called ``pandas``, which focuses on creating and manipulating data frames. .. tab:: Python .. code:: python import pandas as pd Read data --------- With ``pandas`` imported, we can read in .csv files with the ``pandas`` function ``read_csv()``. In that function, we can specify the file we want to use with a URL or with the path to a local file as a string. This saves the data in a structure called a DataFrame. .. tab:: Python .. code:: python df = pd.read_csv("https://raw.githubusercontent.com/DeisData/python/master/data/gapminder.csv") # read in data Our data is now saved as a data frame in Python as the variable ``df``. With the data now in the environment, we can take a look at the first few rows with ``df.head()``. .. tab:: Python .. code:: python df.head() .. tab:: Output = =========== ==== ====== ========== =============== ============== ================ ============== =========== \ country year region population life_expectancy age5_surviving babies_per_woman gdp_per_capita gdp_per_day = =========== ==== ====== ========== =============== ============== ================ ============== =========== 0 Afghanistan 1800 Asia 3280000.0 28.21 53.142 7.0 603.0 1.650924 1 Afghanistan 1810 Asia 3280000.0 28.11 53.002 7.0 604.0 1.653662 2 Afghanistan 1820 Asia 3323519.0 28.01 52.862 7.0 604.0 1.653662 3 Afghanistan 1830 Asia 3448982.0 27.90 52.719 7.0 625.0 1.711157 4 Afghanistan 1840 Asia 3625022.0 27.80 52.576 7.0 647.0 1.771389 = =========== ==== ====== ========== =============== ============== ================ ============== =========== We can see that this data frame has several different columns, with information about countries and demography. Summarize data frame -------------------- It is important to understand the data we are working with before we begin analysis. First, let’s look at the dimenions of the data frame using ``df.shape``. It gives the number of rows by the number of columns. .. tab:: Python .. code:: python df.shape .. tab:: Output .. code:: none (14740, 9) This shows that our data frame has 14740 rows by 9 columns. We can also use ``df.columns`` to display the column names. .. tab:: Python .. code:: python df.columns .. tab:: Output .. code:: none Index(['country', 'year', 'region', 'population', 'life_expectancy', 'age5_surviving', 'babies_per_woman', 'gdp_per_capita', 'gdp_per_day'], dtype='object') Categorical variables --------------------- Next, let’s summarize the categorical, non-numerical variables. For instance, we can identify how many unique regions we have in the data set. First, to select a column, we use the notation ``df['COLUMN_NAME']``. .. tab:: Python .. code:: python df['region'] .. tab:: Output .. code:: none 0 Asia 1 Asia 2 Asia 3 Asia 4 Asia ... 14735 Africa 14736 Africa 14737 Africa 14738 Africa 14739 Africa Name: region, Length: 14740, dtype: object To identify unique entries in this column, we can use the ``.unique()`` function. .. tab:: Python .. code:: python df['region'].unique() .. tab:: Output .. code:: none array(['Asia', 'Europe', 'Africa', 'America'], dtype=object) The ``countries`` column has many unique values, so instead of ``.unique()``, we can use ``.nunique()`` to find the number of unique countries in the data set. .. tab:: Python .. code:: python df['country'].nunique() .. tab:: Output .. code:: none 182 Numerical variables ------------------- Numerical columns can be summarized in several ways. Let’s find the mean first. To make things simpler, we’ll just do calculations on the ``population``, ``life_expectancy``, and ``babies_per_woman`` columns. We can put those names in an arrangement called a ``list`` and then specify that list for the columns. .. tab:: Python .. code:: python num_cols = [ 'population', 'life_expectancy', 'babies_per_woman' ] # numerical columns df[num_cols] .. tab:: Output ====== ========== =============== ================ \ population life_expectancy babies_per_woman ====== ========== =============== ================ 0 3280000.0 28.21 7.00 1 3280000.0 28.11 7.00 2 3323519.0 28.01 7.00 3 3448982.0 27.90 7.00 4 3625022.0 27.80 7.00 \.\.\. \.\.\. \.\.\. \.\.\. 14735 14255592.0 51.60 3.64 14736 14565482.0 54.20 3.56 14737 14898092.0 55.70 3.49 14738 15245855.0 57.00 3.41 14739 15602751.0 59.30 3.35 ====== ========== =============== ================ 14740 rows × 3 columns With this set of columns, we can run ``.mean()`` to find the mean of each column. .. tab:: Python .. code:: python df[num_cols].mean() # returns the mean of each column .. tab:: Output .. code:: none population 2.252933e+07 life_expectancy 5.683453e+01 babies_per_woman 4.643472e+00 dtype: float64 If we want a larger variety of summary statistics, we can use the ``.describe()`` method. .. tab:: Python .. code:: python df[num_cols].describe() .. tab:: Output ===== ============ =============== ================ \ population life_expectancy babies_per_woman ===== ============ =============== ================ count 1.474000e+04 14740.000000 14740.000000 mean 2.252933e+07 56.834526 4.643472 std 9.307143e+07 15.868464 1.994833 min 2.128000e+03 4.000000 1.130000 25% 8.990308e+05 44.230000 2.630000 50% 4.063978e+06 60.080000 5.060000 75% 1.218722e+07 70.380000 6.440000 max 1.376049e+09 83.300000 9.220000 ===== ============ =============== ================ We can also break down subgroupings of our data with the method ``.groupby()``. .. tab:: Python .. code:: python grouped_data = df.groupby('region') grouped_data['population'].describe() .. tab:: Output ======= ====== ============ ============ ======= ========== ========= =========== ============ \ count mean std min 25% 50% 75% max ------- ------ ------------ ------------ ------- ---------- --------- ----------- ------------ region \ \ ======= ====== ============ ============ ======= ========== ========= =========== ============ Africa 4293.0 9.181313e+06 1.655128e+07 12522.0 996331.00 3457113.0 9901052.00 1.822020e+08 America 2673.0 1.667833e+07 4.411806e+07 24000.0 331799.00 2843246.0 10061519.00 3.217736e+08 Asia 4212.0 4.604245e+07 1.658010e+08 2128.0 512028.25 4011309.5 19517390.25 1.376049e+09 Europe 3562.0 1.520351e+07 2.463153e+07 61428.0 2308682.00 5186801.5 10638884.75 1.484358e+08 ======= ====== ============ ============ ======= ========== ========= =========== ============ Accessing rows and specific entries ----------------------------------- You can also to access a specific row using ``df.loc[ROW, :]``. The colon specifies to select all columns for that row number. .. tab:: Python .. code:: python df.loc[0, :] # the first row .. tab:: Output .. code:: none country Afghanistan year 1800 region Asia population 3280000.0 life_expectancy 28.21 age5_surviving 53.142 babies_per_woman 7.0 gdp_per_capita 603.0 gdp_per_day 1.650924 Name: 0, dtype: object We can use ``.loc`` to find the value of specific entries, as well. .. tab:: Python .. code:: python df.loc[0, 'country'] # first row entry for column .. tab:: Output .. code:: none 'Afghanistan' Question 1: Summarizing data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Print out the summary statistics for columns ``age5_surviving``, ``gdp_per_day``, and ``gdp_per_capita``. .. tab:: Python .. code:: python ### your code below: .. collapse:: Solution .. container:: .. tab:: Python .. code:: python df[['age5_surviving', 'gdp_per_day', 'gdp_per_capita']].describe() Manipulate data --------------- Subset by row ~~~~~~~~~~~~~ Sometimes, we want to create a subset of the main data frame based on certain conditions. Below, we take all of the rows where ``babies_per_woman`` is greater or equal to 4 with ``df['babies_per_woman'] >= 4`` and assign this to a new data frame. To check that this was done correctly, we can look at the minimum of the ``babies_per_woman`` column in the new data frame with ``.min()``. .. tab:: Python .. code:: python # take all rows where babies_per_woman is greater or equal to 4 and make a new data frame df_4 = df[df['babies_per_woman'] >= 4] df_4['babies_per_woman'].min() .. tab:: Output .. code:: none 4.0 We can use the following operators to make subsets: - Equals: ``==`` - Not equals: ``!=`` - Greater than, less than: ``>``, ``<`` - Greater than or equal to: ``>=`` - Less than or equal to: ``<=`` We can also subset with categorical variables. Here, we take all rows where the country is Hungary. .. tab:: Python .. code:: python df_hungary = df[df['country'] == 'Hungary'] pd.unique(df_hungary['country']) .. tab:: Output .. code:: none array(['Hungary'], dtype=object) Math ~~~~ If we multiply a data frame by a single number, each value in the column will be muliplied by that value. .. tab:: Python .. code:: python df['babies_per_woman'] * 1000 .. tab:: Output .. code:: none 0 7000.0 1 7000.0 2 7000.0 3 7000.0 4 7000.0 ... 14735 3640.0 14736 3560.0 14737 3490.0 14738 3410.0 14739 3350.0 Name: babies_per_woman, Length: 14740, dtype: float64 We can also do math between columns, since they have the same length. Elements of the same row are added, substacted, multiplied, or divided. Here, we subtract the ``life_expectancy`` column from the ``age5_surviving`` column and assign it to a new column called ``life_difference``. .. tab:: Python .. code:: python df['life_difference'] = df['age5_surviving'] - df['life_expectancy'] print(df['life_difference']) .. tab:: Output .. code:: none 0 24.932 1 24.892 2 24.852 3 24.819 4 24.776 ... 14735 39.200 14736 37.130 14737 35.970 14738 34.900 14739 32.740 Name: life_difference, Length: 14740, dtype: float64 This new column is now reflected in the data frame. .. tab:: Python .. code:: python print(df.columns) .. tab:: Output .. code:: none Index(['country', 'year', 'region', 'population', 'life_expectancy', 'age5_surviving', 'babies_per_woman', 'gdp_per_capita', 'gdp_per_day', 'life_difference'], dtype='object') Question 2: Working with data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a subset of data from Lithuania. Within that subset, calculate the mean GDP per 1000 people across entries. *Hint: Multiply per capita GDP by 1000.* .. tab:: Python .. code:: python ### Your code here: .. collapse:: Solution .. container:: .. tab:: Python .. code:: python df_lth = df[df['country']=='Lithuania'] df_lth['gdp_per_1000'] = 1000 * df_lth['gdp_per_capita'] print(df_lth['gdp_per_1000'].mean()) Create your own data frame -------------------------- To make your own data frame without a .csv, we use the function ``pd.DataFrame()``. There are many ways to use this function to construct a data frame. Here, we show how to convert a dictionary of lists into a data frame. Each list will be its own column, and you need to make sure the lists are all the same length. The keys of each list should be the column names. .. tab:: Python .. code:: python data_dict = { 'a': [1, 3, 5], 'b': ['apple', 'banana', 'apple'], 'c': [-2., -3., -5.] } pd.DataFrame(data_dict) .. tab:: Output = = ====== ==== \ a b c = = ====== ==== 0 1 apple -2.0 1 3 banana -3.0 2 5 apple -5.0 = = ====== ==== You can also use lists of lists or 2D NumPy arrays to create data frames. Each list will be a row, instead of a column, and you will need to specify the column name as another argument in ``pd.DataFrame()`` called ``columns``. .. tab:: Python .. code:: python data_list = [ [1, 'apple', -2.], [3, 'banana', -3.], [5, 'apple', -5.] ] pd.DataFrame(data_list, columns=['a', 'b', 'c']) .. tab:: Output = = ====== ==== \ a b c = = ====== ==== 0 1 apple -2.0 1 3 banana -3.0 2 5 apple -5.0 = = ====== ==== Note: we need to save this as a variable to use it in the future. Sort data frame --------------- To sort the rows in a data frame by the value of a column, we can use the ``.sort_values()`` method. The argument ``by`` requires a list with a column name. Again, if you want to use the sorted version in the future, you need to save it as a new variable. .. tab:: Python .. code:: python my_df = pd.DataFrame(data_list, columns=['a', 'b', 'c']) my_df.sort_values(by=['b']) .. tab:: Output = = ====== ==== \ a b c = = ====== ==== 0 1 apple -2.0 2 5 apple -5.0 1 3 banana -3.0 = = ====== ==== You can also sort descending by specifying the ``ascending=False`` argument. .. tab:: Python .. code:: python my_df.sort_values(by=['b'], ascending=False) .. tab:: Output = = ====== ==== \ a b c = = ====== ==== 1 3 banana -3.0 0 1 apple -2.0 2 5 apple -5.0 = = ====== ==== If desired, multiple column names can be specified, with priority given to those first in the list. .. tab:: Python .. code:: python my_df.sort_values(by=['b', 'a'], ascending=False) .. tab:: Output = = ====== ==== \ a b c = = ====== ==== 1 3 banana -3.0 2 5 apple -5.0 0 1 apple -2.0 = = ====== ==== Add rows -------- There are multiple ways to add a new row to a data frame. The most straightforward way is to use the ``pandas.concat()`` function with a new data frame with just one row. We put the the two data frames into a list, and we set ``axis=0`` to make sure it adds as a row. We will specify ``.reset_index(drop=True)`` to reset row numbers to account for the new row. .. tab:: Python .. code:: python new_row = pd.DataFrame({ 'a': [2], 'b': ['banana'], 'c': [-1.] }) my_df2 = pd.concat([my_df, new_row], axis=0).reset_index(drop=True) print(my_df2) .. tab:: Output .. code:: none a b c 0 1 apple -2.0 1 3 banana -3.0 2 5 apple -5.0 3 2 banana -1.0 You can also use this approach to add multiple rows, as well, by having the new data frame consist of multiple rows. .. tab:: Python .. code:: python new_rows = pd.DataFrame({ 'a': [6, 5], 'b': ['banana', 'orange'], 'c': [-4., -9.] }) my_df3 = pd.concat([my_df2, new_rows], axis=0).reset_index(drop=True) print(my_df3) .. tab:: Output .. code:: none a b c 0 1 apple -2.0 1 3 banana -3.0 2 5 apple -5.0 3 2 banana -1.0 4 6 banana -4.0 5 5 orange -9.0 Join data frames ---------------- A critical tool in data wrangling is combining data frames that share common values, columns, or identifiers. Let’s important two new .csv files and join them. .. tab:: Python .. code:: python surveys_df = pd.read_csv("https://raw.githubusercontent.com/DeisData/python/master/data/surveys.csv", keep_default_na=False, na_values=[""]) species_df = pd.read_csv("https://raw.githubusercontent.com/DeisData/python/master/data/species.csv", keep_default_na=False, na_values=[""]) print(surveys_df.head()) print(species_df.head()) .. tab:: Output .. code:: none record_id month day year plot_id species_id sex hindfoot_length \ 0 1 7 16 1977 2 NL M 32.0 1 2 7 16 1977 3 NL M 33.0 2 3 7 16 1977 2 DM F 37.0 3 4 7 16 1977 7 DM M 36.0 4 5 7 16 1977 3 DM M 35.0 weight 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN species_id genus species taxa 0 AB Amphispiza bilineata Bird 1 AH Ammospermophilus harrisi Rodent 2 AS Ammodramus savannarum Bird 3 BA Baiomys taylori Rodent 4 CB Campylorhynchus brunneicapillus Bird The shared column between these data frames is ``species_id``, so this is the column we will want to join around. Inner Join ~~~~~~~~~~ The pandas function for performing joins is called ``merge()`` and an Inner join is the default option. Inner joins take all rows from both data frames that share values from an identifier column. In our case, this means that our joined data frame will only include rows with species identifiers present in ``species_df`` and ``surveys_df``. |inner join| .. tab:: Python .. code:: python merged_inner = pd.merge(left=surveys_df, right=species_df, left_on='species_id', right_on='species_id') # In this case `species_id` is the only column name in both dataframes, so if we skipped `left_on` # And `right_on` arguments we would still get the same result # What's the size of the output data? print(merged_inner.shape) merged_inner .. tab:: Output .. code:: none (34786, 12) ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ============= ======== ======= \ record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ============= ======== ======= 0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent 1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent 2 22 7 17 1977 15 NL F 31.0 NaN Neotoma albigula Rodent 3 38 7 17 1977 17 NL M 33.0 NaN Neotoma albigula Rodent 4 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. 34781 28988 12 23 1998 6 CT NaN NaN NaN Cnemidophorus tigris Reptile 34782 35512 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird 34783 35513 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird 34784 35528 12 31 2002 13 US NaN NaN NaN Sparrow sp. Bird 34785 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ============= ======== ======= 34786 rows × 12 columns The result ``merged_inner`` data frame contains all of the columns from ``surveys_df`` (``record_id``, ``month``, ``day``, etc.) as well as all the columns from ``species_df`` (``species_id``, ``genus``, ``species``, and ``taxa``). Left join ~~~~~~~~~ What if we want to add information from ``species_df`` to ``surveys_df``\ without losing any of the information from ``surveys_df``? In this case, we use a different type of join called a left join, where we keep all rows from the data frame we call left (in our case ``surveys_df``) and only take rows from the right data frame (``species_df``) with species IDs in ``surveys_df``. |left join| A left join is performed in pandas by calling the same ``merge()`` function used for inner join, but using the ``how='left'`` argument. .. tab:: Python .. code:: python merged_left = pd.merge(left=surveys_df, right=species_df, how='left', left_on='species_id', right_on='species_id') merged_left .. tab:: Output ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ================ ========= ======= \ record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ================ ========= ======= 0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent 1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent 2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent 3 4 7 6 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent 4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. \.\.\. 35544 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent 35545 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent 35546 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent 35547 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent 35548 35549 12 31 2002 5 NaN NaN NaN NaN NaN NaN NaN ====== ========= ====== ====== ====== ======= ========== ====== =============== ====== ================ ========= ======= Export data frame as .csv ------------------------- If you have made modifications to a data set in Python and want to export that to a new .csv, you can easily do that with the ``.to_csv()`` method that all pandas data frames have. .. tab:: Python .. code:: python %%script false --no-raise-error ## comment this line to run the cell df_hungary.to_csv('gapminder_hungary.csv', index=False) # index = False makes sure row names are not saved as their own columns Question 3: Putting it together ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create two data frames called ``A`` and ``B`` with at least 3 columns and 4 rows. Make one column in both ``A`` and ``B`` an identifier column, with at least one ID present in both data frames. Use a left join with ``A`` as the left data frame, and call the new data frame ``C``. Display the data frame, and export it as a .csv file. .. tab:: Python .. code:: python ### Your code here: .. collapse:: Solution .. container:: .. tab:: Python .. code:: python A = pd.DataFrame({ 'ident': [0, 1, 2, 3], 'size': [2.1, 5.2, 3.1, 1.5], 'location': ['IL', 'MA', 'CA', 'NE'] }) B = pd.DataFrame({ 'ident': [5, 4, 0, 1, 3], 'animal': ['monkey', 'giraffe', 'ape', 'lion', 'fish'], 'sex': ['M', 'F', 'F', 'F', 'M'] }) C = pd.merge(left=A, right=B, how='left', left_on='ident', right_on='ident') print(C) C.to_csv('joined_data.csv', index=False) .. tab:: Output .. code:: none ident size location animal sex 0 0 2.1 IL ape F 1 1 5.2 MA lion F 2 2 3.1 CA NaN NaN 3 3 1.5 NE fish M Resources --------- - `Pandas docs `__ - `Pandas getting started `__ - `Pandas cheatsheet `__ - `PySpark for big data `__ This lesson is adapted from `Software Carpentry `__. .. |inner join| image:: /_static/images/python/pandas/innerjoin.png .. |left join| image:: /_static/images/python/pandas/leftjoin.png