BASIC

In [ ]:
2+2
In [ ]:
print(4+6)
In [ ]:
print 5+5 #doesn't work, use parent.
In [ ]:
4  #int
4.3 # float
"hello world" # string
True #boolean
False #boolean
None # undefined
In [ ]:
type(4) # type() is useful for figuring out what kind of data you have
In [ ]:
type(5.4)
In [ ]:
type("hello world")

Converting between types

In [ ]:
int(6.42)  #convert float -> int
In [ ]:
str(5)  # convert int -> string
In [ ]:
#group question 
#How can we make this work? 
print(5 + " is a number") # not work
In [ ]:
# group question
type(1/2)   # float, int, string?

logic in python

In [ ]:
5 > 4
In [ ]:
# 5 >= 4, 5 <= 4, 5 == 4, 5 < 4, 5> 4
5 == 4 # false
5.0 == 5  # convert, True
5.0 == "5"  # false
In [ ]:
(5 > 4) and (5 < 3)
In [ ]:
#group question
(5 > 4) or (5 < 3)

Variables

In [ ]:
x = 3.2
y = "hello world"
In [ ]:
helloHappyPeople = "hello world"
In [ ]:
helloHappyPeople  # tab to autocomplete
In [ ]:
z = "3.2"
#group question 
x == z # False or True?
In [ ]:
#group question
x = True
y = False
z = False

if x or y and z:
    print('yes')
else:
    print('no')   

# will the output be yes or no?
In [ ]:
#group question
x or (y and z)
In [ ]:
#group question
x or y or z

tuples/lists/dictionaries

In [ ]:
names_tuples = ('alice','bob','sam')  # tuple, are not modifiable (immutable)
names_list  = ['alice','bob','sam']  #list, are changeable (mutable)
In [ ]:
names_tuples.append('jake')
In [ ]:
names_list.append('jake')
names_list
In [ ]:
names_list.insert(3,'chelsea')  # chelsea inserted at position 3
names_list
In [ ]:
names_list[1]  # to get bob
In [ ]:
names_list.pop()  # pull from end
In [ ]:
names_list
In [ ]:
names_list.index('sam')  #tell me where sam is in the list
In [ ]:
len(names_list)  # give me the length  --> 3 or 4
In [ ]:
a = [[1,2],[4,5]]
a[1][0]  # will I get 1,2,4, or 5?
In [ ]:
#group quesiton
a = [1,2,3,None,(),[],] # what is len(a)? 4,5,6,7 or error?
In [ ]:
# concat lists
list1 = [1,2,3,4]
list2 = [5,6,7,8]
list1 + list2
In [ ]:
#group question
#What is the answer? 
"apple" + "bana"  

Objects in Python

In [ ]:
class Person:
    
    fullname = None   # this is a field
    weight = 0 # also a field
    height = 0 # also a field
    
    def __init__(self,name,w,h):  # constructor (make a new person)
        self.fullname = name
        self.weight = w
        self.height = h
    
    def getFullName(self):  # give me your name
        return(self.fullname)
    
    def setWeight(self,newweight):  # update your weight
        self.weight = newweight
        
    def getBMI(self):
        return(int(703*self.weight/(self.height*self.height)))
In [ ]:
alex = Person("alex",150,68)  # make a new instance of person
In [ ]:
jane = Person("jane",130,68)  # make a new instance
In [ ]:
alex.getFullName()   # what is your name?
In [ ]:
jane.getFullName()   # what is your name?
In [ ]:
alex.getBMI()  # what is your BMI?
In [ ]:
alex.setWeight(190)
In [ ]:
alex.getBMI()

Pandas 101

In [ ]:
import numpy as np  #load up the libraries and object defs. we need
import pandas as pd
from pandas import DataFrame, Series
# tell ipython notebook to print visualizations into chrome
%pylab
%matplotlib inline
# load up my visualization system, and call the object plt
import matplotlib.pyplot as plt
In [ ]:
# defined a new class with students, years, and grades
myclass = pd.DataFrame({'student':['alice','bob','louis','jen'],\
                       'year':[4,4,3,3],\
                       'grade':[10,9,10,10]})
In [ ]:
myclass   # show me what the class looks like
In [ ]:
myclass.shape  # how many rows and columns
In [ ]:
myclass.columns  # give me the column names
In [ ]:
myclass.year.unique()  # give me the unique years
In [ ]:
pd.crosstab(myclass.grade,myclass.year)  # count me how many people are in each condition
In [ ]:
plt.hist(myclass.grade)
In [ ]:
pd.crosstab(myclass.year,myclass.grade)  #order reversed (x/y)
In [ ]:
myclass.info()
In [ ]:
myclass.describe()
In [ ]:
myclass.T  # get the transpose
In [ ]:
myclass = myclass.set_index('student')  # make a new dataframe based on myclass, BUT with student as the main/index key
myclass
In [ ]:
yrg = myclass.groupby('year')  # partition into groups I care about
yrg.describe()  # describe them  (statistical props. of each group)
In [ ]:
myclass.grade >= 10 
In [ ]:
goodgrades = myclass.grade >= 10  # created a filter for students with good grades
goodgrades
In [ ]:
myclass[goodgrades]  # apply the filter
In [ ]:
myclass
In [ ]:
myclass.year == 4  # all people who are seniors
In [ ]:
(myclass.year == 4) & (myclass.grade >= 10)   # & = and for bit operations
In [ ]:
flt = (myclass.year == 4) & (myclass.grade >= 10)  # make the filtering criteria
myclass[flt]
In [ ]:
olive_oil = pd.read_csv('olive.csv')  # load up the file

1. Setup Environment and Import Data

In [ ]:
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number

# Enable inline plotting
%matplotlib inline

1.1 Install statsmodels package.

statsmodels is a Python package that provides various statistical analysis tools. We will use statsmodels for this lab. The package does not come with Anaconda by default, but we can easily install it.

  • Open Terminal (or cmd in Windows)
  • Choose an appropriate option:
    • If python3 is your default python, then type
      conda install statsmodels
    • If python3 is NOT your default python, which means you had to activate your Python environment before each session (e.g., "source activate python3" on Mac or "activate python3" on Windows), then do the following
      conda install --name=python3 statsmodels
      
      where python3 should be whatever the name of your conda environment is.
  • Follow the instructions to complete installation. Then come back to this notebook. The rest of the lab will be done within this notebook.

1.2 Import File

In [26]:
#Even though this functions has many parameters, we will simply pass it the location of the text file.
#Location = C:\Users\fatem_000\OneDrive\Academic\2014 Summer\TA\599-VIS-Fall2016\Lab2\seeds-subset.csv
#Note: Depending on where you save your notebooks, you may need to modify the location above.

Location = r'C:\Users\fara\OneDrive\Academic\2014 Summer\TA\599-VIS-Fall2016\Lab2\olive.csv'
olive_oil = pd.read_csv(Location)  # load up the file

olive_oil
Out[26]:
Unnamed: 0 region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
0 1.North-Apulia 1 1 1075 75 226 7823 672 36 60 29
1 2.North-Apulia 1 1 1088 73 224 7709 781 31 61 29
2 3.North-Apulia 1 1 911 54 246 8113 549 31 63 29
3 4.North-Apulia 1 1 966 57 240 7952 619 50 78 35
4 5.North-Apulia 1 1 1051 67 259 7771 672 50 80 46
5 6.North-Apulia 1 1 911 49 268 7924 678 51 70 44
6 7.North-Apulia 1 1 922 66 264 7990 618 49 56 29
7 8.North-Apulia 1 1 1100 61 235 7728 734 39 64 35
8 9.North-Apulia 1 1 1082 60 239 7745 709 46 83 33
9 10.North-Apulia 1 1 1037 55 213 7944 633 26 52 30
10 11.North-Apulia 1 1 1051 35 219 7978 605 21 65 24
11 12.North-Apulia 1 1 1036 59 235 7868 661 30 62 44
12 13.North-Apulia 1 1 1074 70 214 7728 747 50 79 33
13 14.North-Apulia 1 1 875 52 243 8018 655 41 79 32
14 15.North-Apulia 1 1 952 49 254 7795 780 50 75 41
15 16.North-Apulia 1 1 1155 98 201 7606 816 32 60 29
16 17.North-Apulia 1 1 943 94 183 7840 788 42 75 31
17 18.North-Apulia 1 1 1278 69 205 7344 957 45 70 28
18 19.North-Apulia 1 1 961 70 195 7958 742 46 75 30
19 20.North-Apulia 1 1 952 77 258 7820 736 43 78 33
20 21.North-Apulia 1 1 1074 67 236 7692 716 56 83 45
21 22.North-Apulia 1 1 995 46 288 7806 679 56 86 40
22 23.North-Apulia 1 1 1056 53 247 7703 700 54 89 51
23 24.North-Apulia 1 1 1065 39 234 7876 703 42 74 26
24 25.North-Apulia 1 1 1065 45 245 7779 696 47 82 38
25 26.Calabria 1 2 1315 139 230 7299 832 42 60 32
26 27.Calabria 1 2 1321 136 217 7174 950 43 63 30
27 28.Calabria 1 2 1359 115 246 7234 874 45 63 18
28 29.Calabria 1 2 1378 111 272 7127 940 46 64 23
29 30.Calabria 1 2 1295 109 245 7253 903 43 62 38
... ... ... ... ... ... ... ... ... ... ... ...
542 543.West-Liguria 3 8 1020 100 290 7620 960 0 10 2
543 544.West-Liguria 3 8 970 90 220 7700 1020 0 0 3
544 545.West-Liguria 3 8 1180 130 220 7450 1010 0 10 2
545 546.West-Liguria 3 8 1060 140 240 7690 850 10 10 1
546 547.West-Liguria 3 8 990 100 250 7630 1030 0 0 3
547 548.West-Liguria 3 8 1010 90 350 7630 940 10 0 3
548 549.West-Liguria 3 8 1040 90 250 7780 820 10 10 1
549 550.West-Liguria 3 8 1040 90 250 7810 810 10 10 2
550 551.West-Liguria 3 8 1020 90 350 7620 920 10 0 3
551 552.West-Liguria 3 8 1020 90 260 7620 1010 0 0 3
552 553.West-Liguria 3 8 1010 90 350 7610 930 10 0 3
553 554.West-Liguria 3 8 920 110 340 7720 910 0 0 3
554 555.West-Liguria 3 8 1030 100 250 7710 900 0 10 2
555 556.West-Liguria 3 8 960 90 300 7820 830 0 0 3
556 557.West-Liguria 3 8 1030 110 210 7810 840 0 0 1
557 558.West-Liguria 3 8 1010 100 240 7710 910 10 20 2
558 559.West-Liguria 3 8 1020 90 240 7800 850 0 0 2
559 560.West-Liguria 3 8 1120 90 300 7650 830 0 10 1
560 561.West-Liguria 3 8 1090 90 290 7710 800 10 0 2
561 562.West-Liguria 3 8 1100 120 280 7630 770 10 10 2
562 563.West-Liguria 3 8 1090 80 240 7820 760 10 0 2
563 564.West-Liguria 3 8 1150 90 250 7720 810 0 10 3
564 565.West-Liguria 3 8 1110 90 230 7810 750 0 10 2
565 566.West-Liguria 3 8 1010 110 210 7720 950 0 0 1
566 567.West-Liguria 3 8 1070 100 220 7730 870 10 10 2
567 568.West-Liguria 3 8 1280 110 290 7490 790 10 10 2
568 569.West-Liguria 3 8 1060 100 270 7740 810 10 10 3
569 570.West-Liguria 3 8 1010 90 210 7720 970 0 0 2
570 571.West-Liguria 3 8 990 120 250 7750 870 10 10 2
571 572.West-Liguria 3 8 960 80 240 7950 740 10 20 2

572 rows × 11 columns

1.3 Import Data

Let us start with necessary imports and data loading. Just execute every cell below. Remember: a convenient shortcut to run a cell and then jumps to the next cell is Shirt + Enter.

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline

In [2]:
# This is the Anscombe's quartet
# Source - Wikipedia: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
from io import StringIO

TESTDATA=StringIO("""X1,Y1,X2,Y2,X3,Y3,X4,Y4
10,8.04,10,9.14,10,7.46,8,6.58
8,6.95,8,8.14,8,6.77,8,5.76
13,7.58,13,8.74,13,12.74,8,7.71
9,8.81,9,8.77,9,7.11,8,8.84
11,8.33,11,9.26,11,7.81,8,8.47
14,9.96,14,8.1,14,8.84,8,7.04
6,7.24,6,6.13,6,6.08,8,5.25
4,4.26,4,3.1,4,5.39,19,12.5
12,10.84,12,9.13,12,8.15,8,5.56
7,4.82,7,7.26,7,6.42,8,7.91
5,5.68,5,4.74,5,5.73,8,6.89""")

df = pd.DataFrame.from_csv(TESTDATA, index_col=None)
df
Out[2]:
X1 Y1 X2 Y2 X3 Y3 X4 Y4
0 10 8.04 10 9.14 10 7.46 8 6.58
1 8 6.95 8 8.14 8 6.77 8 5.76
2 13 7.58 13 8.74 13 12.74 8 7.71
3 9 8.81 9 8.77 9 7.11 8 8.84
4 11 8.33 11 9.26 11 7.81 8 8.47
5 14 9.96 14 8.10 14 8.84 8 7.04
6 6 7.24 6 6.13 6 6.08 8 5.25
7 4 4.26 4 3.10 4 5.39 19 12.50
8 12 10.84 12 9.13 12 8.15 8 5.56
9 7 4.82 7 7.26 7 6.42 8 7.91
10 5 5.68 5 4.74 5 5.73 8 6.89

This dataset contains 4 groups of data: (X1, Y1), to (X4, Y4).

It is not hard to notice that X1, X2, and X3 are identical. But that doesn't matter now.

We can do a quick vis of the data by creating scatterplots.

In [27]:
fig, axs = plt.subplots(2, 2, figsize=(12,9))
for i, ax in enumerate(axs.flat):
    j = i + 1
    ax.scatter(df['X%d'%j], df['Y%d'%j])
    ax.set_title('(%d)'%j)
    ax.set_xlabel('X%d'%j)
    ax.set_ylabel('Y%d'%j)
    ax.grid(True)

2. Compute Summary Statistics

Pandas DataFrames and Series come with a couple of convenient functions for computing basic summary statistics. Try the following commands. Their meanings are quite self-explanatory.

In [3]:
df.mean()
Out[3]:
X1    9.000000
Y1    7.500909
X2    9.000000
Y2    7.500909
X3    9.000000
Y3    7.500000
X4    9.000000
Y4    7.500909
dtype: float64
In [4]:
df.median()
Out[4]:
X1    9.00
Y1    7.58
X2    9.00
Y2    8.14
X3    9.00
Y3    7.11
X4    8.00
Y4    7.04
dtype: float64
In [5]:
df.std()
Out[5]:
X1    3.316625
Y1    2.031568
X2    3.316625
Y2    2.031657
X3    3.316625
Y3    2.030424
X4    3.316625
Y4    2.030579
dtype: float64
In [6]:
df.max()
Out[6]:
X1    14.00
Y1    10.84
X2    14.00
Y2     9.26
X3    14.00
Y3    12.74
X4    19.00
Y4    12.50
dtype: float64
In [7]:
# This one computes the maximum along the 1st axies (i.e,. across columns).
df.max(axis=1)
Out[7]:
0     10.00
1      8.14
2     13.00
3      9.00
4     11.00
5     14.00
6      8.00
7     19.00
8     12.00
9      8.00
10     8.00
dtype: float64
In [8]:
df.var()
Out[8]:
X1    11.000000
Y1     4.127269
X2    11.000000
Y2     4.127629
X3    11.000000
Y3     4.122620
X4    11.000000
Y4     4.123249
dtype: float64

You can also apply the functions on a column (i.e,. a Series), too. Such as...

In [9]:
df.X1.mean()
Out[9]:
9.0
In [10]:
# This one is slightly more complicated... But you can figure this out easily.
(df.X1 + df.X2).mean()
Out[10]:
18.0
In [11]:
# And finally...
df.describe()
Out[11]:
X1 Y1 X2 Y2 X3 Y3 X4 Y4
count 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
mean 9.000000 7.500909 9.000000 7.500909 9.000000 7.500000 9.000000 7.500909
std 3.316625 2.031568 3.316625 2.031657 3.316625 2.030424 3.316625 2.030579
min 4.000000 4.260000 4.000000 3.100000 4.000000 5.390000 8.000000 5.250000
25% 6.500000 6.315000 6.500000 6.695000 6.500000 6.250000 8.000000 6.170000
50% 9.000000 7.580000 9.000000 8.140000 9.000000 7.110000 8.000000 7.040000
75% 11.500000 8.570000 11.500000 8.950000 11.500000 7.980000 8.000000 8.190000
max 14.000000 10.840000 14.000000 9.260000 14.000000 12.740000 19.000000 12.500000

3. Binning, Grouping, and Histograms

To quickly understand the distribution of data, it is a good idea to use binning and grouping and creating histograms.

3.1 Binning and Grouping

Use the following code to create bins based on X1's values, and check the mean value of both X1 and Y1 within each bin.

In [12]:
# Since we have very few data points, I will only use 3 bins (by setting num = 4)
bins_by_x1 = np.linspace(start=3, stop=15, num=4)
groups_by_x1 = df[['X1','Y1']].groupby(pd.cut(df.X1, bins_by_x1))
groups_by_x1.mean()
Out[12]:
X1 Y1
X1
(3, 7] 5.5 5.5000
(7, 11] 9.5 8.0325
(11, 15] 13.0 9.4600

The above result shows the mean value of X1 and Y1 binned by the value of X1.

To understand what is going on in the above commands, feel free to print out the intermediate variables, including:

  • bins_by_x1
  • pd.cut(df.X1, bins_by_x1)
  • groups_by_x1

Also try the following commands:

In [13]:
groups_by_x1.median()
Out[13]:
X1 Y1
X1
(3, 7] 5.5 5.250
(7, 11] 9.5 8.185
(11, 15] 13.0 9.960
In [14]:
groups_by_x1.size().to_frame(name='count')
Out[14]:
count
X1
(3, 7] 4
(7, 11] 4
(11, 15] 3

In addition to using existing aggregate functions (i.e., max, min, mean, median, size, etc.), you can also define custom functions to "apply" to the grouping object.

The following should generate the same result as the previous one. Try to figure out how it works.

In [15]:
groups_by_x1.apply(lambda x: len(x)).to_frame(name='count')
Out[15]:
count
X1
(3, 7] 4
(7, 11] 4
(11, 15] 3
In [16]:
# Or, alternatively, and more confusingly ...
groups_by_x1.apply(lambda x: pd.Series({'count': len(x)}))
Out[16]:
count
X1
(3, 7] 4
(7, 11] 4
(11, 15] 3

3.2 Creating Histograms

To create a histogram, simply use .hist() on a Series object. It automatically handles binning, grouping, and counting.

By default, hist() generates 10 bins. You may customize this by specifying the bins parameter, for example, hist(bins=5).

Try the following.

In [17]:
df.Y1.hist(bins=5)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x243f07715f8>

you may call hist() on a DataFrame object and obtain a panel of histograms, one for each individual column.

In [18]:
_ = df[['Y1','Y2','Y3','Y4']].hist(bins=5, figsize=(10,6))

In the above example, notice two things:

  • I used figsize=(10, 6) to specify the size of the figure (the unit is inch, although the %matplotlib inline configuration reduces the figure sizes by a predefined ratio automatically).
  • I put "_ =" in the front to avoid seeing the returned value of df.hist, which I do not care. If you are curious, it is an array of axes objects of matplotlib.
    • In general, _ can be used whenever you want to ignore the return value of a function.

You can also make histograms that have side-by-side bars for multiple variables. For the following example, I used matplotlib's hist function instead of the equivalence of pandas, because it is easier to use the former one to make side-by-side histograms like this.

In [19]:
bins = np.linspace(start=2, stop=12, num=6)
plt.hist(df[['Y1','Y2','Y3','Y4']].values, 
         bins, 
         label=['Y1','Y2','Y3','Y4'])
plt.legend(loc="upper left")
plt.grid(True)

3.3 Boxplots

Boxplots (or box and whisker diagrams) is a another good way of depicting the distribution of numerical data by showing the mean, min, max, and interquantile range(IQR).

Using pandas, it is very easy to create a boxplot of multiple variables.

In [20]:
_ = df[['X1','Y1']].boxplot()
C:\Users\fara\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
In [ ]: