BASIC¶

2+2

print(4+6)

print 5+5 #doesn't work, use parent.

4  #int
4.3 # float
"hello world" # string
True #boolean
False #boolean
None # undefined

type(4) # type() is useful for figuring out what kind of data you have

type(5.4)

type("hello world")

Converting between types¶

int(6.42)  #convert float -> int

str(5)  # convert int -> string

#group question 
#How can we make this work? 
print(5 + " is a number") # not work

# group question
type(1/2)   # float, int, string?

logic in python¶

5 > 4

# 5 >= 4, 5 <= 4, 5 == 4, 5 < 4, 5> 4
5 == 4 # false
5.0 == 5  # convert, True
5.0 == "5"  # false

(5 > 4) and (5 < 3)

#group question
(5 > 4) or (5 < 3)

Variables¶

x = 3.2
y = "hello world"

helloHappyPeople = "hello world"

helloHappyPeople  # tab to autocomplete

z = "3.2"
#group question 
x == z # False or True?

#group question
x = True
y = False
z = False

if x or y and z:
    print('yes')
else:
    print('no')   

# will the output be yes or no?

#group question
x or (y and z)

#group question
x or y or z

tuples/lists/dictionaries¶

names_tuples = ('alice','bob','sam')  # tuple, are not modifiable (immutable)
names_list  = ['alice','bob','sam']  #list, are changeable (mutable)

names_tuples.append('jake')

names_list.append('jake')
names_list

names_list.insert(3,'chelsea')  # chelsea inserted at position 3
names_list

names_list[1]  # to get bob

names_list.pop()  # pull from end

names_list

names_list.index('sam')  #tell me where sam is in the list

len(names_list)  # give me the length  --> 3 or 4

a = [[1,2],[4,5]]
a[1][0]  # will I get 1,2,4, or 5?

#group quesiton
a = [1,2,3,None,(),[],] # what is len(a)? 4,5,6,7 or error?

# concat lists
list1 = [1,2,3,4]
list2 = [5,6,7,8]
list1 + list2

#group question
#What is the answer? 
"apple" + "bana"

Objects in Python¶

class Person:
    
    fullname = None   # this is a field
    weight = 0 # also a field
    height = 0 # also a field
    
    def __init__(self,name,w,h):  # constructor (make a new person)
        self.fullname = name
        self.weight = w
        self.height = h
    
    def getFullName(self):  # give me your name
        return(self.fullname)
    
    def setWeight(self,newweight):  # update your weight
        self.weight = newweight
        
    def getBMI(self):
        return(int(703*self.weight/(self.height*self.height)))

alex = Person("alex",150,68)  # make a new instance of person

jane = Person("jane",130,68)  # make a new instance

alex.getFullName()   # what is your name?

jane.getFullName()   # what is your name?

alex.getBMI()  # what is your BMI?

alex.setWeight(190)

alex.getBMI()

Pandas 101¶

import numpy as np  #load up the libraries and object defs. we need
import pandas as pd
from pandas import DataFrame, Series
# tell ipython notebook to print visualizations into chrome
%pylab
%matplotlib inline
# load up my visualization system, and call the object plt
import matplotlib.pyplot as plt

# defined a new class with students, years, and grades
myclass = pd.DataFrame({'student':['alice','bob','louis','jen'],\
                       'year':[4,4,3,3],\
                       'grade':[10,9,10,10]})

myclass   # show me what the class looks like

myclass.shape  # how many rows and columns

myclass.columns  # give me the column names

myclass.year.unique()  # give me the unique years

pd.crosstab(myclass.grade,myclass.year)  # count me how many people are in each condition

plt.hist(myclass.grade)

pd.crosstab(myclass.year,myclass.grade)  #order reversed (x/y)

myclass.info()

myclass.describe()

myclass.T  # get the transpose

myclass = myclass.set_index('student')  # make a new dataframe based on myclass, BUT with student as the main/index key
myclass

yrg = myclass.groupby('year')  # partition into groups I care about
yrg.describe()  # describe them  (statistical props. of each group)

myclass.grade >= 10

goodgrades = myclass.grade >= 10  # created a filter for students with good grades
goodgrades

myclass[goodgrades]  # apply the filter

myclass

myclass.year == 4  # all people who are seniors

(myclass.year == 4) & (myclass.grade >= 10)   # & = and for bit operations

flt = (myclass.year == 4) & (myclass.grade >= 10)  # make the filtering criteria
myclass[flt]

olive_oil = pd.read_csv('olive.csv')  # load up the file

1. Setup Environment and Import Data¶

import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number

# Enable inline plotting
%matplotlib inline

1.1 Install `statsmodels` package.¶

statsmodels is a Python package that provides various statistical analysis tools. We will use statsmodels for this lab. The package does not come with Anaconda by default, but we can easily install it.

Open Terminal (or cmd in Windows)
Choose an appropriate option:
- If python3 is your default python, then type
```
conda install statsmodels
```
- If python3 is NOT your default python, which means you had to activate your Python environment before each session (e.g., "source activate python3" on Mac or "activate python3" on Windows), then do the following
```
conda install --name=python3 statsmodels
```
  where python3 should be whatever the name of your conda environment is.
Follow the instructions to complete installation. Then come back to this notebook. The rest of the lab will be done within this notebook.

1.2 Import File¶

#Even though this functions has many parameters, we will simply pass it the location of the text file.
#Location = C:\Users\fatem_000\OneDrive\Academic\2014 Summer\TA\599-VIS-Fall2016\Lab2\seeds-subset.csv
#Note: Depending on where you save your notebooks, you may need to modify the location above.

Location = r'C:\Users\fara\OneDrive\Academic\2014 Summer\TA\599-VIS-Fall2016\Lab2\olive.csv'
olive_oil = pd.read_csv(Location)  # load up the file

olive_oil

1.3 Import Data¶

Let us start with necessary imports and data loading. Just execute every cell below. Remember: a convenient shortcut to run a cell and then jumps to the next cell is Shirt + Enter.

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline

# This is the Anscombe's quartet
# Source - Wikipedia: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
from io import StringIO

TESTDATA=StringIO("""X1,Y1,X2,Y2,X3,Y3,X4,Y4
10,8.04,10,9.14,10,7.46,8,6.58
8,6.95,8,8.14,8,6.77,8,5.76
13,7.58,13,8.74,13,12.74,8,7.71
9,8.81,9,8.77,9,7.11,8,8.84
11,8.33,11,9.26,11,7.81,8,8.47
14,9.96,14,8.1,14,8.84,8,7.04
6,7.24,6,6.13,6,6.08,8,5.25
4,4.26,4,3.1,4,5.39,19,12.5
12,10.84,12,9.13,12,8.15,8,5.56
7,4.82,7,7.26,7,6.42,8,7.91
5,5.68,5,4.74,5,5.73,8,6.89""")

df = pd.DataFrame.from_csv(TESTDATA, index_col=None)
df

This dataset contains 4 groups of data: (X1, Y1), to (X4, Y4).

It is not hard to notice that X1, X2, and X3 are identical. But that doesn't matter now.

We can do a quick vis of the data by creating scatterplots.

fig, axs = plt.subplots(2, 2, figsize=(12,9))
for i, ax in enumerate(axs.flat):
    j = i + 1
    ax.scatter(df['X%d'%j], df['Y%d'%j])
    ax.set_title('(%d)'%j)
    ax.set_xlabel('X%d'%j)
    ax.set_ylabel('Y%d'%j)
    ax.grid(True)

2. Compute Summary Statistics¶

Pandas DataFrames and Series come with a couple of convenient functions for computing basic summary statistics. Try the following commands. Their meanings are quite self-explanatory.

df.mean()

X1    9.000000
Y1    7.500909
X2    9.000000
Y2    7.500909
X3    9.000000
Y3    7.500000
X4    9.000000
Y4    7.500909
dtype: float64

df.median()

X1    9.00
Y1    7.58
X2    9.00
Y2    8.14
X3    9.00
Y3    7.11
X4    8.00
Y4    7.04
dtype: float64

df.std()

X1    3.316625
Y1    2.031568
X2    3.316625
Y2    2.031657
X3    3.316625
Y3    2.030424
X4    3.316625
Y4    2.030579
dtype: float64

df.max()

X1    14.00
Y1    10.84
X2    14.00
Y2     9.26
X3    14.00
Y3    12.74
X4    19.00
Y4    12.50
dtype: float64

# This one computes the maximum along the 1st axies (i.e,. across columns).
df.max(axis=1)

0     10.00
1      8.14
2     13.00
3      9.00
4     11.00
5     14.00
6      8.00
7     19.00
8     12.00
9      8.00
10     8.00
dtype: float64

df.var()

X1    11.000000
Y1     4.127269
X2    11.000000
Y2     4.127629
X3    11.000000
Y3     4.122620
X4    11.000000
Y4     4.123249
dtype: float64

You can also apply the functions on a column (i.e,. a Series), too. Such as...

df.X1.mean()

9.0

# This one is slightly more complicated... But you can figure this out easily.
(df.X1 + df.X2).mean()

18.0

# And finally...
df.describe()

3. Binning, Grouping, and Histograms¶

To quickly understand the distribution of data, it is a good idea to use binning and grouping and creating histograms.

3.1 Binning and Grouping¶

Use the following code to create bins based on X1's values, and check the mean value of both X1 and Y1 within each bin.

# Since we have very few data points, I will only use 3 bins (by setting num = 4)
bins_by_x1 = np.linspace(start=3, stop=15, num=4)
groups_by_x1 = df[['X1','Y1']].groupby(pd.cut(df.X1, bins_by_x1))
groups_by_x1.mean()

The above result shows the mean value of X1 and Y1 binned by the value of X1.

To understand what is going on in the above commands, feel free to print out the intermediate variables, including:

bins_by_x1
pd.cut(df.X1, bins_by_x1)
groups_by_x1

Also try the following commands:

groups_by_x1.median()

groups_by_x1.size().to_frame(name='count')

In addition to using existing aggregate functions (i.e., max, min, mean, median, size, etc.), you can also define custom functions to "apply" to the grouping object.

The following should generate the same result as the previous one. Try to figure out how it works.

groups_by_x1.apply(lambda x: len(x)).to_frame(name='count')

# Or, alternatively, and more confusingly ...
groups_by_x1.apply(lambda x: pd.Series({'count': len(x)}))

3.2 Creating Histograms¶

To create a histogram, simply use .hist() on a Series object. It automatically handles binning, grouping, and counting.

By default, hist() generates 10 bins. You may customize this by specifying the bins parameter, for example, hist(bins=5).

Try the following.

df.Y1.hist(bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x243f07715f8>

you may call hist() on a DataFrame object and obtain a panel of histograms, one for each individual column.

_ = df[['Y1','Y2','Y3','Y4']].hist(bins=5, figsize=(10,6))

In the above example, notice two things:

I used figsize=(10, 6) to specify the size of the figure (the unit is inch, although the %matplotlib inline configuration reduces the figure sizes by a predefined ratio automatically).
I put "_ =" in the front to avoid seeing the returned value of df.hist, which I do not care. If you are curious, it is an array of axes objects of matplotlib.
- In general, _ can be used whenever you want to ignore the return value of a function.

You can also make histograms that have side-by-side bars for multiple variables. For the following example, I used matplotlib's hist function instead of the equivalence of pandas, because it is easier to use the former one to make side-by-side histograms like this.

bins = np.linspace(start=2, stop=12, num=6)
plt.hist(df[['Y1','Y2','Y3','Y4']].values, 
         bins, 
         label=['Y1','Y2','Y3','Y4'])
plt.legend(loc="upper left")
plt.grid(True)

3.3 Boxplots¶

Boxplots (or box and whisker diagrams) is a another good way of depicting the distribution of numerical data by showing the mean, min, max, and interquantile range(IQR).

Using pandas, it is very easy to create a boxplot of multiple variables.

_ = df[['X1','Y1']].boxplot()

C:\Users\fara\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

	Unnamed: 0	region	area	palmitic	palmitoleic	stearic	oleic	linoleic	linolenic	arachidic	eicosenoic
0	1.North-Apulia	1	1	1075	75	226	7823	672	36	60	29
1	2.North-Apulia	1	1	1088	73	224	7709	781	31	61	29
2	3.North-Apulia	1	1	911	54	246	8113	549	31	63	29
3	4.North-Apulia	1	1	966	57	240	7952	619	50	78	35
4	5.North-Apulia	1	1	1051	67	259	7771	672	50	80	46
5	6.North-Apulia	1	1	911	49	268	7924	678	51	70	44
6	7.North-Apulia	1	1	922	66	264	7990	618	49	56	29
7	8.North-Apulia	1	1	1100	61	235	7728	734	39	64	35
8	9.North-Apulia	1	1	1082	60	239	7745	709	46	83	33
9	10.North-Apulia	1	1	1037	55	213	7944	633	26	52	30
10	11.North-Apulia	1	1	1051	35	219	7978	605	21	65	24
11	12.North-Apulia	1	1	1036	59	235	7868	661	30	62	44
12	13.North-Apulia	1	1	1074	70	214	7728	747	50	79	33
13	14.North-Apulia	1	1	875	52	243	8018	655	41	79	32
14	15.North-Apulia	1	1	952	49	254	7795	780	50	75	41
15	16.North-Apulia	1	1	1155	98	201	7606	816	32	60	29
16	17.North-Apulia	1	1	943	94	183	7840	788	42	75	31
17	18.North-Apulia	1	1	1278	69	205	7344	957	45	70	28
18	19.North-Apulia	1	1	961	70	195	7958	742	46	75	30
19	20.North-Apulia	1	1	952	77	258	7820	736	43	78	33
20	21.North-Apulia	1	1	1074	67	236	7692	716	56	83	45
21	22.North-Apulia	1	1	995	46	288	7806	679	56	86	40
22	23.North-Apulia	1	1	1056	53	247	7703	700	54	89	51
23	24.North-Apulia	1	1	1065	39	234	7876	703	42	74	26
24	25.North-Apulia	1	1	1065	45	245	7779	696	47	82	38
25	26.Calabria	1	2	1315	139	230	7299	832	42	60	32
26	27.Calabria	1	2	1321	136	217	7174	950	43	63	30
27	28.Calabria	1	2	1359	115	246	7234	874	45	63	18
28	29.Calabria	1	2	1378	111	272	7127	940	46	64	23
29	30.Calabria	1	2	1295	109	245	7253	903	43	62	38
...	...	...	...	...	...	...	...	...	...	...	...
542	543.West-Liguria	3	8	1020	100	290	7620	960	0	10	2
543	544.West-Liguria	3	8	970	90	220	7700	1020	0	0	3
544	545.West-Liguria	3	8	1180	130	220	7450	1010	0	10	2
545	546.West-Liguria	3	8	1060	140	240	7690	850	10	10	1
546	547.West-Liguria	3	8	990	100	250	7630	1030	0	0	3
547	548.West-Liguria	3	8	1010	90	350	7630	940	10	0	3
548	549.West-Liguria	3	8	1040	90	250	7780	820	10	10	1
549	550.West-Liguria	3	8	1040	90	250	7810	810	10	10	2
550	551.West-Liguria	3	8	1020	90	350	7620	920	10	0	3
551	552.West-Liguria	3	8	1020	90	260	7620	1010	0	0	3
552	553.West-Liguria	3	8	1010	90	350	7610	930	10	0	3
553	554.West-Liguria	3	8	920	110	340	7720	910	0	0	3
554	555.West-Liguria	3	8	1030	100	250	7710	900	0	10	2
555	556.West-Liguria	3	8	960	90	300	7820	830	0	0	3
556	557.West-Liguria	3	8	1030	110	210	7810	840	0	0	1
557	558.West-Liguria	3	8	1010	100	240	7710	910	10	20	2
558	559.West-Liguria	3	8	1020	90	240	7800	850	0	0	2
559	560.West-Liguria	3	8	1120	90	300	7650	830	0	10	1
560	561.West-Liguria	3	8	1090	90	290	7710	800	10	0	2
561	562.West-Liguria	3	8	1100	120	280	7630	770	10	10	2
562	563.West-Liguria	3	8	1090	80	240	7820	760	10	0	2
563	564.West-Liguria	3	8	1150	90	250	7720	810	0	10	3
564	565.West-Liguria	3	8	1110	90	230	7810	750	0	10	2
565	566.West-Liguria	3	8	1010	110	210	7720	950	0	0	1
566	567.West-Liguria	3	8	1070	100	220	7730	870	10	10	2
567	568.West-Liguria	3	8	1280	110	290	7490	790	10	10	2
568	569.West-Liguria	3	8	1060	100	270	7740	810	10	10	3
569	570.West-Liguria	3	8	1010	90	210	7720	970	0	0	2
570	571.West-Liguria	3	8	990	120	250	7750	870	10	10	2
571	572.West-Liguria	3	8	960	80	240	7950	740	10	20	2

	X1	Y1
X1
(3, 7]	5.5	5.5000
(7, 11]	9.5	8.0325
(11, 15]	13.0	9.4600

	X1	Y1
X1
(3, 7]	5.5	5.250
(7, 11]	9.5	8.185
(11, 15]	13.0	9.960

	count
X1
(3, 7]	4
(7, 11]	4
(11, 15]	3

	count
X1
(3, 7]	4
(7, 11]	4
(11, 15]	3

	X1	Y1	X2	Y2	X3	Y3	X4	Y4
0	10	8.04	10	9.14	10	7.46	8	6.58
1	8	6.95	8	8.14	8	6.77	8	5.76
2	13	7.58	13	8.74	13	12.74	8	7.71
3	9	8.81	9	8.77	9	7.11	8	8.84
4	11	8.33	11	9.26	11	7.81	8	8.47
5	14	9.96	14	8.10	14	8.84	8	7.04
6	6	7.24	6	6.13	6	6.08	8	5.25
7	4	4.26	4	3.10	4	5.39	19	12.50
8	12	10.84	12	9.13	12	8.15	8	5.56
9	7	4.82	7	7.26	7	6.42	8	7.91
10	5	5.68	5	4.74	5	5.73	8	6.89

	X1	Y1	X2	Y2	X3	Y3	X4	Y4
count	11.000000	11.000000	11.000000	11.000000	11.000000	11.000000	11.000000	11.000000
mean	9.000000	7.500909	9.000000	7.500909	9.000000	7.500000	9.000000	7.500909
std	3.316625	2.031568	3.316625	2.031657	3.316625	2.030424	3.316625	2.030579
min	4.000000	4.260000	4.000000	3.100000	4.000000	5.390000	8.000000	5.250000
25%	6.500000	6.315000	6.500000	6.695000	6.500000	6.250000	8.000000	6.170000
50%	9.000000	7.580000	9.000000	8.140000	9.000000	7.110000	8.000000	7.040000
75%	11.500000	8.570000	11.500000	8.950000	11.500000	7.980000	8.000000	8.190000
max	14.000000	10.840000	14.000000	9.260000	14.000000	12.740000	19.000000	12.500000