Learning objectives:
Here we are using iPython version 3 in the iPython Jupyter Notebook
We will set up the notebook and load the additional software library we will need - "Pandas".
import pandas as pd
The Jupyter Notebook understands shell commands as well, we'll use this to find our data. We'll use the exclamation mark to signal that what follows is a shell command rather than Python code.
# print working directory
!pwd
# list files in 'data/'
!ls data/
We can use the 'read_csv' function of Pandas to read the datafile into a pandas DataFrame. A DataFrame will look familiar to anyone who has used Excel or SPSS.
# read the csv
df = pd.read_csv('data/tf1_visneuro_2016_Feb_02_1234.csv')
# print the first 5 lines of the DataFrame
df.head()
# drop the first row (under the header)
# the inplace option allows us to makes changes to the dataframe directly
# rather than having to save the output to another variable
df.drop(0, inplace=True)
df.head()
# drop all colums with NA's
df.dropna(axis=1, inplace=True, how='all')
df.head()
# save out new dataframe as csv
df.to_csv('data/ID001.csv', index=False)
# to read back in you can use:
#df = pd.read_csv('data/ID001.csv')
**Now that we have cleaned the data, it might be helpful to visualise it.
To do this we need to import some packages. **
Matplotlib is a Python library that produces quality figures and graphs from data.
Seaborn is a Python library based on matplotlib, which provides additional graphical modifications for matplotlib figures and graphs.
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns # seaborn
# to see the plots in the workspace we use:
%matplotlib inline
# these are seaborn commands, they are used for asthetics
sns.set_color_codes()
sns.set_style('whitegrid')
Create a basic plot of all the columns
df.plot()
The plot is good, but it doesn't actually tell us much.
Let's used the package matplotlib.pyplot ('plt'), but first, find out what the function 'plot' in the package does.
plt.plot?
To plot a single column (e.g. reaction time), enter the name of the column (the header) as a string in square brackets immediately following the dataframe label as shown below. The column header is the 'index' for that column. In this example, the plot will include the legend.
df['key_resp_3.rt'].plot(legend=True)
Seaborn (sns) has a nice distribution plot to visualise whether the responses are normally disributed.
sns.distplot(df['key_resp_3.rt'])
It is clear from the graphs that the first answer is much longer than the rest of the answers. This is more likely due to it being the first response and therefore it isn't real data. We should filter it out so it doesn't overly affect the result.
df_clean = df[df['key_resp_3.rt'] < .9]
sns.distplot(df_clean['key_resp_3.rt'])
It's still not looking particularly normal but better. Maybe less than 0.3 would be better still...
We can plot the reaction time by the type of image by including the column name 'type_of_image' in parentheses before the index.
df_clean.groupby('type_of_image')['key_resp_3.rt'].plot(legend=True)
Creating a variable that groups reation time by the animal type gives some more functionality
animal_type_group = df_clean.groupby('type_of_image')['key_resp_3.rt']
# then we can plot that variable
animal_type_group.plot(legend = True)
This doesn't tell us much, but we can now print out the statistics for that variable.
animal_type_group.mean()
print('Mean: ',animal_type_group.mean())
print('SD: ',animal_type_group.std())
We can also visualise the reaction time by the type of image as a boxplot
df_clean.boxplot('key_resp_3.rt',by = 'type_of_image',)
and save plot to the current working directory
plt.savefig('ID001_boxplot.png')