CSV¶

Reading, writting and analysing CSV files is achived by using the pandas library.

Opening a CSV¶

The reading of CSV files is done by using the read_csv method.

import pandas as pd

data = pd.read_csv('./data/nina34.csv', 
                   delim_whitespace=True,  # use spaces as delimiter
                   skipfooter=3,   # skips the last 2 lines
                   na_values=-99.99,  # sets missing values  
                   engine='python'  # sets engine to Python (default C does not support skip footer)
                   )
data

	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC
1948	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1949	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1950	24.55	25.06	25.87	26.28	26.18	26.46	26.29	25.88	25.74	25.69	25.47	25.29
1951	25.24	25.71	26.90	27.58	27.92	27.73	27.60	27.02	27.23	27.20	27.25	26.91
1952	26.67	26.74	27.17	27.80	27.79	27.18	26.53	26.30	26.36	26.26	25.92	26.21
...	...	...	...	...	...	...	...	...	...	...	...	...
2016	29.11	29.01	28.90	28.72	28.23	27.69	26.82	26.28	26.14	25.98	25.94	26.10
2017	26.12	26.67	27.32	28.03	28.30	28.06	27.54	26.70	26.29	26.15	25.74	25.62
2018	25.57	25.97	26.48	27.31	27.73	27.77	27.42	26.94	27.19	27.62	27.61	27.49
2019	27.19	27.46	28.09	28.44	28.48	28.18	27.64	26.90	26.75	27.20	27.22	27.12
2020	27.18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

73 rows × 12 columns

It returns a pandas.DataFrame object.

To get the names of the line and columns:

data.index

Int64Index([1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958,
            1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969,
            1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
            1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
            1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002,
            2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
            2014, 2015, 2016, 2017, 2018, 2019, 2020],
           dtype='int64')

data.columns

Index(['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT',
       'NOV', 'DEC'],
      dtype='object')

To display some lines at the beginning or at the end:

data.head(3)

	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC
1948	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1949	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1950	24.55	25.06	25.87	26.28	26.18	26.46	26.29	25.88	25.74	25.69	25.47	25.29

data.tail(3)

	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC
2018	25.57	25.97	26.48	27.31	27.73	27.77	27.42	26.94	27.19	27.62	27.61	27.49
2019	27.19	27.46	28.09	28.44	28.48	28.18	27.64	26.90	26.75	27.20	27.22	27.12
2020	27.18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Data extraction¶

To extract data from the DataFrame, you can either

extract one column
use column/row names
use column/row indexes

Extracting one column¶

To extract a whole column, we can provide a list of column names as follows:

col = data[['JAN', 'FEB']]
col

	JAN	FEB
1948	NaN	NaN
1949	NaN	NaN
1950	24.55	25.06
1951	25.24	25.71
1952	26.67	26.74
...	...	...
2016	29.11	29.01
2017	26.12	26.67
2018	25.57	25.97
2019	27.19	27.46
2020	27.18	NaN

73 rows × 2 columns

Using names¶

Extracting data using column and row names is done by using the loc method.

dataex = data.loc[:, ['JAN', 'FEB']]
dataex

	JAN	FEB
1948	NaN	NaN
1949	NaN	NaN
1950	24.55	25.06
1951	25.24	25.71
1952	26.67	26.74
...	...	...
2016	29.11	29.01
2017	26.12	26.67
2018	25.57	25.97
2019	27.19	27.46
2020	27.18	NaN

73 rows × 2 columns

dataex = data.loc[[1950, 1960], :]
dataex

	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC
1950	24.55	25.06	25.87	26.28	26.18	26.46	26.29	25.88	25.74	25.69	25.47	25.29
1960	26.27	26.29	26.98	27.49	27.68	27.24	26.88	26.70	26.44	26.22	26.26	26.22

dataex = data.loc[1950:1953, ['JAN', 'FEB']]
dataex

	JAN	FEB
1950	24.55	25.06
1951	25.24	25.71
1952	26.67	26.74
1953	26.74	27.00

Using indexes¶

Extracting data using column and row names is done by using the iloc method.

dataex = data.iloc[:5, 0]
dataex

    NaN
    NaN
  24.55
  25.24
  26.67
Name: JAN, dtype: float64

dataex = data.iloc[2, :]
dataex

JAN    24.55
FEB    25.06
MAR    25.87
APR    26.28
MAY    26.18
JUN    26.46
JUL    26.29
AUG    25.88
SEP    25.74
OCT    25.69
NOV    25.47
DEC    25.29
Name: 1950, dtype: float64

dataex = data.iloc[slice(2, 6), [0, 1]]
dataex

	JAN	FEB
1950	24.55	25.06
1951	25.24	25.71
1952	26.67	26.74
1953	26.74	27.00

dataex = data.iloc[slice(2, 6), :].loc[:, ['OCT', 'NOV']]
dataex

	OCT	NOV
1950	25.69	25.47
1951	27.20	27.25
1952	26.26	25.92
1953	26.87	26.88

Extracting data arrays¶

To extract the data arrays, use the values attributes.

array = data.values
array.shape

(73, 12)

Plotting¶

pandas comes with some functions to draw quick plots.

import matplotlib.pyplot as plt

l = data.loc[:, ['JAN', 'FEB']].plot()

l = data.loc[1970, :].plot()

l = data.T.loc[:, 1995:2000].plot()

Creating dataframes¶

To create a data frame is done by using the pandas.DataFrame method.

import numpy as np

# init a date object: 10 elements with a 1h interval
date = pd.date_range('1/1/2012', periods=10, freq='H')

x = np.arange(10)
y = np.arange(10)*0.5
cat = ['A']*2 + ['C'] + ['A'] + 3*['B'] + ['C'] + ['D'] + ['A']

data = pd.DataFrame({'xvalue': x,
                     'yvalue': y,
                     'cat': cat},
                     index=date)
data

	xvalue	yvalue	cat
2012-01-01 00:00:00	0	0.0	A
2012-01-01 01:00:00	1	0.5	A
2012-01-01 02:00:00	2	1.0	C
2012-01-01 03:00:00	3	1.5	A
2012-01-01 04:00:00	4	2.0	B
2012-01-01 05:00:00	5	2.5	B
2012-01-01 06:00:00	6	3.0	B
2012-01-01 07:00:00	7	3.5	C
2012-01-01 08:00:00	8	4.0	D
2012-01-01 09:00:00	9	4.5	A

Mathematical operations¶

Mathematical operations can be done by using the available pandas methods. Note that it is done only on numerical types. By default, the mean over all the rows is performed:

datam = data.loc[:, ['xvalue', 'yvalue']].mean()
datam

xvalue    4.50
yvalue    2.25
dtype: float64

But you can also compute means over columns:

# mean over the second dimension (columns)
datam = data.loc[:, ['xvalue', 'yvalue']].mean(axis=1)
datam

2012-01-01 00:00:00    0.00
2012-01-01 01:00:00    0.75
2012-01-01 02:00:00    1.50
2012-01-01 03:00:00    2.25
2012-01-01 04:00:00    3.00
2012-01-01 05:00:00    3.75
2012-01-01 06:00:00    4.50
2012-01-01 07:00:00    5.25
2012-01-01 08:00:00    6.00
2012-01-01 09:00:00    6.75
Freq: H, dtype: float64

There is also the possibility to do some treatments depending on the value of a caterogical variable (here, the column called cat).

data_sorted = data.sort_values(by="cat")
data_sorted

	xvalue	yvalue	cat
2012-01-01 00:00:00	0	0.0	A
2012-01-01 01:00:00	1	0.5	A
2012-01-01 03:00:00	3	1.5	A
2012-01-01 09:00:00	9	4.5	A
2012-01-01 04:00:00	4	2.0	B
2012-01-01 05:00:00	5	2.5	B
2012-01-01 06:00:00	6	3.0	B
2012-01-01 02:00:00	2	1.0	C
2012-01-01 07:00:00	7	3.5	C
2012-01-01 08:00:00	8	4.0	D

You can count the occurrences:

data.groupby("cat").size()

cat
A    4
B    3
C    2
D    1
dtype: int64

data.groupby("cat").mean()

	xvalue	yvalue
cat
A	3.25	1.625
B	5.00	2.500
C	4.50	2.250
D	8.00	4.000

data.groupby("cat").std()

	xvalue	yvalue
cat
A	4.031129	2.015564
B	1.000000	0.500000
C	3.535534	1.767767
D	NaN	NaN

Writting a CSV¶

Writting a CSV file is done by calling the DataFrame.to_csv method.

data.to_csv('data/example.csv', sep=';')

Introduction to Python