Python中的Pandas-十分钟搞定pandas

Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。

Pandas[1] 是python的一个数据分析包，最初由AQR Capital Management于2008年4月开发，并于2009年底开源出来，目前由专注于Python数据包开发的PyData开发team继续开发和维护，属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来，因此，pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据（panel data）和python数据分析（data analysis）。panel data是经济学中关于多维数据集的一个术语，在Pandas中也提供了panel的数据类型。

http://pandas.pydata.org

十分钟搞定pandas

pandas

pandas download

sudo easy_install pandas

Matplotlib

Matplotlib 是一个 Python 的 2D绘图库，它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形。

通过 Matplotlib，开发者可以仅需要几行代码，便可以生成绘图，直方图，功率谱，条形图，错误图，散点图等。

matplotlib download

sudo easy_install matplotlib

>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pylot as plt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named pylot
>>> import matplotlib.pyplot as plt
>>> s = pd.Series([1,3,5,np,nan,6,8])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'nan' is not defined
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
>>> dates = pd.date_range('20170616',periods=6)
>>> dates
DatetimeIndex(['2017-06-16', '2017-06-17', '2017-06-18', '2017-06-19',
'2017-06-20', '2017-06-21'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
>>> df
A         B         C         D
2017-06-16  0.106189 -0.188820 -0.278583 -1.599889
2017-06-17 -2.056375  2.038176  1.364875 -1.292091
2017-06-18  1.227845 -0.469293  0.270563  1.139910
2017-06-19 -0.024595  2.890697  0.096862 -0.997049
2017-06-20 -0.966905  0.176658  0.760456  0.511681
2017-06-21 -0.239669  0.260096 -1.502962 -0.713180
>>> df2 = pd.DataFrame({'A':1.,})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/frame.py", line 275, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/frame.py", line 411, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/frame.py", line 5594, in _arrays_to_mgr
index = extract_index(arrays)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/frame.py", line 5633, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
>>> df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20160616'),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})
>>> df2
A          B    C  D      E    F
0  1.0 2016-06-16  1.0  3   test  foo
1  1.0 2016-06-16  1.0  3  train  foo
2  1.0 2016-06-16  1.0  3   test  foo
3  1.0 2016-06-16  1.0  3  train  foo
>>> df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
>>> df2.<TAB>
File "<stdin>", line 1
df2.<TAB>
^
SyntaxError: invalid syntax
>>> df2.
File "<stdin>", line 1
df2.	
^
SyntaxError: invalid syntax
>>> df.head()
A         B         C         D
2017-06-16  0.106189 -0.188820 -0.278583 -1.599889
2017-06-17 -2.056375  2.038176  1.364875 -1.292091
2017-06-18  1.227845 -0.469293  0.270563  1.139910
2017-06-19 -0.024595  2.890697  0.096862 -0.997049
2017-06-20 -0.966905  0.176658  0.760456  0.511681
>>> df.tail(3)
A         B         C         D
2017-06-19 -0.024595  2.890697  0.096862 -0.997049
2017-06-20 -0.966905  0.176658  0.760456  0.511681
2017-06-21 -0.239669  0.260096 -1.502962 -0.713180
>>> df.index
DatetimeIndex(['2017-06-16', '2017-06-17', '2017-06-18', '2017-06-19',
'2017-06-20', '2017-06-21'],
dtype='datetime64[ns]', freq='D')
>>> df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>> df.values
array([[ 0.1061891 , -0.18881996, -0.27858265, -1.59988881],
[-2.05637483,  2.03817581,  1.36487471, -1.29209099],
[ 1.22784542, -0.46929269,  0.27056344,  1.13991027],
[-0.02459463,  2.89069652,  0.09686167, -0.99704863],
[-0.96690471,  0.17665782,  0.76045631,  0.51168099],
[-0.23966907,  0.26009644, -1.502962  , -0.7131797 ]])
>>> df.describe()
A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.325585  0.784586  0.118535 -0.491769
std    1.104941  1.354370  0.977340  1.080931
min   -2.056375 -0.469293 -1.502962 -1.599889
25%   -0.785096 -0.097451 -0.184722 -1.218330
50%   -0.132132  0.218377  0.183713 -0.855114
75%    0.073493  1.593656  0.637983  0.205466
max    1.227845  2.890697  1.364875  1.139910
>>> df.I
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/generic.py", line 2970, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'I'
>>> df.T
2017-06-16  2017-06-17  2017-06-18  2017-06-19  2017-06-20  2017-06-21
A    0.106189   -2.056375    1.227845   -0.024595   -0.966905   -0.239669
B   -0.188820    2.038176   -0.469293    2.890697    0.176658    0.260096
C   -0.278583    1.364875    0.270563    0.096862    0.760456   -1.502962
D   -1.599889   -1.292091    1.139910   -0.997049    0.511681   -0.713180
>>> df.sort_index(axis=1,ascending=False)
D         C         B         A
2017-06-16 -1.599889 -0.278583 -0.188820  0.106189
2017-06-17 -1.292091  1.364875  2.038176 -2.056375
2017-06-18  1.139910  0.270563 -0.469293  1.227845
2017-06-19 -0.997049  0.096862  2.890697 -0.024595
2017-06-20  0.511681  0.760456  0.176658 -0.966905
2017-06-21 -0.713180 -1.502962  0.260096 -0.239669
>>> df.sort(column='B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/generic.py", line 2970, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'sort'
>>> df.sort(columns='B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/generic.py", line 2970, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'sort'
>>> df.sort(column='B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/generic.py", line 2970, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'sort'
>>> df['A']
2017-06-16    0.106189
2017-06-17   -2.056375
2017-06-18    1.227845
2017-06-19   -0.024595
2017-06-20   -0.966905
2017-06-21   -0.239669
Freq: D, Name: A, dtype: float64
>>> df[0:3]]
File "<stdin>", line 1
df[0:3]]
^
SyntaxError: invalid syntax
>>> df[0:3]
A         B         C         D
2017-06-16  0.106189 -0.188820 -0.278583 -1.599889
2017-06-17 -2.056375  2.038176  1.364875 -1.292091
2017-06-18  1.227845 -0.469293  0.270563  1.139910
>>> df['20170616':'20170619']
A         B         C         D
2017-06-16  0.106189 -0.188820 -0.278583 -1.599889
2017-06-17 -2.056375  2.038176  1.364875 -1.292091
2017-06-18  1.227845 -0.469293  0.270563  1.139910
2017-06-19 -0.024595  2.890697  0.096862 -0.997049
>>> df.loc[dates[0]]
A    0.106189
B   -0.188820
C   -0.278583
D   -1.599889
Name: 2017-06-16 00:00:00, dtype: float64
>>> df.loc[:['A','B']]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexing.py", line 1328, in __getitem__
return self._getitem_axis(key, axis=0)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexing.py", line 1506, in _getitem_axis
return self._get_slice_axis(key, axis=axis)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexing.py", line 1356, in _get_slice_axis
slice_obj.step, kind=self.name)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/datetimes.py", line 1515, in slice_indexer
return Index.slice_indexer(self, start, end, step, kind=kind)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/base.py", line 3301, in slice_indexer
kind=kind)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/base.py", line 3495, in slice_locs
end_slice = self.get_slice_bound(end, 'right', kind)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/base.py", line 3432, in get_slice_bound
slc = self._get_loc_only_exact_matches(label)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/base.py", line 3401, in _get_loc_only_exact_matches
return self.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas-0.20.1-py2.7-macosx-10.12-intel.egg/pandas/core/indexes/datetimes.py", line 1435, in get_loc
stamp = Timestamp(key, tz=self.tz)
File "pandas/_libs/tslib.pyx", line 402, in pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)
File "pandas/_libs/tslib.pyx", line 1528, in pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:28851)
TypeError: Cannot convert input [['A', 'B']] of type <type 'list'> to Timestamp
>>> df.loc[:,['A','B']]
A         B
2017-06-16  0.106189 -0.188820
2017-06-17 -2.056375  2.038176
2017-06-18  1.227845 -0.469293
2017-06-19 -0.024595  2.890697
2017-06-20 -0.966905  0.176658
2017-06-21 -0.239669  0.260096
>>>