Pandasの基本的なデータ構造Seriesの基礎と活用方法

Pandasには、最も基本的なオブジェクトとしてSeriesと呼ばれるデータ構造があります。Pandasを使ったコードを書く上での基本的なデータ構造なので、抑えておきましょう。

Seriesオブジェクトとは

公式ドキュメントによると、Seriesオブジェクトは以下のように記述されています。

One-dimensional ndarray with axis labels  
(including time series)

簡単に言えばNumPyのndarrayの一次元配列のことを指します。しかしNumPyのndarrayとは異なる点があります。

インデックスを番号以外で振ることができる。
オブジェクトそのものに名前をつけることができる。
時間データを格納できる(PandasのTime Seriesが扱える)

これらはPandasとNumPyの違いにも当てはまります。

Seriesには1セットのデータが格納されており、これらをつなぎあわせるとDataFrameオブジェクトになります。DataFrameオブジェクトについては以下の記事で解説しています。

Pandasのデータを格納するオブジェクトDataFrameを理解する /features/pandas-dataframe.html

SeriesオブジェクトのAPIドキュメントは以下のとおりです。

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

params:

パラメータ名	型	概要
`data`	(省略可能)初期値None 配列, 辞書, または数値	Seriesに格納するデータを指定します
`index`	(省略可能)初期値None 配列に相当するものもしくは1次元の Indexオブジェクト	データの値それぞれに対応するインデックスを指定します。インデックスの値が被っていても問題ありません。インデックスが指定されない場合、(0, 1, 2, …,)と順番につけられます。
`dtype`	numpy.dtype またはNone	(省略可能)初期値None データ型を指定します。Noneの場合は`data`で入力された値から推測します。
`copy`	bool値	(省略可能)初期値False 入力されたデータを参照しながら使う(False)のか、コピーを作成して用いる(True)のかを指定します。
`name`	文字列または数値	データセット自体のラベルを指定します。

引数の名前から大体予想がつくものばかりです。

引数fastpathについてですが、ほとんど使用されることがなく公式ドキュメントでも言及がないため割愛します。

では、これを踏まえてSeriesオブジェクトを実際に使ってみましょう。

実際に使ってみる

Seriesオブジェクトの生成

Seriesオブジェクトの格納できるデータは配列に相当するオブジェクト、辞書(dict)、または単体の数値となっています。

まずは簡単な配列から生成してみます。

In [1]: import pandas as pd # pandasモジュールのインポート

In [2]: a = pd.Series([1,2,3]) # 配列から生成

In [3]: a
Out[3]:
0    1
1    2
2    3
dtype: int64

In [4]: array = [1., 2., 3.] # もちろんfloat型もOK.

In [5]: b = pd.Series(array)

In [6]: b
Out[6]:
0    1.0
1    2.0
2    3.0
dtype: float64

NumPyの１次元配列から生成することもできます。

In [7]: import numpy as np # numpyモジュールのインポート

In [8]: np_array = np.array([1,2,3])

In [9]: c = pd.Series(np_array)

In [10]: c
Out[10]:
0    1
1    2
2    3
dtype: int64

辞書から作成します。

In [15]: dic = {"Tokyo": 100, "Osaka": 250, "Nagoya": 10} # 辞書オブジェクトの生成

In [16]: d = pd.Series(dic)

In [17]: d # インデックスが辞書のキーになっている
Out[17]:
Nagoya     10
Osaka     250
Tokyo     100
dtype: int64

値だけを入れることもできます。

In [20]: e = pd.Series(1) # 1だけを入れる

In [21]: e
Out[21]:
0    1
dtype: int64

文字列も格納できます。

In [22]: f = pd.Series(["A", "B", "C"])

In [23]: f
Out[23]:
0    A
1    B
2    C
dtype: object

これらをミックスして入れることも可能です。

In [26]: g = pd.Series(['A', 1, 1.0, None])

In [27]: g
Out[27]:
0       A
1       1
2       1
3    None
dtype: object

Noneは欠損値を表します。

インデックスの指定

次にインデックスを指定していきます。デフォルトでは入力された値から順に0,1,2,…とつけられていきます。このときRangeIndexクラスによって自動的につけられています。

In [29]: series = pd.Series([5,4,3,2,1])

In [30]: series
Out[30]:
0    5
1    4
2    3
3    2
4    1
dtype: int64

In [31]: series.index # indexの表示
Out[31]: RangeIndex(start=0, stop=5, step=1)

始点startが0で終点stopが5、間隔step1でとった数列がインデックスとなっているという意味になっています。stopはインデックスには含まれてないことに注意してください。

このオブジェクトはただ順番にインデックスをつけるためのものなのでNumPyのnp.arange()関数でも代用できます。

In [41]: series_2 = pd.Series([5,4,3,2,1], index=np.arange(5))

In [42]: series_2
Out[42]:
0    5
1    4
2    3
3    2
4    1
dtype: int64

In [43]: series_2.index
Out[43]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

Int64IndexオブジェクトというのはPandasに実装されているIndexオブジェクトの中でIndexが整数のみで構成されているもののことを指します。基本的にはIndexオブジェクトに格納されますが特殊なケースにおいてはこのようなものが使われます。

Indexに関するオブジェクトの種類についてはPandas側が勝手に判断してやってくれるのでindexによって種類が変わるんだなあ程度にしておいて問題はないでしょう。

では次に文字列でインデックスをつけてみます。

In [53]: series_3 = pd.Series([5,4,3,2,1], index=['a','b','c','d','e']) # a,b,c,d,eとインデックスをつける

In [54]: series_3
Out[54]:
a    5
b    4
c    3
d    2
e    1
dtype: int64

In [55]: series_3.index
Out[55]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

うまくできました。先程も触れた通り、辞書でデータを指定すればインデックスも同時に指定することになります。

In [56]: dic_2 = {'a':5, 'b':4, 'c':3, 'd':2, 'e':1}

In [57]: series_4 = pd.Series(dic_2)

In [58]: series_4
Out[58]:
a    5
b    4
c    3
d    2
e    1
dtype: int64

In [59]: series_4.index
Out[59]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

生成されているSeriesオブジェクトに対してIndexをつけ直すことも可能です。

In [61]: series_2
Out[61]:
0    5
1    4
2    3
3    2
4    1
dtype: int64


In [62]: series_2.index = ['a', 'b', 'c', 'd','e']

In [63]: series_2
Out[63]:
a    5
b    4
c    3
d    2
e    1
dtype: int64

また、Seriesオブジェクトを生成する際に辞書を使い、その上でIndex指定をすると辞書のキーとして使われていないインデックスの値はNaNとなります。

In [65]: dic_2
Out[65]: {'a': 5, 'b': 4, 'c': 3, 'd': 2, 'e': 1}

In [66]: series_5 = pd.Series(dic_2, index=['a','c','e','f'])

In [67]: series_5 # 指定されていないindexの値は表示されない。
Out[67]:
a    5.0
c    3.0
e    1.0
f    NaN
dtype: float64

In [68]: series_5.index = ['a', 'b', 'c', 'd', 'e'] # あとから指定するときは要素数と同じ数だけ指定しないとエラーが返ってくる。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-6c858c0842cf> in <module>()
----> 1 series_5.index = ['a', 'b', 'c', 'd', 'e']
(エラーメッセージが表示される)

ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements

引数`copy`の操作

引数copyをTrueにすると、dataとしてSeriesが指定されたとき、コピーが作成され、参照元が引き渡されなくなります。

In [89]: array_2 = [20,0,0,0,0]

In [90]: series_6 = pd.Series(array_2)

In [100]: series_6 # もととなるSeriesを生成
Out[100]:
0    20
1     0
2     0
3     0
4     0
dtype: int64

In [101]: series_7 = pd.Series(series_6, copy=False) # こちらはFalse

In [102]: series_8 = pd.Series(series_6, copy=True) # こちらはTrue

In [103]: series_7
Out[103]:
0    20
1     0
2     0
3     0
4     0
dtype: int64

In [104]: series_7[0] = 11 # Falseにした方の値を変更する

In [105]: series_7
Out[105]:
0    11
1     0
2     0
3     0
4     0
dtype: int64

In [106]: series_6 # もとのSeriesにも変更が反映されている
Out[106]:
0    11
1     0
2     0
3     0
4     0
dtype: int64

In [107]: series_8 # Trueにしておくとそのような変更の影響を受けない
Out[107]:
0    20
1     0
2     0
3     0
4     0
dtype: int64

In [108]: series_6[0] = 10

In [109]: series_7
Out[109]:
0    10
1     0
2     0
3     0
4     0
dtype: int64

In [110]: series_8
Out[110]:
0    20
1     0
2     0
3     0
4     0
dtype: int64

copy関数でもコピーを生成することができます。

In [111]: series_9 = series_6.copy()

要素の抜き出し

要素を抜き出すこともできます。まずはインデックスの指定から。

In [70]: series
Out[70]:
0    5
1    4
2    3
3    2
4    1
dtype: int64

In [71]: series[0]
Out[71]: 5

In [72]: series[0:2] # スライス表記もできる。
Out[72]:
0    5
1    4
dtype: int64

条件を指定して偶数だけ抜き出すことができます。

In [73]: series[series%2 == 0]
Out[73]:
1    4
3    2
dtype: int64

インデックスが文字列でも抜き出すことができます。

In [78]: series_3
Out[78]:
a    5
b    4
c    3
d    2
e    1
dtype: int64

In [79]: series_3['a']
Out[79]: 5

In [80]: series_3[['a','c']] # 複数抜き出す場合はリストにする必要あり
Out[80]:
a    5
c    3
dtype: int64

簡単な操作

SeriesオブジェクトではNumPy配列のような操作をすることが可能です。

In [81]: series
Out[81]:
0    5
1    4
2    3
3    2
4    1
dtype: int64

In [82]: series + 1
Out[82]:
0    6
1    5
2    4
3    3
4    2
dtype: int64

In [83]: series + pd.Series([1,1,2,2,2])
Out[83]:
0    6
1    5
2    5
3    4
4    3
dtype: int64

In [84]: series * 3
Out[84]:
0    15
1    12
2     9
3     6
4     3
dtype: int64

合計などを求めることも可能です。

In [87]: series.sum()
Out[87]: 15

In [88]: series.std()
Out[88]: 1.5811388300841898

NumPy関数の適用

SeriesはNumPyとの親和性が高いため、NumPyの関数を使った数値処理をすることが簡単にできます。

In [85]: np.sum(series) # 合計を求める
Out[85]: 15

In [86]: np.log(series)
Out[86]:
0    1.609438
1    1.386294
2    1.098612
3    0.693147
4    0.000000
dtype: float64

値の追加、変更の仕方

値の追加や変更はインデックスを指定することで可能です。

In [147]: series = pd.Series([0,0,0,0,0])

In [148]: series
Out[148]:
0    0
1    0
2    0
3    0
4    0
dtype: int64

In [149]: series[7] = 10

In [150]: series
Out[150]:
0     0
1     0
2     0
3     0
4     0
7    10
dtype: int64

In [151]: series['a'] = 11

In [152]: series
Out[152]:
0     0
1     0
2     0
3     0
4     0
7    10
a    11
dtype: int64

In [153]: series['a'] = 12 # 上書きもできる

In [154]: series
Out[154]:
0     0
1     0
2     0
3     0
4     0
7    10
a    12
dtype: int64

時系列データの扱い

Pandasに実装されているTimeSeriesをSeriesの中に組み込むことができます。
日付のデータを作りSeriesオブジェクトに格納してみましょう。

In [177]: data = pd.date_range('2018/05/26', periods=10,freq='D') # 時系列データの作成

In [178]: data
Out[178]:
DatetimeIndex(['2018-05-26', '2018-05-27', '2018-05-28', '2018-05-29',
               '2018-05-30', '2018-05-31', '2018-06-01', '2018-06-02',
               '2018-06-03', '2018-06-04'],
              dtype='datetime64[ns]', freq='D')

In [179]: date_series = pd.Series(data)

In [180]: date_series
Out[180]:
0   2018-05-26
1   2018-05-27
2   2018-05-28
3   2018-05-29
4   2018-05-30
5   2018-05-31
6   2018-06-01
7   2018-06-02
8   2018-06-03
9   2018-06-04
dtype: datetime64[ns]

In [181]: date_series_2 = pd.Series(np.random.randn(10),index=data) # インデックスとして指定することも可能

In [182]: date_series_2
Out[182]:
2018-05-26   -0.335079
2018-05-27    0.099053
2018-05-28   -0.155142
2018-05-29    0.448569
2018-05-30   -0.839239
2018-05-31    0.768965
2018-06-01    0.320166
2018-06-02   -1.122765
2018-06-03    0.331456
2018-06-04   -1.453074
Freq: D, dtype: float64

TimeSeriesはPandasの特徴的な機能の１つなのでまた別の記事で詳しく取り上げます。

Seriesの属性(Attributes)

Seriesには様々な属性が含まれています。
DataFrameと混同して使っても問題ないように追加されている属性もあります。
公式ドキュメントにリストがあったので著者なりに噛み砕いた形で掲載しておきます。
ilocやlocといった重要なものについてはまた別のページで取り上げていきます。

属性(Attribute)	説明
`T`	軸を入れ替えたものを返します。Seriesオブジェクトそのものを返します。
`asobject`	Series自体をオブジェクト化したものを返します。dataもリストの形で内包されています。
`at`	at[インデックス]の形で使用。値を抜き出します。1つの値だけを抜き出します。
`axes`	インデックスとなっているものを返します。
`base`	このSeriesオブジェクトが参照しているオブジェクトがあった場合、そのオブジェクトを返します。
`blocks`	(非推奨) 内部プロパティを表示します。as_blocks()で同様の表示ができる。
`data`	data部分のポインター（メモリー内の住所)を表示します。
`dtype`	data部分に使われているデータ型を表示します。
`dtypes`	同上
`flags`	説明無し。メモリーレイアウトの情報を表示します。(NumPyのndarrayでも同じ属性が存在します)
`ftype`	data部分が`parse`か`dense`のときそれを表示します。
`ftypes`	同上
`hasnans`	NaN値があるかどうかを返します。これでいろんな操作の速度を上げることが可能です。(他のNaN値を調べる関数より高速に調べられます)
`iat`	インデックスの値に関係なく、iat[i]でi番目の値にアクセスできます。
`iloc`	インデックスの値ではなく番号指定で目的の値にアクセスできます。複数の値の指定も可能。
`index`	インデックスに使われているオブジェクトを返します。
`is_monotonic`	dataの値が単調増加かどうかをTrueかFalseで返します。
`is_monotonic_decreasing`	単調減少かどうかを返します。
`is_monotonic_increasing`	単調増加かどうかを返します。
`is_unique`	Seriesに含まれている値に被りがなければTrueを返します。被りが有るとFalseを返します。
`itemsize`	dataの１つの値が使用しているメモリ量をバイト(byte)単位で返します。
`ix`	(非推奨) アイテムの位置を指定することによってその値を表示します。まぎらわしさがあったのでilocとlocを使うことが推奨されています。
`loc`	インデックスの値を指定することで該当する値を指定します。loc[インデックス]の形で使います。
`nbytes`	データ部分が使用しているメモリ量を返します。単位はバイト(byte)です。
`ndim`	次元数を返します。Seriesの場合は1です。
`shape`	data部分の形状(shape)を返します。Seriesの場合は1次元配列と同様”（アイテムの個数,)”の形で返されます。
`size`	data部分のアイテム数を返します。
`strides`	メモリ上で何バイト分移動すれば次のアイテムを読み込むことができるかを返します。float64なら8バイトです。
`values`	データ部分のみを返します。
`empty`	data部分に何も指定されていなければTrueが帰ってきます。
`name`	name引数で指定した値を返します。
`real`	データが複素数のとき実部だけを返します。

軽くですが一通り使ってみます。最初はTとasobjectです。Tは転置されたものを返しますがSeriesは1次元のデータしか扱わないので同じものしか返ってきません。asobjectはSeriesをオブジェクトとして返します。

In [2]: a = pd.Series([1,2,3,4,5])

In [3]: a.T
Out[3]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]: a.asobject
Out[4]: array([1, 2, 3, 4, 5], dtype=object)

次はatとaxesです。インデックスを指定して値を抜き出したり、インデックスを表示させる属性です。Seriesにおいてはindexと同様の働きをします。

In [5]: b = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])

In [6]: b.at['b']
Out[6]: 2

In [7]: b.axes
Out[7]: [Index(['a', 'b', 'c', 'd', 'e'], dtype='object')]

baseとblocksです。baseは参照元のオブジェクトを返します。blocksは内部プロパティを表示します。baseは基本的には何も返ってきません。

In [14]: f = [2,3,1,2,3]

In [15]: c = pd.Series(f)

In [16]: c.base

In [17]: c.blocks
Out[17]:
{'int64': 0    2
 1    3
 2    1
 3    2
 4    3
 dtype: int64}

dataはSeriesが持つデータ列のメモリ内のアドレスを表示します。dtypeはデータ部分に使われている型を表示します。dtypesも同様です。DataFrameになると挙動がまた変化し、カラムごとに使われているデータ型を表示させます。

In [18]: c.data
Out[18]: <memory at 0x1071b41c8>

In [19]: c.dtype
Out[19]: dtype('int64')

In [20]: c.dtypes
Out[20]: dtype('int64')

flagsはメモリーレイアウトなどについての情報を表示します。

In [21]: c.flags
Out[21]:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

ftypeはdata部分がparseかdenseになっているときそれを表示するものです。
今回は密なのでdenseが返されます。
ftypesも同様です。DataFrameだとコラムごとの結果を表示します。

In [22]: c.ftype
Out[22]: 'int64:dense'

In [23]: c.ftypes
Out[23]: 'int64:dense'

hasnansはNaN値が含まれているかを確認します。
これは非常に高速で他の関数を組み合わせた手法、例えばisnull()関数とany()を組み合わせた手法などよりもずっと早く実行することができるので覚えておくと損はなさそうです。

In [25]: import numpy as np # numpyモジュールをインポート

In [26]: random_array = np.random.randn(1000000) # 乱数を100万個生成

In [27]: random_series = pd.Series(random_array)

In [28]: random_series.at[99000] = np.nan  

In [30]: random_series.hasnans
Out[30]: True

In [34]: random_series.isnull().any()
Out[34]: True

In [35]: %timeit random_series.hasnans
769 ns ± 154 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [36]: %timeit random_series.isnull().any()
11.4 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2桁ほどオーダーが変わり、hasnansを使ったほうがはやく答えを知ることが可能です。

iatとilocは値にアクセスする属性で、どちらもインデックスの値ではなく何番目の要素かで値を抽出します。

In [38]: b
Out[38]:
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [39]: b.iat[1]
Out[39]: 2

In [40]: b.iloc[1]
Out[40]: 2

indexはインデックスの一覧を返します。

In [41]: b.index
Out[41]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

is_monotonic、is_monotonic_decreasing、is_monotonic_increasingはそれぞれ単調変化、単調減少、単調増加かどうかを調べます。

In [42]: b.is_monotonic # 単調変化か
Out[42]: True

In [43]: b.is_monotonic_decreasing # 単調減少か
Out[43]: False

In [44]: b.is_monotonic_increasing
Out[44]: True

is_uniqueはすべての値に被りがないかどうかを確かめます。

In [45]: b.is_unique
Out[45]: True

In [46]: b[1] = 1

In [47]: b.is_unique
Out[47]: False

In [48]: b
Out[48]:
a    1
b    1
c    3
d    4
e    5
dtype: int64

itemsizeは1つ1つのデータの大きさをbyte(1byte = 8bit)で返します。

In [49]: b.itemsize
Out[49]: 8

ixとlocはどちらも値を抽出します。
ixはインデックスの値とデータの位置どちらにも対応していますがlocはインデックスの値のみです。
ixはどちらも対応しているのがむしろ混乱する結果となり、現在では非推奨となっています。

In [50]: b.ix['a']
#...
: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

Out[50]: 1

In [51]: b.ix[1]
Out[51]: 1

In [52]: b.loc['a']
Out[52]: 1

ixを使うとDeprecationWarningが表示されます。

nbytesはデータ部分全ての容量を返します。

In [54]: b.nbytes
Out[54]: 40

ndimは次元の数を返します。Seriesならいつでも1になります。
shapeはデータ部分の形状を表示します。
sizeはデータの個数を表示します。

In [55]: b.ndim
Out[55]: 1

In [56]: b.shape
Out[56]: (5,)

In [57]: b.size
Out[57]: 5

stridesは1つ隣の要素を参照するのにメモリを何バイト移動する必要があるかを示します。

In [58]: b.strides
Out[58]: (8,)

valuesはデータの中身だけを配列で返します。
数値データのときはNumPyのndarrayです。

In [59]: b.values
Out[59]: array([1, 1, 3, 4, 5])

emptyは中身が有るかどうか調べます。

In [63]: empty_series = pd.Series()

In [64]: empty_series.empty
Out[64]: True

In [65]: b.empty
Out[65]: False

nameはSeriesに指定されたnameを返します。

In [66]: name_series = pd.Series([1,2,3], name="name")

In [67]: name_series.name
Out[67]: 'name'

In [68]: b.name

realはデータの実部だけを返します。
NumPyで複素数のデータをつくって実部だけ取り出してみましょう。

In [69]: array = np.array([1+2j, -2+3j, -1-4j]) # NumPyで複素数の数列を作成

In [70]: imag_series = pd.Series(array)

In [72]: imag_series.real # うまく実部だけを取り出せた
Out[72]: array([ 1., -2., -1.])

まとめ

今回はSeriesオブジェクトの使い方を中心的にまとめました。

ここで扱った操作はSeriesオブジェクトの集合体であるDataFrameオブジェクトと共通点が多いです。1次元配列を操作していることを明示的にするためにSeriesオブジェクトが用意されているということでしょう。

参考

pandas.Series — pandas 0.23.0 documentation
Python for Data Analysis 2nd edition –Wes McKinney(書籍)
pandas.Int64Index — pandas 0.23.0 documentation
pandas.RangeIndex — pandas 0.23.0 documentation