Pandas教程-数据处理

在数据分析和建模的大部分时间中，都花费在数据准备和处理上，即加载、清理和重新排列数据等。此外，由于Python库，Pandas为我们提供了高性能、灵活和高级的数据处理环境。对于 Pandas，有各种功能可用于有效地处理数据。

层次化索引

为了增强数据处理的能力，我们必须使用一些索引，这有助于根据标签对数据进行排序。因此，层次化索引出现在这里，并被定义为 Pandas 的一个基本特性，可以帮助我们使用多个索引级别。

创建多重索引

在层次化索引中，我们必须为数据创建多个索引。这个例子创建了一个具有多个索引的 Series。

例子:

import pandas as pd  
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],  
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])  
data

输出:

aobj1   11
obj2   14
obj3   17
     obj4   24 
bobj1   19
obj2   32
obj3   34
obj4  27
dtype: int64

在这里，我们采用了两级索引，即 (x, y) 和 (obj1, obj4)，并且可以使用 'index' 命令查看索引。

info.index

输出:

MultiIndex(levels=[['x', 'y'], ['obj1', 'obj2', 'obj3', 'obj4']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

部分索引

部分索引可以被定义为从分层索引中选择特定索引的一种方法。

下面的代码从数据中提取了 'b'，

import pandas as pd  
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],  
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])  
info['b']

输出:

obj1   19 
obj2   32 
obj3   34 
obj4   27
dtype: int64

此外，还可以基于内部级别提取数据，即 'obj'。下面的结果定义了 Series 中 'obj2' 的两个可用值。

info[:, 'obj2']

输出:

x   14 
y 32
dtype: int64

数据的展开

展开意味着将行标头更改为列标头。行索引将更改为列索引，因此 Series 将变成 DataFrame。以下是展开数据的示例。

例子:

import pandas as pd  
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],  
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])  
# unstack on first level i.e. x, y  
#note that data row-labels are x and y  
data.unstack(0)

输出:

ab 
obj1  11   19
obj2  14   32
obj3 17   34 
obj4  24    27
# unstack based on second level i.e. 'obj'
info.unstack(1)

输出:

obj1 obj2 obj3 obj4 
a  11       14      17       24
b  19       32      34      27

'stack()' 操作用于将列索引转换为行索引。在上述代码中，我们可以使用 'stack' 操作将 'obj' 作为列索引转换为行索引。

import pandas as pd  
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],  
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])  
# unstack on first level i.e. x, y  
#note that data row-labels are x and y  
data.unstack(0)   
d.stack()

输出:

aobj1   11
obj2   14
obj3   17
     obj4   24 
bobj1   19
obj2   32
     obj3   34 
obj4  27
dtype: int64

列索引

请记住，由于列索引需要二维数据，因此列索引仅对 DataFrame 有效（不适用于 Series）。让我们为演示带有多个索引的列创建新 DataFrame。

import numpy as np   
info = pd.DataFrame(np.arange(12).reshape(4, 3),  
index = [['a', 'a', 'b', 'b'], ['one', 'two', 'three', 'four']],   
columns = [['num1', 'num2', 'num3'], ['x', 'y', 'x']] ... )   
info

输出:

num1 num2 num3
x           y             x
a one0 1 2 
two3 4 5
b three 6 7 8 
four 9 10 11 
# display row index   
info.index

输出:

MultiIndex(levels=[['x', 'y'], ['four', 'one', 'three', 'two']], labels=[[0, 0, 1, 1], [1, 3, 2, 0]])
# display column index   
info.columns

输出:

MultiIndex(levels=[['num1', 'num2', 'num3'], ['green', 'red']], labels=[[0, 1, 2], [1, 0, 1]])

交换和排序级别

我们可以轻松地通过使用 'swaplevel' 命令交换索引级别，该命令接受两个级别编号作为输入。

import numpy as np   
info = pd.DataFrame(np.arange(12).reshape(4, 3),  
index = [['a', 'a', 'b', 'b'], ['one', 'two', 'three', 'four']],   
columns = [['num1', 'num2', 'num3'], ['x', 'y', 'x']] ... )   
info.swaplevel('key1', 'key2')   
nnum1 num2 num3   
p                             x                  y              x   
key2 key1   
onea 0 1 2   
twoa 3 4 5   
three b 6 7 8  
four b 9 10 11

我们可以使用 'sort_index' 命令按照 'key2' 名称排序标签。数据将按字母顺序排列。

info.sort_index(level='key2')   
nnum1 num2    num3   
p           x             y             x  
key1 key2   
bfour 9 10 11   
aone 0 1 2  
bthree 6 7 8   
atwo 3 4 5

Pandas教程-数据处理

部分索引

数据的展开

列索引

交换和排序级别

推荐文章

其它