Pandas教程-Python Pandas DataFrame

Pandas DataFrame（数据框）是一种广泛使用的数据结构，它与带有标签轴（行和列）的二维数组一起工作。DataFrame 被定义为存储具有两个不同索引的数据的标准方式，即行索引和列索引。它具有以下属性：

列可以是异构类型，如整数、布尔值等。
它可以被看作是一种 Series 结构的字典，其中行和列都被索引。在列的情况下，它表示为“columns”，在行的情况下表示为“index”。

参数和描述:

data: 它由不同形式组成，如 ndarray、series、map、constants、lists、array。

index: 如果没有传递索引，则默认使用 np.arrange(n) 索引作为行标签。

columns: 列标签的默认语法是 np.arrange(n)。如果没有传递索引，则仅显示 true。

dtype: 指的是每列的数据类型。

copy(): 用于复制数据。

微信截图_20240222145748.png

创建 DataFrame

我们可以使用以下方式创建 DataFrame：

字典
列表
Numpy 数组
Series

创建一个空的 DataFrame

以下代码显示了如何在 Pandas 中创建一个空的 DataFrame：

# importing the pandas library  
import pandas as pd  
df = pd.DataFrame()  
print (df)

输出

Empty DataFrame
Columns: []
Index: []

说明: 在上面的代码中，首先我们导入了 pandas 库并使用别名 pd，然后定义了一个名为 df 的变量，其中包含一个空的 DataFrame。最后，我们通过将 df 传递给 print 来打印它。

使用列表创建 DataFrame:

我们可以使用列表在 Pandas 中轻松创建 DataFrame。

# importing the pandas library  
import pandas as pd  
# a list of strings  
x = ['Python', 'Pandas']  
  
# Calling DataFrame constructor on list  
df = pd.DataFrame(x)  
print(df)

输出

      0
0   Python
1   Pandas

说明: 在上面的代码中，我们定义了一个名为 "x" 的变量，其中包含字符串值。调用了列表的 DataFrame 构造函数以打印值。

使用字典的 ndarrays/列表创建 DataFrame

# importing the pandas library  
import pandas as pd  
info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech',]}  
df = pd.DataFrame(info)  
print (df)

输出

       ID      Department
0      101        B.Sc
1      102        B.Tech
2      103        M.Tech

说明: 在上面的代码中，我们定义了一个名为 "info" 的字典，其中包含 ID 和 Department 的列表。为了打印值，我们必须通过一个名为 df 的变量调用 info 字典，并将其作为参数传递给 print()。

使用 Series 的字典创建 DataFrame:

# importing the pandas library  
import pandas as pd  
  
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),  
   'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}  
  
d1 = pd.DataFrame(info)  
print (d1)

输出

        one         two
a       1.0          1
b       2.0          2
c       3.0          3
d       4.0          4
e       5.0          5
f       6.0          6
g       NaN          7
h       NaN          8

说明: 在上面的代码中，一个名为 "info" 的字典包含两个具有各自索引的 Series。为了打印值，我们必须通过一个名为 d1 的变量调用 info 字典，并将其作为参数传递给 print()。

列选择

我们可以从 DataFrame 中选择任何列。下面的代码演示了如何从 DataFrame 中选择列。

# importing the pandas library  
import pandas as pd  
  
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),  
   'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}  
  
d1 = pd.DataFrame(info)  
print (d1 ['one'])

输出

a      1.0
b      2.0
c      3.0
d      4.0
e      5.0
f      6.0
g      NaN
h      NaN
Name: one, dtype: float64

说明: 在上面的代码中，一个名为 "info" 的字典包含两个具有各自索引的 Series。随后，我们通过一个名为 d1 的变量调用 info 字典，并通过将其传递给 print() 选择了 "one" Series。

列添加

我们也可以将任何新列添加到现有的 DataFrame 中。下面的代码演示了如何将任何新列添加到现有的 DataFrame 中：

# importing the pandas library  
import pandas as pd  
  
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),  
   'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}  
  
df = pd.DataFrame(info)  
  
# Add a new column to an existing DataFrame object   
  
print ("Add new column by passing series")  
df['three']=pd.Series([20,40,60],index=['a','b','c'])  
print (df)  
  
print ("Add new column using existing DataFrame columns")  
df['four']=df['one']+df['three']  
  
print (df)

输出

Add new column by passing series
      one     two      three
a     1.0      1        20.0
b     2.0      2        40.0
c     3.0      3        60.0
d     4.0      4        NaN
e     5.0      5        NaN
f     NaN      6        NaN

Add new column using existing DataFrame columns
       one      two       three      four
a      1.0       1         20.0      21.0
b      2.0       2         40.0      42.0
c      3.0       3         60.0      63.0
d      4.0       4         NaN      NaN
e      5.0       5         NaN      NaN
f      NaN       6         NaN      NaN

说明: 在上面的代码中，一个名为 f 的字典包含两个具有各自索引的 Series。随后，我们通过一个名为 df 的变量调用 info 字典。

要将新列添加到现有的 DataFrame 对象中，我们传递了一个包含一些值的新 series，并使用 print() 打印了其结果。

我们可以使用现有 DataFrame 添加新列。已添加了“four”列，该列存储两列的加法结果，即 one 和 three。

列删除:

我们也可以从现有的 DataFrame 中删除任何列。这段代码演示了如何从现有的 DataFrame 中删除列：

# importing the pandas library  
import pandas as pd  
  
info = {'one' : pd.Series([1, 2], index= ['a', 'b']),   
   'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])}  
     
df = pd.DataFrame(info)  
print ("The DataFrame:")  
print (df)  
  
# using del function  
print ("Delete the first column:")  
del df['one']  
print (df)  
# using pop function  
print ("Delete the another column:")  
df.pop('two')  
print (df)

输出

The DataFrame:
      one    two
a     1.0     1
b     2.0     2
c     NaN     3

Delete the first column:
     two
a     1
b     2
c     3

Delete the another column:
Empty DataFrame
Columns: []
Index: [a, b, c]

说明:

在上面的代码中，df 变量负责调用 info 字典并打印字典的所有值。我们可以使用 delete 或 pop 函数从 DataFrame 中删除列。

在第一种情况中，我们使用 delete 函数从 DataFrame 中删除了 "one" 列，而在第二种情况中，我们使用 pop 函数从 DataFrame 中删除了 "two" 列。

行选择、添加和删除

行选择:

我们可以随时轻松选择、添加或删除任何行。首先，我们将了解行选择。让我们看看如何使用不同的方式选择行，如下所示：

通过标签选择:

我们可以通过将行标签传递给 loc 函数来选择任何行。

# importing the pandas library  
import pandas as pd  
  
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),   
   'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}  
  
df = pd.DataFrame(info)  
print (df.loc['b'])

输出

one    2.0
two    2.0
Name: b, dtype: float64

说明: 在上面的代码中，一个名为 "info" 的字典包含两个具有各自索引的 Series。

要选择一行，我们将行标签传递给 loc 函数。

通过整数位置选择:

也可以通过将整数位置传递给 iloc 函数来选择行。

# importing the pandas library  
import pandas as pd  
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),  
   'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}  
df = pd.DataFrame(info)  
print (df.iloc[3])

输出

one    4.0
two    4.0
Name: d, dtype: float64

说明: 在上面的代码中，我们定义了一个名为 "info" 的字典，其中包含两个具有各自索引的 Series。

要选择一行，我们将整数位置传递给 iloc 函数。

切片行

这是使用 ':' 运算符选择多行的另一种方法。

# importing the pandas library  
import pandas as pd  
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),   
   'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}  
df = pd.DataFrame(info)  
print (df[2:5])

输出

      one    two
c     3.0     3
d     4.0     4
e     5.0     5

说明: 在上面的代码中，我们定义了一个范围从 2:5 的区间以选择行，然后在控制台上打印其值。

添加行:

我们可以使用 append 函数轻松将新行添加到 DataFrame 中。它将新行添加到末尾。

# importing the pandas library  
import pandas as pd  
d = pd.DataFrame([[7, 8], [9, 10]], columns = ['x','y'])  
d2 = pd.DataFrame([[11, 12], [13, 14]], columns = ['x','y'])  
d = d.append(d2)  
print (d)

输出

      x      y
0     7      8
1     9      10
0     11     12
1     13     14

说明: 在上面的代码中，我们定义了两个分别包含一些行和列的列表。然后，使用 append 函数添加了这些列，然后在控制台上显示了结果。

删除行:

我们可以使用 index 标签从 DataFrame 中删除或删除任何行。如果情况下，标签重复，则将删除多行。

# importing the pandas library  
import pandas as pd  
  
a_info = pd.DataFrame([[4, 5], [6, 7]], columns = ['x','y'])  
b_info = pd.DataFrame([[8, 9], [10, 11]], columns = ['x','y'])  
  
a_info = a_info.append(b_info)  
  
# Drop rows with label 0  
a_info = a_info.drop(0)

输出

x      y
1     6      7
1     10    11

说明: 在上面的代码中，我们定义了两个分别包含一些行和列的列表。

这里，我们定义了要从列表中删除的行的索引标签。

DataFrame 函数

DataFrame 中有许多常用的函数，如下所示：

函数	描述
Pandas DataFrame.append()	将其他 dataframe 的行添加到给定 dataframe 的末尾。
Pandas DataFrame.apply()	允许用户传递一个函数，并将其应用于 Pandas series 的每个单个值。
Pandas DataFrame.assign()	向 dataframe 中添加新列。
Pandas DataFrame.astype()	将 Pandas 对象转换为指定的数据类型。
Pandas DataFrame.concat()	在 DataFrame 上沿轴执行连接操作。
Pandas DataFrame.count()	计算每列或每行的非 NA 单元格数。
Pandas DataFrame.describe()	计算一些统计数据，如 Series 或 DataFrame 的数字值的百分位数、均值和标准差。
Pandas DataFrame.drop_duplicates()	从 DataFrame 中删除重复值。
Pandas DataFrame.groupby()	将数据拆分成各种组。
Pandas DataFrame.head()	基于位置返回对象的前 n 行。
Pandas DataFrame.hist()	将数值变量中的值分割为“箱子”。
Pandas DataFrame.iterrows()	遍历行作为（索引，系列）对。
Pandas DataFrame.mean()	返回请求轴的值的平均值。
Pandas DataFrame.melt()	将 DataFrame 从宽格式解除成长格式。
Pandas DataFrame.merge()	将两个数据集合并为一个。
Pandas DataFrame.pivot_table()	使用诸如 Sum、Count、Average、Max 和 Min 等计算聚合数据。
Pandas DataFrame.query()	过滤数据帧。
Pandas DataFrame.sample()	随机选择数据帧的行和列。
Pandas DataFrame.shift()	移动列或从数据帧中减去前一行值的列值。
Pandas DataFrame.sort()	对数据帧进行排序。
Pandas DataFrame.sum()	返回用户请求轴的值的总和。
Pandas DataFrame.to_excel()	将数据帧导出到 Excel 文件。
Pandas DataFrame.transpose()	转置数据帧的索引和列。
Pandas DataFrame.where()	检查数据帧是否满足一个或多个条件。