Python數(shù)據(jù)分析招式:pandas庫過濾分組透視表-2
發(fā)布時(shí)間:2021-11-23 點(diǎn)擊數(shù):560
相關(guān): Python數(shù)據(jù)分析招式:pandas庫提取清洗排序-1
要點(diǎn):
- 數(shù)據(jù)的字符處理
- 數(shù)據(jù)的過濾
- 數(shù)據(jù)的分組
- 數(shù)據(jù)的透視表
引入數(shù)據(jù)
# -*- coding: utf-8 -*- # @File : 數(shù)據(jù)集的處理.py # @Date : 2018-06-03 import pandas as pd file = "data/train.csv" df = pd.DataFrame(pd.read_csv(file)) print(df.head(3)) """ PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S [3 rows x 12 columns] """
1、數(shù)據(jù)集的字符處理
# 對列的處理 mapping ={ 'PassengerId': '乘客編號', 'Survived':'是否獲救', 'Name':'姓名', 'Pclass':'船艙等級','Sex':'性別', 'Age':'年齡','SibSp':'兄弟姐妹數(shù)', 'Parch':'父母小孩數(shù)','Ticket':'船票', 'Fare':'船票費(fèi)' } ret = df.rename(columns=mapping) print(ret.head(3)) """ 乘客編號 是否獲救 船艙等級 ... 船票費(fèi) Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S [3 rows x 12 columns] """ # 對數(shù)據(jù)集里面的特定字符串進(jìn)行替換 ret = df['Sex'].map({'female':'女','male':'男'}) print(ret.head(3)) """ [3 rows x 12 columns] 0 男 1 女 2 女 Name: Sex, dtype: object """ # 對列的字符進(jìn)行替換, 只保留數(shù)字部分 # contains,split,match,findall,endswith df['Embarked']=df['Embarked'].replace(regex='[CS]', value='xxx') print(df.head(3)) """ PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN xxx 1 2 1 1 ... 71.2833 C85 xxx 2 3 1 3 ... 7.9250 NaN xxx [3 rows x 12 columns] """
2、數(shù)據(jù)集的過濾
# 用邏輯表達(dá)式組合過濾 ==, != ret = df[(df['Sex']=='female')&(df['Age']>10)] print(ret.head(3)) """ PassengerId Survived Pclass ... Fare Cabin Embarked 1 2 1 1 ... 71.2833 C85 xxx 2 3 1 3 ... 7.9250 NaN xxx 3 4 1 1 ... 53.1000 C123 xxx [3 rows x 12 columns] """ # query函數(shù) ret = df.query('Age==[10, 20]') print(ret[["Name", "Age"]].head(3)) """ Name Age 12 Saundercock, Mr. William Henry 20.0 91 Andreasson, Mr. Paul Edvin 20.0 113 Jussila, Miss. Katriina 20.0 """
3、數(shù)據(jù)的分類
# 用where函數(shù) import numpy as np ret=np.where(df['Age']>=18) # apply函數(shù) def convert_age(age): if age> 0 and age < 10: return "小孩" elif age < 30: return "大人" else: return "老人" df["年齡分類"] = df['Age'].apply(convert_age) print(df[["Name", "Age", "年齡分類"]].sample(3)) """ Name Age 年齡分類 624 Bowen, Mr. David John "Dai" 21.0 大人 880 Shelley, Mrs. William (Imanita Parrish Hall) 25.0 大人 471 Cacic, Mr. Luka 38.0 老人 """
4、 數(shù)據(jù)的切片和透視表
# groupby函數(shù) print(df.groupby('Sex')[['Name', 'Sex']].count()) """ Name Sex Sex female 314 314 male 577 577 """ # 對數(shù)據(jù)進(jìn)行軸切片分析 ret = df.groupby(['Survived','Pclass'])['Age'].agg(['size','max','min','mean']) print(ret) """ Survived Pclass 0 1 80 71.0 2.00 43.695312 2 97 70.0 16.00 33.544444 3 372 74.0 1.00 26.555556 1 1 136 80.0 0.92 35.368197 2 87 62.0 0.67 25.901566 3 119 63.0 0.42 20.646118 """ # 數(shù)據(jù)透視表 ret = df.pivot_table(columns=['Sex'],index=['Survived','Pclass'],values='Age', aggfunc={'Age':[np.mean,min,max]}) print(ret) """ max mean min Sex female male female male female male Survived Pclass 0 1 50.0 71.0 25.666667 44.581967 2.00 18.00 2 57.0 70.0 36.000000 33.369048 24.00 16.00 3 48.0 74.0 23.818182 27.255814 2.00 1.00 1 1 63.0 80.0 34.939024 36.248000 14.00 0.92 2 55.0 62.0 28.080882 16.022000 2.00 0.67 3 63.0 45.0 19.329787 22.274211 0.75 0.42 """