MLNLP
(
機器學習演算法與自然語言處理
)社群是國內外知名自然語言處理社群,受眾覆蓋國內外NLP碩博生、高校老師以及企業研究人員。
社群的願景 是促進國內外自然語言處理,機器學習學術界、產業界和廣大愛好者之間的交流,特別是初學者同學們的進步。
本文轉載自 | python大本營
作者 | zsx_yiyiyi
50個Matplotlib圖的彙編,在資料分析和視覺化中最有用。此列表允許您使用Python的Matplotlib和Seaborn庫選擇要顯示的視覺化物件。
1.關聯
散點圖
帶邊界的氣泡圖
帶線性迴歸最佳擬合線的散點圖
抖動圖
計數圖
邊緣直方圖
邊緣箱形圖
相關圖
矩陣圖
2.偏差
發散型條形圖
發散型文字
發散型包點圖
帶標記的發散型棒棒糖圖
面積圖
3.排序
4.分佈
連續變數的直方圖
型別變數的直方圖
密度圖
直方密度線圖
Joy Plot
分散式包點圖
包點+箱形圖
Dot + Box Plot
小提琴圖
人口金字塔
分類圖
5.組成
6.變化
時間序列圖
帶波峰波谷標記的時序圖
自相關和部分自相關圖
交叉相關圖
時間序列分解圖
多個時間序列
使用輔助Y軸來繪製不同範圍的圖形
帶有誤差帶的時間序列
堆積面積圖
未堆積的面積圖
日曆熱力圖
季節圖
7.分組
# !pip install brewer2mplimport
numpy
as
np
import
pandas
as
pd
import
matplotlib
as
mpl
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
import
warnings; warnings.filterwarnings(action=
'once'
)
large =
22
; med =
16
; small =
12
params = {
'axes.titlesize'
: large,
'legend.fontsize'
: med,
'figure.figsize'
: (
16
,
10
),
'axes.labelsize'
: med,
'axes.titlesize'
: med,
'xtick.labelsize'
: med,
'ytick.labelsize'
: med,
'figure.titlesize'
: large}
plt.rcParams.update(params)
plt.style.use(
'seaborn-whitegrid'
)
sns.set_style(
"white"
)
%matplotlib inline
# Version
print
(mpl.__version__)
#> 3.0.0
print
(sns.__version__)
#> 0.9.0
1. 散點圖
Scatteplot是用於研究兩個變數之間關係的經典和基本圖。如果資料中有多個組,則可能需要以不同顏色視覺化每個組。在Matplotlib,你可以方便地使用。
# Import dataset
midwest = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv"
)
# Prepare Data
# Create as many colors as there are unique midwest['category']
categories = np.unique(midwest[
'category'
])
colors = [plt.cm.tab10(i/
float
(len(categories)-1))
for
i
in
range(len(categories))]
# Draw Plot for Each Category
plt.figure(figsize=(16, 10), dpi= 80, facecolor=
'w'
, edgecolor=
'k'
)
for
i, category
in
enumerate(categories):
plt.scatter(
'area'
,
'poptotal'
,
data=midwest.loc[midwest.category==category, :],
s=20, c=colors[i], label=str(category))
# Decorations
plt.gca().
set
(xlim=(0.0, 0.1), ylim=(0, 90000),
xlabel=
'Area'
, ylabel=
'Population'
)
plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title(
"Scatterplot of Midwest Area vs Population"
, fontsize=22)
plt.legend(fontsize=12)
plt.show()
2. 帶邊界的氣泡圖
有時,您希望在邊界內顯示一組點以強調其重要性。在此示例中,您將從應該被環繞的資料幀中獲取記錄,並將其傳遞給下面的程式碼中描述的記錄。encircle()
from
matplotlib
import
patches
from
scipy.spatial
import
ConvexHull
import
warnings; warnings.simplefilter(
'ignore'
)
sns.set_style(
"white"
)
# Step 1: Prepare Data
midwest = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv"
)
# As many colors as there are unique midwest['category']
categories = np.unique(midwest[
'category'
])
colors = [plt.cm.tab10(i/float(len(categories)
-1
))
for
i
in
range(len(categories))]
# Step 2: Draw Scatterplot with unique color for each category
fig = plt.figure(figsize=(
16
,
10
), dpi=
80
, facecolor=
'w'
, edgecolor=
'k'
)
for
i, category
in
enumerate(categories):
plt.scatter(
'area'
,
'poptotal'
, data=midwest.loc[midwest.category==category, :], s=
'dot_size'
, c=colors[i], label=str(category), edgecolors=
'black'
, linewidths=
.5
)
# Step 3: Encircling
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
defencircle(x,y, ax=None, **kw):
ifnot
ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
ax.add_patch(poly)
# Select data to be encircled
midwest_encircle_data = midwest.loc[midwest.state==
'IN'
, :]
# Draw polygon surrounding vertices
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec=
"k"
, fc=
"gold"
, alpha=
0.1
)
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec=
"firebrick"
, fc=
"none"
, linewidth=
1.5
)
# Step 4: Decorations
plt.gca().set(xlim=(
0.0
,
0.1
), ylim=(
0
,
90000
),
xlabel=
'Area'
, ylabel=
'Population'
)
plt.xticks(fontsize=
12
); plt.yticks(fontsize=
12
)
plt.title(
"Bubble Plot with Encircling"
, fontsize=
22
)
plt.legend(fontsize=
12
)
plt.show()
3. 帶線性迴歸最佳擬合線的散點圖
如果你想了解兩個變數如何相互改變,那麼最合適的線就是要走的路。下圖顯示了資料中各組之間最佳擬合線的差異。要停用分組並僅為整個資料集繪製一條最佳擬合線,請從下面的呼叫中刪除該引數。
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
df_select = df.loc[df.cyl.isin([4,8]), :]
# Plot
sns.set_style(
"white"
)
gridobj = sns.lmplot(x=
"displ"
, y=
"hwy"
, hue=
"cyl"
, data=df_select,
height=7, aspect=1.6, robust=True, palette='tab10',
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.title(
"Scatterplot with line of best fit grouped by number of cylinders"
, fontsize=20)
每個迴歸線都在自己的列中
或者,您可以在其自己的列中顯示每個組的最佳擬合線。你可以透過在裡面設定引數來實現這一點。
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
df_select = df.loc[df.cyl.isin([4,8]), :]
# Each line in its own column
sns.set_style(
"white"
)
gridobj = sns.lmplot(x=
"displ"
, y=
"hwy"
,
data=df_select,
height=7,
robust=True,
palette='Set1',
col=
"cyl"
,
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.show()
4. 抖動圖
通常,多個數據點具有完全相同的X和Y值。結果,多個點相互繪製並隱藏。為避免這種情況,請稍微抖動點,以便您可以直觀地看到它們。這很方便使用
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)
# Decorations
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt.show()
5. 計數圖
避免點重疊問題的另一個選擇是增加點的大小,這取決於該點中有多少點。因此,點的大小越大,周圍的點的集中度就越大。
# Import Datadf
= pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
df_counts = df.groupby([
'hwy'
,
'cty'
]).size().reset_index(name=
'counts'
)
# Draw Stripplot
fig, ax = plt.subplots(figsize=(
16
,
10
), dpi=
80
)
sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*
2
, ax=ax)
# Decorations
plt.title(
'Counts Plot – Size of circle is bigger as more points overlap'
, fontsize=
22
)
plt.show()
6. 邊緣直方圖
邊緣直方圖具有沿X和Y軸變數的直方圖。這用於視覺化X和Y之間的關係以及單獨的X和Y的單變數分佈。該圖如果經常用於探索性資料分析(EDA)。
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
# Create Fig and gridspec
fig = plt.figure(figsize=(
16
,
10
), dpi=
80
)
grid = plt.GridSpec(
4
,
4
, hspace=
0
.
5
, wspace=
0
.
2
)
# Define the axes
ax_main = fig.add_subplot(grid[
:-1
,
:-1
])
ax_right = fig.add_subplot(grid[
:-1
, –
1
], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-
1
,
0:-1
], xticklabels=[], yticklabels=[])
# Scatterplot on main ax
ax_main.scatter(
'displ'
,
'hwy'
, s=df.cty*
4
, c=df.manufacturer.astype(
'category'
).cat.codes, alpha=.
9
, data=df, cmap=
"tab10"
, edgecolors=
'gray'
, linewidths=.
5
)
# histogram on the right
ax_bottom.hist(df.displ,
40
, histtype=
'stepfilled'
, orientation=
'vertical'
, color=
'deeppink'
)
ax_bottom.invert_yaxis()
# histogram in the bottom
ax_right.hist(df.hwy,
40
, histtype=
'stepfilled'
, orientation=
'horizontal'
, color=
'deeppink'
)
# Decorations
ax_main.set(title=
'Scatterplot with Histograms
displ vs hwy'
, xlabel=
'displ'
, ylabel=
'hwy'
)
ax_main.title.set_fontsize(
20
)
for
item
in
([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(
14
)
xlabels = ax_main.get_xticks().tolist()
ax_main.set_xticklabels(xlabels)
plt.show()
7.邊緣箱形圖
邊緣箱圖與邊緣直方圖具有相似的用途。然而,箱線圖有助於精確定位X和Y的中位數,第25和第75百分位數。
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv"
)
# Create Fig and gridspec
fig = plt.figure(figsize=(
16
,
10
), dpi=
80
)
grid = plt.GridSpec(
4
,
4
, hspace=
0
.
5
, wspace=
0
.
2
)
# Define the axes
ax_main = fig.add_subplot(grid[
:-1
,
:-1
])
ax_right = fig.add_subplot(grid[
:-1
, –
1
], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-
1
,
0:-1
], xticklabels=[], yticklabels=[])
# Scatterplot on main ax
ax_main.scatter(
'displ'
,
'hwy'
, s=df.cty*
5
, c=df.manufacturer.astype(
'category'
).cat.codes, alpha=.
9
, data=df, cmap=
"Set1"
, edgecolors=
'black'
, linewidths=.
5
)
# Add a graph in each part
sns.boxplot(df.hwy, ax=ax_right, orient=
"v"
)
sns.boxplot(df.displ, ax=ax_bottom, orient=
"h"
)
# Decorations ——————
# Remove x axis name for the boxplot
ax_bottom.set(xlabel=
''
)
ax_right.set(ylabel=
''
)
# Main Title, Xlabel and YLabel
ax_main.set(title=
'Scatterplot with Histograms
displ vs hwy'
, xlabel=
'displ'
, ylabel=
'hwy'
)
# Set font size of different components
ax_main.title.set_fontsize(
20
)
for
item
in
([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(
14
)
plt.show()
8. 相關圖
Correlogram用於直觀地檢視給定資料幀(或2D陣列)中所有可能的數值變數對之間的相關度量。
# Import Datasetdf
= pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mtcars.csv"
)
# Plot
plt.figure(figsize=(
12
,
10
), dpi=
80
)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap=
'RdYlGn'
, center=
0
, annot=True)
# Decorations
plt.title(
'Correlogram of mtcars'
, fontsize=
22
)
plt.xticks(fontsize=
12
)
plt.yticks(fontsize=
12
)
plt.show()
9. 矩陣圖
成對圖是探索性分析中的最愛,以理解所有可能的數字變數對之間的關係。它是雙變數分析的必備工具。
# Load Dataset
df = sns.load_dataset('iris')
# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind=
"scatter"
, hue=
"species"
, plot_kws=dict(s=80, edgecolor=
"white"
, linewidth=2.5))
plt.show()
# Load Dataset
df = sns.load_dataset('iris')
# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind=
"reg"
, hue=
"species"
)
plt.show()
偏差
10. 發散型條形圖
如果您想根據單個指標檢視專案的變化情況,並可視化此差異的順序和數量,那麼發散條是一個很好的工具。它有助於快速區分資料中組的效能,並且非常直觀,並且可以立即傳達這一點。
# Prepare Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mtcars.csv"
)
x = df.loc[:, [
'mpg'
]]
df[
'mpg_z'
] = (x – x.mean())/x.std()
df[
'colors'
] = [
'red'if
x <
0else'green'for
x
in
df[
'mpg_z'
]]
df.sort_values(
'mpg_z'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
plt.figure(figsize=(
14
,
10
), dpi=
80
)
plt.hlines(y=df.index, xmin=
0
, xmax=df.mpg_z, color=df.colors, alpha=
0.4
, linewidth=
5
)
# Decorations
plt.gca().set(ylabel=
'$Model$'
, xlabel=
'$Mileage$'
)
plt.yticks(df.index, df.cars, fontsize=
12
)
plt.title(
'Diverging Bars of Car Mileage'
, fontdict={
'size'
:
20
})
plt.grid(linestyle=
'–'
, alpha=
0.5
)
plt.show()
11. 發散型文字
分散的文字類似於發散條,如果你想以一種漂亮和可呈現的方式顯示圖表中每個專案的價值,它更喜歡。
# Prepare Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mtcars.csv"
)
x = df.loc[:, [
'mpg'
]]
df[
'mpg_z'
] = (x – x.mean())/x.std()
df[
'colors'
] = [
'red'if
x <
0else'green'for
x
in
df[
'mpg_z'
]]
df.sort_values(
'mpg_z'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
plt.figure(figsize=(
14
,
14
), dpi=
80
)
plt.hlines(y=df.index, xmin=
0
, xmax=df.mpg_z)
for
x, y, tex
in
zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex,
2
), horizontalalignment=
'right'if
x <
0else'left'
,
verticalalignment=
'center'
, fontdict={
'color'
:
'red'if
x <
0else'green'
,
'size'
:
14
})
# Decorations
plt.yticks(df.index, df.cars, fontsize=
12
)
plt.title(
'Diverging Text Bars of Car Mileage'
, fontdict={
'size'
:
20
})
plt.grid(linestyle=
'–'
, alpha=
0.5
)
plt.xlim(
-2.5
,
2.5
)
plt.show()
12. 發散型包點圖
發散點圖也類似於發散條。然而,與發散條相比,條的不存在減少了組之間的對比度和差異。
# Prepare Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mtcars.csv"
)
x = df.loc[:, [
'mpg'
]]
df[
'mpg_z'
] = (x – x.mean())/x.std()
df[
'colors'
] = [
'red'if
x <
0else'darkgreen'for
x
in
df[
'mpg_z'
]]
df.sort_values(
'mpg_z'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
plt.figure(figsize=(
14
,
16
), dpi=
80
)
plt.scatter(df.mpg_z, df.index, s=
450
, alpha=
.6
, color=df.colors)
for
x, y, tex
in
zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex,
1
), horizontalalignment=
'center'
,
verticalalignment=
'center'
, fontdict={
'color'
:
'white'
})
# Decorations
# Lighten borders
plt.gca().spines[
"top"
].set_alpha(
.3
)
plt.gca().spines[
"bottom"
].set_alpha(
.3
)
plt.gca().spines[
"right"
].set_alpha(
.3
)
plt.gca().spines[
"left"
].set_alpha(
.3
)
plt.yticks(df.index, df.cars)
plt.title(
'Diverging Dotplot of Car Mileage'
, fontdict={
'size'
:
20
})
plt.xlabel(
'$Mileage$'
)
plt.grid(linestyle=
'–'
, alpha=
0.5
)
plt.xlim(
-2.5
,
2.5
)
plt.show()
13. 帶標記的發散型棒棒糖圖
帶標記的棒棒糖透過強調您想要引起注意的任何重要資料點並在圖表中適當地給出推理,提供了一種視覺化分歧的靈活方式。
# Prepare Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mtcars.csv"
)
x = df.loc[:, [
'mpg'
]]
df[
'mpg_z'
] = (x – x.mean())/x.std()
df[
'colors'
] =
'black'
# color fiat differently
df.loc[df.cars ==
'Fiat X1-9'
,
'colors'
] =
'darkorange'
df.sort_values(
'mpg_z'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
import
matplotlib.patches
as
patches
plt.figure(figsize=(
14
,
16
), dpi=
80
)
plt.hlines(y=df.index, xmin=
0
, xmax=df.mpg_z, color=df.colors, alpha=
0.4
, linewidth=
1
)
plt.scatter(df.mpg_z, df.index, color=df.colors, s=[
600if
x ==
'Fiat X1-9'else300for
x
in
df.cars], alpha=
0.6
)
plt.yticks(df.index, df.cars)
plt.xticks(fontsize=
12
)
# Annotate
plt.annotate(
'Mercedes Models'
, xy=(
0.0
,
11.0
), xytext=(
1.0
,
11
), xycoords=
'data'
,
fontsize=
15
, ha=
'center'
, va=
'center'
,
bbox=dict(boxstyle=
'square'
, fc=
'firebrick'
),
arrowprops=dict(arrowstyle=
'-[, widthB=2.0, lengthB=1.5'
, lw=
2.0
, color=
'steelblue'
), color=
'white'
)
# Add Patches
p1 = patches.Rectangle((
-2.0
,
-1
), width=
.3
, height=
3
, alpha=
.2
, facecolor=
'red'
)
p2 = patches.Rectangle((
1.5
,
27
), width=
.8
, height=
5
, alpha=
.2
, facecolor=
'green'
)
plt.gca().add_patch(p1)
plt.gca().add_patch(p2)
# Decorate
plt.title(
'Diverging Bars of Car Mileage'
, fontdict={
'size'
:
20
})
plt.grid(linestyle=
'–'
, alpha=
0.5
)
plt.show()
14.面積圖
透過對軸和線之間的區域進行著色,區域圖不僅強調峰值和低谷,而且還強調高點和低點的持續時間。高點持續時間越長,線下面積越大。
import
numpy
as
np
import
pandas
as
pd
# Prepare Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/economics.csv"
, parse_dates=[
'date'
]).head(
100
)
x = np.arange(df.shape[
0
])
y_returns = (df.psavert.diff().fillna(
0
)/df.psavert.shift(
1
)).fillna(
0
) *
100
# Plot
plt.figure(figsize=(
16
,
10
), dpi=
80
)
plt.fill_between(x[
1
:], y_returns[
1
:],
0
, where=y_returns[
1
:] >=
0
, facecolor=
'green'
, interpolate=
True
, alpha=
0.7
)
plt.fill_between(x[
1
:], y_returns[
1
:],
0
, where=y_returns[
1
:] <=
0
, facecolor=
'red'
, interpolate=
True
, alpha=
0.7
)
# Annotate
plt.annotate(
'Peak
1975'
, xy=(
94.0
,
21.0
), xytext=(
88.0
,
28
),
bbox=dict(boxstyle=
'square'
, fc=
'firebrick'
),
arrowprops=dict(facecolor=
'steelblue'
, shrink=
0.05
), fontsize=
15
, color=
'white'
)
# Decorations
xtickvals = [str(m)[:
3
].upper()+
"-"
+str(y)
for
y,m
in
zip(df.date.dt.year, df.date.dt.month_name())]
plt.gca().set_xticks(x[::
6
])
plt.gca().set_xticklabels(xtickvals[::
6
], rotation=
90
, fontdict={
'horizontalalignment'
:
'center'
,
'verticalalignment'
:
'center_baseline'
})
plt.ylim(
-35
,
35
)
plt.xlim(
1
,
100
)
plt.title(
"Month Economics Return %"
, fontsize=
22
)
plt.ylabel(
'Monthly returns %'
)
plt.grid(alpha=
0.5
)
plt.show()
排序
15. 有序條形圖
有序條形圖有效地傳達了專案的排名順序。但是,在圖表上方新增度量標準的值,使用者可以從圖表本身獲取精確資訊。
# Prepare Data
df_raw = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
df = df_raw[[
'cty'
,
'manufacturer'
]].groupby(
'manufacturer'
).apply(
lambda
x: x.mean())
df.sort_values(
'cty'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plotimport
matplotlib.patches
as
patches
fig, ax = plt.subplots(figsize=(
16
,
10
), facecolor=
'white'
, dpi=
80
)
ax.vlines(x=df.index, ymin=
0
, ymax=df.cty, color=
'firebrick'
, alpha=
0.7
, linewidth=
20
)
# Annotate Textfor
i, cty
in
enumerate(df.cty):
ax.text(i, cty+
0.5
, round(cty,
1
), horizontalalignment=
'center'
)
# Title, Label, Ticks and Ylim
ax.set_title(
'Bar Chart for Highway Mileage'
, fontdict={
'size'
:
22
})
ax.set(ylabel=
'Miles Per Gallon'
, ylim=(
0
,
30
))
plt.xticks(df.index, df.manufacturer.str.upper(), rotation=
60
, horizontalalignment=
'right'
, fontsize=
12
)
# Add patches to color the X axis labels
p1 = patches.Rectangle((
.57
,
-0.005
), width=
.33
, height=
.13
, alpha=
.1
, facecolor=
'green'
, transform=fig.transFigure)
p2 = patches.Rectangle((
.124
,
-0.005
), width=
.446
, height=
.13
, alpha=
.1
, facecolor=
'red'
, transform=fig.transFigure)
fig.add_artist(p1)
fig.add_artist(p2)
plt.show()
16. 棒棒糖圖
棒棒糖圖表以一種視覺上令人愉悅的方式提供與有序條形圖類似的目的。
# Prepare Data
df_raw = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
df = df_raw[[
'cty'
,
'manufacturer'
]].groupby(
'manufacturer'
).apply(
lambda
x: x.mean())
df.sort_values(
'cty'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
fig, ax = plt.subplots(figsize=(
16
,
10
), dpi=
80
)
ax.vlines(x=df.index, ymin=
0
, ymax=df.cty, color=
'firebrick'
, alpha=
0.7
, linewidth=
2
)
ax.scatter(x=df.index, y=df.cty, s=
75
, color=
'firebrick'
, alpha=
0.7
)
# Title, Label, Ticks and Ylim
ax.set_title(
'Lollipop Chart for Highway Mileage'
, fontdict={
'size'
:
22
})
ax.set_ylabel(
'Miles Per Gallon'
)
ax.set_xticks(df.index)
ax.set_xticklabels(df.manufacturer.str.upper(), rotation=
60
, fontdict={
'horizontalalignment'
:
'right'
,
'size'
:
12
})
ax.set_ylim(
0
,
30
)
# Annotate
for
row
in
df.itertuples():
ax.text(row.Index, row.cty+
.5
, s=round(row.cty,
2
), horizontalalignment=
'center'
, verticalalignment=
'bottom'
, fontsize=
14
)
plt.show()
17. 包點圖
點圖表傳達了專案的排名順序。由於它沿水平軸對齊,因此您可以更容易地看到點彼此之間的距離。
# Prepare Data
df_raw = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
df = df_raw[[
'cty'
,
'manufacturer'
]].groupby(
'manufacturer'
).apply(
lambda
x: x.mean())
df.sort_values(
'cty'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Draw plot
fig, ax = plt.subplots(figsize=(
16
,
10
), dpi=
80
)
ax.hlines(y=df.index, xmin=
11
, xmax=
26
, color=
'gray'
, alpha=
0.7
, linewidth=
1
, linestyles=
'dashdot'
)
ax.scatter(y=df.index, x=df.cty, s=
75
, color=
'firebrick'
, alpha=
0.7
)
# Title, Label, Ticks and Ylim
ax.set_title(
'Dot Plot for Highway Mileage'
, fontdict={
'size'
:
22
})
ax.set_xlabel(
'Miles Per Gallon'
)
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={
'horizontalalignment'
:
'right'
})
ax.set_xlim(
10
,
27
)
plt.show()
18. 坡度圖
斜率圖最適合比較給定人/專案的“之前”和“之後”位置。
import matplotlib.lines
as
mlines
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv"
)
left_label = [str(c) +
', '
+ str(round(y))
for
c,
y
inzip
(
df.continent, df['1952']
)]
right_label
= [str(c) +
', '
+ str(round(y))
for
c,
y
inzip
(
df.continent, df['1957']
)]
klass
= [
'red'if
(y1-y2) <
0else'green'for
y1,
y2
inzip
(
df['1952'], df['1957']
)]
# draw line
# https:
//stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941
def
newline
(
p1, p2, color='black'
):
ax
= plt.gca()
l = mlines.Line2D([p1[
0
],p2[
0
]], [p1[
1
],p2[
1
]], color=
'red'if
p1[
1
]-p2[
1
] >
0else'green'
, marker=
'o'
, markersize=
6
)
ax.add_line(l)
return
l
fig, ax = plt.subplots(
1
,
1
,figsize=(
14
,
14
), dpi=
80
)
# Vertical Lines
ax.vlines(x=
1
, ymin=
500
, ymax=
13000
, color=
'black'
, alpha=
0.7
, linewidth=
1
, linestyles=
'dotted'
)
ax.vlines(x=
3
, ymin=
500
, ymax=
13000
, color=
'black'
, alpha=
0.7
, linewidth=
1
, linestyles=
'dotted'
)
# Points
ax.scatter(y=df[
'1952'
], x=np.repeat(
1
, df.shape[
0
]), s=
10
, color=
'black'
, alpha=
0.7
)
ax.scatter(y=df[
'1957'
], x=np.repeat(
3
, df.shape[
0
]), s=
10
, color=
'black'
, alpha=
0.7
)
# Line Segmentsand Annotation
for
p1, p2,
c
inzip
(
df['1952'], df['1957'], df['continent']
):
newline
(
[1,p1], [3,p2]
)
ax.
text
(
1-0.05, p1, c + ', ' + str(round(p1
)), horizontalalignment
=
'right'
, verticalalignment=
'center'
, fontdict={
'size'
:
14
})
ax.text(
3
+
0.05
, p2, c +
', '
+ str(round(p2)), horizontalalignment=
'left'
, verticalalignment=
'center'
, fontdict={
'size'
:
14
})
# 'Before' and 'After' Annotations
ax.text(
1-0.05
,
13000
,
'BEFORE'
, horizontalalignment=
'right'
, verticalalignment=
'center'
, fontdict={
'size'
:
18
,
'weight'
:
700
})
ax.text(
3
+
0.05
,
13000
,
'AFTER'
, horizontalalignment=
'left'
, verticalalignment=
'center'
, fontdict={
'size'
:
18
,
'weight'
:
700
})
# Decoration
ax.set_title(
"Slopechart: Comparing GDP Per Capita between 1952 vs 1957"
, fontdict={
'size'
:
22
})
ax.
set
(xlim=(
0
,
4
), ylim=(
0
,
14000
), ylabel=
'Mean GDP Per Capita'
)
ax.set_xticks([
1
,
3
])
ax.set_xticklabels([
"1952"
,
"1957"
])
plt.yticks(np.arange(
500
,
13000
,
2000
), fontsize=
12
)
# Lighten borders
plt.gca().spines[
"top"
].set_alpha(
.0
)
plt.gca().spines[
"bottom"
].set_alpha(
.0
)
plt.gca().spines[
"right"
].set_alpha(
.0
)
plt.gca().spines[
"left"
].set_alpha(
.0
)
plt.show()
19. 啞鈴圖
啞鈴圖傳達各種專案的“前”和“後”位置以及專案的排序。如果您想要將特定專案/計劃對不同物件的影響視覺化,那麼它非常有用。
import
matplotlib.lines
as
mlines
# Import Data
df = pd.read_csv(
"https://raw.githubusercontent.com/selva86/datasets/master/health.csv"
)
df.sort_values(
'pct_2014'
, inplace=
True
)
df.reset_index(inplace=
True
)
# Func to draw line segment
defnewline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[
0
],p2[
0
]], [p1[
1
],p2[
1
]], color=
'skyblue'
)
ax.add_line(l)
return
l
# Figure and Axes
fig, ax = plt.subplots(
1
,
1
,figsize=(
14
,
14
), facecolor=
'#f7f7f7'
, dpi=
80
)
# Vertical Lines
ax.vlines(x=
.05
, ymin=
0
, ymax=
26
, color=
'black'
, alpha=
1
, linewidth=
1
, linestyles=
'dotted'
)
ax.vlines(x=
.10
, ymin=
0
, ymax=
26
, color=
'black'
, alpha=
1
, linewidth=
1
, linestyles=
'dotted'
)
ax.vlines(x=
.15
, ymin=
0
, ymax=
26
, color=
'black'
, alpha=
1
, linewidth=
1
, linestyles=
'dotted'
)
ax.vlines(x=
.20
, ymin=
0
, ymax=
26
, color=
'black'
, alpha=
1
, linewidth=
1
, linestyles=
'dotted'
)
# Points
ax.scatter(y=df[
'index'
], x=df[
'pct_2013'
], s=
50
, color=
'#0e668b'
, alpha=
0.7
)
ax.scatter(y=df[
'index'
], x=df[
'pct_2014'
], s=
50
, color=
'#a3c4dc'
, alpha=
0.7
)
# Line Segments
for
i, p1, p2
in
zip(df[
'index'
], df[
'pct_2013'
], df[
'pct_2014'
]):
newline([p1, i], [p2, i])
# Decoration
ax.set_facecolor(
'#f7f7f7'
)
ax.set_title(
"Dumbell Chart: Pct Change – 2013 vs 2014"
, fontdict={
'size'
:
22
})
ax.set(xlim=(
0
,
.25
), ylim=(
-1
,
27
), ylabel=
'Mean GDP Per Capita'
)
ax.set_xticks([
.05
,
.1
,
.15
,
.20
])
ax.set_xticklabels([
'5%'
,
'15%'
,
'20%'
,
'25%'
])
ax.set_xticklabels([
'5%'
,
'15%'
,
'20%'
,
'25%'
])
plt.show()
分配
20. 連續變數的直方圖
直方圖顯示給定變數的頻率分佈。下面的表示基於分類變數對頻率條進行分組,從而更好地瞭解連續變數和串聯變數。
# Import Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
# Prepare data
x_var =
'displ'
groupby_var =
'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist()
for
i, df
in
df_agg]
# Draw
plt.figure(figsize=(
16
,
9
), dpi=
80
)
colors = [plt.cm.Spectral(i/
float
(len(vals)
-1
))
for
i
inrange
(
len(vals
))]
n, bins, patches
= plt.hist(vals,
30
, stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({
group
:col
forgroup
,
col
inzip
(
np.unique(df[groupby_var]
).
tolist
(
), colors[:
len
(
vals
)])})
plt.
title
(
f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22
)
plt.
xlabel
(
x_var
)
plt.
ylabel
(
"Frequency"
)
plt.
ylim
(
0, 25
)
plt.
xticks
(
ticks=bins[::3], labels=[round(b,1
)
for
b
in
bins[::3]])
plt.
show
(
)
21. 型別變數的直方圖
分類變數的直方圖顯示該變數的頻率分佈。透過對條形圖進行著色,您可以將分佈與表示顏色的另一個分類變數相關聯。
# Import Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
# Prepare data
x_var =
'manufacturer'
groupby_var =
'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist()
for
i, df
in
df_agg]
# Draw
plt.figure(figsize=(
16
,
9
), dpi=
80
)
colors = [plt.cm.Spectral(i/
float
(len(vals)
-1
))
for
i
inrange
(
len(vals
))]
n, bins, patches
= plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({
group
:col
forgroup
,
col
inzip
(
np.unique(df[groupby_var]
).
tolist
(
), colors[:
len
(
vals
)])})
plt.
title
(
f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22
)
plt.
xlabel
(
x_var
)
plt.
ylabel
(
"Frequency"
)
plt.
ylim
(
0, 40
)
plt.
xticks
(
ticks=bins, labels=np.unique(df[x_var]
).
tolist
(
), rotation
=
90
, horizontalalignment=
'left'
)
plt.show()
22. 密度圖
密度圖是一種常用工具,視覺化連續變數的分佈。透過“響應”變數對它們進行分組,您可以檢查X和Y之間的關係。以下情況,如果出於代表性目的來描述城市裡程的分佈如何隨著汽缸數的變化而變化。
# Import Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
# Draw Plot
plt.figure(figsize=(
16
,
10
), dpi=
80
)
sns.kdeplot(df.loc[df[
'cyl'
] ==
4
,
"cty"
], shade=
True
, color=
"g"
, label=
"Cyl=4"
, alpha=
.7
)
sns.kdeplot(df.loc[df[
'cyl'
] ==
5
,
"cty"
], shade=
True
, color=
"deeppink"
, label=
"Cyl=5"
, alpha=
.7
)
sns.kdeplot(df.loc[df[
'cyl'
] ==
6
,
"cty"
], shade=
True
, color=
"dodgerblue"
, label=
"Cyl=6"
, alpha=
.7
)
sns.kdeplot(df.loc[df[
'cyl'
] ==
8
,
"cty"
], shade=
True
, color=
"orange"
, label=
"Cyl=8"
, alpha=
.7
)
# Decoration
plt.title(
'Density Plot of City Mileage by n_Cylinders'
, fontsize=
22
)
plt.legend()
23. 直方密度線圖
帶有直方圖的密度曲線將兩個圖表傳達的集體資訊彙集在一起,這樣您就可以將它們放在一個圖形而不是兩個圖形中。
# Import Data
df = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
# Draw Plot
plt.figure(figsize=(13,10), dpi= 80)
sns.distplot(df.loc[df[
'class'
] ==
'compact'
,
"cty"
], color=
"dodgerblue"
, label=
"Compact"
, hist_kws={
'alpha'
:.7}, kde_kws={
'linewidth'
:3})
sns.distplot(df.loc[df[
'class'
] ==
'suv'
,
"cty"
], color=
"orange"
, label=
"SUV"
, hist_kws={
'alpha'
:.7}, kde_kws={
'linewidth'
:3})
sns.distplot(df.loc[df[
'class'
] ==
'minivan'
,
"cty"
], color=
"g"
, label=
"minivan"
, hist_kws={
'alpha'
:.7}, kde_kws={
'linewidth'
:3})
plt.ylim(0, 0.35)
# Decoration
plt.title(
'Density Plot of City Mileage by Vehicle Type'
, fontsize=22)
plt.legend()
plt.show()
24. Joy Plot
Joy Plot允許不同組的密度曲線重疊,這是一種視覺化相對於彼此的大量組的分佈的好方法。它看起來很悅目,並清楚地傳達了正確的資訊。它可以使用joypy基於的包來輕鬆構建matplotlib。
# !pip install joypy# Import Data
mpg = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
# Draw Plot
plt.figure(figsize=(
16
,
10
), dpi=
80
)
fig, axes = joypy.joyplot(mpg, column=[
'hwy'
,
'cty'
],
by
=
"class"
, ylim=
'own'
, figsize=(
14
,
10
))
# Decoration
plt.title(
'Joy Plot of City and Highway Mileage by Class'
, fontsize=
22
)
plt.show()
25. 分散式點圖
分佈點圖顯示按組分割的點的單變數分佈。點數越暗,該區域的資料點集中度越高。透過對中位數進行不同著色,組的真實定位立即變得明顯。
import
matplotlib.patches
as
mpatches
# Prepare Data
df_raw = pd.read_csv(
"https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv"
)
cyl_colors = {
4
:
'tab:red'
,
5
:
'tab:green'
,
6
:
'tab:blue'
,
8
:
'tab:orange'
}
df_raw[
'cyl_color'
] = df_raw.cyl.map(cyl_colors)
# Mean and Median city mileage by make
df = df_raw[[
'cty'
,
'manufacturer'
]].groupby(
'manufacturer'
).apply(
lambda
x: x.mean())
df.sort_values(
'cty'
, ascending=
False
, inplace=
True
)
df.reset_index(inplace=
True
)
df_median = df_raw[[
'cty'
,
'manufacturer'
]].groupby(
'manufacturer'
).apply(
lambda
x: x.median())
# Draw horizontal lines
fig, ax = plt.subplots(figsize=(
16
,
10
), dpi=
80
)
ax.hlines(y=df.index, xmin=
0
, xmax=
40
, color=
'gray'
, alpha=
0.5
, linewidth=
.5
, linestyles=
'dashdot'
)
# Draw the Dots
for
i, make
in
enumerate(df.manufacturer):
df_make = df_raw.loc[df_raw.manufacturer==make, :]
ax.scatter(y=np.repeat(i, df_make.shape[
0
]), x=
'cty'
, data=df_make, s=
75
, edgecolors=
'gray'
, c=
'w'
, alpha=
0.5
)
ax.scatter(y=i, x=
'cty'
, data=df_median.loc[df_median.index==make, :], s=
75
, c=
'firebrick'
)
# Annotate
ax.text(
33
,
13
,
"$red ; dots ; are ; the : median$"
, fontdict={
'size'
:
12
}, color=
'firebrick'
)
# Decorations
red_patch = plt.plot([],[], marker=
"o"
, ms=
10
, ls=
""
, mec=
None
, color=
'firebrick'
, label=
"Median"
)
plt.legend(handles=red_patch)
ax.set_title(
'Distribution of City Mileage by Make'
, fontdict={
'size'
:
22
})
ax.set_xlabel(
'Miles Per Gallon (City)'
, alpha=
0.7
)
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={
'horizontalalignment'
:
'right'
}, alpha=
0.7
)
ax.set_xlim(
1
,
40
)
plt.xticks(alpha=
0.7
)
plt.gca().spines[
"top"
].set_visible(
False
)
plt.gca().spines[
"bottom"
].set_visible(
False
)
plt.gca().spines[
"right"
].set_visible(
False
)
plt.gca().spines[
"left"
].set_visible(
False
)
plt.grid(axis=
'both'
, alpha=
.4
, linewidth=
.1
)
plt.show()
本文參考自:
https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
技術交流群邀請函
△長按新增小助手
掃描二維碼新增小助手微信
即可申請加入自然語言處理/Pytorch等技術交流群
關於我們
MLNLP社群 ( 機器學習演算法與自然語言處理 ) 是由國內外自然語言處理學者聯合構建的民間學術社群,目前已經發展為國內外知名自然語言處理社群,旗下包括 萬人頂會交流群、AI臻選匯、AI英才匯 以及 AI學術匯 等知名品牌,旨在促進機器學習,自然語言處理學術界、產業界和廣大愛好者之間的進步。
社群可以為相關從業者的深造、就業及研究等方面提供開放交流平臺。歡迎大家關注和加入我們。