使用 YOLOv5 进行姿态估计和行为检测

1. 姿态估计和行为检测概述

行为识别（Action Recognition）是指对视频中人的行为动作进行识别。行为识别是一项具有挑战性的任务，受光照条件各异、视角多样性、背景复杂、类内变化大等诸多因素的影响。^[1]

对行为识别的研究可以追溯到 1973 年，Johansson 通过实验观察发现，人体的运动可以通过一些主要关节点的移动来描述^[2]。因此，只要 10-12 个关键节点的组合与追踪便能形成对诸多行为例如跳舞、走路、跑步等的刻画，做到通过人体关键节点的运动来识别行为。

姿态估计（Pose Estimation）是指检测图像和视频中的人物形象的计算机视觉技术，可以确定某人的某个身体部位出现在图像中的位置，也就是在图像和视频中对人体关节的定位问题，也可以理解为在所有关节姿势的空间中搜索特定姿势。简言之，姿态估计的任务就是重建人的关节和肢干。^[3]

姿态估计可输出一个高维的姿态向量表示关节点的位置，即一整组关节点的定位，从图像背景中分离出人体前景，然后重建人物的关节、肢体，以便作为行为识别的输入，进行动作的识别，如跑步，跳跃等。

当我们使用姿态估计的结果时，行为识别可认为是典型的分类问题。姿态估计得到了特征点在图片中的位置信息，这些信息可全部进行归一化，然后利用最流行的分类器来对行为进行分类。

2. 姿态估计的方法

目前人体姿态估计总体分为 Top-down 和 Bottom-up 两种，与目标检测不同，无论是基于热力图或是基于检测器处理的关键点检测算法，都较为依赖计算资源，推理耗时略长，2022 年出现了以 YOLO 为基线的关键点检测器。^[4]

在 ECCV 2022 和 CVPRW 2022 会议上，YoLo-Pose 和 KaPao 都基于流行的 YOLO 目标检测框架提出一种新颖的无热力图的方法^[4:1]^[5]，YOLO 类型的姿态估计方法不使用检测器进行二阶处理，也不使用使用热力图拼接，虽然是一种暴力回归关键点的检测算法，但在处理速度上具有一定优势。

对于人的姿势估计，它可以归结为一个单个类别检测器（对于人）。每个人有 $17$ 个关键点，而每个关键点又被确定为识别位置和置信度。因此， $17$ 个关键点有 $51$ 个元素与一个锚点（anchor）。因此，对于每个锚点需要预测 $51$ 个元素，预测框需要 $6$ 个元素。对于一个有 $n$ 个关键点的锚，整个预测向量被定义为

P_v = \{ C_x,\,C_y,\,W,\,H,\,\mathrm{box}_{conf},\,\mathrm{class}_{conf},\,K^1_x,\,K^1_y,\,K^1_{conf},\,\dots,\,K^n_x,\,K^n_y,\,K^n_{conf} \}

YOLO-Pose 使用的数据集是 Keypoints Labels of MS COCO 2017，数据集中每一行表示一个人的姿态标注。第一个值恒为 $0$ ，表示类别为人。后面的四个值分别是 $x,\,y$ 和宽高的归一化值，接下来是 $17$ 个关键点的位置。每一个关键点是一个长度为 $3$ 的数组，第一和第二个元素分别是 $x$ 和 $y$ 归一化坐标值，第三个元素是个标志位 $v$ ， $v$ 为 $0$ 时表示这个关键点没有标注（这种情况下 $x=y=v=0$ ）， $v$ 为 $1$ 时表示这个关键点标注了但是不可见（被遮挡了）， $v$ 为 $2$ 时表示这个关键点标注了同时也可见。

网络中每一个锚点（anchor）的输出值是 $P_v$ ，对于 YOLO，通常使用非极大值抑制来获取最终的输出结果。也就是说，我们最终会得到一个人的目标框和关键点信息。我们取所有关键点信息的归一化值来给下面的行为检测器使用。我们提取人的检测框，并使用检测框对 $17$ 个关键点进行归一化，这样我们就得到了 $51$ 维度的训练数据。

现在 YOLOv7 Pose^[6] 和 YOLOv8^[7] 都已经实现了这个算法，并且提供了相应的预训练模型。后续将提供相应的代码示例。

2. YOLOv5 姿态估计

下载 ONNX 预训练模型，得到文件 yolov5s6_pose_640_ti_lite_54p9_82p2.onnx。

下面使用 ONNX Runtime 进行推理。

查看代码

import os

import cv2
import numpy as np
import onnxruntime

_CLASS_COLOR_MAP = [
    (0, 0, 255),  # Person (blue).
    (255, 0, 0),  # Bear (red).
    (0, 255, 0),  # Tree (lime).
    (255, 0, 255),  # Bird (fuchsia).
    (0, 255, 255),  # Sky (aqua).bbbbbbb
    (255, 255, 0),  # Cat (yellow).
]
palette = np.array(
    [
        [255, 128, 0],
        [255, 153, 51],
        [255, 178, 102],
        [230, 230, 0],
        [255, 153, 255],
        [153, 204, 255],
        [255, 102, 255],
        [255, 51, 255],
        [102, 178, 255],
        [51, 153, 255],
        [255, 153, 153],
        [255, 102, 102],
        [255, 51, 51],
        [153, 255, 153],
        [102, 255, 102],
        [51, 255, 51],
        [0, 255, 0],
        [0, 0, 255],
        [255, 0, 0],
        [255, 255, 255],
    ]
)

skeleton = [
    [16, 14],
    [14, 12],
    [17, 15],
    [15, 13],
    [12, 13],
    [6, 12],
    [7, 13],
    [6, 7],
    [6, 8],
    [7, 9],
    [8, 10],
    [9, 11],
    [2, 3],
    [1, 2],
    [1, 3],
    [2, 4],
    [3, 5],
    [4, 6],
    [5, 7],
]

pose_limb_color = palette[
    [9, 9, 9, 9, 7, 7, 7, 0, 0, 0, 0, 0, 16, 16, 16, 16, 16, 16, 16]
]
pose_kpt_color = palette[[16, 16, 16, 16, 16, 0, 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 9]]
radius = 5

_cache_session = None


def preprocess_image(img: np.ndarray, img_mean=0, img_scale=1 / 255):
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (640, 640), interpolation=cv2.INTER_LINEAR)
    img = (img - img_mean) * img_scale
    img = np.asarray(img, dtype=np.float32)
    img = np.expand_dims(img, 0)
    img = img.transpose(0, 3, 1, 2)
    return img


def model_inference(model_path="./yolov7-w6-pose.onnx", input=None):
    global _cache_session
    if _cache_session is None:
        _cache_session = onnxruntime.InferenceSession(model_path, None)
    input_name = _cache_session.get_inputs()[0].name
    output = _cache_session.run([], {input_name: input})
    return output


def post_process(img: np.ndarray, output: np.ndarray, score_threshold=0.3):
    h, w, _ = img.shape
    img = cv2.resize(img, (640, 640), interpolation=cv2.INTER_LINEAR)
    det_bboxes, det_scores, det_labels, kpts = (
        output[:, 0:4],
        output[:, 4],
        output[:, 5],
        output[:, 6:],
    )
    for idx in range(len(det_bboxes)):
        det_bbox = det_bboxes[idx]
        kpt = kpts[idx]
        # print(det_labels[idx], kpt, det_bbox)
        if det_scores[idx] > score_threshold:
            color_map = _CLASS_COLOR_MAP[int(det_labels[idx])]
            img = cv2.rectangle(
                img,
                (int(det_bbox[0]), int(det_bbox[1])),
                (int(det_bbox[2]), int(det_bbox[3])),
                color_map[::-1],
                2,
            )
            cv2.putText(
                img,
                "id:{}".format(int(det_labels[idx])),
                (int(det_bbox[0] + 5), int(det_bbox[1]) + 15),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.5,
                color_map[::-1],
                2,
            )
            cv2.putText(
                img,
                "score:{:2.1f}".format(det_scores[idx]),
                (int(det_bbox[0] + 5), int(det_bbox[1]) + 30),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.5,
                color_map[::-1],
                2,
            )
            plot_skeleton_kpts(img, kpt)
    img = cv2.resize(img, (w, h), interpolation=cv2.INTER_LINEAR)
    return img, kpts


def plot_skeleton_kpts(img: np.ndarray, kpts, steps=3):
    num_kpts = len(kpts) // steps
    # plot keypoints
    for kid in range(num_kpts):
        r, g, b = pose_kpt_color[kid]
        x_coord, y_coord = kpts[steps * kid], kpts[steps * kid + 1]
        conf = kpts[steps * kid + 2]
        if conf > 0.5:  # Confidence of a keypoint has to be greater than 0.5
            cv2.circle(
                img, (int(x_coord), int(y_coord)), radius, (int(r), int(g), int(b)), -1
            )
    # plot skeleton
    for sk_id, sk in enumerate(skeleton):
        r, g, b = pose_limb_color[sk_id]
        pos1 = (int(kpts[(sk[0] - 1) * steps]), int(kpts[(sk[0] - 1) * steps + 1]))
        pos2 = (int(kpts[(sk[1] - 1) * steps]), int(kpts[(sk[1] - 1) * steps + 1]))
        conf1 = kpts[(sk[0] - 1) * steps + 2]
        conf2 = kpts[(sk[1] - 1) * steps + 2]
        if (
            conf1 > 0.5 and conf2 > 0.5
        ):  # For a limb, both the keypoint confidence must be greater than 0.5
            cv2.line(img, pos1, pos2, (int(r), int(g), int(b)), thickness=2)


def infer_video(video_path: str | int):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print("Error opening video stream or file")
        return
    while cap.isOpened():
        ret, frame = cap.read()
        if ret:
            img = preprocess_image(frame)
            output = model_inference(input=img)[0]
            res, kpts = post_process(frame, output)
            cv2.imshow("frame", res)
            if cv2.waitKey(1) & 0xFF == ord("q"):
                break
        else:
            break
    cap.release()
    cv2.destroyAllWindows()


def build_train_data():
    import pandas as pd

    cols = []
    for p in range(1, 18):
        cols.append("x{}".format(p))
        cols.append("y{}".format(p))
        cols.append("c{}".format(p))
    data = pd.DataFrame(columns=cols)
    i = 0
    data_path = "train"
    for f in os.listdir(f"./data/{data_path}"):
        img_src = cv2.imread(f"./data/{data_path}/{f}")
        img = preprocess_image(img_src)
        output = model_inference(input=img)[0]
        res, kpts = post_process(img_src, output)
        if kpts.size > 0:
            data.loc[i] = kpts[0]  # type: ignore
            i += 1
        cv2.imshow("frame", res)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            return None
    data.to_csv(f"./data/{data_path}.csv", index=False)


def main():
    infer_video(0)
    # build_train_data()


if __name__ == "__main__":
    main()

如果我们需要对每个关键点基于检测框进行归一化，可以在 post_process 函数中添加如下代码：

python

if kpts.size > 0:
    det_bbox = det_bboxes[0]
    x1, y1, x2, y2 = map(int, det_bbox)
    w, h = x2 - x1, y2 - y1
    kpts[0, 0::3] = (kpts[0, 0::3] - x1) / w
    kpts[0, 1::3] = (kpts[0, 1::3] - y1) / h

如果需要推理某个文件夹下的全部文件，修改 build_train_data 函数，最终会构建 CSV 文件。

3. 行为分类

有了关键点数据，我们就可以对行为进行分类。我们可以使用 Kaggle 瑜伽姿态数据集，这个数据集包含了 5 种不同的瑜伽姿势，每种姿势有 100~200 个样本^[8]。我们可以使用 SVM 分类器来对这些数据进行分类。

SVM 分类器的工作流程如下：

收集训练数据集：收集一组已经标记好的训练数据集，其中每个样本都有一个标签，表示它所属的类别。
特征提取：从每个样本中提取出一组特征向量，用于描述该样本的特征。
标准化：对特征向量进行标准化处理，使其在数值上具有相同的尺度。
寻找最优超平面：通过求解一个优化问题，找到一个最优的超平面，使得该超平面能够将不同类别的样本分开，并且在两侧的分类边界上的距离最大。
核函数选择：如果数据集不是线性可分的，需要使用核函数将数据映射到高维空间中，使其成为线性可分的。
参数调优：选择合适的参数，如正则化参数和核函数参数，以达到更好的分类效果。
模型评估：使用测试数据集对模型进行评估，检验其泛化能力。
应用模型：将训练好的模型应用于新的未知数据进行分类。

依据上面的流程，我们设计的训练流程如下：

由摄像机提取的图像帧数据经过预处理后经过 YOLOv7-Pose 网络检测后，得到每一个图像的特征点数据，然后降维到低维度后训练 SVM 分类器，通过 SVM 分类器实现行为检测，从而判断具体行为。

下面是一个二分类的示例，用于分类摔倒和没有摔倒的图片，预处理方式相似，需要对检测框进行归一化，请参考上方代码，将数据保存为 CSV 文件。使用网格搜索查找最优参数，训练后绘制混淆矩阵，并打印准确率、精度、召回率和 F1 值。

SVM 分类器示例

import joblib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Data
data_fall = pd.read_csv("data/fall.csv")
data_nofall = pd.read_csv("data/nofall.csv")

# Data Preprocessing
data_fall["label"] = 1
data_nofall["label"] = 0
data = pd.concat([data_fall, data_nofall], ignore_index=True)
data = data.dropna()
data = data.sample(frac=1).reset_index(drop=True)
data = data.astype("float64")

# Split data
X = data.drop("label", axis=1)
y = data["label"]
X_train = X[: int(len(X) * 0.8)]
X_test = X[int(len(X) * 0.8) :]
y_train = y[: int(len(y) * 0.8)]
y_test = y[int(len(y) * 0.8) :]
print(X, y)

# SVM
svm = SVC(kernel="rbf", C=1000, gamma=0.001)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

# Evaluation
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Plot
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="Blues")
plt.show()

# Grid Search
param_grid = {
    "C": [0.1, 1, 10, 100, 1000],
    "gamma": [1, 0.1, 0.01, 0.001, 0.0001],
    "kernel": ["rbf", "poly", "sigmoid", "linear"],
}
grid = GridSearchCV(SVC(), param_grid, refit=True)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_estimator_)
grid_predictions = grid.predict(X_test)
print(confusion_matrix(y_test, grid_predictions))
print(classification_report(y_test, grid_predictions))

# Plot
sns.heatmap(confusion_matrix(y_test, grid_predictions), annot=True, cmap="Blues")
plt.show()

# Save model
joblib.dump(grid, "model/svm.pkl")

本次测试每个类别 116 个样本，训练集 80%，测试集 20%，最终结果如下。

最优参数：

C	gamma	kernel
10	1	`rbf`

参数表	precision	recall	f1-score	support
0.0	0.91	0.95	0.93	21
1.0	0.96	0.92	0.94	26
accuracy	-	-	0.94	47
macro avg	0.93	0.94	0.94	47
weighted avg	0.94	0.94	0.94	47

混淆矩阵：

matrix

一文了解通用行为识别 Action Recognition：了解及分类，https://zhuanlan.zhihu.com/p/103566134 ↩︎
Johansson, G. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics 14, 201–211 (1973). https://doi.org/10.3758/BF03212378 ↩︎
姿态估计与行为识别（行为检测、行为分类）的区别，https://cloud.tencent.com/developer/article/2029260 ↩︎
YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss，https://arxiv.org/abs/2204.06806 ↩︎ ↩︎
Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation，https://arxiv.org/abs/2111.08557 ↩︎
YOLOv7-Pose，GitHub，https://github.com/WongKinYiu/yolov7/tree/pose ↩︎
YOLOv8，GitHub，https://github.com/ultralytics/ultralytics ↩︎
在 Python 中使用机器学习进行人体姿势估计，深度学习与计算机视觉——微信公众号，https://mp.weixin.qq.com/s/D_sTpTp_pkLeO2nrcjgpaA ↩︎

OpenCV Awesome 项目

OpenCV4 — 计算机视觉项目实战

OpenCV 开发实践总结

Python OpenCV 教程

理论基础

使用 YOLOv5 进行姿态估计和行为检测

1. 姿态估计和行为检测概述

2. 姿态估计的方法

2. YOLOv5 姿态估计

3. 行为分类

使用 YOLOv5 进行姿态估计和行为检测 ​

1. 姿态估计和行为检测概述 ​

2. 姿态估计的方法 ​

2. YOLOv5 姿态估计 ​

3. 行为分类 ​

使用 YOLOv5 进行姿态估计和行为检测

1. 姿态估计和行为检测概述

2. 姿态估计的方法

2. YOLOv5 姿态估计

3. 行为分类