“Fight detection”, “sport exercises recognition”, “body VR”… Recently, the number of references to these phrases has increased in articles, scientific paper abstracts, and posts on Linkedin. All these things are based on one interesting problem in the field of Computer Vision — Pose Estimation.

The main goal of Pose Estimation is to detect keypoints of the human body.

(Photo taken from Pexels)

You can read more about the Pose Estimation tasks in other articles on Medium or in numerous resources on the Internet.


Metrics exist to evaluate machine learning models and calculate the performance value of an algorithm. Metrics logic is very simple for the Pose Estimation tasks: you find the keypoint or not. But the question is what to consider as the “found point”. Let’s answer it below.

This article describes two metrics: Percentage of Detected Joints (PDJ) and Object Keypoint Similarity (OKS).

Percentage of Detected Joints (PDJ)

Detected Joint is considered correct if the distance between the predicted and the true joint is within a certain fraction of the bounding box diagonal.

The use of the PDJ metric implies that the accuracy of the determination of all joints is evaluated using the same error threshold.

Intuitive presentation PDJ

Intuitive logic: there is a base element — value that shows the body size in the image (for example man’s height, 300 px). Take a small part of this base element, let it be 5% (300 px * 0.05 = 15 px). Draw a circle with the center in a true position of a keypoint (joint) and a radius of 15 px. Look on predicted and true joints, if the prediction is in a circle — it’s ok, keypoint detected.

In the original implementation of the metric, the base element is torso diameter. But when a person turns sideways in a 2D image, the diameter of the body seems to be 0: the horizontal distance between the points, that indicate shoulders, is close to 0, the same situation with the distance between right and left sides of the pelvis. The solution is to take the bounding box diagonal as the base element.

  • di — the euclidian distance between ground truth keypoint and predicted keypoint;
  • bool(condition) — a function that returns 1 if the condition is true, 0 if it is false;
  • n —the number of key points on the image.

Object Keypoint Similarity (OKS)

“It is calculated from the distance between predicted points and ground truth points normalized by the scale of the person. Scale and Keypoint constant needed to equalize the importance of each keypoint: neck location more precise than hip location.” — http://cocodataset.org/

  • di — the euclidian distance between ground truth keypoint and predicted keypoint;
  • sscale: the square root of the object segment area;
  • k —per-keypoint constant that controls fall off;

Constants for joints, calculated by the group of researchers from COCO.

Intuitive logic is the same as in the PDJ metric. But in this implementation for each keypoint, there is a coefficient (for shoulders and knees circles may be larger than for nose or eyes). Besides, we calculate the scale (in PDJ it was bounding box diagonal).

The OKS metric shows ONLY how close is predicted keypoint to the true keypoint (value from 0 to 1).

The second part of this metric is Average Precision with a threshold. All papers use the threshold value of 0.5 or 0.75. Compare OKS value with the threshold: if OKS greater then threshold — keypoint detected. That’s it.

The OKS metric is more difficult for computing than the PDJ one.

Perfect predictions will have OKS=1 and predictions for which all keypoints are off by more than a few standard deviations ski will have OKS≈0.

Computer Vision Engineer