The core challenge of 3D cuboid annotation is that depth must be inferred, not measured. Untrained annotators produce inconsistent estimates that introduce systematic training noise.
When you outsource 3D cuboid annotation services to a labeling provider without perspective geometry expertise, you get cuboids that look correct in 2D but misrepresent real-world object dimensions. HabileData is an annotation company where every cuboid annotator completes calibration covering perspective projection, camera coordinate systems, and standard vehicle dimension priors.
Distance, dimensions, and heading – all inferred from a single 2D dashcam frame
An annotator placing a 3D cuboid around a vehicle in a dashcam image must estimate the vehicle’s distance from the camera, its real-world width, length, and height, and its heading direction relative to the camera coordinate system. This estimation relies on perspective geometry – apparent size, vanishing point convergence, and vertical image position.
Sedan 4.5m, SUV 4.8m – annotators calibrated on vehicle dimension priors
Every annotator completes calibration covering perspective geometry, camera coordinate frames, and standard vehicle dimension priors – sedan: approximately 4.5m × 1.8m × 1.5m; SUV: approximately 4.8m × 1.9m × 1.7m. We use CVAT 3D and Scale AI’s 3D interface that renders the projected cuboid onto the 2D image in real time as parameters are adjusted.
Yaw conventions, truncation rules, and coordinate system – specified per project
Our annotation guideline specifies canonical object dimensions as depth estimation anchors, explicit yaw rotation conventions per class, truncation and occlusion labeling rules, and the coordinate system standard – camera-centric, ego-vehicle-centric, or world-coordinate – required by the target model. No ambiguity is left to individual annotator judgment.
Nine parameters per cuboid – validated through three-stage QA before delivery
Every annotated cuboid includes nine parameters: 3D centre position (x, y, z), three dimensions (length, width, height), and three rotation angles (yaw, pitch, roll). Each annotation is validated against the project guideline through our three-stage QA workflow – no cuboid is delivered without passing all three validation stages.
We deliver the full range of annotation capabilities for this technique – each configured to your specific ML framework and output requirements.
3D cuboids placed around all object class instances in the point cloud or image sequence with precise length, width, height, and heading direction encoding. Used to train one-stage detectors (PointPillars, CenterPoint) and two-stage detectors (PointRCNN, PVRCNN) for AV perception.
High-precision dimensional annotation defining exact physical length, width, and height of objects in metres, referenced to the sensor coordinate frame. Critical for distance estimation models and collision avoidance systems that use physical size as a prior in their detection pipeline.
3D pose annotation for objects that have multiple valid orientations – articulated robots, construction equipment, cargo containers in varying loading positions. Combines cuboid placement with additional orientation parameters beyond the standard heading direction.
Partially obscured objects annotated with estimated full dimensions and occlusion severity flag. Severely occluded objects annotated with truncation flag. Our written occlusion protocol ensures consistent handling across all annotators, which is essential for training models that must detect partially hidden objects reliably.
We apply the plus or minus 2 degree heading direction accuracy standard contractually on all AV safety-critical 3D cuboid projects. This standard is what separates annotation that is safe to use in L2-L4 perception model training from annotation that introduces systematic trajectory prediction errors.
We annotate in camera, LiDAR, and world coordinate systems, and handle coordinate transformations between them using your extrinsic calibration parameters. Many providers annotate in one coordinate system only — requiring you to perform transformations that introduce additional error.
For camera-LiDAR fusion projects, we annotate both modalities simultaneously with consistent object IDs and dimensions. Annotating them separately and attempting to align afterwards is a common source of cross-modal inconsistency that degrades fusion model performance.
Our written occlusion handling protocol ensures every annotator applies the same three-level classification (visible, partially occluded, severely occluded) consistently. This consistency allows your training pipeline to apply appropriate loss weighting per occlusion level.
We deliver in KITTI, nuScenes, Waymo, and OpenPCDet-compatible formats. If your training framework requires a custom format, provide the specification and we configure export before the project begins.





3D cuboid annotation places a three-dimensional bounding box on an object in a 2D image, capturing the object’s position in 3D space (x, y, z coordinates, including depth from camera), its physical dimensions (length, width, height), and its 3D orientation (yaw, pitch, roll rotation angles). A standard 2D bounding box captures only the object’s 2D location (x, y, width, height) in image pixel coordinates with no depth or orientation information. 3D cuboid annotation provides nine parameters per object compared to four for a 2D bounding box, enabling training data for models that need to estimate object distance, real-world size, and heading direction from camera images alone.
LiDAR point cloud annotation labels actual 3D sensor measurements (x, y, z point coordinates measured by a LiDAR sensor) with 3D bounding boxes and semantic class labels. The depth is directly measured, not estimated. 3D cuboid annotation works on 2D camera images and infers 3D geometry from perspective cues. LiDAR annotation is more geometrically accurate because depth is measured, but it requires LiDAR sensor data. 3D cuboid annotation works with standard camera images and is the cost-effective choice when LiDAR hardware is not available or not justified. For the highest accuracy, teams often use both: LiDAR annotation as ground truth and camera-based 3D cuboids as the model input labels.
We deliver in KITTI 3D object detection format (text file with class, truncation, occlusion, alpha angle, 2D bbox, 3D dimensions, 3D location, rotation_y), nuScenes camera annotation format, COCO 3D extension JSON, Waymo Open Dataset camera format, ARKit and ARCore spatial anchor format, ROS geometry_msgs/PoseStamped, 8-vertex cuboid JSON (eight corner coordinates in 3D space), and fully custom schema. For sensor fusion datasets, we deliver paired annotations in the multi-modal format required by your training pipeline (nuScenes, Waymo, or Argoverse).
Yes, with camera calibration parameters provided. Fisheye and wide-angle cameras introduce strong radial distortion that changes the perspective geometry used for depth inference. We apply camera-specific distortion correction and projection models to annotate 3D cuboids correctly in distorted image space, using the provided camera calibration matrix (intrinsic parameters and distortion coefficients) to maintain geometric accuracy. Stereo camera pairs with known baseline distance enable more accurate depth estimation than monocular images.
3D cuboid depth estimation from monocular 2D images is inherently less precise than LiDAR-measured depth because distance must be inferred from perspective cues rather than directly measured. Across our annotation team, we achieve ±15% depth estimation consistency on standard object classes (vehicles at 10–50m range), which is sufficient for training monocular depth estimation models. For applications requiring higher depth accuracy, we recommend stereo camera annotation (using disparity for depth calculation) or supplementing with LiDAR annotation for ground truth calibration.
We annotate in camera coordinate system (origin at camera optical centre, Z-axis forward), ego-vehicle coordinate system (origin at rear axle or IMU, X-axis forward), world coordinate system (GPS or map-aligned global frame), and robot base frame (origin at robot mounting point). The coordinate system is specified in the annotation guideline before project start and validated during QA. Coordinate frame misalignment between annotation and model expectation is a common source of training failure, so we confirm this specification with your ML engineering team before annotation begins.
We accept 3D cuboid annotation projects starting from 1,000 images. For first-time clients, we offer a free pilot of 100-200 images so you can evaluate annotation quality, depth consistency, and yaw accuracy on your actual dataset before committing to full-scale production. There is no maximum project ceiling. We have annotated datasets exceeding 500,000 images with 3D cuboid labels for autonomous driving perception programmes.
Occluded objects (partially hidden behind another object) and truncated objects (partially outside the image frame) are annotated with the full estimated 3D cuboid extent, not just the visible portion. Each cuboid includes an occlusion level flag (0 = fully visible, 1 = partly occluded, 2 = largely occluded, 3 = unknown) and a truncation flag (0.0 = not truncated to 1.0 = fully truncated), following the KITTI convention. This enables models to learn to predict full object geometry even from partial visual evidence, which is critical for safety in autonomous driving applications.
Disclaimer: HitechDigital Solutions LLP and HabileData will never ask for money or commission to offer jobs or projects. In the event you are contacted by any person with job offer in our companies, please reach out to us at info@habiledata.com.