Detecting Vehicles in Videos with Faster R-CNN: A Step-by-Step Guide

4 min readJan 3, 2025

In the age of AI, computer vision continues to play a pivotal role in solving real-world problems. From autonomous driving to traffic monitoring, object detection in videos has become a key technology. In this blog, I’ll walk you through a project I recently completed: detecting vehicles in a video using a pre-trained Faster R-CNN model.

This post covers the project workflow, challenges faced, and how you can replicate it for your own use cases.

Project Overview

The goal of this project was to detect vehicles in a video, visualize the detections by drawing bounding boxes, and compile the processed frames back into a new video. Using PyTorch’s pre-trained Faster R-CNN model and OpenCV, I implemented a simple yet powerful pipeline.

Here’s what the pipeline looks like:

Extract frames from the video.
Detect objects in each frame using Faster R-CNN.
Visualize detections by overlaying bounding boxes and labels.
Recompile the processed frames into a video.

Why Faster R-CNN?

Faster R-CNN is a robust object detection algorithm known for its accuracy and versatility. It uses a Region Proposal Network (RPN) to efficiently predict object regions and classify them. I chose this model because it’s pre-trained on the COCO dataset, which includes a wide range of object classes, including vehicles.

Step-by-Step Implementation

Step 1: Extract Frames from the Video

To process a video, I first converted it into individual frames. This was done using OpenCV, which provides a straightforward way to read and save frames from a video.


# Importing the required libraries
import os
import cv2
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.transforms import functional as F
from PIL import Image

import warnings
warnings.filterwarnings('ignore')

# Extracting frames from the input video
def extract_frames(video_path, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    video = cv2.VideoCapture(video_path)
    frame_count = 0
    
    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        frame_path = os.path.join(output_folder, f'frame_{frame_count}.jpg')
        cv2.imwrite(frame_path, frame)
        frame_count += 1
    
    video.release()
    return frame_count

Step 2: Load the Faster R-CNN Model

I used PyTorch to load a pre-trained fasterrcnn_resnet50_fpn model. The model was set to evaluation mode and moved to GPU (if available) for faster inference.

model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Step 3: Detect Vehicles on Each Frame

For each frame, I passed the image through the model, extracted the bounding boxes, labels, and confidence scores, and filtered out low-confidence detections.

def detect_vehicles_on_frame(frame_path, model):
    image = Image.open(frame_path).convert("RGB")
    image_tensor = F.to_tensor(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        outputs = model(image_tensor)
    
    boxes = outputs[0]['boxes'].cpu().numpy()
    labels = outputs[0]['labels'].cpu().numpy()
    scores = outputs[0]['scores'].cpu().numpy()
    return boxes, labels, scores

Step 4: Draw Bounding Boxes

Using OpenCV, I visualized the results by overlaying bounding boxes and labels for detected vehicles.

def draw_boxes(image, boxes, labels, scores, threshold=0.5):
    for box, label, score in zip(boxes, labels, scores):
        if score > threshold:
            x1, y1, x2, y2 = map(int, box)
            class_name = COCO_INSTANCE_CATEGORY_NAMES[label]
            cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(image, class_name, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 255, 255), 2)
    return image


# Process each frame and draw bounding boxes
output_frames_folder = 'output_frames'
os.makedirs(output_frames_folder, exist_ok=True)


for i in range(num_frames):
    frame_path = os.path.join(output_folder, f'frame_{i}.jpg')
    image = cv2.imread(frame_path)
    boxes, labels, scores = detect_vehicles_on_frame(frame_path, model)
    image_with_boxes = draw_boxes(image, boxes, labels, scores)
    cv2.imwrite(os.path.join(output_frames_folder, f'output_frame_{i}.jpg'), image_with_boxes)

Step 5: Recompile the Video

After processing all frames, I recompiled them into a video using OpenCV. This step ensured the output retained the original video’s resolution and frame rate.

def frames_to_video(input_folder, output_video_path, frame_rate=30):
    frames = sorted([os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.endswith('.jpg')])
    frame_height, frame_width, _ = cv2.imread(frames[0]).shape
    out = cv2.VideoWriter(output_video_path, cv2.VideoWriter_fourcc(*'mp4v'), frame_rate, (frame_width, frame_height))
    
    for frame_path in frames:
        frame = cv2.imread(frame_path)
        out.write(frame)
    
    out.release()


# Path to output video
output_video_path = 'output_clip.mp4'
frames_to_video(output_frames_folder, output_video_path)

Challenges Faced

Model Speed: While Faster R-CNN provides accurate results, it’s not the fastest model. Using a GPU significantly improved the inference speed.
Output Video Quality: Ensuring that the output video maintained the same resolution and frame rate as the input required careful handling of frame dimensions.

Results

After processing the video, the final output highlights vehicles with bounding boxes and labels. The compiled video provides a clear visualization of detected objects.

Here’s a snapshot of the output:

What’s Next?

This project can be extended in various ways:

Real-Time Processing: Implement the pipeline for live video feeds.
Custom Training: Fine-tune the Faster R-CNN model on a dataset specific to your domain.
Edge Deployment: Optimize the model for deployment on edge devices.

Closing Thoughts

This project showcases the power of pre-trained models in solving complex problems with minimal effort. Whether you’re working on traffic analysis or building smarter surveillance systems, the techniques shared here can be a starting point for your projects.

Check out the full project code on my GitHub. Let me know your thoughts and feel free to connect with me on LinkedIn for more updates on my projects.