3 — Run Models in Minutes

Section 3 — Run Models in Minutes

In this section we get data samples, use pipeline() for inference, and compare the NVIDIA SegFormer and TCD-SegFormer on a test sample.

3.1 Get Data Samples with `datasets`

from datasets import load_dataset

"""
  Load the dataset and pull one sample from each split
"""
dataset = load_dataset("restor/tcd-nc")

image_train = dataset["train"][1]["image"]
image_test  = dataset["test"][20]["image"]

3.2 Perform Inference with `transformers`

Using pipeline(), we can go from an image to segmentation results in a single call:

from transformers import pipeline

nvidia_segformer_pipeline = pipeline(
    "image-segmentation",
    model="nvidia/segformer-b0-finetuned-ade-512-512"
)

result = nvidia_segformer_pipeline(image_train)

# Each element is a dict with 'label' and 'mask' (PIL Image)
print("Number of classes:", len(result))
# Number of classes: 15

import matplotlib.pyplot as plt
import numpy as np

# Combine all masks into a single labeled image
combined = np.zeros_like(np.array(result[0]["mask"]), dtype=np.float32)
for i, r in enumerate(result):
    combined[np.array(r["mask"]) > 0] = i + 1

# Side by side: original image and segmentation masks
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].imshow(image_train)
ax[0].set_title("Original (image_train)")
ax[0].axis("off")

ax[1].imshow(combined, cmap="tab20", vmin=0, vmax=len(result))
ax[1].set_title("Segmentation masks")
ax[1].axis("off")

plt.tight_layout()
plt.show()

# Print all identified classes
print("Number of classes:", len(result))
print("Classes:", [r["label"] for r in result])

The NVIDIA model was trained on ADE20K (150 classes of general scenes), so it recognizes many object types — but it was not specifically trained to detect tree cover.

3.3 Compare NVIDIA SegFormer and TCD-SegFormer on `image_test`

Now let's run both models on the same image_test sample and compare. The TCD SegFormer was fine-tuned specifically for tree cover detection, so it outputs only 2 classes: background and tree.

restor_tcd_pipeline = pipeline(
    "image-segmentation",
    model="restor/tcd-segformer-mit-b0"
)

nvidia_result = nvidia_segformer_pipeline(image_test)
tcd_result    = restor_tcd_pipeline(image_test)

import numpy as np

# ADE20K vegetation-related classes (see SegFormer docs)
vegetation_labels = {"tree", "grass", "plant", "flower", "palm"}

# Combine all vegetation masks from the NVIDIA model
nvidia_veg = np.zeros_like(np.array(nvidia_result[0]["mask"]))
nvidia_veg_found = []
for r in nvidia_result:
    if r["label"] in vegetation_labels:
        nvidia_veg = np.maximum(nvidia_veg, np.array(r["mask"]))
        nvidia_veg_found.append(r["label"])

# TCD model outputs a single "tree" class
tcd_tree = next(r for r in tcd_result if r["label"] == "tree")

print("NVIDIA vegetation classes found:", nvidia_veg_found)

# Side-by-side comparison on image_test
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

ax[0].imshow(image_test)
ax[0].set_title("Original (image_test)")
ax[0].axis("off")

ax[1].imshow(image_test)
ax[1].imshow(nvidia_veg, alpha=0.5)
ax[1].set_title("NVIDIA — vegetation classes")
ax[1].axis("off")

ax[2].imshow(image_test)
ax[2].imshow(tcd_tree["mask"], alpha=0.5)
ax[2].set_title("TCD — tree class")
ax[2].axis("off")

plt.tight_layout()
plt.show()

3.3.1 Key differences

	NVIDIA SegFormer	TCD SegFormer
Training data	ADE20K (general scenes)	Restor TCD (aerial tree cover)
Classes	150	2 (tree / background)
Output layer	150 output channels (38,550 params)	2 output channels (514 params)
Total params	3.75M	3.71M
Best for	General scene understanding	Tree cover detection in aerial imagery

Both models share the same SegFormer backbone (mit-b0), but differ in their decode head — specifically the final classifier layer, which was re-trained for the target task. This is the power of fine-tuning: adapting a general-purpose model to a specific domain.

Bonus: Fine-Tuned Models vs. Foundation Models

The models we used in this section are fine-tuned — they were trained on a specific dataset to recognize specific classes. Fine-tuning adapts a general-purpose model to a narrow domain (e.g., tree cover detection) by re-training its final layers on labeled data.

A different approach is foundation models like SAM (Segment Anything Model). Foundation models are pre-trained on massive, diverse datasets and can generalize to new tasks without fine-tuning. SAM 3, the latest version, accepts text prompts — you can ask it to segment "tree" in any image, even if it was never trained on aerial tree cover data.

SAM 3 was trained on massive labeled data (11M+ images, 1B+ masks with concept labels) — it learned many concepts during pre-training. It's not truly "zero-shot" in the strict ML sense. A more accurate description is that it works without task-specific fine-tuning — you don't need to collect and label your own dataset.

	Fine-tuned (SegFormer)	Foundation model (SAM)
Training	Re-trained on a labeled dataset for a specific task	Pre-trained on massive data, works without task-specific fine-tuning
Output	Per-pixel class labels (e.g., "tree", "background")	Masks for prompted objects, no predefined classes
Labeling effort	Requires labeled training data	No labeling needed
Best for	Domain-specific accuracy	Rapid exploration, annotation, unknown objects

Try it yourself: SAM on aerial imagery

Download the sample images to upload to the demo:

Download image_train Download image_test

Try SAM 3 Demo — a Hugging Face Space running Meta's latest Segment Anything Model. Upload one of the aerial images from the workshop and give it a text prompt like "tree".

Open SAM 3 Demo

Does SAM 3 segment trees when prompted with "tree"?
How does the output compare to the SegFormer results from earlier?