Section 3 — Run Models in Minutes
In this section we get data samples, use pipeline() for inference, and compare the NVIDIA SegFormer and TCD-SegFormer on a test sample.
3.1 Get Data Samples with datasets
from datasets import load_dataset
"""
Load the dataset and pull one sample from each split
"""
dataset = load_dataset("restor/tcd-nc")
image_train = dataset["train"][1]["image"]
image_test = dataset["test"][20]["image"]
3.2 Perform Inference with transformers
Using pipeline(), we can go from an image to segmentation results in a single call:
from transformers import pipeline
nvidia_segformer_pipeline = pipeline(
"image-segmentation",
model="nvidia/segformer-b0-finetuned-ade-512-512"
)
result = nvidia_segformer_pipeline(image_train)
# Each element is a dict with 'label' and 'mask' (PIL Image)
print("Number of classes:", len(result))
# Number of classes: 15
import matplotlib.pyplot as plt
import numpy as np
# Combine all masks into a single labeled image
combined = np.zeros_like(np.array(result[0]["mask"]), dtype=np.float32)
for i, r in enumerate(result):
combined[np.array(r["mask"]) > 0] = i + 1
# Side by side: original image and segmentation masks
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].imshow(image_train)
ax[0].set_title("Original (image_train)")
ax[0].axis("off")
ax[1].imshow(combined, cmap="tab20", vmin=0, vmax=len(result))
ax[1].set_title("Segmentation masks")
ax[1].axis("off")
plt.tight_layout()
plt.show()
# Print all identified classes
print("Number of classes:", len(result))
print("Classes:", [r["label"] for r in result])
The NVIDIA model was trained on ADE20K (150 classes of general scenes), so it recognizes many object types — but it was not specifically trained to detect tree cover.
3.3 Compare NVIDIA SegFormer and TCD-SegFormer on image_test
Now let's run both models on the same image_test sample and compare. The TCD SegFormer was fine-tuned specifically for tree cover detection, so it outputs only 2 classes: background and tree.
restor_tcd_pipeline = pipeline(
"image-segmentation",
model="restor/tcd-segformer-mit-b0"
)
nvidia_result = nvidia_segformer_pipeline(image_test)
tcd_result = restor_tcd_pipeline(image_test)
import numpy as np
# ADE20K vegetation-related classes (see SegFormer docs)
vegetation_labels = {"tree", "grass", "plant", "flower", "palm"}
# Combine all vegetation masks from the NVIDIA model
nvidia_veg = np.zeros_like(np.array(nvidia_result[0]["mask"]))
nvidia_veg_found = []
for r in nvidia_result:
if r["label"] in vegetation_labels:
nvidia_veg = np.maximum(nvidia_veg, np.array(r["mask"]))
nvidia_veg_found.append(r["label"])
# TCD model outputs a single "tree" class
tcd_tree = next(r for r in tcd_result if r["label"] == "tree")
print("NVIDIA vegetation classes found:", nvidia_veg_found)
# Side-by-side comparison on image_test
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
ax[0].imshow(image_test)
ax[0].set_title("Original (image_test)")
ax[0].axis("off")
ax[1].imshow(image_test)
ax[1].imshow(nvidia_veg, alpha=0.5)
ax[1].set_title("NVIDIA — vegetation classes")
ax[1].axis("off")
ax[2].imshow(image_test)
ax[2].imshow(tcd_tree["mask"], alpha=0.5)
ax[2].set_title("TCD — tree class")
ax[2].axis("off")
plt.tight_layout()
plt.show()
3.3.1 Key differences
| NVIDIA SegFormer | TCD SegFormer | |
|---|---|---|
| Training data | ADE20K (general scenes) | Restor TCD (aerial tree cover) |
| Classes | 150 | 2 (tree / background) |
| Output layer | 150 output channels (38,550 params) | 2 output channels (514 params) |
| Total params | 3.75M | 3.71M |
| Best for | General scene understanding | Tree cover detection in aerial imagery |
Both models share the same SegFormer backbone (mit-b0), but differ in their decode head — specifically the final classifier layer, which was re-trained for the target task. This is the power of fine-tuning: adapting a general-purpose model to a specific domain.
Bonus: Fine-Tuned Models vs. Foundation Models
The models we used in this section are fine-tuned — they were trained on a specific dataset to recognize specific classes. Fine-tuning adapts a general-purpose model to a narrow domain (e.g., tree cover detection) by re-training its final layers on labeled data.
A different approach is foundation models like SAM (Segment Anything Model). Foundation models are pre-trained on massive, diverse datasets and can generalize to new tasks without fine-tuning. SAM 3, the latest version, accepts text prompts — you can ask it to segment "tree" in any image, even if it was never trained on aerial tree cover data.
SAM 3 was trained on massive labeled data (11M+ images, 1B+ masks with concept labels) — it learned many concepts during pre-training. It's not truly "zero-shot" in the strict ML sense. A more accurate description is that it works without task-specific fine-tuning — you don't need to collect and label your own dataset.
| Fine-tuned (SegFormer) | Foundation model (SAM) | |
|---|---|---|
| Training | Re-trained on a labeled dataset for a specific task | Pre-trained on massive data, works without task-specific fine-tuning |
| Output | Per-pixel class labels (e.g., "tree", "background") | Masks for prompted objects, no predefined classes |
| Labeling effort | Requires labeled training data | No labeling needed |
| Best for | Domain-specific accuracy | Rapid exploration, annotation, unknown objects |
Try it yourself: SAM on aerial imagery
Download the sample images to upload to the demo:
Download image_train Download image_test
Try SAM 3 Demo — a Hugging Face Space running Meta's latest Segment Anything Model. Upload one of the aerial images from the workshop and give it a text prompt like "tree".
- Does SAM 3 segment trees when prompted with "tree"?
- How does the output compare to the SegFormer results from earlier?