Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

a lot, apparently

It turns out that this whole time, some visual encoders are sensitive to processing parameters (e.g. JPEG compression) and acquisition parameters (e.g. camera model). This information can be recovered from the image representations produced by these visual encoders, which in turn can overshadow the image’s semantic information and affect performance on downstream tasks. This blog post provides a brief overview of these findings, which are detailed further in our ICCV 2025 highlight paper Processing and acquisition traces in visual encoders: What does CLIP know about your camera?.

For a quick overview, check out this thread from our paper's co-first author!

Have you ever asked yourself how much your favorite vision model knows about image capture parameters (e.g., the amount of JPEG compression, the camera model, etc.)? Furthermore, could these parameters influence its semantic recognition abilities?

[image or embed]
— Vladan Stojnić (@stojnicv.xyz) August 18, 2025 at 7:48 PM

If you take an ImageNet image and calculate its cosine similarity with all the other images, you might get distributions that look like these. The similarities with images of the same class will skew higher than those with different classes, but whether or not the images have the same JPEG compression as the query image can also affect the similarities.

Processing and acquisition traces can be recovered

To show that vision encoders incorporate processing and acquisition traces into their embeddings (and determine which encoders do this the most), we train linear classifiers over embeddings extracted from a variety of vision encoders, with the encoders kept frozen. We categorize them as:

Supervised (SUP): Models trained in a supervised manner (e.g. with an image classification objective). We study several variants of ResNet, ConvNeXt, and ViT.
Self-supervised learning (SSL): Models trained in a self-supervised manner (e.g. contrastive learning). We study several variants of DINO, DINOv2, and MoCov3.
Contrastive visual-language (CVL): Models trained for vision-language alignment. We study several variants of CLIP, OpenCLIP, and SigLIP.

The idea is that if you can predict traces from a frozen model’s embeddings at rates higher than random chance, then the model is definitely encoding relevant information. Across our experiments we tune learning rate and weight decayOur hyperparameter tuning and training implementation is based on https://github.com/naver/trex/tree/master/transfer, but we use different datasets and processing steps depending on what we’re trying to predict. These are covered in the next subsections.

Recovering processing labels

To probe for processing traces, we can take a dataset like ImageNet or iNaturalist 2018 and process them ourselves, creating our own image-label pairs. We choose the following parameters:

The processing parameters we analyze.

parameter	# classes	description
JPEG compression	6	amount of JPEG compression
sharpening	3	amount of sharpening
resizing	3	amount of resizing
interpolation	4	type of interpolation during resize

For the specific classes we study and their implementation, feel free to check out the supplementary material of our paper.

In creating our pairs for a specific parameter, we seek to balance our data, so the specific class we process our image into is randomly sampled. For example, when we’re creating our data for probing JPEG compression, there is \(\frac{1}{6}\) chance that the image is compressed with quality 95 and chroma-subsampling 4:2:0, a \(\frac{1}{6}\) chance that the quality is 95 and chroma-subsampling is 4:4:4, so on and so forth.

source

75, 4:2:0

75, 4:4:4

85, 4:2:0

85, 4:4:4

95, 4:2:0

95, 4:4:4

An example of our probing data creation process using the JPEG compression parameter. Each image is processed into one of these classes at equal chance (in this example, the numbers refer to the quality and chroma-subsampling of the compression).

The results of our experiments are shown below:

Test accuracies on processing label prediction. Hover over each bar to see which visual encoder it belongs to. CVLs perform well above random performance, with supervised ConvNeXts also showing good performance. SSL models perform closer to random accuracy.

First of all, it’s clear that some visual encoder are performing well above random accuracy, meaning that processing traces are being stored in their embeddings. We observe this most strongly in CVLs, which can reach 80% test accuracy on JPEG compression (random chance is 16.67%, which is more than 4\(\times\) lower!). We also see that supervised ConvNeXts perform well, which is surprising given that the only information available during training are ImageNet labels. Lastly, we note that SSL models tend to perform the weakest. These observations are a commonly reoccuring point throughout this work.

Recovering acquisition labels

To probe for acquisition traces, we need a dataset that comes with relevant annotations. Fortunately, this can be extracted from images’ Exif metadata. Unfortunately, many readily available datasets don’t provide this metadata. Thus, we use the Flickr APIhttps://www.flickr.com/services/api/ to search for safe-for-work, permissively licensed photos and extract their Exif metadata to create our image-label pairs. We present this dataset, named FlickrExif, below NOTE: FlickrExif is still private; ILIAS is shown below as a placeholder:

With FlickrExif, we can study the following acquisition attributes:

The acquisition parameters we analyze.

parameter	# classes	description
make	9	manufacturer of the camera
model (all)	88	specific camera model used
model (smart)	12	specific smartphone used
model (smart vs non-smart)	2	whether camera is a smartphone
exposure	16	amount of light captured by sensor
aperture	17	size of the opening in the lens
ISO speed	16	camera sensor’s sensitivity to light
focal length	13	distance from lens to sensor

Now, there’s a possibility that there are correlations between image semantics and acquisition labels. For example, if nighttime photos are commonly shot with higher ISO values, then instead of determining whether encoders are sensitive to the low-level features associated with different ISO values, we might simply end up determining which ones can distinguish between daytime and nighttime photos.

To avoid this, we have two fixes. The first is to limit the amount of photos a Flickr user contributes per month-year in our dataset (if wedding photographer Bob took 1,000 photos in one day with similar camera settings, we’d be in trouble). The second is to simply mask a large portion of the image (90%) to scrub out the semantic information. The first fix is incorporated into our dataset, while we leave the second as the recommend preprocessing procedure for future users.

With that established, our results are below:

Test accuracies on acquisition label prediction. Hover over each bar to see which visual encoder it belongs to. Similar to the last experiment, we observe strong performance from CVLs and supervised ConvNeXts and near-random performance from SSL models.

We observe a very similar pattern in these results. CVLs, with the exception of SigLIPs, perform very well, alongside supervised ConvNeXts. Meanwhile, SSL models seem to encode very little acquisition information.

Traces can interfere with downstream tasks

Now that we’ve (quite clearly) established that some visual encoders do indeed encode processing and acquisition traces, we want to show why you should care. We show an example for each type of trace: kNN classification and image retrieval.

Processing traces affect kNN classification

Consider four different types of training sets for a kNN classifier:

All-same (baseline): all train and test images are processed identically
All-diff: test images are processed differently from train images
Pos-same: train and test images only have the same processing attributes if they share the same class, and are different otherwise
Neg-same: train and test images only have the same processing attributes if they do not share the same class, and are identical otherwise
Uniform: train images are processed one way while the test images are processed randomly in a uniform manner

We plot the results for each set-up below:

Accuracies across the different set-ups. For some models, whether the query image's metadata aligns with the positives or negatives significantly impacts performance. For some, there is very little effect (try zooming in on the SSL models). ImageNet classification uses k=10 while iNaturalist classification uses k=1.

If you look at the CVLs, we can see that performance tends to wildly differ depending on whether it’s the positivies or the negatives that match the query’s metadata. In the latter case, the metadata becomes the dominant factor in determining the nearest neighbors, essentially “distracting” the classifier and tanking its performance. Conversely, we see that processing has minimal effect in the case of SSL models, showcasing the lack of potentially distracting processing traces in their embeddings.

Take a look below for a direct visualization of how the top-\(k\) neighbors identified by a CVL ConvNeXt-L are affected by JPEG compression. Making the positives’ metadata match the query’s pushes them closer towards the query, while doing so with the negatives pushes the negatives closer towards the query.

query

positives

negatives

The top 10 neighbors retrieved using the embeddings from a CVL ConvNeXt-L model. JPEG compression can influence which images are the closest, even pushing them over images with semantics more aligned to the query. The query image is denoted with a gray border. Retrieved images of the same semantic class are denoted with green borders, while those of different semantic classes are shown with red borders.

Acquisition traces distract in image retrieval

Imagine an image retrieval scenario, where the target image was captured with a type of camera different from the one used to capture the query image. Now imagine two possible collections:

same: one where the rest of the collection were also captured by a different camera
different: one where the non-target collection images were captured with the same camera as the query

Similar to the previous subsection, we show that acquisition traces, in this case camera type, can also dominate over and distract from semantic information, affecting retrieval.

To do this, we collect a dataset we dub PairCams, available here. We capture 730 pairs of photos of the exact same subject with nearly identical shooting conditions (e.g. angle, camera orientation, time of day, camera shooting mode), with the only difference being camera type. We experiment with modern smartphones and older digital cameras. Despite the difference in camera, the image pairs contain nearly identical semantic content, meaning that it should be trivial to retrieve one using the other.

modern smartphones

older digital cameras

Examples from PairCams. Each image is captured twice, once with a modern smartphone and once with an older digital camera.

We go through every visual encoder we’ve used so far to calculate the recall@\(1\) for each collection:

Recall@1 on the same and different collections. Zoom into the square for a closer look. Again, CVL models show the most sensitivity, with SSL models showing the least. A plot on y=x implies completely robust performance.

The \(y=x\) line shows where a perfectly robust visual encoder should lie. If a visual encoder is not prone to being distracted by acquisition parameters as it searches for the best semantic match, then performance should be equal in both settings. What we observe however is that CVLs again show an extreme sensitivity to acquisition parameters. A visual encoder with a near perfect recall@\(1\) with the different collection can drop to 0.85 recall@\(1\) in the same setting. Zooming into the black square shows another consistent finding: SSL models are among the most robust, staying much closer to the \(y=x\) line than any other model.

Discussion

The evidence that some visual encoders also encode processing and acquisition traces is strong, but we’re still in the middle of figuring out the possibly why’s behind all of this. It’s a little difficult to test our theories given the use of private datasets for existing models, and the cost of training our own models from scratch. We also have yet to think of mitigation techniques.

This phenomenon we observe of course carries implications. Firstly, the fact that metadata traces can overshadow semantic information raises concerns over the robustness and trustworthiness of our current algorithms. We wouldn’t want a malicious agent messing with a model simply by playing with images’ metadata, especially in critical domains like healthcare or autonomous systems. However, there is also the implication that this information being available through off-the-shelf models can help with digital forensics research or deepfake detection.

Thank you for reading!