a lot, apparently
It turns out that this whole time, some visual encoders are sensitive to processing parameters (e.g. JPEG compression) and acquisition parameters (e.g. camera model). This information can be recovered from the image representations produced by these visual encoders, which in turn can overshadow the image’s semantic information and affect performance on downstream tasks. This blog post provides a brief overview of these findings, which are detailed further in our ICCV 2025 highlight paper Processing and acquisition traces in visual encoders: What does CLIP know about your camera?.
Have you ever asked yourself how much your favorite vision model knows about image capture parameters (e.g., the amount of JPEG compression, the camera model, etc.)? Furthermore, could these parameters influence its semantic recognition abilities?
— Vladan Stojnić (@stojnicv.xyz) August 18, 2025 at 7:48 PM
[image or embed]
To show that vision encoders incorporate processing and acquisition traces into their embeddings (and determine which encoders do this the most), we train linear classifiers over embeddings extracted from a variety of vision encoders, with the encoders kept frozen. We categorize them as:
The idea is that if you can predict traces from a frozen model’s embeddings at rates higher than random chance, then the model is definitely encoding relevant information. Across our experiments we tune learning rate and weight decay
To probe for processing traces, we can take a dataset like ImageNet
parameter | # classes | description |
---|---|---|
JPEG compression | 6 | amount of JPEG compression |
sharpening | 3 | amount of sharpening |
resizing | 3 | amount of resizing |
interpolation | 4 | type of interpolation during resize |
For the specific classes we study and their implementation, feel free to check out the supplementary material of our paper.
In creating our pairs for a specific parameter, we seek to balance our data, so the specific class we process our image into is randomly sampled. For example, when we’re creating our data for probing JPEG compression, there is \(\frac{1}{6}\) chance that the image is compressed with quality 95 and chroma-subsampling 4:2:0, a \(\frac{1}{6}\) chance that the quality is 95 and chroma-subsampling is 4:4:4, so on and so forth.
The results of our experiments are shown below:
First of all, it’s clear that some visual encoder are performing well above random accuracy, meaning that processing traces are being stored in their embeddings. We observe this most strongly in CVLs, which can reach 80% test accuracy on JPEG compression (random chance is 16.67%, which is more than 4\(\times\) lower!). We also see that supervised ConvNeXts perform well, which is surprising given that the only information available during training are ImageNet labels. Lastly, we note that SSL models tend to perform the weakest. These observations are a commonly reoccuring point throughout this work.
To probe for acquisition traces, we need a dataset that comes with relevant annotations. Fortunately, this can be extracted from images’ Exif metadata. Unfortunately, many readily available datasets don’t provide this metadata. Thus, we use the Flickr API
With FlickrExif, we can study the following acquisition attributes:
parameter | # classes | description |
---|---|---|
make | 9 | manufacturer of the camera |
model (all) | 88 | specific camera model used |
model (smart) | 12 | specific smartphone used |
model (smart vs non-smart) | 2 | whether camera is a smartphone |
exposure | 16 | amount of light captured by sensor |
aperture | 17 | size of the opening in the lens |
ISO speed | 16 | camera sensor’s sensitivity to light |
focal length | 13 | distance from lens to sensor |
Now, there’s a possibility that there are correlations between image semantics and acquisition labels. For example, if nighttime photos are commonly shot with higher ISO values, then instead of determining whether encoders are sensitive to the low-level features associated with different ISO values, we might simply end up determining which ones can distinguish between daytime and nighttime photos.
To avoid this, we have two fixes. The first is to limit the amount of photos a Flickr user contributes per month-year in our dataset (if wedding photographer Bob took 1,000 photos in one day with similar camera settings, we’d be in trouble). The second is to simply mask a large portion of the image (90%) to scrub out the semantic information. The first fix is incorporated into our dataset, while we leave the second as the recommend preprocessing procedure for future users.
With that established, our results are below:
We observe a very similar pattern in these results. CVLs, with the exception of SigLIPs, perform very well, alongside supervised ConvNeXts. Meanwhile, SSL models seem to encode very little acquisition information.
Now that we’ve (quite clearly) established that some visual encoders do indeed encode processing and acquisition traces, we want to show why you should care. We show an example for each type of trace: kNN classification and image retrieval.
Consider four different types of training sets for a kNN classifier:
We plot the results for each set-up below:
If you look at the CVLs, we can see that performance tends to wildly differ depending on whether it’s the positivies or the negatives that match the query’s metadata. In the latter case, the metadata becomes the dominant factor in determining the nearest neighbors, essentially “distracting” the classifier and tanking its performance. Conversely, we see that processing has minimal effect in the case of SSL models, showcasing the lack of potentially distracting processing traces in their embeddings.
Take a look below for a direct visualization of how the top-\(k\) neighbors identified by a CVL ConvNeXt-L are affected by JPEG compression. Making the positives’ metadata match the query’s pushes them closer towards the query, while doing so with the negatives pushes the negatives closer towards the query.
Imagine an image retrieval scenario, where the target image was captured with a type of camera different from the one used to capture the query image. Now imagine two possible collections:
Similar to the previous subsection, we show that acquisition traces, in this case camera type, can also dominate over and distract from semantic information, affecting retrieval.
To do this, we collect a dataset we dub PairCams, available here. We capture 730 pairs of photos of the exact same subject with nearly identical shooting conditions (e.g. angle, camera orientation, time of day, camera shooting mode), with the only difference being camera type. We experiment with modern smartphones and older digital cameras. Despite the difference in camera, the image pairs contain nearly identical semantic content, meaning that it should be trivial to retrieve one using the other.
We go through every visual encoder we’ve used so far to calculate the recall@\(1\) for each collection:
The \(y=x\) line shows where a perfectly robust visual encoder should lie. If a visual encoder is not prone to being distracted by acquisition parameters as it searches for the best semantic match, then performance should be equal in both settings. What we observe however is that CVLs again show an extreme sensitivity to acquisition parameters. A visual encoder with a near perfect recall@\(1\) with the different collection can drop to 0.85 recall@\(1\) in the same setting. Zooming into the black square shows another consistent finding: SSL models are among the most robust, staying much closer to the \(y=x\) line than any other model.
The evidence that some visual encoders also encode processing and acquisition traces is strong, but we’re still in the middle of figuring out the possibly why’s behind all of this. It’s a little difficult to test our theories given the use of private datasets for existing models, and the cost of training our own models from scratch. We also have yet to think of mitigation techniques.
This phenomenon we observe of course carries implications. Firstly, the fact that metadata traces can overshadow semantic information raises concerns over the robustness and trustworthiness of our current algorithms. We wouldn’t want a malicious agent messing with a model simply by playing with images’ metadata, especially in critical domains like healthcare or autonomous systems. However, there is also the implication that this information being available through off-the-shelf models can help with digital forensics research or deepfake detection.
Thank you for reading!