publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- ICCVW 2025From Global to Local: Social Bias Transfer in CLIPRyan Ramos, Yusuke Hirota, Yuta Nakashima, and 1 more authorIn ICCVW , 2025
The recycling of contrastive language-image pre-trained (CLIP) models as backbones for a large number of downstream tasks calls for a thorough analysis of their transferability implications, especially their well-documented reproduction of social biases and human stereotypes. How do such biases, learned during pre-training, propagate to downstream applications like visual question answering or image captioning? Do they transfer at all? We investigate this phenomenon, referred to as bias transfer in prior literature, through a comprehensive empirical analysis. Firstly, we examine how pre-training bias varies between global and local views of data, finding that bias measurement is highly dependent on the subset of data on which it is computed. Secondly, we analyze correlations between biases in the pre-trained models and the downstream tasks across varying levels of pre-training bias, finding difficulty in discovering consistent trends in bias transfer. Finally, we explore why this inconsistency occurs, showing that under the current paradigm, representation spaces of different pre-trained CLIPs tend to converge when adapted for downstream tasks. We hope this work offers valuable insights into bias behavior and informs future research to promote better bias mitigation practices.
@inproceedings{ramos2025globallocalsocialbias, author = {Ramos, Ryan and Hirota, Yusuke and Nakashima, Yuta and Garcia, Noa}, title = {From Global to Local: Social Bias Transfer in CLIP}, booktitle = {ICCVW}, year = {2025}, url = {https://arxiv.org/abs/2508.17750}, }
- ICCVW 2025Data Leakage in Visual DatasetsPatrick Ramos*, Ryan Ramos*, and Noa GarciaIn ICCVW , 2025
We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.
@inproceedings{ramos2025dataleakagevisualdatasets, author = {Ramos*, Patrick and Ramos*, Ryan and Garcia, Noa}, title = {Data Leakage in Visual Datasets}, booktitle = {ICCVW}, year = {2025}, url = {https://arxiv.org/abs/2508.17416}, }
- ICCV 2025Processing and acquisition traces in visual encoders: What does CLIP know about your camera?Ryan Ramos*, Vladan Stojnić*, Giorgos Kordopatis-Zilos, and 3 more authorsIn ICCV , 2025
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces
@inproceedings{ramos2025processingacquisitiontracesvisual, author = {Ramos*, Ryan and Stojnić*, Vladan and Kordopatis-Zilos, Giorgos and Nakashima, Yuta and Tolias, Giorgos and Garcia, Noa}, title = {Processing and acquisition traces in visual encoders: What does CLIP know about your camera?}, booktitle = {ICCV}, year = {2025}, url = {https://arxiv.org/abs/2508.10637}, }
- WACV 2025No Annotations for Object Detection in Art through Stable DiffusionPatrick Ramos, Nicolas Gonthier, Selina Khan, and 2 more authorsIn WACV , 2025
Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (n annotations for detection in art), a pipeline that leverages diffusion models’ art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada
@inproceedings{Ramos_2025_WACV, author = {Ramos, Patrick and Gonthier, Nicolas and Khan, Selina and Nakashima, Yuta and Garcia, Noa}, title = {No Annotations for Object Detection in Art through Stable Diffusion}, booktitle = {WACV}, year = {2025}, url = {https://arxiv.org/abs/2412.06286}, }
2023
- MS ThesisDistillCLIP: Knowledge Distillation of Contrastive Language-Image Pretrained ModelsPatrick Ramos, Raphael Alampay, and Patricia AbuAteneo de Manila University , 2023
Despite CLIP’s performance on vision-language tasks, CLIP’s size limits its deployment in low resource environments. We propose a knowledge distillation scheme to compress a teacher CLIP into a smaller student model we term DistillCLIP. Our framework consists of distilling both intra-modal and inter-modal similarity maps between and within image and text embeddings. DistillCLIP is 43.69% the size of CLIP and has 82.43% its FLOPs. We show that the ability of DistillCLIP to retain teacher performance on zero-shot transfer tasks may depend on the semantic granularity of class labels, preserving only 63.81% of teacher accuracy on average. Meanwhile DistillCLIP’s linear probe performance matches and on some datasets surpasses that of the teacher CLIP with an average retention rate of 100.53%. However, DistillCLIP retains only 12.28% teacher accuracy on average on distribution shift datasets. We also demonstrate that DistillCLIP is able to preserve 99.34% teacher accuracy on video accident recognition in dashcam videos.
@mastersthesis{ramos2023distillclip, author = {Ramos, Patrick and Alampay, Raphael and Abu, Patricia}, title = {DistillCLIP: Knowledge Distillation of Contrastive Language-Image Pretrained Models}, school = {Ateneo de Manila University}, year = {2023}, address = {Quezon City, Philippines}, }
- MS ThesisCausal Interventions for Robust Visual Question AnsweringRyan Ramos, Raphael Alampay, and Patricia AbuAteneo de Manila University , 2023
Contemporary visual question answering (VQA) models have been shown to exhibit poor out-of-distribution (OOD) generalization ability due to their tendency to learn superficial statistical correlations from training data as opposed to more reliable underlying causal features. This can be addressed by widening the training distribution through data augmentation, but though recent advances have been made in generative modelling and training large foundation models, the application of these methods for data augmentation targeting robust VQA remains underexplored. This study proposes a novel approach to ensembling foundation models in order to generate OOD datapoints to widen the distribution of a training dataset. In particular, this study proposes a novel token sampling method to perturb existing image captions into OOD captions, which can then be used to steer a pretrained text-to-image model. The resulting images along with the original questions and answers can then be used to finetune a VQA model that has only been trained on the original training dataset. This method is empirically shown to lead to robustness improvements; with a BLIP pretrained on VQA v2.0, finetuning with the study’s generated data introduces a 7.59% accuracy drop reduction on AQUA and a 1.43% accuracy drop reduction on VizWiz.
@mastersthesis{ramos2023causal, author = {Ramos, Ryan and Alampay, Raphael and Abu, Patricia}, title = {Causal Interventions for Robust Visual Question Answering}, school = {Ateneo de Manila University}, year = {2023}, address = {Quezon City, Philippines}, }
- CORES’23Knowledge Distillation with Relative Representations for Image Representation LearningPatrick Ramos, Raphael Alampay, and Patricia AbuIn CORES , 2023
Relative representations allow the alignment of latent spaces which embed data in extrinsically different manners but with similar relative distances between data points. This ability to compare different latent spaces for the same input lends itself to knowledge distillation techniques. We explore the applicability of relative representations to knowledge distillation by training a student model such that the relative representations of its outputs match the relative representations of the outputs of a teacher model. We test our Relative Representation Knowledge Distillation (RRKD) scheme on supervised and self-supervised image representation learning with MNIST and show that an encoder can be compressed to 47.71% of its original size while maintaining 91.92% of its full performance. We demonstrate that RRKD is competitive with or outperforms other relation-based distillation schemes in traditional distillation setups (CIFAR-10, CIFAR-100, SVHN) and in a transfer learning setting (Stanford Cars, Oxford-IIIT Pets, Oxford Flowers-102). Our results indicate that relative representations are an effective signal for knowledge distillation. Code will be made available at https://github.com/Ramos-Ramos/rrkd.
@inproceedings{ramos2023knowledge, title = {Knowledge Distillation with Relative Representations for Image Representation Learning}, author = {Ramos, Patrick and Alampay, Raphael and Abu, Patricia}, booktitle = {CORES}, year = {2023}, url = {https://link.springer.com/chapter/10.1007/978-3-031-41630-9_14}, doi = {https://doi.org/10.1007/978-3-031-41630-9_14}, }
- CORES’23Exploring Text-Guided Synthetic Distribution Shifts for Robust Image ClassificationRyan Ramos, Raphael Alampay, and Patricia AbuIn CORES , 2023
The empirical risk minimization approach of contemporary machine learning leads to potential failures under distribution shifts. While out-of-distribution data can be used to probe for robustness issues, collecting this at scale in the wild can be difficult given its nature. We propose a novel method to generate this data using pretrained foundation models. We train a language model to generate class-conditioned image captions that minimize their cosine similarity with that of corresponding class images from the original distribution. We then use these captions to synthesize new images with off-the-shelf text-to-image generative models. We show our method’s ability to generate samples from shifted distributions, and the quality of the data for both robustness testing and as additional training data to improve generalization.
@inproceedings{ramos2023exploring, title = {Exploring Text-Guided Synthetic Distribution Shifts for Robust Image Classification}, author = {Ramos, Ryan and Alampay, Raphael and Abu, Patricia}, booktitle = {CORES}, year = {2023}, url = {https://link.springer.com/chapter/10.1007/978-3-031-41630-9_16}, doi = {https://doi.org/10.1007/978-3-031-41630-9_16}, }
2022
- ACLW 2022Emotion Analysis of Writers and Readers of Japanese Tweets on VaccinationsPatrick John Ramos, Kiki Ferawati, Kongmeng Liew, and 2 more authorsIn ACLW , 2022
Public opinion in social media is increasingly becoming a critical factor in pandemic control. Understanding the emotions of a population towards vaccinations and COVID-19 may be valuable in convincing members to become vaccinated. We investigated the emotions of Japanese Twitter users towards Tweets related to COVID-19 vaccination. Using the WRIME dataset, which provides emotion ratings for Japanese Tweets sourced from writers (Tweet posters) and readers, we fine-tuned a BERT model to predict levels of emotional intensity. This model achieved a training accuracy of MSE = 0.356. A separate dataset of 20,254 Japanese Tweets containing COVID-19 vaccine-related keywords was also collected, on which the fine-tuned BERT was used to perform emotion analysis. Afterwards, a correlation analysis between the extracted emotions and a set of vaccination measures in Japan was conducted.The results revealed that surprise and fear were the most intense emotions predicted by the model for writers and readers, respectively, on the vaccine-related Tweet dataset. The correlation analysis also showed that vaccinations were weakly positively correlated with predicted levels of writer joy, writer/reader anticipation, and writer/reader trust.
@inproceedings{ramos-etal-2022-emotion, title = {Emotion Analysis of Writers and Readers of {J}apanese Tweets on Vaccinations}, author = {Ramos, Patrick John and Ferawati, Kiki and Liew, Kongmeng and Aramaki, Eiji and Wakamiya, Shoko}, booktitle = {ACLW}, year = {2022}, url = {https://aclanthology.org/2022.wassa-1.10}, }