Selected Publications (Switch to 'All Publications')

Note: The top three computer vision conferences (CVPR/ICCV/ECCV) are highly competitive, with low acceptance rates: 20-30%. CVPR is ranked #1 in Google Scholar Metrics among all journals and conferences in Computer Vision & Pattern Recognition.

2022

   Learned Vertex Descent: A New Direction for 3D Human Model Fitting   
E.Corona, G.Pons-Moll, G.Alenyà and F.Moreno-Noguer
European Conference on Computer Vision (ECCV), 2022

Paper  Abstract  Project page  Code  Bibtex

@article{Corona_eccv2022,
title = {Learned Vertex Descent: A New Direction for 3D Human Model Fitting,
author = {Enric Corona and Gerard Pons-Moll and Guillem Alenyà and Francesc Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection. At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point. An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method. Code is released at https://www.iri.upc.edu/people/ecorona/lvd/

   Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification   
J.Shen, A.Agudo, F.Moreno-Noguer and A.Ruiz
European Conference on Computer Vision (ECCV), 2022

Paper  Abstract  Bibtex

@article{Shen_eccv2022,
title = {Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification,
author = {Jianxiong Shen and Antonio Agudo and Francesc Moreno-Noguer and Adria Ruiz},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

A critical limitation of current methods based on Neural Radiance Fields (NeRF) is that they are unable to quantify the uncertainty associated with the learned appearance and geometry of the scene. This information is paramount in real applications such as medical diagnosis or autonomous driving where, to reduce potentially catastrophic failures, the confidence on the model outputs must be included into the decision-making process. In this context, we introduce Conditional-Flow NeRF (CF-NeRF), a novel probabilistic framework to incorporate uncertainty quantification into NeRF-based approaches. For this purpose, our method learns a distribution over all possible radiance fields modelling the scene which is used to quantify the uncertainty associated with the modelled scene. In contrast to previous approaches enforcing strong constraints over the radiance field distribution, CF-NeRF learns it in a flexible and fully data-driven manner by coupling Latent Variable Modelling and Conditional Normalizing Flows. This strategy allows to obtain reliable uncertainty estimation while preserving model expressivity. Compared to previous state-of-the-art methods proposed for uncertainty quantification in NeRF, our experiments show that the proposed method achieves significantly lower prediction errors and more reliable uncertainty values for synthetic novel view and depth-map estimation.

   PoseScript: 3D Human Poses from Natural Language   
G.Delmas, P.Weinzaepfel, T.Lucas, F.Moreno-Noguer and G.Rogez
European Conference on Computer Vision (ECCV), 2022

Paper  Abstract  Project page  Bibtex

@article{Delmas_eccv2022,
title = {PoseScript: 3D Human Poses from Natural Language,
author = {Ginger Delmas and Philippe Weinzaepfel and Thomas Lucas and Francesc Moreno-Noguer and Grégory Rogez},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

Multi-Person Extreme Motion Prediction   
W.Guo, X.Bie, X.Alameda-Pineda and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2022

Paper  Abstract  Project page  Bibtex

@article{Guo_cvpr2022,
title = {Multi-Person Extreme Motion Prediction},
author = {Wen Guo and Xiaoyu Bie and Xavier Alameda-Pineda and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}

Human motion prediction aims to forecast future poses given a sequence of past 3D skeletons. While this problem has recently received increasing attention, it has mostly been tackled for single humans in isolation. In this paper, we explore this problem when dealing with humans performing collaborative tasks, we seek to predict the future motion of two interacted persons given two sequences of their past skeletons. We propose a novel cross interaction attention mechanism that exploits historical information of both persons, and learns to predict cross dependencies between the two pose sequences. Since no dataset to train such interactive situations is available, we collected ExPI (Extreme Pose Interaction) dataset, a new lab-based person interaction dataset of professional dancers performing Lindy-hop dancing actions, which contains 115 sequences with 30K frames annotated with 3D body poses and shapes. We thoroughly evaluate our cross interaction network on ExPI and show that both in short- and long-term predictions, it consistently outperforms state-of-the-art methods for single-person motion prediction. Our code and dataset are available at: https://team.inria.fr/robotlearn/multi-person-extreme-motion-prediction/

LISA: Learning Implicit Shape and Appearance of Hands   
E.Corona, T.Hogan, M.Vo, F.Moreno-Noguer, C.Sweeney, R.Newcombe and L.Ma
Conference in Computer Vision and Pattern Recognition (CVPR), 2022

Paper  Abstract  Project page  Bibtex

@article{Corona_cvpr2022,
title = {{LISA}: Learning Implicit Shape and Appearance of Hands},
author = {Enric Corona and Tomas Hogan and Minh Vo and Francesc Moreno-Noguer and Chris Sweeney and Richard Newcombe and Lingni Ma},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}

This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild, and can be easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coarse 3D poses of the hand skeleton. For a 3D point in the local hand coordinates, our model predicts the color and the signed distance with respect to each hand bone independently, and then combines the per-bone predictions using the predicted skinning weights. The shape, color, and pose representations are disentangled by design, enabling fine control of the selected hand parameters. We experimentally demonstrate that LISA can accurately reconstruct a dynamic hand from monocular or multi-view sequences, achieving a noticeably higher quality of reconstructed hand shapes compared to baseline ap- proaches. Project page: https:// www.iri.upc.edu/people/ecorona/lisa/.

2021

3D Human Pose, Shape and Texture from Low-Resolution Images and Videos 
X.Xu, H.Chen, F.Moreno-Noguer, L.Jeni and F. De la Torre
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021

Paper  Abstract  Bibtex

@article{Xu_pami2021,
title = {3D Human Pose, Shape and Texture from Low-Resolution Images and Videos},
author = {Xiangyu Xu and Hao Chen and Francesc Moreno-Noguer and Laszlo Attila Jeni and Fernando De la Torre},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {},
number = {},
issn = {0162-8828},
pages = {},
doi = {},
year = {2021}
}

3D human pose and shape estimation from monocular images has been an active research area in computer vision. Existing deep learning methods for this task rely on high-resolution input, which however, is not always available in many scenarios such as video surveillance and sports broadcasting. Two common approaches to deal with low-resolution images are applying super-resolution techniques to the input, which may result in unpleasant artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed method is able to learn 3D body pose and shape across different resolutions with one single model. The self-supervision loss enforces scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new losses provide robustness when learning in a weakly-supervised manner. Moreover, we extend the RSC-Net to handle low-resolution videos and apply it to reconstruct textured 3D pedestrians from low-resolution input. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction   
E.Ramon, G.Triginer, J.Escur, A.Pumarola, J.Garcia, X.Giro-i-Nieto and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2021

Paper  Abstract  Project page  Bibtex

@article{Ramon_iccv2021,
title = {H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction},
author = {Eduard Ramon and Gil Triginer and Janna Escur and Albert Pumarola and Jaime Garcia and Xavier Giro-i-Nieto and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2021}
}

Recent learning approaches that implicitly represent surface geometry using coordinate-based neural representations have shown impressive results in the problem of multi-view 3D reconstruction. The effectiveness of these techniques is, however, subject to the availability of a large number (several tens) of input views of the scene, and computationally demanding optimizations. In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three). First, we learn a shape model of 3D heads from thousands of incomplete raw scans using implicit representations. At test time, we jointly overfit two coordinate-based neural networks to the scene, one modelling the geometry and another estimating the surface radiance, using implicit differentiable rendering. We devise a two-stage optimization strategy in which the learned prior is used to initialize and constrain the geometry during an initial optimization phase. Then, the prior is unfrozen and fine-tuned to the scene. By doing this, we achieve high-fidelity head reconstructions, including hair and shoulders, and with a high level of detail that consistently outperforms both state-of-the-art 3D Morphable Models methods in the few-shot scenario, and non-parametric methods when large sets of views are available.

Generating Attribution Maps with Disentangled Masked Backpropagation   
A.Ruiz, A.Agudo and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2021

Paper  Abstract  Supplemental  Bibtex

@article{Ruiz_iccv2021,
title = {Generating Attribution Maps with Disentangled Masked Backpropagation},
author = {Adria Ruiz and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2021}
}

Attribution map visualization has arisen as one of the most effective techniques to understand the underlying inference process of Convolutional Neural Networks. In this task, the goal is to compute an score for each image pixel related to its contribution to the network output. In this paper, we introduce Disentangled Masked Backpropagation (DMBP), a novel gradient-based method that leverages on the piecewise linear nature of ReLU networks to decompose the model function into different linear mappings. This decomposition aims to disentangle the attribution maps into positive, negative and nuisance factors by learning a set of variables masking the contribution of each filter during back-propagation. A thorough evaluation over standard architectures (ResNet50 and VGG16) and benchmark datasets (PASCAL VOC and ImageNet) demonstrates that DMBP generates more visually interpretable attribution maps than previous approaches. Additionally, we quantitatively show that the maps produced by our method are more consistent with the true contribution of each pixel to the final network output.

Neural Cellular Automata Manifold    (Oral)
A.Hernandez, A.Vilalta and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

Paper  Abstract  Bibtex

@article{Hernandez_cvpr2021,
title = {Neural Cellular Automata Manifold},
author = {Alejandro Hernandez and Armand Vilalta and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Very recently, the Neural Cellular Automata (NCA) has been proposed to simulate the morphogenesis process with deep networks. NCA learns to grow an image starting from a fixed single pixel. In this work, we show that the neural network (NN) architecture of the NCA can be encapsulated in a larger NN. This allows us to propose a new model that encodes a manifold of NCA, each of them capable of generating a distinct image. Therefore, we are effectively learning a embedding space of CA, which shows generalization capabilities. We accomplish this by introducing dynamic convolutions inside an Auto-Encoder architecture, for the first time used to join two different sources of information, the encoding and cell’s environment information. In biological terms, our approach would play the role of the transcription factors, modulating the mapping of genes into specific proteins that drive cellular differentiation, which occurs right before the morphogenesis. We thoroughly evaluate our approach in a dataset of synthetic emojis and also in real images of CIFAR- 10. Our model introduces a general-purpose network, which can be used in a broad range of problems beyond image generation.

SMPLicit: Topology-aware Generative Model for Clothed People   
E.Corona, A.Pumarola, G.Alenyà, G.Pons-Moll and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

Paper  Abstract  Project page  Bibtex

@article{Corona_cvpr2021,
title = {SMPLicit: Topology-aware Generative Model for Clothed People},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Gerard Pons-Moll and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

In this paper we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and to open jackets), while controlling other properties like the garment size or tightness/looseness. We show our model to be applicable to a large variety of garments including T- shirts, hoodies, jackets, shorts, pants, skirts, shoes and even hair. The representation flexibility of SMPLicit builds upon an implicit model conditioned with the SMPL human body parameters and a learnable latent space which is semantically interpretable and aligned with the clothing attributes. The proposed model is fully differentiable, allowing for its use into larger end-to-end trainable systems. In the experimental section, we demonstrate SMPLicit can be readily used for fitting 3D scans and for 3D reconstruction in images of dressed people. In both cases we are able to go beyond state of the art, by retrieving complex garment geometries, handling situations with multiple clothing layers and providing a tool for easy outfit editing. To stimulate further research in this direction, we will make our code and model publicly available at http://www.iri.upc. edu/people/ecorona/smplicit/.

D-NeRF: Neural Radiance Fields for Dynamic Scenes   
A.Pumarola, E.Corona, G.Pons-Moll and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

Paper  Abstract  Project page  Bibtex

@article{Pumarola_cvpr2021,
title = {D-NeRF: Neural Radiance Fields for Dynamic Scenes},
author = {Albert Pumarola and Enric Corona and Gerard Pons-Moll and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Neural rendering techniques combining machine learning with geometric reasoning have arisen as one of the most promising approaches for synthesizing novel views of a scene from a sparse set of images. Among these, stands out the Neural radiance fields (NeRF), which trains a deep network to map 5D input coordinates (representing spatial location and viewing direction) into a volume density and view-dependent emitted radiance. However, despite achieving an unprecedented level of photorealism on the generated images, NeRF is only applicable to static scenes, where the same spatial location can be queried from different images. In this paper we introduce D-NeRF, a method that extends neural radiance fields to a dynamic domain, allowing to reconstruct and render novel images of objects under rigid and non-rigid motions from a single camera moving around the scene. For this purpose we consider time as an additional input to the system, and split the learning process in two main stages: one that encodes the scene into a canonical space and another that maps this canonical representation into the deformed scene at a particular time. Both mappings are simultaneously learned using fully-connected networks. Once the networks are trained, D-NeRF can render novel images, controlling both the camera view and the time variable, and thus, the object movement. We demonstrate the effectiveness of our approach on scenes with objects under rigid, articulated and non-rigid motions. Code, model weights and the dynamic scenes dataset will be released.

Uncertainty-Aware Camera Pose Estimation from Points and Lines   
A.Vakhitov, L.Ferraz, A.Agudo and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

Paper  Abstract  Code  Bibtex

@article{Vakhitov_cvpr2021,
title = {Uncertainty-Aware Camera Pose Estimation from Points and Lines},
author = {Alexander Vakhitov and Luis Ferraz and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Perspective-n-Point-and-Line (PnPL) algorithms aim at fast, accurate, and robust camera localization with respect to a 3D model from 2D-3D feature correspondences, being a major part of modern robotic and AR/VR systems. Current point-based pose estimation methods use only 2D feature detection uncertainties, and the line-based methods do not take uncertainties into account. In our setup, both 3D coordinates and 2D projections of the features are considered uncertain. We propose globally convergent PnP solvers based on EPnP and DLS for the uncertainty-aware pose estimation. We also modify to the motion-only bundle adjustment to take 3D uncertainties into account. We perform exhaustive synthetic and real experiments on two different visual odometry datasets. The new PnP(L) methods outperform the state-of-the-art on real data in isolation, showing an increase in mean translation accuracy by 12% on a representative subset of KITTI, while the new uncertain refinement improves pose accuracy for most of the solvers, e.g. decreasing mean translation error for the EP$n$P by 5\% compared to the standard pose refinement on the same dataset. We will release the code of the proposed methods.

2020

GANimation: One-Shot Anatomically Consistent Facial Animation 
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2020

Paper  Abstract  Project page  Bibtex

@article{Pumarola_ijcv2020,
title = {GANimation: One-Shot Anatomically Consistent Facial Animation},
author = {A. Pumarola and A. Agudo and A.M. Martinez and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {128},
number = {},
issn = {0920-5691},
pages = {698-713},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2019},
month = {March}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for the task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of people sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content and granularity of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combining several of them. Additionally, we propose a weakly supervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit a novel self-learned attention mechanism that makes our network robust to changing backgrounds, lighting conditions and occlusions. Extensive evaluation shows that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild. The code of this work is publicly available at https://github.com/albertpumarola/GANimation.

3D Human Shape and Pose from a Single Low-Resolution Image 
X.Xu, H.Chen, F.Moreno-Noguer, L.Jeni and F. De la Torre
European Conference on Computer Vision (ECCV), 2020

Paper  Abstract  Video  Bibtex

@article{Xu_eccv2020,
title = {3D Human Shape and Pose from a Single Low-Resolution Image},
author = {Xiangyu Xu and Hao Chen and Francesc Moreno-Noguer and Laszlo Attila Jeni and Fernando De la Torre},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2020}
}

3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning meth- ods for 3D human shape and pose estimation rely on relatively high- resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a resolution-aware network, a self-supervision loss, and a contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes    (Oral)
E.Corona, A.Pumarola, G.Alenyà, F.Moreno-Noguer and G.Rogez
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

Paper  Abstract  Project page  Bibtex

@article{Corona_cvpr2020a,
title = {GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Francesc Moreno-Noguer and Gregory Rogez},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

The rise of deep learning has brought remarkable progress in estimating hand geometry from images where the hands are part of the scene. This paper focuses on a new problem not explored so far, consisting in predicting how a human would grasp one or several objects, given a single RGB image of these objects. This is a problem with enormous potential in e.g. augmented reality, robotics or prosthetic design. In order to predict feasible grasps, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. To this end, we introduce a generative model that jointly reasons in all these levels and 1) regresses the 3D shape and pose of the objects in the scene; 2) estimates the grasp types; and 3) refines the 51-DoF of a 3D hand model that minimize a graspability loss. To train this model we build the YCB-Affordance dataset, that contains more than 133k images of 21 objects in the YCB-Video dataset. We have annotated these images with more than 28M plausible 3D human grasps according to a 33-class taxonomy. A thorough evaluation in synthetic and real images shows that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.

C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds 
A.Pumarola, S.Popov, F.Moreno-Noguer and V.Ferrari
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

Paper  Abstract  Project page  Bibtex

@article{Pumarola_cvpr2020,
title = {C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds},
author = {Albert Pumarola and Stefan Popov and Francesc Moreno-Noguer and Vittorio Ferrari},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

Flow-based generative models have highly desirable properties like exact log-likelihood evaluation and exact latent-variable inference, however they are still in their infancy and have not received as much attention as alternative generative models. In this paper, we introduce C-Flow, a novel conditioning scheme that brings normalizing flows to an entirely new scenario with great possibilities for multi-modal data modeling. C-Flow is based on a parallel sequence of invertible mappings in which a source flow guides the target flow at every step, enabling fine-grained control over the generation process. We also devise a new strategy to model unordered 3D point clouds that, in combination with the conditioning scheme, makes it possible to address 3D reconstruction from a single image and its inverse problem of rendering an image given a point cloud. We demonstrate our conditioning method to be very adaptable, being also applicable to image manipulation, style transfer and multi-modal image-to-image mapping in a diversity of domains, including RGB images, segmentation maps and edge masks.

Context-aware Human Motion Prediction 
E.Corona, A.Pumarola, G.Alenyà and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

Paper  Abstract  Project page  Bibtex

@article{Corona_cvpr2020b,
title = {Context-aware Human Motion Prediction},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

The problem of predicting human motion given a sequence of past observations is at the core of many applications in robotics and computer vision. Current state-of-the-art formulate this problem as a sequence-to-sequence task, in which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that predicts future movements, typically in the order of 1 to 2 seconds. However, one aspect that has been obviated so far, is the fact that human motion is inherently driven by interactions with objects and/or other humans in the environment. In this paper, we explore this scenario using a novel context-aware motion prediction architecture. We use a semantic-graph model where the nodes parameterize the human and objects in the scene and the edges their mutual interactions. These interactions are iteratively learned through a graph attention layer, fed with the past observations, which now include both object and human body motions. Once this semantic graph is learned, we inject it to a standard RNN to predict future movements of the human/s and object/s. We consider two variants of our architecture, either freezing the contextual interactions in the future of updating them. A thorough evaluation in the “Whole-Body Human Motion Database” shows that in both cases, our context-aware networks clearly outperform baselines in which the context information is not considered.

2019

Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019

Paper  Abstract  Bibtex

@article{Agudo_pami2019,
title = {Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {41},
number = {4},
issn = {0162-8828},
pages = {971 - 984},
doi = {10.1109/TPAMI.2018.2823717},
year = {2019}
}

In this paper we present an approach to reconstruct the 3D shape of multiple deforming objects from a collection of sparse, noisy and possibly incomplete 2D point tracks acquired by a single monocular camera. Additionally, the proposed solution estimates the camera motion and reasons about the spatial segmentation (i.e., identifies each of the deforming objects in every frame) and temporal clustering (i.e., splits the sequence into motion primitive actions). This advances competing work, which mainly tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks, the camera motion, and the time-varying 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and does not require any training data at all. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

3DPeople: Modeling the Geometry of Dressed Humans
A.Pumarola, J.Sanchez, G.P.Choi, A.Sanfeliu and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2019

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_iccv2019,
title = {3DPeople: Modeling the Geometry of Dressed Humans},
author = {A. Pumarola and J. Sanchez and G.P.T. Choi and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2019}
}

Recent advances in 3D human shape estimation build upon parametric representations that model very well the shape of the naked body, but are not appropriate to represent the clothing geometry. In this paper, we present an approach to model dressed humans and predict their geometry from single images. We contribute in three fundamental aspects of the problem, namely, a new dataset, a novel shape parameterization algorithm and an end-to-end deep generative network for predicting shape. First, we present 3DPeople, a large-scale synthetic dataset with 2 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits. We annotate the dataset with body SMPL parameters, segmentation masks, skeletons, depth, normal maps and optical flow. All this together makes 3DPeople suitable for a plethora of tasks. We then represent the 3D shapes using 2D geometry images. To build these images we propose a novel spherical area-preserving parameterization algorithm based on the optimal mass transportation method. We show this approach to improve existing spherical maps which tend to shrink the elongated parts of the full body models such as the arms and legs, making the geometry images incomplete. Finally, we design a multi-resolution deep generative network that, given an input image of a dressed human, predicts his/her geometry image (and thus the clothed body shape) in an end-to-end manner. We obtain very promising results in jointly capturing body pose and clothing shape, both for synthetic validation and on the wild images.

Human Motion Prediction via Spatio-Temporal Inpainting 
A.Hernandez, J.Gall and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2019

Paper  Abstract  Bibtex

@inproceedings{Hernandez_iccv2019,
title = {Human Motion Prediction via Spatio-Temporal Inpainting},
author = {A. Hernandez and J. Gall and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2019}
}

We propose a Generative Adversarial Network (GAN) to forecast 3D human motion given a sequence of past 3D skeleton poses. While recent GANs have shown promising results, they can only forecast plausible motion over relatively short periods of time (few hundred milliseconds) and typically ignore the absolute position of the skeleton w.r.t. the camera. Our scheme provides long term predictions (two seconds or more) for both the body pose and its absolute position. Our approach builds upon three main contributions. First, we represent the data using a spatio-temporal tensor of 3D skeleton coordinates which allows formulating the prediction problem as an inpainting one, for which GANs work particularly well. Secondly, we design an architecture to learn the joint distribution of body poses and global motion, capable to hypothesize large chunks of the input 3D tensor with missing data. And finally, we argue that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion. We propose two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns. Extensive experiments demonstrate our approach to significantly improve the state of the art, while also handling situations in which past observations are corrupted by occlusions, noise and missing frames.

2018

Force-based Representation for Non-Rigid Shape and Elastic Model Estimation 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

Paper  Abstract  Bibtex

@article{Agudo_pami2018,
title = {Force-based Representation for Non-Rigid Shape and Elastic Model Estimation},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {9},
issn = {0162-8828},
pages = {2137 - 2150},
doi = {10.1109/TPAMI.2017.2676778},
year = {2018}
}

This paper addresses the problem of simultaneously recovering 3D shape, pose and the elastic model of a deformable object from only 2D point tracks in a monocular video. This is a severely under-constrained problem that has been typically addressed by enforcing the shape or the point trajectories to lie on low-rank dimensional spaces. We show that formulating the problem in terms of a low-rank force space that induces the deformation and introducing the elastic model as an additional unknown, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. In order to simultaneously estimate force, pose, and the elastic model of the object we use an expectation maximization strategy, where each of these parameters are successively learned by partial M-steps. Once the elastic model is learned, it can be transfered to similar objects to code its 3D deformation. Moreover, our approach can robustly deal with missing data, and encode both rigid and non-rigid points under the same formalism. We thoroughly validate the approach on Mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Boosted Random Ferns for Object Detection 
M.Villamizar, J.Andrade, A.Sanfeliu and F.Moreno-Noguer 
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

Paper  Abstract  Bibtex

@article{Villamizar_pami2018,
title = {Boosted Random Ferns for Object Detection},
author = {M. Villamizar and J. Andrade and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {40},
number = {2},
issn = {0162-8828},
pages = {272 - 288},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2676778},
year = {2018} }

In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

GANimation: Anatomically-aware Facial Animation from a Single Image     (Oral)
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_eccv2018,
title = {GANimation: Anatomically-aware Facial Animation from a Single Image},
author = {A. Pumarola and A. Agudo and A.M. Martinez and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View
A.Pumarola, A.Agudo, L.Porzi, A.Sanfeliu, V.Lepetit and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_cvpr2018b,
title = {Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View},
author = {A. Pumarola and A. Agudo and L. Porzi and A. Sanfeliu and V. Lepetit and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We propose a method for predicting the 3D shape of a deformable surface from a single view. By contrast with previous approaches, we do not need a pre-registered template of the surface, and our method is robust to the lack of texture and partial occlusions. At the core of our approach is a geometry-aware deep architecture that tackles the problem as usually done in analytic solutions: first perform 2D detection of the mesh and then estimate a 3D shape that is geometrically consistent with the image. We train this architecture in an end-to-end manner using a large dataset of synthetic renderings of shapes under different levels of deformation, material properties, textures and lighting conditions. We evaluate our approach on a test split of this dataset and available real benchmarks, consistently improving state-of-the-art solutions with a significantly lower computational time.

Unsupervised Person Image Synthesis in Arbitrary Poses     (Spotlight)
A.Pumarola, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_cvpr2018a,
title = {Unsupervised Person Image Synthesis in Arbitrary Poses},
author = {A. Pumarola and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We present a novel approach for synthesizing photorealistic images of people in arbitrary poses using generative adversarial learning. Given an input image of a person and a desired pose represented by a 2D skeleton, our model renders the image of the same person under the new pose, synthesizing novel views of the parts visible in the input image and hallucinating those that are not seen. This problem has recently been addressed in a supervised manner, i.e., during training the ground truth images under the new poses are given to the network. We go beyond these approaches by proposing a fully unsupervised strategy. We tackle this challenging scenario by splitting the problem into two principal subtasks. First, we consider a pose conditioned bidirectional generator that maps back the initially rendered image to the original pose, hence being directly comparable to the input image without the need to resort to any training image. Second, we devise a novel loss function that incorporates content and style terms, and aims at producing images of high perceptual quality. Extensive experiments conducted on the DeepFashion dataset demonstrate that the images rendered by our model are very close in appearance to those obtained by fully supervised approaches.

Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories     (Spotlight)
A.Agudo, M.Pijoan and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Bibtex

@inproceedings{Agudo_cvpr2018,
title = {Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories},
author = {A. Agudo and M. Pijoan and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

This paper introduces an approach to simultaneously estimate 3D shape, camera pose, and object and type of deformation clustering, from partial 2D annotations in a multi-instance collection of images. Furthermore, we can indistinctly process rigid and non-rigid categories. This advances existing work, which only addresses the problem for one single object or, if multiple objects are considered, they are assumed to be clustered a priori. To handle this broader version of the problem, we model object deformation using a formulation based on multiple unions of subspaces, able to span from small rigid motion to complex deformations. The parameters of this model are learned via Augmented Lagrange Multipliers, in a completely unsupervised manner that does not require any training data at all. Extensive validation is provided in a wide variety of synthetic and real scenarios, including rigid and non-rigid categories with small and large deformations. In all cases our approach outperforms state-of-the-art in terms of 3D reconstruction accuracy, while also providing clustering results that allow segmenting the images into object instances and their associated type of deformation (or action the object is performing).

2017

BreakingNews: Article Annotation by Image and Text Processing
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

Paper  Abstract  Dataset  Bibtex

@article{Ramisa_pami2017,
title = {BreakingNews: Article Annotation by Image and Text Processing},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {40},
number = {5},
issn = {0162-8828},
pages = {1072 - 1085},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2721945},
year = {2017}
}

Building upon recent Deep Neural Network architectures, current approaches lying in the intersection of computer vision and natural language processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval. Most of these learning methods, though, rely on large training sets of images associated with human annotations that specifically describe the visual content. In this paper we propose to go a step further and explore the more complex cases where textual descriptions are loosely related to the images. We focus on the particular domain of News articles in which the textual content often expresses connotative and ambiguous relations that are only suggested but not directly inferred from images. We introduce new deep learning methods that address source detection, popularity prediction, article illustration and geolocation of articles. An adaptive CNN architecture is proposed, that shares most of the structure for all the tasks, and is suitable for multitask and transfer learning. Deep Canonical Correlation Analysis is deployed for article illustration, and a new loss function based on Great Circle Distance is proposed for geolocation. Furthermore, we present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and popularity metrics). We show this dataset to be appropriate to explore all aforementioned problems, for which we provide a baseline performance using various Deep Learning architectures, and different representations of the textual and visual features. We report very promising results and bring to light several limitations of current state-of-the-art in this kind of domain, which we hope will help spur progress in the field.

Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion
A.Agudo and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

Paper  Abstract  Bibtex

@article{Agudo_ijcv2017,
title = {Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {371-387},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2017}
}

In this paper, we simultaneously estimate camera pose and non-rigid 3D shape from a monocular video, using a sequential solution that combines local and global representations. We model the object as an ensemble of particles, each ruled by the linear equation of the Newton’s second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency. The resulting approach allows to sequentially estimate shape and camera poses, while progressively learning a global low-rank model of the shape that is fed back into the optimization scheme, introducing thus, global constraints. The overall combination of local (physical) and global (statistical) constraints yields a solution that is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, without requiring any training data at all. Validation is done in a variety of real application domains, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our on-line methodology yields significantly more accurate reconstructions than competing sequential approaches, being even comparable to the more computationally demanding batch methods.

3D Human Pose Tracking Priors using Geodesic Mixture Models
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

Paper  Abstract  Project page  Bibtex

@article{Simo_ijcv2017,
title = {3D Human Pose Tracking Priors using Geodesic Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {388-408},
doi = {https://doi.org/10.1007/s11263-016-0941-2},
year = {2017}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. Additionally, we consider using shrinkage covariance estimation to improve the robustness of the method, especially when dealing with very sparsely distributed samples. We evaluate the approach on a number of situations, going from data clustering on manifolds to combining pose and kinematics of articulated bodies for 3D human pose tracking. In all cases, we demonstrate remarkable improvement compared to several chosen baselines.

3D Human Pose Estimation from a Single Image via Distance Matrix Regression
F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

Paper  Abstract  Video  Bibtex

@inproceedings{Moreno_cvpr2017,
title = {3D Human Pose Estimation from a Single Image via Distance Matrix Regression},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the N body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2N-to-3N regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using N × N distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.

DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction
A.Agudo and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

Paper  Abstract  Suppl. Material  Bibtex

@inproceedings{Agudo_cvpr2017,
title = {DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D
Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

We present an approach to reconstruct the 3D shape of multiple deforming objects from incomplete 2D trajectories acquired by a single camera. Additionally, we simultaneously provide spatial segmentation (i.e., we identify each of the objects in every frame) and temporal clustering (i.e., we split the sequence into primitive actions). This advances existing work, which only tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks and the 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and results in a formulation which does not need initialization. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

3D CNNs on Distance Matrices for Human Action Recognition
A.Hernandez, L.Porzi, S.Rota and F.Moreno-Noguer
ACM Conference on Multimedia (ACMMM), 2017

Paper  Abstract  Bibtex

@inproceedings{Hernandez_acmmm2017,
title = {3D CNNs on Distance Matrices for Human Action Recognition},
author = {A. Hernandez and L. Porzi and S. Rota and F. Moreno-Noguer},
booktitle = {Proceedings of the ACM Conference on Multimedia (ACMMM)},
year = {2017}
}

In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

Multi-Modal Embedding for Main Product Detection in Fashion (Best Paper Award)
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer
Fashion Workshop in International Conference on Computer Vision (ICCVw), 2017

Paper  Abstract  Bibtex

@inproceedings{Rubio_iccvw2017,
title = {Multi-Modal Embedding for Main Product Detection in Fashion},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision Workshops (ICCVW)},
year = {2017}
}

We present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.

2016

Sequential Non-Rigid Structure from Motion using Physical Priors
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016

Paper  Abstract  Bibtex

@article{Agudo_pami2016,
title = {Sequential Non-Rigid Structure from Motion using Physical Priors},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {38},
number = {5},
issn = {0162-8828},
pages = {979-994},
doi = {10.1109/TPAMI.2015.2469293},
year = {2016}
}

We propose a new approach to simultaneously recover camera pose and 3D shape of non-rigid and potentially extensible surfaces from a monocular image sequence. For this purpose, we make use of the EKF-SLAM (Extended Kalman Filter based Simultaneous Localization And Mapping) formulation, a Bayesian optimization framework traditionally used in mobile robotics for estimating camera pose and reconstructing rigid scenarios. In order to extend the problem to a deformable domain we represent the object’s surface mechanics by means of Navier’s equations, which are solved using a FEM (Finite Element Method). With these main ingredients, we can further model the material’s stretching, allowing us to go a step further than most of current techniques, typically constrained to surfaces undergoing isometric deformations. We extensively validate our approach in both real and synthetic experiments, and demonstrate its advantages with respect to competing methods. More specifically, we show that besides simultaneously retrieving camera pose and non-rigid shape, our approach is adequate for both isometric and extensible surfaces, does not require neither batch processing all the frames nor tracking points over the whole sequence and runs at several frames per second.

Accurate and Linear Time Pose Estimation from Points and Lines
A.Vakhitov, J.Funke and F.Moreno-Noguer
European Conference in Computer Vision (ECCV), 2016

Paper  Abstract  Code  Bibtex

@inproceedings{Vakhitov_eccv2016,
title = {Accurate and Linear Time Pose Estimation from Points and Lines},
author = {A. Vakhitov and J. Funke and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2016}
}

The Perspective-n-Point (PnP) problem seeks to estimate the pose of a calibrated camera from n 3D-to-2D point correspondences. There are situations, though, where PnP solutions are prone to fail because feature point correspondences cannot be reliably estimated (e.g. scenes with repetitive patterns or with low texture). In such scenarios, one can still exploit alternative geometric entities, such as lines, yielding the so-called Perspective-n-Line (PnL) algorithms. Unfortunately, existing PnL solutions are not as accurate and efficient as their point-based counterparts. In this paper we propose a novel approach to introduce 3D-to-2D line correspondences into a PnP formulation, allowing to simultaneously process points and lines. For this purpose we introduce an algebraic line error that can be formulated as linear constraints on the line endpoints, even when these are not directly observable. These constraints can then be naturally integrated within the linear formulations of two state-of-the-art point-based algorithms, the OPnP and the EPnP, allowing them to indistinctly handle points, lines, or a combination of them. Exhaustive experiments show that the proposed formulation brings remarkable boost in performance compared to only point or only line based solutions, with a negligible computational overhead compared to the original OPnP and EPnP.

2015

Dense Segmentation-aware Descriptors
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Chapter in Dense Image Correspondences for Computer Vision, Eds. C.Liu and T.Hassner, Springer, 2015

Paper  Abstract  Bibtex

@article{Trulls_springerchapter2015,
title = {Dense Segmentation-aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Dense Image Correspondences for Computer Vision},
editor = {Ce Liu and Tal Hassner},
publisher = {Springer},
doi = {http://dx.doi.org/10.1007/978-3-319-23048-1}
year = {2015}
}

Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However the extraction of descriptors on generic image points, rather than select geometric features, e.g. blobs, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense Scale- and Rotation-Invariant Descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.

Non-Rigid Graph Registration using Active Testing Search
E.Serradell, M.A.Pinheiro, R.Sznitman, J.Kybic, F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2015

Paper  Abstract  Bibtex

@article{Serradell_pami2015,
title = {Non-Rigid Graph Registration using Active Testing Search},
author = {E. Serradell and M.A. Pinheiro and R. Sznitman and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {37},
number = {3},
issn = {0162-8828},
pages = {625-638},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2014.2343235},
year = {2015}
}

We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R2 or R3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

DaLI: Deformation and Light Invariant Descriptor
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2015

Paper  Abstract  Project page  Bibtex

@article{Simo_ijcv2015,
title = {{DaLI}: Deformation and Light Invariant Descriptor},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {115},
number = {2},
issn = {0920-5691},
pages = {135-154},
doi = {https://doi.org/10.1007/s11263-015-0805-1},
year = {2015}
}

Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.

Discriminative Learning of Deep Convolutional Feature Point Descriptors
E.Simo-Serra, E.Trulls, L.Ferraz, I.Kokkinos, P.Fua and F.Moreno-Noguer
International Conference in Computer Vision (ICCV), 2015

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_iccv2015,
title = {Discriminative Learning of Deep Convolutional Feature Point Descriptors},
author = {E. Simo-Serra and E. Trulls and L. Ferraz and I. Kokkinos and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on handcrafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

Learning Shape, Motion and Elastic Models in Force Space
A.Agudo and F.Moreno-Noguer
International Conference in Computer Vision (ICCV), 2015

Paper  Abstract  Bibtex

@inproceedings{Agudo_iccv2015,
title = {Learning Shape, Motion and Elastic Models in Force Space},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

In this paper, we address the problem of simultaneously recovering the 3D shape and pose of a deformable and potentially elastic object from 2D motion. This is a highly ambiguous problem typically tackled by using low-rank shape and trajectory constraints. We show that formulating the problem in terms of a low-rank force space that induces the deformation, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. However, this comes at the price of, besides force and pose, having to estimate the elastic model of the object. For this, we use an Expectation Maximization strategy, where each of these parameters are successively learned within partial M-steps, while robustly dealing with missing observations. We thoroughly validate the approach on both mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Simultaneous Pose and Non-rigid Shape with Particle Dynamics
A.Agudo and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Paper  Abstract  Bibtex

@inproceedings{Agudo_cvpr2015,
title = {Simultaneous Pose and Non-rigid Shape with Particle Dynamics},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.

Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2015,
title = {Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph’s setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions
A.Ramisa, J.Wang, Y.Lu, E.Dellandrea, F.Moreno-Noguer and R.Gaizauskas
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

Paper  Abstract  Bibtex

@inproceedings{Ramisa_emnlp2015,
title = {Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions},
author = {A. Ramisa and J. Wang and Y. Lu and E. Dellandrea and F. Moreno-Noguer and R. Gaizauskas},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2015}
}

We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.

2014

Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection
L.Ferraz, X.Binefa and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Paper  Abstract  Code  Bibtex

@inproceedings{Ferraz_cvpr2014,
title = {Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {501-508},
year = {2014}
}

We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accurate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.

Segmentation-aware Deformable Part Models
E.Trulls, S.Tsogkas, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Paper  Abstract  Spotlight  Bibtex

@inproceedings{Trulls_cvpr2014,
title = {Segmentation-aware Deformable Part Models},
author = {E. Trulls and S. Tsogkas and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {168-175}
year = {2014}
}

In this work we propose a technique to combine bottom-up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs). The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs. We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.

A High Performance CRF Model for Cloth Parsing
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun
Asian Conference on Computer Vision (ACCV), 2014

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_accv2014,
title = {A High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2014}
}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.

2013

Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation
A.Penate-Sanchez, J.Andrade-Cetto and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

Paper  Abstract  Code  Bibtex

@article{Penate_pami2013,
title = {Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation},
author = {A. Penate-Sanchez and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {10},
issn = {0162-8828},
pages = {2387-2400},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.36},
year = {2013}
}

We propose a novel approach for the estimation of the pose and focal length of a camera from a set of 3D-to-2D point correspondences. Our method compares favorably to competing approaches in that it is both more accurate than existing closed form solutions, as well as faster and also more accurate than iterative ones. Our approach is inspired on the EPnP algorithm, a recent O(n) solution for the calibrated case. Yet, we show that considering the focal length as an additional unknown renders the linearization and relinearization techniques of the original approach no longer valid, especially with large amounts of noise. We present new methodologies to circumvent this limitation termed exhaustive linearization and exhaustive relinearization which perform a systematic exploration of the solution space in closed form. The method is evaluated on both real and synthetic data, and our results show that besides producing precise focal length estimation, the retrieved camera pose is almost as accurate as the one computed using the EPnP, which assumes a calibrated camera.

Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery
F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

Paper  Abstract  Video  Bibtex

@article{Moreno_pami2013,
title = {Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery},
author = {F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {2},
issn = {0162-8828},
pages = {463-475},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.102},
year = {2013}
}

Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.

Dense Segmentation-Aware Descriptors
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

Paper  Abstract  Code  Bibtex

@inproceedings{Trulls_cvpr2013,
title = {Dense Segmentation-Aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2890-2897}
year = {2013}
}

In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and wide-baseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.

A Joint Model for 2D and 3D Pose Estimation from a Single Image
E.Simo-Serra, A.Quattoni, C.Torras and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2013,
title = {A Joint Model for 2D and 3D Pose Estimation from a Single Image},
author = {E. Simo-Serra and A. Quattoni and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {3634-3641}
year = {2013}
}

We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.

2012

Single Image 3D Human Pose Estimation from Noisy Observations
E.Simo-Serra, A.Ramisa, G.Alenya, C.Torras and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2012

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2012,
title = {Single Image 3D Human Pose Estimation from Noisy Observations},
author = {E. Simo-Serra and A. Ramisa and G. Alenya and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2673-2680}
year = {2012}
}

Markerless 3D human pose detection from a single image is a severely underconstrained problem because different 3D poses can have similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models that can only be correctly adjusted if 2D image features are accurately detected. Unfortunately, although current 2D part detectors algorithms have shown promising results, they are not yet accurate enough to guarantee a complete disambiguation of the 3D inferred shape. In this paper, we introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable from their image projection. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D human shape. We validate the method on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.

Robust Non-Rigid Registration of 2D and 3D Graphs
E.Serradell, P.Glowacki, J.Kybic, F.Moreno-Noguer and P.Fua
Conference on Computer Vision and Pattern Recognition (CVPR), 2012

Paper  Abstract  Bibtex

@inproceedings{Serradell_cvpr2012,
title = {Robust Non-Rigid Registration of 2D and 3D Graphs},
author = {E. Serradell and P. Glowacki and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {996-1003}
year = {2012}
}

We present a new approach to matching graphs embedded in R^2 or R^3. Unlike earlier methods, our approach does not rely on the similarity of local appearance features, does not require an initial alignment, can handle partial matches, and can cope with non-linear deformations and topological differences. To handle arbitrary non-linear deformations, we represent them as Gaussian Processes. In the absence of appearance information, we iteratively establish correspondences between graph nodes, update the structure accordingly, and use the current mapping estimate to find the most likely correspondences that will be used in the next iteration. This makes the computation tractable. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction of Non-Rigid and Ambiguous Scenes
E.Trulls, A.Sanfeliu and F.Moreno-Noguer
Europen Conference on Computer Vision (ECCV), 2012

Paper  Abstract  Spotlight  Bibtex

@inproceedings{Trulls_eccv2012,
title = {Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction of Non-Rigid and Ambiguous Scenes},
author = {E. Trulls and and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {7574},
series = {Lecture Notes in Computer Science},
pages = {441-454}
year = {2012}
}

This paper studies the use of temporal consistency to match appearance descriptors and handle complex ambiguities when computing dynamic depth maps from stereo. Previous attempts have designed 3D descriptors over the space-time volume and have been mostly used for monocular action recognition, as they cannot deal with perspective changes. Our approach is based on a state-of-the-art 2D dense appearance descriptor which we extend in time by means of optical flow priors, and can be applied to wide-baseline stereo setups. The basic idea behind our approach is to capture the changes around a feature point in time instead of trying to describe the spatiotemporal volume. We demonstrate its effectiveness on very ambiguous synthetic video sequences with ground truth data, as well as real sequences.

2011

Simultaneous Correspondence and Non-Rigid 3D Reconstruction of the Coronary Tree from Single X-Ray Images
E.Serradell, A.Romero, R.Leta, C.Gatta and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2011

Paper  Abstract  Bibtex

@inproceedings{Serradell_iccv2011,
title = {Simultaneous Correspondence and Non-Rigid 3D Reconstruction of the Coronary Tree from Single X-Ray Images},
author = {E. Serradell and A. Romero and R. Leta and C. Gatta and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {850-857}
year = {2011}
}

We present a novel approach to simultaneously recon- struct the 3D structure of a non-rigid coronary tree and estimate point correspondences between an input X-ray image and a reference 3D shape. At the core of our approach lies an optimization scheme that iteratively fits a generative 3D model of increasing complexity and guides the matching process. As a result, and in contrast to existing approaches that assume rigidity or quasi-rigidity of the structure, our method is able to retrieve large non-linear deformations even when the input data is corrupted by the presence of noise and partial occlusions. We extensively evaluate our approach under synthetic and real data and demonstrate a remarkable improvement compared to state-of-the-art.

Deformation and Illumination Invariant Feature Point Descriptor
F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2011

Paper  Abstract  Bibtex

@inproceedings{Moreno_cvpr2011a,
title = {Deformation and Illumination Invariant Feature Point Descriptor},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1593-1600}
year = {2011}
}

Recent advances in 3D shape recognition have shown that kernels based on diffusion geometry can be effectively used to describe local features of deforming surfaces. In this paper, we introduce a new framework that allows using these kernels on 2D local patches, yielding a novel feature point descriptor that is both invariant to non-rigid image deformations and illumination changes. In order to build the descriptor, 2D image patches are embedded as 3D surfaces, by multiplying the intensity level by an arbitrarily large and constant weight that favors anisotropic diffusion and retains the gradient magnitude information. Patches are then described in terms of a heat kernel signature, which is made invariant to intensity changes, rotation and scaling. The resulting feature point descriptor is proven to be significantly more discriminative than state of the art ones, even those which are specifically designed for describing non-rigid image deformations.

Probabilistic Simultaneous Pose and Non-Rigid Shape
F.Moreno-Noguer and J.M.Porta
Conference on Computer Vision and Pattern Recognition (CVPR), 2011

Paper  Abstract  Bibtex

@inproceedings{Moreno_cvpr2011b,
title = {Probabilistic Simultaneous Pose and Non-Rigid Shape},
author = {F. Moreno-Noguer and J.M. Porta},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1289-1296}
year = {2011}
}

We present an algorithm to simultaneously recover non-rigid shape and camera poses from point correspondences between a reference shape and a sequence of input images. The key novel contribution of our approach is in bringing the tools of the probabilistic SLAM methodology from a rigid to a deformable domain. Under the assumption that the shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, may be probabilistically formulated as a maximum a posterior estimate and solved using an iterative least squares optimization. An extensive evaluation on synthetic and real data, shows that our approach has several significant advantages over current approaches, such as performing robustly under large amounts of noise and outliers, and neither requiring to track points over the whole sequence nor initializations close from the ground truth solution.

2010

Simultaneous Pose, Correspondence and Non-Rigid Shape
J.Sanchez, J.Östlund, P.Fua and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Paper  Abstract  Bibtex

@inproceedings{Sanchez_cvpr2010,
title = {Simultaneous Pose, Correspondence and Non-Rigid Shape},
author = {J. Sanchez and J. Östlund and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1189-1196}
year = {2010}
}

Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a single image given a set of 3D-to-2D correspondences between that image and another one for which the shape is known. However, existing approaches assume that such correspondences can be readily established, which is not necessarily true when large deformations produce significant appearance changes between the input and the reference images. Furthermore, it is either assumed that the pose of the camera is known, or the estimated solution is pose-ambiguous. In this paper we relax all these assumptions and, given a set of 3D and 2D unmatched points, we present an approach to simultaneously solve their correspondences, compute the camera pose and retrieve the shape of the surface in the input image. This is achieved by introducing weak priors on the pose and shape that we model as Gaussian Mixtures. By combining them into a Kalman filter we can progressively reduce the number of 2D candidates that can be potentially matched to each 3D point, while pose and shape are refined. This lets us to perform a complete and efficient exploration of the solution space and retain the best solution.

Efficient Rotation Invariant Object Detection using Boosted Random Ferns
M.Villamizar, F.Moreno-Noguer, J.Andrade-Cetto and A.Sanfeliu
Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Paper  Abstract  Bibtex

@inproceedings{Villamizar_cvpr2010,
title = {Efficient Rotation Invariant Object Detection using Boosted Random Ferns},
author = {M. Villamizar and F. Moreno-Noguer and J. Andrade-Cetto and A. Sanfeliu},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1038-1045}
year = {2010}
}

We present a new approach for building an efficient and robust classifier for the two class problem, that localizes objects that may appear in the image under different orientations. In contrast to other works that address this problem using multiple classifiers, each one specialized for a specific orientation, we propose a simple two-step approach with an estimation stage and a classification stage. The estimator yields an initial set of potential object poses that are then validated by the classifier. This methodology allows reducing the time complexity of the algorithm while classification results remain high. The classifier we use in both stages is based on a boosted combination of Random Ferns over local histograms of oriented gradients (HOGs), which we compute during a pre-processing step. Both the use of supervised learning and working on the gradient space makes our approach robust while being efficient at run-time. We show these properties by thorough testing on standard databases and on a new database made of motorbikes under planar rotations, and with challenging conditions such as cluttered backgrounds, changing illumination conditions and partial occlusions.

Exploring Ambiguities for Monocular Non-Rigid Shape Estimation
F.Moreno-Noguer, J.M.Porta and P.Fua
European Conference on Computer Vision (ECCV), 2010

Paper  Abstract  Bibtex

@inproceedings{Moreno_eccv2010,
title = {Exploring Ambiguities for Monocular Non-Rigid Shape Estimation},
author = {F. Moreno-Noguer and J.M. Porta and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {6313},
series = {Lecture Notes in Computer Science},
pages = {370-383}
year = {2010}
}

Recovering the 3D shape of deformable surfaces from single images is difficult because many different shapes have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce an efficient approach to exploring the set of solutions of an objective function based on point-correspondences and to proposing a small set of candidate 3D shapes. This allows the use of additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem.

Combining Geometric and Appearance Priors for Robust Homography Estimation
E.Serradell, M.Özuysal, V.Lepetit, P.Fua and F.Moreno-Noguer
European Conference on Computer Vision (ECCV), 2010

Paper  Abstract  Bibtex

@inproceedings{Serradell_eccv2010,
title = {Combining Geometric and Appearance Priors for Robust Homography Estimation},
author = {E. Serradell and M. Özuysal and V. Lepetit and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {6313},
series = {Lecture Notes in Computer Science},
pages = {58-72}
year = {2010}
}

The homography between pairs of images are typically computed from the correspondence of keypoints, which are established by using image descriptors. When these descriptors are not reliable, either because of repetitive patterns or large amounts of clutter, additional priors need to be considered. The Blind PnP algorithm makes use of geometric priors to guide the search for matches while computing camera pose. Inspired by this, we propose a novel approach for homography estimation that combines geometric priors with appearance priors of ambiguous descriptors. More specifically, for each point we retain its best candidates according to appearance. We then prune the set of potential matches by iteratively shrinking the regions of the image that are consistent with the geometric prior. We can then successfully compute homographies between pairs of images containing highly repetitive pat- terns and even under oblique viewing conditions.

2009

EPnP: An Accurate O(n) Solution to the PnP Problem
V.Lepetit, F.Moreno-Noguer and P.Fua
International Journal of Computer Vision (IJCV), 2009

Paper  Abstract  Bibtex

@article{Lepetit_ijcv2009,
title = {{EPnP}: An Accurate O(n) Solution to the PnP Problem},
author = {V. Lepetit and F. Moreno-Noguer and P. Fua},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {81},
number = {2},
issn = {0920-5691},
pages = {155-166},
doi = {https://doi.org/10.1007/s11263-008-0152-6},
year = {2009}
}

We propose a non-iterative solution to the PnP problem —the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences— whose computational complexity grows linearly with n. This is in contrast to state-of-the-art methods that are O(n^5) or even O(n^8), without being more accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control points. The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick the right weights. Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time. The advantages of our method are demonstrated by thorough testing on both synthetic and real-data.

Capturing 3D Stretchable Surfaces from Single Images in Closed Form
F.Moreno-Noguer, M.Salzmann, V.Lepetit and P.Fua
Conference on Computer Vision and Pattern Recognition (CVPR), 2009

Paper  Abstract  Video  Bibtex

@inproceedings{Moreno_cvpr2009,
title = {Capturing 3D Stretchable Surfaces from Single Images in Closed Form},
author = {F. Moreno-Noguer and M. Salzmann and V. Lepetit and P. Fua},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1842-1849}
year = {2009}
}

We present a closed-form solution to the problem of recovering the 3D shape of a non-rigid potentially stretchable surface from 3D-to-2D correspondences. In other words, we can reconstruct a surface from a single image without a priori knowledge of its deformations in that image. State-of-the-art solutions to non-rigid 3D shape recovery rely on the fact that distances between neighboring surface points must be preserved and are therefore limited to inelastic surfaces. Here, we show that replacing the inextensibility constraints by shading ones removes this limitation while still allowing 3D reconstruction in closed-form. We demonstrate our method and compare it to an earlier one using both synthetic and real data.

2008

Dependent Multiple Cue Integration for Robust Tracking
F.Moreno-Noguer, D.Samaras and A.Sanfeliu
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2008

Paper  Abstract  Bibtex

@article{Moreno_pami2008,
title = {Dependent Multiple Cue Integration for Robust Tracking},
author = {F. Moreno-Noguer and D. Samaras and A. Sanfeliu},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {30},
number = {4},
issn = {0162-8828},
pages = {670-685},
doi = {http://doi.ieeecomputersociety.org/0.1109/TPAMI.2007.70727},
year = {2008}
}

We propose a new technique for fusing multiple cues to robustly segment an object from its background in video sequences that suffer from abrupt changes of both illumination and position of the target. Robustness is achieved by the integration of appearance and geometric object features and by their estimation using Bayesian filters, such as Kalman or particle filters. In particular, each filter estimates the state of a specific object feature, conditionally dependent on another feature estimated by a distinct filter. This dependence provides improved target representations, permitting us to segment it out from the background even in nonstationary sequences. Considering that the procedure of the Bayesian filters may be described by a “hypotheses generation - hypotheses correction” strategy, the major novelty of our methodology compared to previous approaches is that the mutual dependence between filters is considered during the feature observation, that is, into the “hypotheses-correction” stage, instead of considering it when generating the hypotheses. This proves to be much more effective in terms of accuracy and reliability. The proposed method is analytically justified and applied to develop a robust tracking system that adapts online and simultaneously the color space where the image points are represented, the color distributions, the contour of the object, and its bounding box. Results with synthetic data and real video sequences demonstrate the robustness and versatility of our method.

Pose Priors for Simultaneously Solving Alignment and Correspondence
F.Moreno-Noguer, V.Lepetit and P.Fua
European Conference on Computer Vision (ECCV), 2008

Paper  Abstract  Code  Bibtex

@inproceedings{Moreno_eccv2008,
title = {Pose Priors for Simultaneously Solving Alignment and Correspondence},
author = {F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {5303},
series = {Lecture Notes in Computer Science},
pages = {405-418}
year = {2008}
}

Estimating a camera pose given a set of 3D-object and 2D-image feature points is a well understood problem when correspondences are given. However, when such correspondences cannot be established a priori, one must simultaneously compute them along with the pose. Most current approaches to solving this problem are too computationally intensive to be practical. An interesting exception is the SoftPosit algorithm, that looks for the solution as the minimum of a suitable objective function. It is arguably one of the best algorithms but its iterative nature means it can fail in the presence of clutter, occlusions, or repetitive patterns. In this paper, we propose an approach that overcomes this limitation by taking advantage of the fact that, in practice, some prior on the camera pose is often available. We model it as a Gaussian Mixture Model that we progressively refine by hypothesizing new correspondences. This rapidly reduces the number of potential matches for each 3D point and lets us explore the pose space more thoroughly than SoftPosit at a similar computational cost. We will demonstrate the superior performance of our approach on both synthetic and real data.

Closed-Form Solution to Non-Rigid 3D Surface Detection
M. Salzmann, F.Moreno-Noguer, V.Lepetit and P.Fua
European Conference on Computer Vision (ECCV), 2008

Paper  Abstract  Video  Bibtex

@inproceedings{Salzmann_eccv2008,
title = {Closed-Form Solution to Non-Rigid 3D Surface Detection},
author = {M. Salzmann and F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {5303},
series = {Lecture Notes in Computer Science},
pages = {581-594}
year = {2008}
}

We present a closed-form solution to the problem of recovering the 3D shape of a non-rigid inelastic surface from 3D-to-2D correspondences. This lets us detect and reconstruct such a surface by matching individual images against a reference configuration, which is in contrast to all existing approaches that require initial shape estimates and track deformations from image to image. We represent the surface as a mesh, and write the constraints provided by the correspondences as a linear system whose solution we express as a weighted sum of eigenvectors. Obtaining the weights then amounts to solving a set of quadratic equations accounting for inextensibility constraints between neighboring mesh vertices. Since available closed-form solutions to quadratic systems fail when there are too many variables, we reduce the number of unknowns by expressing the deformations as a linear combination of modes. The overall closed-form solution then becomes tractable even for complex deformations that require many modes.

2007

Active Refocusing of Images and Videos
F.Moreno-Noguer, P.N.Belhumeur and S.K.Nayar
ACM Transactions on Graphics (SIGGRAPH), 2007

Paper  Abstract  Video  Bibtex

@article{Moreno_siggraph2007,
title = {Active Refocusing of Images and Videos},
author = {F. Moreno-Noguer and P.N. Belhumeur and S.K. Nayar},
booktitle = {ACM Transactions on Graphics (SIGGRAPH)},
volume = {26},
number = {3},
issn = {0730-0301},
pages = {463-475},
doi = {10.1145/1276377.1276461},
year = {2007}
}

We present a system for refocusing images and videos of dynamic scenes using a novel, single-view depth estimation method. Our method for obtaining depth is based on the defocus of a sparse set of dots projected onto the scene. In contrast to other active illumination techniques, the projected pattern of dots can be removed from each captured image and its brightness easily controlled in order to avoid under- or over-exposure. The depths corresponding to the projected dots and a color segmentation of the image are used to compute an approximate depth map of the scene with clean region boundaries. The depth map is used to refocus the acquired image after the dots are removed, simulating realistic depth of field effects. Experiments on a wide variety of scenes, including close-ups and live action, demonstrate the effectiveness of our method.

Accurate Non-Iterative O(n) Solution to the PnP Problem
F.Moreno-Noguer, V.Lepetit and P.Fua
International Conference on Computer Vision (ICCV), 2007

Paper  Abstract  Code Matlab  Code C++  Bibtex

@inproceedings{Moreno_iccv2007,
title = {Accurate Non-Iterative O(n) Solution to the PnP Problem},
author = {F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {1-8}
year = {2007}
}

We propose a non-iterative solution to the PnP problem —the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences— whose computational complexity grows linearly with n. This is in contrast to state-of-the-art methods that are O(n^5) or even O(n^8), without being more accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control points. The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick the right weights. The advantages of our method are demon- strated by thorough testing on both synthetic and real-data.

2005

Integration of Conditionally Dependent Object Features for Robust Figure/Background Segmentation
F.Moreno-Noguer, A.Sanfeliu and D.Samaras
International Conference on Computer Vision (ICCV), 2005

Paper  Abstract  Video Bibtex

@inproceedings{Moreno_iccv2005,
title = {Integration of Conditionally Dependent Object Features for Robust Figure/Background Segmentation},
author = {F. Moreno-Noguer and A. Sanfeliu and D. Samaras},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {1713-1720}
year = {2005}
}

We propose a new technique for fusing multiple cues to robustly segment an object from its background in video sequences that suffer from abrupt changes of both illumination and position of the target. Robustness is achieved by the integration of appearance and geometric object features and by their description using particle filters. Previous approaches assume independence of the object cues or apply the particle filter formulation to only one of the features, and assume a smooth change in the rest, which can prove is very limiting, especially when the state of some features needs to be updated using other cues or when their dynamics follow non-linear and unpredictable paths. Our technique offers a general framework to model the probabilistic relationship between features. The proposed method is analytically justified and applied to develop a robust tracking system that adapts online and simultaneously the col- orspace where the image points are represented, the color distributions, and the contour of the object. Results with synthetic data and real video sequences demonstrate the ro- bustness and versatility of our method.