||We present a vision-inertial system which combines two RGB-Depth devices together with a wearable inertial movement unit in order to detect activities of the daily living. From multi-view videos, we extract dense trajectories enriched with a histogram of normals description computed from the depth cue and bag them into multi-view codebooks. During the later classification step a multi-class support vector machine with a RBF- 2 kernel combines the descriptions at kernel level. In order to perform action detection from the videos, a sliding window approach is utilized. On the other hand, we extract accelerations, rotation angles, and jerk features from the inertial data collected by the wearable placed on the user’s dominant wrist. During gesture spotting, a dynamic time warping is applied and the aligning costs to a set of pre-selected gesture sub-classes are thresholded to determine possible detections. The outputs of the two modules are combined in a late-fusion fashion. The system is validated in a real-case scenario with elderly from an elder home. Learning-based fusion results improve the ones from the single modalities, demonstrating the success of such multimodal approach.