With the emergence of seismic resilience concepts, the objective of buildings’ seismic design shifts from life safety to earthquake-induced functional consequences. The seismic damage of nonstructural components (NSCs) is paramount in economical loss assessment and post-earthquake functional recovery. To rapidly assess the seismic damage of NSCs, this paper leverages the video-comprehension techniques to customize a novel deep learning model (which is called two-pathway vision transformer for the damage state recognition, TPViT-DMSR) that can recognize the freestanding NSCs’ damage states through the video footage. This model includes a slow and fast pathways to analyze the object’s movement at different speeds. The two pathways are fused through a unidirectional connection, which allows damage information to be better shared across two pathways. To improve the robustness of the model, the trajectory attention mechanism is integrated. To demonstrate the model’s applicability and efficacy, a comprehensive video dataset that describes various freestanding NSCs’ movement behavior is compiled via a series of shaking table tests. The application result reveals that the TPViT-DMSR model performs at a desired level with the mean average precision achieving at 74.87%. The model is also applied to the video footage collected from past earthquake events. The application results highlight that the model can deliver a reliable estimate of the damage state for NSCs.