The Supplementary Material

1. Formula Derivation, Proof and Clarification Derivation of Formula (12) First, we claim that P (Ai,hi) = 1 is true for ∀i ∈ [1, 2nv − 1]. It’s intuitive because a(i, hi) always points to the root node and any prediction always falls into the interval corresponding to it. Besides, we can easily point out that P (Ai,kAi,k+1 · · ·Ai,hi) = P (Ai,k) (1) It’s quite trivial because Ai,k+∆k (∆k ≥ 0) always occurs when Ai,k occurs. Afterwards, we try to expand the item P (τs ∈ τ i) and perform a formula simplification. P (τs ∈ τ i) = P (Ai,0) = P (Ai,0Ai,1 · · ·Ai,hi) (2) Therefore,


Formula Derivation, Proof and Clarification
Derivation of Formula (12) First, we claim that P (A i,hi ) = 1 is true for ∀i ∈ [1, 2n v − 1]. It's intuitive because a(i, h i ) always points to the root node and any prediction always falls into the interval corresponding to it.
Besides, we can easily point out that It's quite trivial because A i,k+∆k (∆k ≥ 0) always occurs when A i,k occurs. Afterwards, we try to expand the item P (τ s ∈ τ i ) and perform a formula simplification. Therefore, (3) And the final formula can be worked out through cascaded decision navigation procedure.
Mathematical induction on Formula (15) In order to verify that the cumulative results for decision navigation on each level conform to the definition of probability distribution, we need to prove For h = 0, we have Let us assume Hence, And the next step is to prove Therefore, the cumulative results for decision navigation on each level conform to the definition of probability distribution, and we don't need to perform any extra normalization for them.
Furthur explanation for signal decomposition The process of signal decomposition can be regarded as an approximation of wavelet or fourier transformation. Although the basis function on each level is learnt by a multi-layer perceptron and doesn't have the properties held by most wavelet or fourier basis functions, we can still consider it as a composition of standard basis functions whose frequencies are equal to or less than the sampling frequency. Therefore, we actually conduct a rough transformation or decomposition by specifying the sampling frequency and coefficient of functions at each level manually and generate a proper composition of various basis functions via learnable parameters.

More Details for Experiment Analysis
In this section, we will further elaborate on the failure case in ActivityNet Caption [2] dataset and discuss different cases in Charades-STA [1] and TACoS [3] datasets to conduct a comprehensive qualitative evaluation. Moreover, some analysis for the number of parameter and the selection of hyper-parameter is also included in this part.
Analysis on failure case in ActivityNet Caption dataset After being flipped or rotated, the patterns (including appearances and motions) of the plausible actions (i.e. rope traverse and declined pull up) are quite similar with that of the real one (i.e. push up), which can be observed in Figure  1. In this situation, our model misidentifies these different actions as variants of pushing up under the change of perspective. Qualitative analysis on Charades-STA dataset As shown in Figure 2, the success case indicates that our model can not only capture the right subject and object but also correctly identify the target action and relationship in the video. However, in the failure case, where the boy sequentially performs the action of opening the curtain and window, picking up the phone and shooting out of the window, our model fails to distinguish between opening the window and shooting out of the window. Looking into the original video, we can find it's also difficult for human to distinguish them at the given resolution directly and the action of picking up the phone actually gives us a great hint to infer the successive behavior. This observation suggests that highresolution video or high-quality feature will help to boost the performance. Besides, the ability of common sense reasoning may be beneficial as well.
Qualitative analysis on TACoS dataset The success case and failure case for TACoS dataset are also illustrated in Figure 3. In this dataset, the scenes in different video segments are almost the same and various actions are quite similar, which results in only slight differences between adjacent frames. The success case demonstrates that our model can handle this problem well. But when it comes to the failure case, we can find that our model fails to capture the significant phrase the other, thus mistaking the process for peeling the first kiwi as the target result.
Hyper-parameter selection for the frame number Considering that the average frame number of these three  Figure 4 shows the impact of different frame numbers on the model performance. Analysis for the number of parameter The number of parameter for 2D-TAN [4] and different variants of our model are demonstrated in Table 1. Due to the parametersharing mechanism used in most components, every time the frame number doubles, we just need to add an extra group of navigator and decomposer, which is a multi-layer perceptron in essence. Therefore, the number of parameter in our model keeps almost unchanged as the length of video increases.