\caption{The original SCMS algorithm}\label{alg:scms-original}
\end{algorithm}
Algorithm~\ref{alg:scms-original} can be transformed in the equivalent algorithm~\ref{alg:scms-modified} where the CN-to-VN ($\beta$) and VN-to-CN ($\alpha$) message computations are swapped, their initializations changed and the VN-to-CN messages after the erasure are denoted $\lambda$.
The layered schedule proposed in~\cite{mansour2006turbo} speeds-up the propagation of the posterior information and reduces the average number of iterations. Algorithm~\ref{alg:scms-modified-layered} depicts this proposal applied to the SCMS decoding kernel.
\caption{The modified and layered SCMS algorithm}\label{alg:scms-modified-layered}
\end{algorithm}
With the layered schedule (algorithm~\ref{alg:scms-modified-layered}), the posterior information $\post_n^{(l)}$ must be stored. The $\beta_{m,n}^{(l)}$ (CN-to-VN messages) depend entirely on limited information about the set of the $\alpha_{m,n}^{(l)}$ (argmin, first and second minimum of absolute values, signs, product of signs). The SCMS algorithm also requires that the signs of the $\alpha_{m,n}^{(l)}$ are stored, along with flags indicating that they have been erased. All in all, for each row $m$, the set of $\alpha_{m,n}^{(l)}$ can be represented in the compact form $r_m^{(l)}=\{m_1,m_2,i,\bm{\alpha^s},sp,\bm{\alpha^e}\}$:
With structured quasi-cyclic parity check matrices a first level of parallelism can be exploited to speed-up the decoding: all rows of a block row can be processed in parallel because the active variable nodes do not overlap between rows. In algorithm~\ref{alg:scms-optimized} it means that several iterations of the loop over the rows can be processed in parallel. For 3GPP New Radio LDPC codes, for instance, up to 384 processing elements can be running simultaneously.
As can been seen in algorithm~\ref{alg:scms-optimized}, the processing of a row is split in two phases:
\begin{itemize}
\item$m_1=\min\limits_{n\in\Tg(m)}\lvert\alpha_{m,n}^{(l)}\rvert=$ minimum of absolute values
\item$i=\argmin\limits_{n\in\Tg(m)}\lvert\alpha_{m,n}^{(l)}\rvert=$ index of $m_1$
\item$m_2=\min\limits_{n\in\Tg(m)\setminus i}\lvert\alpha_{m,n}^{(l)}\rvert=$ second minimum of absolute values
\item$\bm{\alpha^s}: \alpha^s_n=\sign(\alpha_{m,n}^{(l)})=$ vector of signs
\item$sp=\prod\limits_{n\in\Tg(m)}\sign(\alpha_{m,n}^{(l)})=$ product of signs
\item$\bm{\alpha^e}: \alpha^e_n=true$ if $\alpha_{m,n}^{(l)}$ erased = vector of erasure flags
\item In a first phase the new variable to check node messages ($\alpha$) are computed (and possibly erased) from the posterior LLRs and the compact representation of the row information. The new compact representation of the row information is created.
\item In a second phase the posterior LLRs ($\post$) are updated using the new compact representation of the row information.
\end{itemize}
Using this compact form, $\beta_{m,n}^{(l)}$ can be computed from $r_m^{(l)}$ as shown on algorithm~\ref{alg:scms-r2b}.
\caption{Computing $\beta_{m,n}^{(l)}$ from $r_m^{(l)}$ compact data structure}\label{alg:scms-r2b}
\end{algorithm}
The SCMS algorithm with layered schedule can be re-formulated as shown on algorithm~\ref{alg:scms-optimized} where we denote $\beta(r_m^{(l)},n)$ the computation of $\beta_{m,n}^{(l)}$ from $r_m^{(l)}$ according algorithm~\ref{alg:scms-r2b}, and $\lambda_{m,n}^{(l)}$ the $\alpha_{m,n}^{(l)}$ after erasure.
\caption{The memory-optimized SCMS algorithm with layered schedule}\label{alg:scms-optimized}
\end{algorithm}
These two phases are not independent: the second cannot start before the first one ends. With some codes, however, the first phase of a group of rows can be processed in parallel with the second phase of the previous group of rows. It is the case with the 3GPP New Radio LDPC codes.
\section{Architecture}
Thanks to these two opportunities of parallel processing the potential speed-up factor can be as large as $2\times384=768$.
\caption{The original SCMS algorithm}\label{alg:scms-original}
\end{algorithm}
Note that algorithm~\ref{alg:scms-original} can be slightly modified to use only posterior information ($\post_n$), initialized with prior information ($\prio_n$) before the first iteration. Line~\ref{alg:scms-original-posterior-update} becomes: $\post_n^{(l)}\leftarrow\post_n^{(l-1)}+\sum\limits_{m\in\Tg(n)}\beta_{m,n}^{(l)}$ with $\post_n^{(0)}=\prio_n$. The impact on decoding performance is negigible but the savings in terms of storage space is significant.
Algorithm~\ref{alg:scms-original} can also be transformed in an equivalent algorithm where the CN-to-VN ($\beta$) and VN-to-CN ($\alpha$) message computations are swapped and their initializations are changed. Algorithm~\ref{alg:scms-modified} illustrates these two modifications.
The layered schedule proposed in~\cite{mansour2006turbo} speeds-up the propagation of the posterior information and reduces the average number of iterations. Algorithm~\ref{alg:scms-modified-layered} depicts this proposal applied to the modified SCMS algorithm~\ref{alg:scms-modified}.
\caption{The modified and layered SCMS algorithm}\label{alg:scms-modified-layered}
\end{algorithm}
With the layered schedule (algorithm~\ref{alg:scms-modified-layered}), the posterior information $\post_n^{(l)}$ must be stored. The $\beta_{m,n}^{(l)}$ (CN-to-VN messages) depend entirely on limited information about the set of the $\alpha_{m,n}^{(l)}$ (argmin, first and second minimum of absolute values, signs, product of signs). The SCMS algorithm also requires that the signs of the $\alpha_{m,n}^{(l)}$ are stored, along with flags indicating that they have been erased. All in all, for each row $m$, the set of $\alpha_{m,n}^{(l)}$ can be represented in the compact form $r_m^{(l)}=\{m_1,m_2,i,\bm{\alpha^s},sp,\bm{\alpha^e}\}$:
\begin{itemize}
\item$m_1=\min\limits_{n\in\Tg(m)}\lvert\alpha_{m,n}^{(l)}\rvert=$ minimum of absolute values
\item$i=\argmin\limits_{n\in\Tg(m)}\lvert\alpha_{m,n}^{(l)}\rvert=$ index of $m_1$
\item$m_2=\min\limits_{n\in\Tg(m)\setminus i}\lvert\alpha_{m,n}^{(l)}\rvert=$ second minimum of absolute values
\item$\bm{\alpha^s}: \alpha^s_n=\sign(\alpha_{m,n}^{(l)})=$ vector of signs
\item$sp=\prod\limits_{n\in\Tg(m)}\sign(\alpha_{m,n}^{(l)})=$ product of signs
\item$\bm{\alpha^e}: \alpha^e_n=true$ if $\alpha_{m,n}^{(l)}$ erased = vector of erasure flags
\end{itemize}
Using this compact form, $\beta_{m,n}^{(l)}$ can be computed from $r_m^{(l)}$ as shown on algorithm~\ref{alg:scms-r2b}.
\caption{Computing $\beta_{m,n}^{(l)}$ from $r_m^{(l)}$ compact data structure}\label{alg:scms-r2b}
\end{algorithm}
The SCMS algorithm with layered schedule can be re-formulated as shown on algorithm~\ref{alg:scms-optimized} where we denote $\beta(r_m^{(l)},n)$ the computation of $\beta_{m,n}^{(l)}$ from $r_m^{(l)}$ and $n$ according algorithm~\ref{alg:scms-r2b}.