# Leveraging Photonic Interconnects for Scalable and Efficient Fully Homomorphic Encryption

Dewan Saiham, Di Wu, and Sazadur Rahman

Department of Electrical and Computer Engineering, University of Central Florida

{dewan.saiham, di.wu, mohammad.rahman}@ucf.edu

*Abstract*—Fully Homomorphic Encryption (FHE) facilitates secure computations on encrypted data but imposes significant demands on memory bandwidth and computational power. While current FHE accelerators focus on optimizing computation, they often face bandwidth limitations that result in performance bottlenecks, particularly in memory-intensive operations. This paper presents *OptoLink*, a scalable photonic interconnect architecture designed to address these bandwidth and latency challenges in FHE systems. *OptoLink* achieves a throughput of 1.6 TB/s with 128 channels, providing 300 times the bandwidth of conventional electrical interconnects. The proposed architecture improves data throughput, scalability, and reduces latency, making it an effective solution for meeting the high memory and data transfer requirements of modern FHE accelerators.

*Keywords*—Fully Homomorphic Encryption, Number Theoretic Transform, Wavelength Division Multiplexing, Memory Acceleration

## I. INTRODUCTION

Fully Homomorphic Encryption (FHE) represents a breakthrough in privacy-preserving computation, allowing encrypted data to be processed without revealing the plaintext. This capability is critical for safeguarding sensitive data in untrusted environments such as cloud computing, healthcare, and financial systems [1]. By performing operations on encrypted inputs and returning encrypted outputs, FHE ensures robust security even in the event of server breaches, as the decryption key remains confidential. Key computational tasks in FHE schemes, including integer-based and ring learning with errors (R-LWE) methods, involve resource-intensive operations like large integer and polynomial multiplications [2]. The Number Theoretic Transform (NTT), essential for modular polynomial multiplication, reduces complexity from  $O(n^2)$  to O(nlogn) [3]. However, implementing NTT is challenging due to high memory bandwidth requirements and complex data access patterns in hardware [4]. Hardware acceleration using FPGA, ASIC, and Compute-in-Memory (CiM) platforms has improved efficiency but faces scalability limitations [5], [6].

The high bandwidth demands and intricate memory access patterns of NTT often lead to read-after-write conflicts [7], further exacerbated by the large parameters required for secure FHE [8]. NTT architectures that are pipelined and parallel have been proposed to increase efficiency [9], but they frequently lack programmability and adaptability across a range of security requirements. Effectively managing dataflow while resolving major memory bandwidth and access conflicts is a major challenge [10]. Although pipeline stalls have been employed



Fig. 1. Computational flow in fully holomorphic encryption (FHE).

to mitigate memory conflicts, the bandwidth requirements of large FHE parameters surpass the capabilities of traditional electrical interconnects. This necessitates novel interconnect solutions for enabling scalable and high-performance FHE systems.

To address these challenges, we propose *OptoLink*, a photonic interconnect architecture tailored for FHE applications. By replacing conventional electrical interconnects with optical alternatives, *OptoLink* alleviates bandwidth bottlenecks and simplifies data paths in NTT computations. Utilizing technologies such as wavelength-division multiplexing (WDM) and space-division multiplexing (SDM), photonic interconnects support high-throughput, low-latency communication and scalable multi-chiplet designs. While effective in domains like deep neural networks [11], their potential in FHE architectures remains largely unexplored.

The primary contributions of this work include:

- Identify memory bandwidth limitations as the key bottleneck in existing FHE accelerators, demonstrating that compute acceleration alone cannot ensure scalability.
- We propose *OptoLink*, a photonic interconnect architecture that reduces memory access conflicts and achieves high bandwidth for NTT operations. *OptoLink* supports scalable deployments with data rates tailored to FHE requirements (Sec.III).
- 3) Utilizing photonics process design kits (PDKs) and electronic photonic design automation (EPDA) tools such as Synopsys OptSim and OptoCompiler, we design a scalable *OptoLink* architecture achieving 1.6TB/s bandwidth with 128 optical channels, with potential for even higher throughput (Sec.III-D and IV).

The remainder of the paper is structured as follows: Sec.II outlines the background and motivation for *OptoLink*. Sec.III details the proposed architecture and its implementation. Sec.IV presents results and analysis, followed by conclusions in Sec.V.

<sup>\*</sup>This is the author's version of the paper presented at GOMACTech 2025.

## II. BACKGROUND AND MOTIVATION

## A. Number Theoretic Transform (NTT)

The NTT, a finite-field adaptation of the Fast Fourier Transform (FFT), enables efficient polynomial multiplication without roundoff errors involved with complex numbers. For a polynomial  $a(x) = \sum_{\substack{i=0\\n-1}}^{n-1} a_i x^i$ , the NTT is defined as:

$$\tilde{a_i} = \sum_{j=0}^{n-1} a_j \omega^{i,j} \mod q, \tag{1}$$

where  $\omega$  is a primitive *n*-th root of unity in the ring  $Z_q$ , satisfying  $\omega^n \equiv 1 \mod q$ , and *q* is a prime number such that  $q \equiv 1 \mod n$ . Polynomial multiplication using NTT involves transforming two polynomials *a* and *b* into the NTT domain, performing point-wise multiplication, and then applying the inverse NTT (INTT) to obtain the final result.

$$c = INTT(NTT(a) \circ NTT(b)), \tag{2}$$

Here  $\circ$  denotes the element-wise multiplication. The INTT, which transforms data back from the NTT domain to its original form, is expressed as follows,

$$a_j = \frac{1}{n} \sum_{i=0}^{n-1} \tilde{a}_i \cdot \omega^{-i \cdot j} \mod q \tag{3}$$

where,  $\omega^{-i \cdot j}$  represents the inverse twiddle factor, and the scaling factor  $n^{-1} \mod q$  completes the transformation. This approach reduces the time complexity from  $O(n^2)$  in naive polynomial multiplication to  $O(n \log n)$ , similar to the FFT, but without floating-point precision errors. The Cooley-Tukey [12] and Gentleman-Sande [13] algorithms are widely used for computing NTT and INTT efficiently, further optimizing the polynomial multiplication process by breaking it down to smaller subproblems through butterfly operations.

## B. Overview of current FHE Accelerators

FHE has made significant progress in lowering its original computational overhead, which was  $10^9$  times slower than conventional unencrypted processing, since its inception in 2009 [14]. However, FHE operations are still  $10,000 \times$  to  $100,000 \times$  slower than traditional techniques, which makes them difficult to use in practice and emphasizes the need for specialized hardware accelerators. Due to their parallel processing capabilities, which may provide speedups of up to  $257 \times$  when compared to CPUs [15], GPUs have become a feasible solution. TensorFHE [16] achieved efficiency levels comparable to ASIC-based systems, with a  $1625.6 \times$  performance boost over CPUs and a  $2.9 \times$  improvement over F1+. However, because GPUs are not made especially for FHE workloads, memory-intensive operations result in inefficiencies and significant power consumption. By enabling custom implementations of FHE tasks, such as the NTT, FPGAs offer increased flexibility. Significant performance improvements have been shown by accelerators like HEAX [7] and Poseidons [17], with Poseidon offering an improvement of more than  $1000 \times$  over GPU solution. Additionally, designs like FAB optimize resource utilization to handle FHE operations effectively, showcasing the potential of FPGA-based acceleration [18]. ASIC accelerators that are specifically made for FHE schemes, such as BFV and CKKS, perform much better. For example, ARK [19] and CraterLake [20] use innovations like hardware-accelerated bootstrapping and optimized data handling to solve performance bottlenecks and enable deeper computation depths, leading to significant speedups over GPU-based methods. Nevertheless, obstacles including their enormous device sizes, high power consumption, and substantial memory needs make it difficult for ASICs to be used practically and provide real-world adoption issues.

TABLE I Memory Requirements of current FHE accelerators

| Name            | Hardware<br>Target | Supported<br>Schemes | Bandwidth |
|-----------------|--------------------|----------------------|-----------|
| 100x [15]       | GPU                | BEV                  | 900 GB/s  |
| 100x [15]       |                    | CKKS                 | 500 GB/3  |
| cryptGPU [21]   | GPU                | MPC                  | 1.25 GB/s |
| TensorFHE [16]  | GPU                | BFV,                 | 2.4 TB/s  |
|                 |                    | CKKS                 |           |
| HEAX [7]        | FPGA               | CKKS                 | 34 GB/s   |
|                 |                    |                      | 64 GB/s   |
| Poseidon [17]   | FPGA               | BFV,                 | 460 GB/s  |
|                 |                    | CKKS                 |           |
| FAB [18]        | FPGA               | BFV,                 | 460 GB/s  |
|                 |                    | CKKS                 |           |
| F1 [22]         | ASIC               | BFV,                 | 1 TB/s    |
|                 |                    | CKKS                 |           |
| CraterLake [20] | ASIC               | BFV,                 | 2.4 TB/s  |
|                 |                    | CKKS                 |           |
| BTS [23]        | ASIC               | BFV,                 | 1 TB/s    |
|                 |                    | CKKS                 |           |
| ARK [19]        | ASIC               | BFV,                 | 1 TB/s    |
|                 |                    | CKKS                 |           |

#### C. Memory Bottlenecks in FHE Acceleration

Although there has been progress in computation acceleration, memory bandwidth is still a significant constraint in FHE applications [24]. When compared to plaintexts, ciphertexts greatly increase the quantity of data, particularly in schemes like CKKS. This results in frequent memory accesses and severe bandwidth limitations. According to [19], a chip that has 40,960 modular multipliers at 2 GHz and 3 TB/s HBM3 can do multiplications in 0.18 ms, however it needs 2.1 ms for data transfer. This shows that in large-scale FHE jobs, the main bottleneck is not computing but data transportation. The intricate memory access patterns of FHE make bandwidth issues worse. Large memory allocations are needed for twiddle factors and intermediate results in operations like NTT, which frequently exceed on-chip cache capacity and call for frequent off-chip accesses. Static architectures find it difficult to handle the dynamic data dependencies and hardware resource strain caused by the  $(n \log n)/2$  butterfly operations in FFT/NTT pipelines [10]. Key-switching's resource-intensive actions increase memory needs even further. The decomposition parameter (dnum) impacts both memory and computation, requiring trade-offs between parallelization techniques such as residuepolynomial-level parallelism (rPLP) and coefficient-level parallelism (CLP). While global NTT communication increases latency in CLP, rPLP introduces additional data exchanges



Fig. 2. Two transmitters and receivers are connected by a WDM photonic interconnect that operates on two distinct wavelengths,  $\lambda_1$  and  $\lambda_2$ .

during basis conversion. Achieving hardware that dynamically balances these approaches remains a challenge. Furthermore, off-chip transfers of twiddle factors and intermediate results are frequently required due to limited on-chip memory, which exacerbates latency and power consumption [8].

FHE accelerators need high-bandwidth capability to transport data efficiently because of these demands. State-of-theart FHE accelerators have bandwidth requirements, which are highlighted in Table I. Meeting these demands presents considerable hurdles. No electronic interconnects currently in use can achieve such high data transfer rates [24].

## D. Overcoming Bandwidth Limits in FHE

To address bandwidth constraints, chiplet-based FHE accelerators utilize high-bandwidth memory (HBM) technologies like HBM3, offering up to 0.819 TB/s per stack with a 1024bit data width [25]. Advanced FHE accelerators (Table I) often require multiple HBM3 stacks to meet multi-terabyte bandwidth demands. Photonic interconnects, such as the OptoLink architecture, provide superior bandwidth, achieving 0.8 TB/s with only 64 channels, reducing bitwidth requirements by  $16 \times$  compared to HBM3, and scaling effectively for advanced workloads. OptoLink dynamically multiplexes data, enabling flexible routing, task parallelism, and improved resource utilization with low latency. It supports diverse FHE tasks, including key-switching and NTT operations, without significant architectural changes. The design and experimental validation of OptoLink are discussed in Secs. III-C and IV, highlighting its ability to meet bandwidth needs and improve scalability for privacy-preserving applications.

## III. METHODOLOGY

## A. Photonic Interconnects

Photonic interconnects, which use light rather than conventional electrical signals, are a state-of-the-art method for highspeed data transfer in chip layouts. As depicted in Fig. 2, light generated by an external laser is directed into on-chip waveguides through optical couplers. Micro-ring resonators (MRRs), serving as modulators and filters, are precisely tuned to specific wavelengths. These MRRs, equipped with resistive heaters and thermal tuning systems, maintain stability by compensating for process and thermal variations [26]. Electrical signals are modulated onto light by MRRs, with each signal assigned a unique wavelength. The modulated light propagates



Fig. 3. Schematic of an WDM-enabled *OptoLink* network communicating between the NTT module and memory. The NTT module retrieves input data and twiddle factors from memory, and sends the computed outputs back.

through waveguides to the receiver, where specific MRRs filter the signals. Photodetectors (PDs) then convert the optical signals back to electrical form, and transimpedance amplifiers (TIAs) amplify them for reliable data recovery.

To boost data throughput, WDM enables multiple data streams to transmit simultaneously on different wavelengths within a single waveguide. Advanced WDM systems can handle up to 64 wavelengths, each at 10Gb/s, achieving aggregate bandwidths exceeding 100Gb/s [27]. Additionally, SDM further increases capacity by employing multiple parallel waveguides. By integrating WDM and SDM, photonic interconnects deliver exceptional bandwidth and energy efficiency. These attributes make them ideal for data-intensive applications like FHE accelerators, where efficient communication between cores and memory is critical.

## B. Single OptoLink Channel

Photonic interconnects are used to facilitate fast data transfer in Fig. 3, which illustrates the integration of memory and NTT modules within a single *OptoLink* channel. Several signals are sent simultaneously via a single waveguide by the system using WDM, and each signal is given a distinct wavelength. This design substantially boosts bandwidth and ensures seamless communication between the memory controller and the NTT module. Memory is used to store input data close to the transmitters, including twiddle factors and coefficients. Using analog electrical signals derived from digital inputs, MRRs modulate light at certain wavelengths. The signals are isolated on the receiving end by wavelength-specific MRRs and sent to PDs for optical-to-electrical conversion. TIAs amplify these signals, which are then processed by comparators to recreate the original data. The output data undergoes similar modulation, transmission, and demodulation after processing by the NTT module. It is then sent back to the memory controller for subsequent operations. The same wavelengths are utilized for input and output data via different waveguides in order to maximize system efficiency. This reduces the overall number of wavelengths needed.



Fig. 4. Schematic representation of the *OptoLink* architecture connecting four NTT modules via five waveguides. Wavelengths  $\lambda_1 - \lambda_{16}$  are allocated for input data transmission, while  $\lambda_{17} - \lambda_{24}$  handle output data transmission.

# C. Scalable OptoLink Network Architecture

The *OptoLink* architecture, depicted in Fig. 4, integrates with four NTT modules and uses five waveguides for efficient data and twiddle factor transmission. *Waveguides 1* and 2 carry input data, to the NTT modules, while *Waveguides 3* and 4 deliver twiddle factors. Processed outputs are sent back to the memory controller via *Waveguide 5*. Two wavelength groups facilitate communication:  $\lambda_1 - \lambda_{16}$  transmit input data and twiddle factors, while  $\lambda_{17} - \lambda_{24}$  handle results. Each wavelength corresponds to one bit per channel, with parallel optical channels enabling simultaneous data transmission. Wavelength reuse further enhances the system's scalability and efficiency.

Scalability, a critical requirement for FHE accelerators, is a key strength of OptoLink. Increasing the number of optical channels and employing WDM significantly boost throughput while maintaining a compact physical footprint. Research shows that up to 64 wavelengths can be multiplexed within a single waveguide [27], providing substantial bandwidth expansion. However, this scalability introduces challenges in power consumption. Adding more NTT cores and optical channels increases laser power requirements to counteract insertion losses. While MRR tunability enables each transmitter to support multiple receivers, reducing the number of modulators, the cumulative power demand rises as more off-chip lasers, MRRs, and photodetectors are required. Balancing scalability with power and thermal efficiency is essential, particularly for high-throughput FHE workloads. The OptoLink design effectively addresses bandwidth and latency constraints while offering flexibility for diverse applications.

## D. Simulation Platform and Parameter

To assess the effectiveness of the proposed *OptoLink* architecture, optical interconnects were implemented between NTT cores and off-chip memory to address the challenges posed by HEAX [7], an FPGA-based FHE accelerator. HEAX's complex memory-to-NTT module connections highlight the limitations of conventional electronic interconnects, motivating our adoption of photonic solutions for improved efficiency and reduced latency. Photonic parameters such as detector responsivity, modulator insertion loss, and coupling efficiency were

TABLE II PHOTONIC PARAMETERS UTILIZED FOR EVALUATION IN *OptoLink* 

| Component     | Value       |  |
|---------------|-------------|--|
| Laser Source  | 5 dB        |  |
| Coupler       | 1 dB        |  |
| Splitter      | $0.2 \ dB$  |  |
| Waveguide     | $1 \ dB/cm$ |  |
| Ring Drop     | $0.7 \ dB$  |  |
| Ring Through  | $0.01 \ dB$ |  |
| Photodetector | $0.5 \ dB$  |  |
| Ring Heating  | $0.32 \ mW$ |  |



Fig. 5. Simulation configuration for a single channel in the OptoLink system

included in the Synopsys OptoCompiler simulation of the *OptoLink* design (Table II). These factors helped determine the laser power needs, guaranteeing dependable signal delivery even in the face of optical imperfections. Furthermore, we used Synopsys Design Compiler to do time, power, and area analysis for the electrical network. To evaluate our architecture's scalability under various computing demands, the analysis took into account different numbers of NTT modules, notably configurations of 4, 8, and 16.

#### **IV. RESULTS AND ANALYSIS**

#### A. Timing Analysis

Synopsys OptoCompiler was used to simulate two optical channels in data transmission studies to assess the performance of the OptoLink network. Data was sent at 10Gb/s via a pseudo-random bit sequence (PRBS) generator, which needed 6.4ns to produce a complete sequence. Fig. 6 shows 64-bit data sequences modulated by MRRs onto 1550nm (channel 1) and 1551nm (channel 2) wavelengths, with reliable signal recovery after transmission. Using a  $1000\mu m$  waveguide, OptoLink achieved a transmission latency of 10ps, significantly lower than the 3.04ns required in an electrical network. Each OptoLink channel achieved a data rate of 100Gb/s or 12.5GB/s, with a total bandwidth of 1.6TB/s for 128 channels, sufficient for FHE workloads. With the architectures scalibility, it can achieve 2.4TB/s with 192 channels-on par with the NVIDIA A100 [28]—and up to 12.8TB/s with 1024 channels. In contrast, electrical interconnects deliver only 5.26GB/s at



Fig. 6. (a-b) Electrical input signals supplied to modulator MRRs across two separate *OptoLink* channels for designated wavelengths. (c-d) Corresponding output signals, after optical-to-electrical conversion by the PD and amplification via the TIA in each channel, observed for the same wavelengths.

a latency of 3.04 ns with a 128-channel configuration. Even with 1024-bit data sequences, their bandwidth is limited to 42.1GB/s. Achieving *OptoLink*'s 1.6TB/s bandwidth electrically would require an unfeasible 4864-bit data width.

The ultra-fast data transfer of *OptoLink* minimizes latency between memory and computational units, alleviating bottlenecks in FHE accelerators. By enabling efficient data exchanges like coefficients and twiddle factors, it accelerates operations, supports large datasets, and enhances performance for privacy-preserving applications. Its scalability ensures suitability for evolving computational demands.

 TABLE III

 BITRATE COMPARISON OF ELECTRICAL NETWORK AND OptoLink

|          | Electrical Network |          | OptoLink |         |
|----------|--------------------|----------|----------|---------|
| Bitwidth | Latency            | Bitrate  | Latency  | Bitrate |
| 32       | 3.04ns             | 1.32GB/s | 10ps     | 0.4TB/s |
| 64       | 3.04ns             | 2.63GB/s | 10ps     | 0.8TB/s |
| 128      | 3.04ns             | 5.26GB/s | 10ps     | 1.6TB/s |

#### B. Power Analysis

The *OptoLink* system's power consumption is largely dictated by its laser source, MRRs, and PDs. The total power consumption can be expressed as,

$$P_{\text{total}} = P_{\text{laser}} + P_{\text{TX}} + P_{\text{RX}},\tag{4}$$

where  $P_{\text{laser}}$  accounts for the laser source's power usage, while  $P_{\text{TX}}$  and  $P_{\text{RX}}$  represent the power consumed by the transmitter and receiver, respectively. Each transmitter includes MRR thermal heating, which consumes approximately 0.32 mW per resonator [29], resulting in  $P_{\text{TX}} = 1.22mW$  and  $P_{\text{TX}} = 0.92mW$  per optical channel. For a 128-channel *OptoLink* system supporting 4 NTT cores, the estimated power consumption is 6.59 W, scaling to 13.16 W for 8 cores and 26.31 W for 16 cores due to the additional transmitters and receivers required for increased core counts.

In contrast, electrical interconnects consume significantly less power. Under a 128-bit configuration, power consumption is 336.99  $\mu$ W for 4 cores, 661.74  $\mu$ W for 8 cores, and 1332.31  $\mu$ W for 16 cores. For narrower 32-bit configurations, electrical networks consume between 283.89  $\mu$ W and 1121.9  $\mu$ W, while *OptoLink* requires 1.65 W to 6.58 W. Similarly, under 64-bit configurations, electrical network power



Fig. 7. Comparison of electrical network and OptoLink performance across varying numbers of NTT cores. (a) Relationship between bitwidth and bandwidth, and (b) Relationship between bitwidth and power consumption.

ranges from 308.18  $\mu$ W to 1232.19  $\mu$ W, whereas *OptoLink* consumes 3.29 W to 13.16 W.

The increased power demand of *OptoLink* comes from optical components, with lasers consuming significant energy for stable light, and transmitters/receivers adding overhead due to photodetectors and signal processing. MRRs also require thermal control, further raising power usage. In contrast, electrical interconnects are more power-efficient with simpler designs. Despite the higher power consumption, *OptoLink* offers superior scalability and high data throughput, making it ideal for applications prioritizing performance and bandwidth over energy efficiency.

 TABLE IV

 POWER CONSUMPTION OF ELECTRICAL NETWORK AND OptoLink

| Bitwidth | NTT Cores | Power Consumption  |          |  |
|----------|-----------|--------------------|----------|--|
|          |           | Electrical Network | OptoLink |  |
| 32       | 4         | 283.89 $\mu W$     | 1.65 W   |  |
|          | 8         | 562.44 $\mu W$     | 3.29 W   |  |
|          | 16        | 1121.9 $\mu W$     | 6.58 W   |  |
| 64       | 4         | $308.18 \ \mu W$   | 3.29 W   |  |
|          | 8         | 619.29 $\mu W$     | 6.58 W   |  |
|          | 16        | 1232.19 $\mu W$    | 13.16 W  |  |
| 128      | 4         | 336.99 µW          | 6.59 W   |  |
|          | 8         | 661.74 $\mu W$     | 13.16 W  |  |
|          | 16        | 1332.31 $\mu W$    | 26.31 W  |  |

## C. Area Analysis

Comparing the space usage of traditional electronic networks with that of the photonic components essential to OptoLink allowed for an analysis of the area needs of the suggested OptoLink architecture. The electronic network areas were approximated using a 32 nm technology library and realistic process design parameters. For a 128-bit NTT configuration, the area requirements of the electrical network scaled nearly linearly with the number of NTT units, with 4, 8, and 16 NTT units occupying 3097.3  $\mu$ m<sup>2</sup>, 5741.2  $\mu$ m<sup>2</sup>, and 11861.9  $\mu$ m<sup>2</sup>, respectively. *OptoLink*, on the other hand, needs more space because of its photonic elements. According to [30], each photonic transmitter or receiver takes up around 0.0096 mm<sup>2</sup> per wavelength, and the wavelength-selective MRRs add an extra 0.01 mm<sup>2</sup> based on an MRR radius of 5  $\mu$ m [31]. Additionally, MRRs require electrical connections, which increase the overall footprint. These connections include four wires for data transfer and temperature tuning.

## V. CONCLUSION

The OptoLink architecture addresses key challenges in FHE accelerators by using photonic interconnects for ultra-low latency and high bandwidth. With picosecond-scale latencies and 1.6 TB/s throughput in a 128-channel configuration, it handles large ciphertexts and complex memory access patterns, reducing bottlenecks in tasks like key-switching and NTTs. While photonic components increase power and area demands, these are outweighed by performance gains. Future integration of broadcast-enabled photonic devices and an optomized dataflow will further optimize power and area efficiency by reducing the number of wavelengths and waveguides needed, enhancing scalability and energy efficiency. In summary, OptoLink is a high-performance, scalable interconnect solution tailored to the demands of FHE systems, enabling faster and more efficient privacy-preserving computations. Ongoing advancements will refine its efficiency, ensuring its broader applicability in data-intensive applications.

#### REFERENCES

- C. Marcolla, V. Sucasas, M. Manzano, R. Bassoli, F. H. Fitzek, and N. Aaraj, "Survey on fully homomorphic encryption, theory, and applications," *Proceedings of the IEEE*, vol. 110, no. 10, pp. 1572–1609, 2022.
- [2] C. Gentry and S. Halevi, "Implementing gentry's fully-homomorphic encryption scheme," in Annual international conference on the theory and applications of cryptographic techniques. Springer, 2011, pp. 129– 148.
- [3] J. Fan and F. Vercauteren, "Somewhat practical fully homomorphic encryption," Cryptology ePrint Archive, 2012.
- [4] J. H. Cheon, A. Kim, M. Kim, and Y. Song, "Homomorphic encryption for arithmetic of approximate numbers," in Advances in Cryptology– ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3-7, 2017, Proceedings, Part I 23. Springer, 2017, pp. 409– 437.
- [5] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, "Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data," in 2019 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 2019, pp. 387–398.
- [6] B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, "Cheetah: Optimizing and accelerating homomorphic encryption for private inference," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 26–39.
- [7] M. S. Riazi et al., "Heax: An architecture for computing on encrypted data," in Proceedings of the twenty-fifth international conference on architectural support for programming languages and operating systems, 2020, pp. 1295–1309.
- [8] M. Zhou, Y. Nam, P. Gangwar, W. Xu, A. Dutta, K. Subramanyam, C. Wilkerson, R. Cammarota, S. Gupta, and T. Rosing, "Fhemem: A processing in-memory accelerator for fully homomorphic encryption," *arXiv preprint arXiv:2311.16293*, 2023.
- [9] P. Duong-Ngoc et al., "Efficient k-parallel pipelined ntt architecture for post quantum cryptography," in 2020 International SoC Design Conference (ISOCC), 2020, pp. 212–213.
- [10] J. Zhang *et al.*, "Sok: Fully homomorphic encryption accelerators," ACM Computing Surveys, 2022.
- [11] Y. Li et al., "Spacx: Silicon photonics-based scalable chiplet accelerator for dnn inference," in *IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, 2022, pp. 831–845.
- [12] J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex fourier series," *Mathematics of computation*, vol. 19, no. 90, pp. 297–301, 1965.
- [13] W. M. Gentleman and G. Sande, "Fast fourier transforms: for fun and profit," in *Proceedings of the November 7-10, 1966, fall joint computer conference*, 1966, pp. 563–578.

- [14] C. Gentry, A fully homomorphic encryption scheme. Stanford university, 2009.
- [15] W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y. Lee, "Over 100x faster bootstrapping in fully homomorphic encryption through memorycentric optimization with gpus," *IACR Transactions on Cryptographic Hardware and Embedded Systems*, pp. 114–148, 2021.
- [16] S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, "Tensorfhe: Achieving practical computation on encrypted data using gpgpu," in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 922–934.
- [17] Y. Yang, H. Zhang, S. Fan, H. Lu, M. Zhang, and X. Li, "Poseidon: Practical homomorphic encryption accelerator," in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 870–881.
- [18] R. Agrawal, L. de Castro, G. Yang, C. Juvekar, R. Yazicigil, A. Chandrakasan, V. Vaikuntanathan, and A. Joshi, "Fab: An fpga-based accelerator for bootstrappable fully homomorphic encryption," in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 882–895.
- [19] J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, "Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse," in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1237– 1254.
- [20] N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and D. Sanchez, "Craterlake: a hardware accelerator for efficient unbounded computation on encrypted data," in *Proceedings of the 49th Annual International Symposium on Computer Architecture*, 2022, pp. 173–187.
- [21] S. Tan, B. Knott, Y. Tian, and D. J. Wu, "Cryptgpu: Fast privacypreserving machine learning on the gpu," in 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 1021–1038.
- [22] N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski, C. Peikert, and D. Sanchez, "F1: A fast and programmable accelerator for fully homomorphic encryption," in *MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture*, 2021, pp. 238–252.
- [23] S. Kim, J. Kim, M. J. Kim, W. Jung, J. Kim, M. Rhu, and J. H. Ahn, "Bts: An accelerator for bootstrappable fully homomorphic encryption," in *Proceedings of the 49th annual international symposium on computer architecture*, 2022, pp. 711–725.
- [24] L. de Castro, R. Agrawal, R. Yazicigil, A. Chandrakasan, V. Vaikuntanathan, C. Juvekar, and A. Joshi, "Does fully homomorphic encryption need compute acceleration?" arXiv preprint arXiv:2112.06396, 2021.
- [25] JEDEC. (2024, November) Jedec publishes hbm3 update to the high bandwidth memory (hbm) standard. Accessed: 2024-11-17. [Online]. Available: https://www.jedec.org/news/pressreleases/jedecpublishes-hbm3-update-high-bandwidth-memory-hbm-standard
- [26] D. A. B. Miller, "Device Requirements for Optical Interconnects to Silicon Chips," *Proceedings of the IEEE*, vol. 97, no. 7, pp. 1166–1185, 2009.
- [27] S. Werner, J. Navaridas, and M. Luján, "Designing low-power, lowlatency networks-on-chip by optimally combining electrical and optical links," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 265–276.
- [28] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, "Nvidia a100 tensor core gpu: Performance and innovation," *IEEE Micro*, vol. 41, no. 2, pp. 29–35, 2021.
- [29] A. Joshi, C. Batten, Y.-J. Kwon, S. Beamer, I. Shamim, K. Asanovic, and V. Stojanovic, "Silicon-photonic clos networks for global on-chip communication," in 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip. IEEE, 2009, pp. 124–133.
- [30] Y. Thonnart, M. Zid, J. L. Gonzalez-Jimenez, G. Waltener, R. Polster, O. Dubray, F. Lepin, S. Bernabé, S. Menezo, G. Parès *et al.*, "A 10gb/s si-photonic transceiver with 150µw 120µs-lock-time digitally supervised analog microring wavelength stabilization for 1tb/s/mm 2 dieto-die optical networks," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 350–352.
- [31] G. Li, X. Zheng, H. Thacker, J. Yao, Y. Luo, I. Shubin, K. Raj, J. E. Cunningham, and A. V. Krishnamoorthy, "40 gb/s thermally tunable cmos ring modulator," in *The 9th International Conference on Group IV Photonics (GFP)*. IEEE, 2012, pp. 1–3.