Efficient OT Extension and its Impact on Secure Computation



TECHNISCHE UNIVERSITÄT DARMSTADT

Pushing the Communication Barrier of Passive Secure Two-Party Computation

# Michael Zohner (TU Darmstadt)

Joint work with Ghada Dessouky, Ahmad-Reza Sadeghi, Thomas Schneider, Shaza Zeitouni (all TU Darmstadt) Farinaz Koushanfar (UC San Diego)



#### **Applications**





This work: passive security and security parameter *k* = 128 bit



# **Secure Two-Party Computation Protocols**

Yao's garbled circuits protocol

- Function-dependent setup phase
- Constant round
- ≥ 256 bit communication per AND (simplex)

Very fast implementations

- Fairplay ~1 000 Gates/s
- FastGC ~100 000 AND/s
- ObliVM ~3 million AND/s
- Billion Gate ~100 000 AND/s
- Blazing Fast ~500 000 AND/s
- More blazing fast

Passive Active





# **Secure Two-Party Computation Protocols**



#### GMW

- Function-independent setup phase
- $\geq$  256 bit communication per AND (duplex)

Setup Phase: pre-compute multiplication triples (MTs) using OT

ApricOT: passive=active ~7 million OTs/s

| ٠ | ABY: ~3 million AND/s  | Passive |
|---|------------------------|---------|
| • | TinvOT: ~400 000 AND/s | Active  |

- TinyOT: ~400 000 AND/s
- SPDZ: 5 000 Mult/s of 128 bit values

Online Phase: simple computation and small messages but multi-round



#### **Status Quo**



Good news: extremely fast computation

- JustGarble generates ~2GBit/s traffic per thread
- Passive OT extension generates ~1 Gbit/s traffic per thread

Bad news: communication boundary

- LAN connection provides 1Gbit/s
- Lowerbound on linear garbling schemes [ZRE15]
- Online time of GMW very latency dependent

Computation resources scale better than communication

Bottleneck: Communication and round complexity



### **Related Work**



[KK13] outlines efficient 100N OT extension variant

- + Reduces communication per AND in GMW from 256 to 160 bit
- High computation overhead

[IKMOP13,DZ16,TinyTable] uses multi-input tables for secure computation + Reduces communication and rounds in the online phase

- High setup costs

GESS [KK12] multi-round information-theoretic variant of garbled circuits

- + Reduces communication
- Unsure how to extend to arbitrary functionalities



#### What Did We Do?



1) Less communication for GMW

- Further optimize 100N OT extension of [KK13] to compute MTs
- Reduces communication from 256 to 134 bit per AND

2) Less communication and rounds Lookup Table (LUT) representation

- Online LUT (O-LUT): efficient online phase
- Setup LUT (S-LUT): efficient overall evaluation

3) Tool support for generating LUT representations

4) Evaluation on various basic operations



# **Our Results**



EC SPRIDE

Trade more computation for 1) less communication and 2) less rounds

- 2-MT: GMW from 1-out-of-2 OT extension
- N-MT: GMW from 1-out-of-N OT extension [KK13]
- O-LUT: LUT protocol with efficient online phase
- S-LUT: LUT protocol with better overall communication



### Part 1) Less Communication for GMW







# 1002 OT Extension [IKNP03]



Alice holds *m* pairs of messages  $(x_{i,0}, x_{i,1})$ 

Bob holds *m*-bit string *r* and wants to obtain  $x_{i,r_i}$  in *i*-th OT





1002 OT Extension [IKNP03] (Base-OT Step)



Alice and Bob switch roles and perform k base OTs





#### From 1002 OT to 100N OT







#### From 1002 OT to 100N OT



100N OT can be obtained from log N invocations of 1002 OT extension

Example: 1004 OT for  $(x_1, ..., x_4)$ 





# **100N OT Extension [KK13]**







**Generalization to 100N OT Extension [KK13]** 



TECHNISCHE

UNIVERSITÄT DARMSTADT

03.06.16 | Pushing the Communication Barrier in 2PC | Slide 15

# **100N OT Extension [KK13] (Efficiency)**



The codewords need *k* bit Hamming distance (HD)

Efficiency of the [KK13] 100N OT depends on the underlying code

For N = 2: use repetition code

• Same as the [IKNP03] protocol

For  $2 < N \leq 2k$ : use a Walsh Hadamard code

- *h* codewords with *h* bit length and HD *h*/2
- Since we require HD=*k* we have 2*k*=256 bit codewords

For *N* > 2*k*: use linear codes

- Achieves O(k) communication instead of O(k log N)
- Concrete improvements for PSI on 128-bit elements



# **100N OT Extension [KK13]**







03.06.16 | Pushing the Communication Barrier in 2PC | Slide 17



Surprising insight: reducing 100N OT to single bit 1002 OTs saves communication



Best for N=16: requires only 320 bits instead of 512 bits



# [KK13] Downside: Increased Computation



1002 OT extension uses efficient fixed-key AES-128 [BHKR13]

100N OT processes values with >128-bit length

- Too large for AES encryption
- Replace by AES-256 with key schedule [KSS12]  $\rightarrow$  30 times slower

Even worse: 100N OT needs N evaluations while 1002 OT needs 2log N

- For N=16: 60x more computation
- For *N*=256: 480x more computation



# **Optimization 1) Improve Computation**



Idea: Use pipelining of [GNLP15] for AES-256+KS



AES-256+KS only 9 instead of 30 times slower than fixed-key AES-128



#### **Optimization 2) Short Codes**



For 2 < N < 2k, [KK13] uses Walsh-Hadamard code, which is not size-optimal

Improve communication using specific codes for specific *N* 

http://mint.sbg.ac.at/ gives short codes for different parameters

Saves between 25% and 1% communication



Code Sizes for Varying N



Optimization 3) MTs from 100N OT (N-MT)



[KK13] reduces 100N OT to 1002 OT for computing AND gates

Instead: reduce 100N OT to MTs (1004 OT)



Setup Communication Cost for MTs

Best for N=16: reducing communication from 256 to 134 bits per AND





#### Part 2) LUT-based Secure Computation

|   |   |    |            |    |    |    |    |    | У  | <u> </u> |    |    |    |    |    |    |    |
|---|---|----|------------|----|----|----|----|----|----|----------|----|----|----|----|----|----|----|
|   |   | 0  | 1          | 2  | 3  | 4  | 5  | 6  | 7  | 8        | 9  | a  | b  | С  | d  | е  | f  |
|   | 0 | 63 | 7c         | 77 | 7b | f2 | 6b | 6f | c5 | 30       | 01 | 67 | 2b | fe | d7 | ab | 76 |
|   | 1 | ca | 82         | c9 | 7d | fa | 59 | 47 | f0 | ad       | d4 | a2 | af | 9c | a4 | 72 | c0 |
|   | 2 | b7 | fd         | 93 | 26 | 36 | 3f | £7 | cc | 34       | a5 | e5 | f1 | 71 | d8 | 31 | 15 |
|   | 3 | 04 | с7         | 23 | c3 | 18 | 96 | 05 | 9a | 07       | 12 | 80 | e2 | eb | 27 | b2 | 75 |
|   | 4 | 09 | 83         | 2c | 1a | 1b | 6e | 5a | a0 | 52       | 3b | d6 | b3 | 29 | e3 | 2f | 84 |
|   | 5 | 53 | d1         | 00 | ed | 20 | fc | b1 | 5b | 6a       | cb | be | 39 | 4a | 4c | 58 | cf |
|   | 6 | d0 | ef         | aa | fb | 43 | 4d | 33 | 85 | 45       | f9 | 02 | 7f | 50 | 3c | 9f | a8 |
|   | 7 | 51 | a3         | 40 | 8f | 92 | 9d | 38 | f5 | bc       | b6 | da | 21 | 10 | ff | f3 | d2 |
| × | 8 | cd | 0c         | 13 | ec | 5f | 97 | 44 | 17 | c4       | a7 | 7e | 3d | 64 | 5d | 19 | 73 |
|   | 9 | 60 | 81         | 4f | dc | 22 | 2a | 90 | 88 | 46       | ee | b8 | 14 | de | 5e | 0b | db |
|   | а | e0 | 32         | 3a | 0a | 49 | 06 | 24 | 5c | c2       | d3 | ac | 62 | 91 | 95 | e4 | 79 |
|   | b | e7 | c8         | 37 | 6d | 8d | d5 | 4e | a9 | 6c       | 56 | f4 | ea | 65 | 7a | ae | 08 |
|   | С | ba | 78         | 25 | 2e | 1c | a6 | b4 | c6 | e8       | dd | 74 | 1f | 4b | bd | 8b | 8a |
|   | d | 70 | 3e         | b5 | 66 | 48 | 03 | f6 | 0e | 61       | 35 | 57 | b9 | 86 | c1 | 1d | 9e |
|   | е | e1 | <b>f</b> 8 | 98 | 11 | 69 | d9 | 8e | 94 | 9Ъ       | 1e | 87 | e9 | ce | 55 | 28 | df |
|   | f | 8c | a1         | 89 | 0d | bf | e6 | 42 | 68 | 41       | 99 | 2d | 0f | b0 | 54 | bb | 16 |



# **LUT Representation**



Boolean circuits require one round per layer of AND gates

Process multi-input gates to decrease rounds [IKMOP13,DZ16]

- [IKMOP13] introduced one-time truth table (OTTT) protocol
- [DZ16] showed how to pre-compute OTTTs



# **OTTT [IKMOP13] Setup by Trusted Third Party**

**TECHNISCHE** UNIVERSITÄT DARMSTADT

1) Represent as table

2) Rotate table

f(x,y) = x + yy\x

|       | 3 | 2 | 1 | 0 |
|-------|---|---|---|---|
|       | 3 | 2 | 1 | 0 |
| s = 1 | 4 | 3 | 2 | 1 |
|       | 5 | 4 | 3 | 2 |
|       | 6 | 5 | 4 | 3 |

|   |     |   | r = 2 | 2 | -> |
|---|-----|---|-------|---|----|
|   | y\x | 0 | 1     | 2 | 3  |
|   | 0   | 5 | 6     | 3 | 4  |
| _ | 1   | 2 | 3     | 0 | 1  |
|   | 2   | 3 | 4     | 1 | 2  |
|   | 3   | 4 | 5     | 2 | 3  |

y\x

3) Secret-Share

| y\x | 0 | 1 | 2 | 3 |  |  |  |
|-----|---|---|---|---|--|--|--|
| 0   | 4 | 4 | 6 | 2 |  |  |  |
| 1   | 5 | 2 | 4 | 5 |  |  |  |
| 2   | 4 | 3 | 5 | 2 |  |  |  |
| 3   | 0 | 3 | 6 | 5 |  |  |  |
| +   |   |   |   |   |  |  |  |

4) Distribute





# **OTTT (Online Phase)**







### **Pre-Computing OTTTs via Circuits [DZ16]**



Use circuit-based protocols to pre-compute all table entries

1) Represent the function as circuit C

2)  $P_0$  chooses random *r*,  $P_1$  chooses random *s* 

3) For all  $0 \le i, j \le 3$ ,  $P_0$  and  $P_1$  evaluate  $C(r+i, s+j) = (M_0[i,j], M_1[i,j])$ 



For circuits with  $\delta$  input bits requires  $2^{\delta}$  evaluations with  $\leq \delta$ -1 ANDs



# LUT Protocols based on 100N OT



Circuit-based pre-computation of [DZ16] adds great overhead

• For  $\delta$ -input LUTs  $2^{\delta}$  overhead compared to Boolean circuit

Idea: Use 100N OT to obliviously transfer OTTTs

· LUT communication becomes independent of the circuit cost

We outline two LUT protocols:

- Online LUT with low online communication
- Setup LUT with low overall but higher online communication



# **Online LUT (O-LUT)**



Use 100N OT to transfer OTTTs for all possible choices of s

1)  $P_0$  chooses random *r* and  $M_0$  and prepares  $M_{1,s'} = f(i+r,j+s') \oplus M_0$  for all  $0 \le i,j,s' \le 3$ 

2)  $P_0$  and  $P_1$  perform a 1004 OT, where  $P_1$  chooses a random table s



For  $\delta$  inputs the parties have to transfer  $2^{\delta}$  tables of  $2^{\delta}$  bits each



### Setup LUT (S-LUT)



High setup communication for OTTTs with  $\delta$  inputs

- Using circuits: between  $138 \cdot 2^{\delta}$  and  $(\delta 1) \cdot 138 \cdot 2^{\delta}$  bits
- Using 100N OT:  $2^{2\delta}$  bits

Problem: OTTTs are heavy since they require outputs for all possible inputs

Idea: pre-compute 100N OT in the setup phase and only update results in the online phase



# Improving S-LUT Round Complexity



**OT Pre-computation** 



2d rounds

Role Switching [Huang12]





d+1 rounds



# LUT Efficiency (Setup Phase)



**Setup Communication LUTs** 





# LUT Efficiency (Online Phase)



**Online Communication LUTs** 





# LUT Efficiency (Total)





#### **Total Communication LUTs**



# **Communication vs Boolean Circuits**





#### **Total Communication LUTs**

Optimum for O-LUT: 4 inputs (105% of N-MT) Optimum for S-LUT: 7 inputs (45% of N-MT)



# Part 3) Generating LUT Representations







Yet Another Compiler?



Re-doing the work for LUTs is a time-consuming and error-prone task

=> Automate the generation of LUT representations

Idea: FPGAs internally operate on single output LUTs



We use the ABC Logic synthesis tool to generate single output LUTs



#### **Grouping Multi-Output LUTs**



Problem: FPGAs only support LUTs with one output bit

We post-process and group LUTs with the same or similar inputs



S-LUT Communication: 512 bits

S-LUT Communication: 380 bits



#### **Extracting XORs**



Values are bitwise XOR secret-shared

Allows free XOR and evaluation of AND gates using MTs

Example: x = y





#### Part 4) Empirical Comparison







#### Setting



Evaluate communication and rounds for basic operations

- Addition
- Multiplication
- Comparison
- AES S-Box
- Floating-Point Operations

#### Evaluate different approaches

- 2-MT: GMW using 1002 OT extension (260 bits)
- N-MT: GMW using 100N OT extension (138 bits)
- O-LUT: for LUTs with up to 4 inputs
- S-LUT: for LUTs with up to 8 inputs









- Mostly: S-LUT < N-MT < O-LUT < 2-MT</li>
- 2-MT and N-MT perform better for Ripple-carry based circuits
- LUT approaches perform best for tree based structures









- Mostly: S-LUT < O-LUT < MT
- Exception: Multiplication with Ripple-carry addition



## **Evaluation on Applications: AES**



AES encryption of 1 block using 4 threads

• LAN (1 GBit, 0.2 ms latency)

**1 AES Evaluation in LAN** 

WAN (120 Mbit, 100ms latency)







## **Evaluation on Applications: AES**

TECHNISCHE UNIVERSITÄT DARMSTADT

AES encryption of 1 000 blocks using 4 threads

- LAN (1 GBit, 0.2 ms latency)
- WAN (120 Mbit, 100ms latency)



#### 1 000 AES Evaluations in LAN



**Take-Away Message** 



Traded more computation for less communication and rounds

GMW costs ~one ciphertext per AND

LUT protocols can reduce communication and rounds even further



# Efficient OT Extension and its Impact on Secure Computation



TECHNISCHE UNIVERSITÄT DARMSTADT

Pushing the Communication Barrier of Passive Secure Two-Party Computation

# **Questions?**

Contact: http://encrypto.de



03.06.16 | Pushing the Communication Barrier in 2PC | Slide 47

#### References



[BHKR13] M. Bellare, V. Hoang, S. Keelveedhi, P. Rogaway. Efficient Garbling from a fixed-key Cipher. In IEEE S&P 2013.

[DZ16] I. Damgård, R. W. Zakarias. Fast oblivious AES: A dedicated application of the MiniMac protocol. In Africacrypt 2016.

[GLNP15] S. Gueron, Y. Lindell, A. Nof, B. Pinkas. Fast garbling of circuits under standard assumptions. In CCS 2015.

[Huang12] Y. Huang. Practical secure two-party computation. Ph. D. Thesis 2012.

[IKMOP13] Y. Ishai, E. Kushilevitz, S. Meldgaard, C. Orlandi, A. Paskin-Cherniasky. On the power of correlated randomness in secure computation. In TCC 2013.

[IKNP03] Y. Ishai, J. Kilian, K. Nissim, E. Petrank. Extending oblivious transfer efficiently. In CRYPTO 2013.

[KK12] V. Kolesnikov, R. Kumaresan. Improved secure two-party computation via information theoretic garbled circuits. In SCN 2012.

[KK13] V. Kolesnikov, R. Kumaresan. Improved OT extension for transferring short secrets. In CRYPTO 2013.

[KSS12] B. Kreuter, A. Shelat, C. Shen. Billion-gate secure computation with malicious adversaries. In USENIX 2012.

[ZRE15] S. Zahur, M. Rosulek, D. Evans. Two halves make a whole: Reducing data transfer in garbled circuits using half gates. In EUROCRYPT 2015.

