TY - JOUR
T1 - Robust and Resource-Efficient Identification of Two Hidden Layer Neural Networks
AU - Fornasier, Massimo
AU - Klock, Timo
AU - Rauchensteiner, Michael
N1 - Publisher Copyright:
© 2021, The Author(s).
PY - 2022/2
Y1 - 2022/2
N2 - We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type f(x) = 1 Th(BTg(ATx)) on Rd, where g=(g1,⋯,gm0), h=(h1,⋯,hm1), A=(a1|⋯|am0)∈Rd×m0 and B=(b1|⋯|bm1)∈Rm0×m1, from a small number of query samples. The solution of the case of two hidden layers presented in this paper is crucial as it can be further generalized to deeper neural networks. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace W spanned by symmetric tensors a1⊗a1,⋯,am0⊗am0 formed by weights of the first layer together with the entangled symmetric tensors v1⊗v1,⋯,vm1⊗vm1, formed by suitable combinations of the weights of the first and second layer as vℓ=AG0bℓ/‖AG0bℓ‖2, ℓ∈ [m1] , for a diagonal matrix G depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within W is then performed by the solution of a robust nonlinear program, maximizing the spectral norm of the competitors constrained over the unit Frobenius sphere. We provide guarantees of stable recovery under a posteriori verifiable conditions. Once the 1-rank symmetric tensors { ai⊗ ai, i∈ [m] } ∪ { vℓ⊗ vℓ, ℓ∈ [m1] } are computed, we address their correct attribution to the first or second layer (ai’s are attributed to the first layer). The attribution to the layers is currently based on a semi-heuristic reasoning, but it shows clear potential of reliable execution. Having the correct attribution of the ai, vℓ to the respective layers and the consequent de-parametrization of the network, by using a suitably adapted gradient descent iteration, it is possible to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix G. Eventually, from the vectors vℓ=AG0bℓ/‖AG0bℓ‖2’s and ai’s one can disentangle the weights bℓ’s, by simple algebraic manipulations. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity and therefore contributes to dwindle the black box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments, which confirm the effectiveness and feasibility of the proposed algorithmic pipeline.
AB - We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type f(x) = 1 Th(BTg(ATx)) on Rd, where g=(g1,⋯,gm0), h=(h1,⋯,hm1), A=(a1|⋯|am0)∈Rd×m0 and B=(b1|⋯|bm1)∈Rm0×m1, from a small number of query samples. The solution of the case of two hidden layers presented in this paper is crucial as it can be further generalized to deeper neural networks. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace W spanned by symmetric tensors a1⊗a1,⋯,am0⊗am0 formed by weights of the first layer together with the entangled symmetric tensors v1⊗v1,⋯,vm1⊗vm1, formed by suitable combinations of the weights of the first and second layer as vℓ=AG0bℓ/‖AG0bℓ‖2, ℓ∈ [m1] , for a diagonal matrix G depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within W is then performed by the solution of a robust nonlinear program, maximizing the spectral norm of the competitors constrained over the unit Frobenius sphere. We provide guarantees of stable recovery under a posteriori verifiable conditions. Once the 1-rank symmetric tensors { ai⊗ ai, i∈ [m] } ∪ { vℓ⊗ vℓ, ℓ∈ [m1] } are computed, we address their correct attribution to the first or second layer (ai’s are attributed to the first layer). The attribution to the layers is currently based on a semi-heuristic reasoning, but it shows clear potential of reliable execution. Having the correct attribution of the ai, vℓ to the respective layers and the consequent de-parametrization of the network, by using a suitably adapted gradient descent iteration, it is possible to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix G. Eventually, from the vectors vℓ=AG0bℓ/‖AG0bℓ‖2’s and ai’s one can disentangle the weights bℓ’s, by simple algebraic manipulations. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity and therefore contributes to dwindle the black box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments, which confirm the effectiveness and feasibility of the proposed algorithmic pipeline.
KW - Active sampling
KW - Deep neural networks
KW - Deparametrization
KW - Exact identifiability
KW - Frames
KW - Nonconvex optimization on matrix spaces
UR - http://www.scopus.com/inward/record.url?scp=85124173593&partnerID=8YFLogxK
U2 - 10.1007/s00365-021-09550-5
DO - 10.1007/s00365-021-09550-5
M3 - Article
AN - SCOPUS:85124173593
SN - 0176-4276
VL - 55
SP - 475
EP - 536
JO - Constructive Approximation
JF - Constructive Approximation
IS - 1
ER -