Gradient Descent Optimization

Gradient Descent Momentum

Momentum방법은 현재 파라미터를 업데이트 해줄 때 이전 gradient들도 계산에 포함해주어 진행한다. 따라서 지금 gradient가 0이라고 할지라도 이전 gradient값이 있어 앞으로 나아갈 수 있다. 이는 마치 관성에 의해 계속 움직이려고 하는 효과를 가져다 준다. 하지만 이전에 있던 모든 gradient를 고려한다면, SGD는 멈추지 않을 것이다. 그래서 Momentum은 이전 gradient들의 영향력을 매 업데이트마다 감소해준다.

$v t = γ v t - 1 + η ▽ θ t J (θ t) θ t + 1 = θ t - v t <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mtd><mtd><mi></mi><mo>=</mo><mi>γ</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>+</mo><mi>η</mi><msub><mo>▽</mo><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mtd><mtd><mi></mi><mo>=</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>-</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mtd></mtr></mtable></math>$

$γ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>γ</mi></math>$ (감소율)는 0과 1사이의 값으로 보통 0.9로 정한다. 이는 SGD가 낮은 local minimum에서 벗어나 손실함수를 더 탐험할 수 있게 만들어 준다.

Nesterov Accelerated Gradient, NAG

momentum은 현재 update과정에서의 기울기 값을 기반으로 미래값을 도출하도록 되어있다. 따라서, 최적의 parameter를 관성에 의해 지나칠 수 있는 우려가 있다. NAG에서는 momentum으로 이동된 지점에서의 기울기를 활용하여 update를 수행하기 때문에 이러한 문제를 해소할 수 있다.

$v t = γ v t - 1 + η ▽ θ t J (θ t - γ v t - 1) θ t + 1 = θ t - v t <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mtd><mtd><mi></mi><mo>=</mo><mi>γ</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>+</mo><mi>η</mi><msub><mo>▽</mo><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>-</mo><mi>γ</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mtd><mtd><mi></mi><mo>=</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>-</mo><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mtd></mtr></mtable></math>$

위 NAG의 update식에서도 $▽ θ t J (θ t - γ v t - 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mo>▽</mo><mrow data-mjx-texclass="ORD"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></mrow></msub><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>-</mo><mi>γ</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 와 같이 관성에 의해 이동된 곳에서의 기울기를 적용한 것을 볼 수 있다. 이는 관성에 의해 빨리 이동하는 이점을 누리면서도 멈춰야하는 곳에서 효과적으로 제동할 수 있다.

Adaptive Gradient, Adagrad

모든 파라미터에 대해서 같은 learning rate를 적용하여 업데이트를 했었는데 각 파라미터의 업데이트 빈도 수에 따라 업데이트 크기를 다르게 해주면 영향력에 따라 learning rate가 변하여 업데이트하는 효과가 발생한다.

Adagrad는 다음과 같은 업데이트식을 사용한다.

$θt+1,i=θt,i−η√Gt,i+ϵgt.i<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>i</mi></mrow></msub><mo>=</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>−</mo><mfrac><mi>η</mi><msqrt><msub><mi>G</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>+</mo><mi>ϵ</mi></msqrt></mfrac><msub><mi>g</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>.</mo><mi>i</mi></mrow></msub></math>$

여기서 $G t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>G</mi><mi>t</mi></msub></math>$ 는 i번째 대각원소로 t시점까지의 $θ i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>θ</mi><mi>i</mi></msub></math>$ 에 대한 gradient들의 제곱의 총합을 갖는 대각행렬이다. 이후 parameter에서는 상대적으로 적은 변화를 주고 반대로 적게 이동한 parameter에서는 큰 변화를 주게 된다. $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 은 매우 작은 값으로 0으로 나누는 것을 방지하기 위해 사용됬다. Adagrad는 parameter별 상대적인 변화로 update를 수행하는 개념이었지만 학습이 진행됨에 따라 변화의 폭이 눈에 띄게 줄어들어 결국 움직이지 않게되는 상황이 발생된다.

RMSProp

Adagrad에서 문제를 개선하기 위해 $G t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>G</mi><mi>t</mi></msub></math>$ 에서 지수이동평균을 사용한다. 이로인해 학습의 최소 step은 유지할 수 있게 되었다.

$E[g2]t=γE[g2]t−1+(1−β)(▽J(θt))2θt+1=θt−α√Gt+ϵ▽J(θt)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>E</mi><mo stretchy="false">[</mo><msup><mi>g</mi><mn>2</mn></msup><msub><mo stretchy="false">]</mo><mi>t</mi></msub></mtd><mtd><mi></mi><mo>=</mo><mi>γ</mi><mi>E</mi><mo stretchy="false">[</mo><msup><mi>g</mi><mn>2</mn></msup><msub><mo stretchy="false">]</mo><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>β</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mo>▽</mo><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mtd></mtr><mtr><mtd><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mtd><mtd><mi></mi><mo>=</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mfrac><mi>α</mi><mrow><msqrt><msub><mi>G</mi><mi>t</mi></msub></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac><mo>▽</mo><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></mtd></mtr></mtable></math>$

식에서와 같이 학습이 진행됨에 따라 parameter 사이 차별화는 유지하되 학습속도가 지속적으로 줄어들어 0에 수렴하는 것은 방지할 수 있다.

Adaptive Moment Estimation, Adam

RMSProp와 Momentum 기법을 합친 방법으로 기울기 값과 기울기의 제곱값의 지수이동평균을 활용하여 step변화량(학습률)을 조절한다. 1차 moment $E [X] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi><mo stretchy="false">[</mo><mi>X</mi><mo stretchy="false">]</mo></math>$ 는 모평균이고 2차 moment $E [X 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi><mo stretchy="false">[</mo><msup><mi>X</mi><mn>2</mn></msup><mo stretchy="false">]</mo></math>$ 와 1차 moment를 이용해 모분산을 얻을 수 있다.

$m t = β 1 m t - 1 + (1 - β 1) ▽ J (θ t) v t = β 2 v t - 1 + (1 - β 2) (▽ J (θ t)) 2 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><msub><mi>m</mi><mi>t</mi></msub></mtd><mtd><mi></mi><mo>=</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>m</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo>▽</mo><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><msub><mi>v</mi><mi>t</mi></msub></mtd><mtd><mi></mi><mo>=</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>2</mn></msub><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mo>▽</mo><mi>J</mi><mo stretchy="false">(</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mtd></mtr></mtable></math>$

편향(bias)를 잡아주기 위해 다음과 같이 bias-corrected를 계산한다.

$^mt=mt1−βt1^vt=vt1−βt2<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mrow data-mjx-texclass="ORD"><mover><msub><mi>m</mi><mi>t</mi></msub><mo stretchy="false">^</mo></mover></mrow></mtd><mtd><mi></mi><mo>=</mo><mfrac><msub><mi>m</mi><mi>t</mi></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>1</mn><mi>t</mi></msubsup></mrow></mfrac></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mover><msub><mi>v</mi><mi>t</mi></msub><mo stretchy="false">^</mo></mover></mrow></mtd><mtd><mi></mi><mo>=</mo><mfrac><msub><mi>v</mi><mi>t</mi></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>2</mn><mi>t</mi></msubsup></mrow></mfrac></mtd></mtr></mtable></math>$

최종 업데이트 식은 다음과 같다.

$θt+1=θt−α√^vt+ϵ^mt<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>=</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>−</mo><mfrac><mi>α</mi><mrow><msqrt><mrow data-mjx-texclass="ORD"><mover><msub><mi>v</mi><mi>t</mi></msub><mo stretchy="false">^</mo></mover></mrow></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac><mrow data-mjx-texclass="ORD"><mover><msub><mi>m</mi><mi>t</mi></msub><mo stretchy="false">^</mo></mover></mrow></math>$

'지식공학 > 기계학습' 카테고리의 다른 글

The Gaussian Distribution (0)	2021.07.20
Binary Variables (0)	2021.07.20
[딥러닝] 활성함수 (0)	2021.06.02
Generative Adversarial Network (GAN) (0)	2021.05.27
딥러닝 기초 (0)	2021.05.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Eric LAB

Gradient Descent Optimization

Gradient Descent Momentum

Nesterov Accelerated Gradient, NAG

Adaptive Gradient, Adagrad

RMSProp

Adaptive Moment Estimation, Adam

'지식공학 > 기계학습' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Gradient Descent Optimization

Gradient Descent Momentum

Nesterov Accelerated Gradient, NAG

Adaptive Gradient, Adagrad

RMSProp

Adaptive Moment Estimation, Adam

'지식공학 > 기계학습' 카테고리의 다른 글

관련글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역