[개념 정리] 스케쥴러란? Scheduler (딥러닝 학습)

이번에 알아볼 건 스케쥴러입니다.

1. 딥러닝에서 스케쥴러(Scheduler)란?

- 스케쥴러는 학습률(Learning Rate)을 시간 또는 학습단계에 따라 변화시키는 전략을 말합니다.

- 학습률이란 것은, 딥러닝에서 데이터를 학습할 때 가중치를 업데이트하는 속도를 결정하는 중요한 하이퍼파라미터로 보통은 하나의 값으로 고정하기도 합니다.

- 이때 스케쥴러는 학습률을 단일 값으로 고정하지 않고 학습 단계에 따라 변화시키는 것을 스케쥴링 하기에 Scheduler라고 불린다고 생각하면 되겠습니다.

- 모델이 처음 학습을 시작할 때는 큰 학습률을 사용해 빠르게 최적점에 접근하고, 이후 학습률을 점차 낮추면서 더 세밀한 조정을 할 수 있도록 설계가 되는 것입니다.

- 이를 활용하면 다음과 같은 이점을 얻을 수 있는데요

1) 빠른 초기학습

* 초기에 비교적 큰 학습률을 사용해 빠르게 가중치를 조정하여 Optimal에 다가가는 속도를 높이게 됩니다.

2) 안정적 수렴

* 학습이 진행될수록 점차 학습률을 낮춰가면서 가중치를 작은 값으로 조정하게되고, 이를 통해 점차 안정적으로 수렴할 수 있게 도와줍니다.

3) 과적합 방지

* 학습률을 점진적으로 줄이면서 모델이 너무 큰 변화 없이 안정적으로 학습하도록 하여 과적합을 방지하게 할 수 있습니다.

2. 대표적인 스케쥴러

- 대표적으로 활용되는 스케쥴러는 아래와 같이 5개 정도가 될 수 있겠습니다.

1) Step Decay

* 일정한 에포크가 지나면서 학습률을 단계적으로 줄여나가는 방식을 말합니다.

* 아래 예시에서 step_size는 몇번째 단계마다 줄일지, 그리고 gamma는 얼마만큼의 비율로 감소 시킬지 (0.5 는 기존 lr의 50%만 활용한다는 의미)를 정해줍니다.

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

2) Exponential Decay

* 에포크가 증가함에 따라 학습률을 지수적(Exponential)으로 감소시키는 방법을 말합니다.

* 아래 코드에서 gamma를 통해 학습률을 줄이는 비율을 정합니다.

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

3) Cosine Annealing

* 학습률이 코사인 곡선을 따라 감소하다가 다시 증가하는 방식을 말합니다. 학습이 끝날 무렵 학습률을 다시 증가시켜 새로운 최적점을 찾을 기회를 주기에 local optimum에 빠지는 오류를 방지해줍니다.

* 아래 코드에서 T_max 는 코사인의 주기를 100 에포크로 설정한다는 의미입니다.

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

4) Reduce on Plateau

* 학습 성능이 일정 기간 동안 개선되지 않을 경우 학습률을 줄이는 방식을 말합니다. 이는 학습이 더이상 개선되지 않는 상황에서만 학습률을 줄여서 성능을 더 세밀하게 조정하도록 유도합니다.

* 아래 코드에서 patience=10은 10 에포크 동안 개선이 없을때 조정한다는 의미이며, factor = 0.1 이라는 것은 학습률을 10%로 줄인다는 의미입니다.

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.1)

5) OneCycleLR

* 학습률을 처음에는 천천히 증가시키고, 학습 중간에 크게 증가한 후 다시 감소시키는 방식을 말합니다.

scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=100, epochs=10)

- 위 스케쥴러들을 비교하기 위해 시각화를 한번 해보겠습니다.

import numpy as np

import matplotlib.pyplot as plt

# Define the number of epochs

epochs = 100

# Step Decay Function

def step_decay(epoch, initial_lr=0.1, drop_rate=0.5, step_size=10):

return initial_lr * (drop_rate ** np.floor(epoch / step_size))

# Exponential Decay Function

def exp_decay(epoch, initial_lr=0.1, decay_rate=0.9):

return initial_lr * (decay_rate ** epoch)

# Cosine Annealing Function

def cosine_annealing(epoch, initial_lr=0.1, T_max=100):

return initial_lr * (1 + np.cos(np.pi * epoch / T_max)) / 2

# Reduce on Plateau (simplified as a step decrease after 50 epochs for illustration)

def reduce_on_plateau(epoch, initial_lr=0.1, reduce_rate=0.1, patience=50):

if epoch >= patience:

return initial_lr * reduce_rate

return initial_lr

# OneCycleLR (linear warm-up followed by a cosine decay)

def one_cycle_lr(epoch, max_lr=0.1, steps_per_epoch=100, epochs=100):

peak_epoch = epochs // 2

if epoch < peak_epoch:

return (max_lr / peak_epoch) * epoch

else:

return max_lr * (np.cos(np.pi * (epoch - peak_epoch) / (epochs - peak_epoch)) + 1) / 2

# Generate the learning rate schedules for all methods

epochs_range = np.arange(epochs)

step_decay_lr = [step_decay(e) for e in epochs_range]

exp_decay_lr = [exp_decay(e) for e in epochs_range]

cosine_annealing_lr = [cosine_annealing(e) for e in epochs_range]

reduce_on_plateau_lr = [reduce_on_plateau(e) for e in epochs_range]

one_cycle_lr_vals = [one_cycle_lr(e) for e in epochs_range]

# Plotting the different learning rate schedules separately using subplots

fig, axes = plt.subplots(3, 2, figsize=(15, 12))

fig.tight_layout(pad=5.0)

# Step Decay

axes[0, 0].plot(epochs_range, step_decay_lr, label='Step Decay', color='b')

axes[0, 0].set_title('Step Decay')

axes[0, 0].set_xlabel('Epochs')

axes[0, 0].set_ylabel('Learning Rate')

axes[0, 0].grid(True)

# Exponential Decay

axes[0, 1].plot(epochs_range, exp_decay_lr, label='Exponential Decay', color='g')

axes[0, 1].set_title('Exponential Decay')

axes[0, 1].set_xlabel('Epochs')

axes[0, 1].set_ylabel('Learning Rate')

axes[0, 1].grid(True)

# Cosine Annealing

axes[1, 0].plot(epochs_range, cosine_annealing_lr, label='Cosine Annealing', color='r')

axes[1, 0].set_title('Cosine Annealing')

axes[1, 0].set_xlabel('Epochs')

axes[1, 0].set_ylabel('Learning Rate')

axes[1, 0].grid(True)

# Reduce on Plateau

axes[1, 1].plot(epochs_range, reduce_on_plateau_lr, label='Reduce on Plateau', color='c')

axes[1, 1].set_title('Reduce on Plateau')

axes[1, 1].set_xlabel('Epochs')

axes[1, 1].set_ylabel('Learning Rate')

axes[1, 1].grid(True)

# OneCycleLR

axes[2, 0].plot(epochs_range, one_cycle_lr_vals, label='OneCycleLR', color='m')

axes[2, 0].set_title('OneCycleLR')

axes[2, 0].set_xlabel('Epochs')

axes[2, 0].set_ylabel('Learning Rate')

axes[2, 0].grid(True)

# Hide the empty subplot (bottom right)

axes[2, 1].axis('off')

plt.show()

* x축은 Epochs이고, y 축은 Learning Rate입니다. 에포크가 변하면서 어떻게 학습률의 변화가 적용되는지를 눈으로 확인할 수 있겠습니다.

저작자표시 비영리 변경금지 (새창열림)

'딥러닝 with Python' 카테고리의 다른 글

[개념 정리] Linear probing이란? (0)	2024.10.31
[개념정리] ELBO란? Evidence Lower Bound란? (0)	2024.10.30
[딥러닝 with Python] VAE (Variational Auto Encoder) 개념 정리 (1)	2024.10.28
[딥러닝 with Python] GELU란?(Gaussian Error Linear Unit) (0)	2024.10.27
[논문 리뷰] ShapeNet : A Shapelet-Neural Network Approach for Multivariate Time Series Classification (시계열 분류) (1)	2024.10.26