DQN 이용해서 2D Bin Packing Problem 풀어보기 (Item 회전)

앞에서 DQN으로 풀어 봤던 2D Bin Packing 문제에서 Item은 처음 주어진 방향으로만 적재가 가능했습니다.

예를 들어 아래와 같은 7x3 크기의 파란색 Item이 있다고 했을 때 90도 회전을 하면 노란색 Item처럼 생각할 수 있는데

Item의 회전이 가능하다면 더욱 뛰어난 품질의 2D Bin Packing 결과를 만들어 낼 수 있습니다.

State정의

State 정의를 위해서 Bin에서 비어있는 칸은 0, 물건이 채워진 칸은 1로 표시 하는 방법을 동일하게 사용합니다.

10x10 의 크기를 갖는 2차원 배열을 0으로 초기화 한 다음, Item이 놓여질 때 마다 1로 업데이트를 하여 State 정의하였습니다.

예를 들어, 10x10 상자에 (7,3) 크기의 물건 하나가 회전 없이 좌측 상단에 놓인 상태는 다음과 같이 표현할 수 있습니다.

Action정의

상자가 10x10 크기이므로 놓을 수 있는 좌표의 개수는 총 10x10 = 100개로 생각할 수 있지만

Item이 90도 방향으로 회전 할 수 있으므로 가능한 Action의 크기는 2배 늘어난 10x10x2 = 200개로 정의 할 수 있습니다. self.action_space_size = bin_width * bin_height * 2

100개의 좌표를 Action Space로 정의하면 아래와 같이 생각할 수 있습니다.

Action 0 = (0, 0)에 놓기

...

Action 99 = (9, 9)에 놓기

Action 100 = (0, 0)에 회전하여 놓기

...

Action 199 = (9, 9)에 회전하여 놓기

아래의 소스를 보면 10x10 크기의 State를 입력으로 받아서 10x10x2 크기의 Action을 출력하는 것을 확인 할 수 있습니다.

class DQN(nn.Module):
    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        input_size = h * w
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, outputs) #outpus = h*w*2
Python
복사

전체소스

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
import numpy as np
import time
from collections import namedtuple, deque

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

class BinPackingEnv:
    def __init__(self, bin_width, bin_height):
        self.bin_width = bin_width
        self.bin_height = bin_height
        # (수정) 액션 공간을 2배로 확장 (회전 여부 포함)
        self.action_space_size = bin_width * bin_height * 2
        self.items_list = self._generate_items()
        self.reset()

    def _generate_items(self):
        items = []
        items.append((7, 3))
        items.append((7, 3))
        items.append((7, 3))
        items.append((7, 3))
        items.append((4, 4))
        return items

    def reset(self):
        self.bin = np.zeros((self.bin_height, self.bin_width))
        self.items = list(self.items_list)
        # random.shuffle(self.items)
        self.current_item_idx = 0

    def _print_status(self):
        print(self.bin)
        if self.current_item_idx < len(self.items):
            w, h = self.items[self.current_item_idx]
            print(f"다음 물건 크기: {w}x{h} (또는 회전 시 {h}x{w})")

    # (수정) 회전된 아이템에 대한 유효 액션을 계산하도록 수정
    def get_state_and_valid_actions(self):
        state = torch.from_numpy(self.bin).float().unsqueeze(0)

        if self.current_item_idx >= len(self.items):
            return state, torch.zeros(self.action_space_size, dtype=torch.bool)

        item_w, item_h = self.items[self.current_item_idx]

        # 액션 마스크 크기를 2배로 늘림
        valid_actions_mask = torch.zeros(self.action_space_size, dtype=torch.bool)

        # 1. 원래 방향으로 놓는 경우
        for y in range(self.bin_height - item_h + 1):
            for x in range(self.bin_width - item_w + 1):
                if np.all(self.bin[y:y + item_h, x:x + item_w] == 0):
                    action_index = y * self.bin_width + x
                    valid_actions_mask[action_index] = True

        # 2. 회전해서 놓는 경우 (가로/세로가 다른 경우에만)
        if item_w != item_h:
            item_h_rot, item_w_rot = item_w, item_h  # 높이와 너비를 스왑
            for y in range(self.bin_height - item_h_rot + 1):
                for x in range(self.bin_width - item_w_rot + 1):
                    if np.all(self.bin[y:y + item_h_rot, x:x + item_w_rot] == 0):
                        # 회전된 액션은 기본 액션 공간 뒷부분에 위치
                        action_index = (self.bin_width * self.bin_height) + (y * self.bin_width + x)
                        valid_actions_mask[action_index] = True

        return state, valid_actions_mask

    # (수정) 회전된 경우까지 고려하여 가능한 수가 있는지 확인
    def _check_if_any_valid_moves_exist(self):
        if self.current_item_idx >= len(self.items):
            return True

        item_w, item_h = self.items[self.current_item_idx]

        # 원래 방향 검사
        for y in range(self.bin_height - item_h + 1):
            for x in range(self.bin_width - item_w + 1):
                if np.all(self.bin[y:y + item_h, x:x + item_w] == 0):
                    return True

        # 회전된 방향 검사
        if item_w != item_h:
            item_h_rot, item_w_rot = item_w, item_h
            for y in range(self.bin_height - item_h_rot + 1):
                for x in range(self.bin_width - item_w_rot + 1):
                    if np.all(self.bin[y:y + item_h_rot, x:x + item_w_rot] == 0):
                        return True

        return False

    # (수정) 액션에 따라 회전 여부를 결정하고 아이템을 놓도록 수정
    def step(self, action):
        item_w, item_h = self.items[self.current_item_idx]

        base_action_space = self.bin_width * self.bin_height
        is_rotated = action >= base_action_space

        if is_rotated:
            # 회전된 경우, 너비와 높이를 바꿈
            item_w, item_h = item_h, item_w
            coord_action = action - base_action_space
        else:
            coord_action = action

        y, x = np.unravel_index(coord_action, (self.bin_height, self.bin_width))

        self.bin[y:y + item_h, x:x + item_w] = 1  # 넘파이 슬라이싱으로 간단하게 채우기

        reward = float(item_w * item_h)
        self.current_item_idx += 1
        done = False

        if not self._check_if_any_valid_moves_exist():
            if self.current_item_idx >= len(self.items):
                reward += 10  # (수정) 보너스 상향
            else:
                reward -= 10  # (수정) 페널티 상향
            done = True

        if self.current_item_idx >= len(self.items) and not done:
            reward += 10  # (수정) 보너스 상향
            done = True

        return reward, done


class DQN(nn.Module):
    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        input_size = h * w
        # (추가) 신경망 레이어 확장
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, outputs)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))


class ReplayBuffer:
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)


BIN_SIZE = 10
# (수정) N_ACTIONS를 2배로 설정
N_ACTIONS = BIN_SIZE * BIN_SIZE * 2
EPISODES = 60000
LEARNING_RATE = 0.0005
GAMMA = 0.99
BATCH_SIZE = 128
REPLAY_MEMORY_SIZE = 10000

env = BinPackingEnv(BIN_SIZE, BIN_SIZE)
policy_net = DQN(BIN_SIZE, BIN_SIZE, N_ACTIONS).to(device)
target_net = DQN(BIN_SIZE, BIN_SIZE, N_ACTIONS).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)
loss_function = nn.MSELoss()
replay_buffer = ReplayBuffer(REPLAY_MEMORY_SIZE)

epsilon = 0.3
TARGET_UPDATE = 50


def optimize_model():
    if len(replay_buffer) < BATCH_SIZE:
        return

    transitions = replay_buffer.sample(BATCH_SIZE)
    batch = Transition(*zip(*transitions))

    state_batch = torch.cat(batch.state).to(device)
    action_batch = torch.cat(batch.action).to(device)
    reward_batch = torch.cat(batch.reward).to(device)
    done_batch = torch.cat(batch.done).to(device)

    non_final_next_states_list = [s for s in batch.next_state if s is not None]
    if len(non_final_next_states_list) > 0:
        non_final_next_states = torch.cat(non_final_next_states_list).to(device)
    else:
        # (수정) 입력 채널을 1로 명시
        non_final_next_states = torch.empty(0, 1, BIN_SIZE, BIN_SIZE).to(device)

    predicted_q = policy_net(state_batch).gather(1, action_batch)

    next_state_q_values = torch.zeros(BATCH_SIZE, device=device)
    non_final_mask = ~done_batch

    if non_final_next_states.size(0) > 0:
        next_state_q_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()

    target_q = reward_batch + (GAMMA * next_state_q_values)

    loss = loss_function(predicted_q, target_q.unsqueeze(1))
    optimizer.zero_grad()
    loss.backward()
    for param in policy_net.parameters():  # (추가) Gradient Clipping
        param.grad.data.clamp_(-1, 1)
    optimizer.step()


print("===== 학습 시작 =====")
total_rewards = []
for i_episode in range(EPISODES):
    env.reset()
    done = False
    episode_reward = 0

    while not done:
        state, valid_actions_mask = env.get_state_and_valid_actions()

        if not valid_actions_mask.any():
            break

        if random.random() < epsilon:
            valid_indices = torch.where(valid_actions_mask)[0]
            action = random.choice(valid_indices).item()
        else:
            with torch.no_grad():
                q_values = policy_net(state.to(device))
                q_values = q_values.masked_fill(~valid_actions_mask.to(device), -float('inf'))
                action = q_values.max(1)[1].item()

        reward, done = env.step(action)
        episode_reward += reward

        next_state, _ = env.get_state_and_valid_actions()
        if done:
            next_state = None

        action_tensor = torch.tensor([[action]], dtype=torch.long)
        reward_tensor = torch.tensor([reward], dtype=torch.float)
        done_tensor = torch.tensor([done], dtype=torch.bool)
        replay_buffer.push(state, action_tensor, next_state, reward_tensor, done_tensor)

        optimize_model()

    epsilon = max(0.01, epsilon * 0.99995)

    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())

    total_rewards.append(episode_reward)
    if (i_episode + 1) % 100 == 0:
        avg_reward = np.mean(total_rewards[-100:])
        print(f"에피소드 {i_episode + 1}/{EPISODES} | 최근 100 에피소드 평균 보상: {avg_reward:.2f} | Epsilon: {epsilon:.3f}")

print("\n===== 학습 완료! =====\n")

print("===== 실제 적재 테스트 시작 =====")
env.reset()
env._print_status()
time.sleep(1)

done = False
while not done:
    state, valid_actions_mask = env.get_state_and_valid_actions()
    if not valid_actions_mask.any():
        print("\n실패! 더 이상 물건을 놓을 공간이 없습니다.")
        break

    with torch.no_grad():
        q_values = policy_net(state.to(device))
        q_values = q_values.masked_fill(~valid_actions_mask.to(device), -float('inf'))
        action = q_values.max(1)[1].item()

    # (수정) 테스트 출력 시 회전 여부 표시
    base_action_space = BIN_SIZE * BIN_SIZE
    is_rotated = action >= base_action_space
    if is_rotated:
        coord_action = action - base_action_space
        rotation_text = "(90도 회전)"
    else:
        coord_action = action
        rotation_text = ""

    coords = np.unravel_index(coord_action, (BIN_SIZE, BIN_SIZE))
    print(f"\n-> AI의 선택: {coords} 위치에 놓기 {rotation_text}")
    _, done = env.step(action)

    env._print_status()
    time.sleep(1)

placed_items = env.current_item_idx
total_items = len(env.items)
print("\n===== 테스트 종료! =====")
print(f"최종 결과: 총 {total_items}개의 물건 중 {placed_items}개를 적재했습니다.")
print("최종 상자 모습:")
print(env.bin)
Python
복사

학습결과를 시각화 하면 아래와 같습니다.

Using device: cuda
===== 학습 시작 =====
에피소드 100/60000 | 최근 100 에피소드 평균 보상: 50.90 | Epsilon: 0.299
에피소드 200/60000 | 최근 100 에피소드 평균 보상: 56.78 | Epsilon: 0.297
...
에피소드 59800/60000 | 최근 100 에피소드 평균 보상: 108.92 | Epsilon: 0.015
에피소드 59900/60000 | 최근 100 에피소드 평균 보상: 108.14 | Epsilon: 0.015
에피소드 60000/60000 | 최근 100 에피소드 평균 보상: 107.06 | Epsilon: 0.015

===== 학습 완료! =====

===== 실제 적재 테스트 시작 =====
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
다음 물건 크기: 7x3 (또는 회전 시 3x7)

-> AI의 선택: (np.int64(3), np.int64(0)) 위치에 놓기 (90도 회전)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]]
다음 물건 크기: 7x3 (또는 회전 시 3x7)

-> AI의 선택: (np.int64(0), np.int64(7)) 위치에 놓기 (90도 회전)
[[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]]
다음 물건 크기: 7x3 (또는 회전 시 3x7)

-> AI의 선택: (np.int64(0), np.int64(0)) 위치에 놓기 
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]]
다음 물건 크기: 7x3 (또는 회전 시 3x7)

-> AI의 선택: (np.int64(7), np.int64(3)) 위치에 놓기 
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
다음 물건 크기: 4x4 (또는 회전 시 4x4)

-> AI의 선택: (np.int64(3), np.int64(3)) 위치에 놓기 
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

===== 테스트 종료! =====
최종 결과: 총 5개의 물건 중 5개를 적재했습니다.
최종 상자 모습:
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
Python
복사

Reference:

[1] The Bin Packing Problem by Google, https://developers.google.com/optimization/pack/bin_packing, Creative Commons Attribution 4.0 License