ゼロから作る Deep Learning 2/word2vecの高速化

Posted on

4章: word2vecの高速化

ゼロから作る Deep Learning (2) 自然言語処理編の読書メモです。3章で実装した word2vec は大きなコーパスを処理することができませんでした。この章では Embedding レイヤーと Negative Sampling というサンプリング手法を使うことで大きなコーパスでも実行可能な CBOW モデルを実装していきます。

%sh
# matplotlibとnumpyを入れる
apt update && apt install -y python3-pip
pip3 install matplotlib numpy
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Hit:1 http://cran.rstudio.com/bin/linux/ubuntu xenial/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu xenial InRelease
Get:3 http://security.ubuntu.com/ubuntu xenial-security InRelease [107 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Fetched 323 kB in 0s (469 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
66 packages can be upgraded. Run 'apt list --upgradable' to see them.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
python3-pip is already the newest version (8.1.1-2ubuntu0.4).
0 upgraded, 0 newly installed, 0 to remove and 66 not upgraded.
Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/7b/ca/8b55a66b7ce426329ab16419a7eee4eb35b5a3fbe0d002434b339a4a7b09/matplotlib-3.0.0-cp35-cp35m-manylinux1_x86_64.whl (12.8MB)
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/75/22/355e68c80802d6f488223788fbda75c1daab83c3ef609153676c1f17be5f/numpy-1.15.2-cp35-cp35m-manylinux1_x86_64.whl (13.8MB)
Collecting kiwisolver>=1.0.1 (from matplotlib)
  Downloading https://files.pythonhosted.org/packages/7e/31/d6fedd4fb2c94755cd101191e581af30e1650ccce7a35bddb7930fed6574/kiwisolver-1.0.1-cp35-cp35m-manylinux1_x86_64.whl (949kB)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib)
  Downloading https://files.pythonhosted.org/packages/42/47/e6d51aef3d0393f7d343592d63a73beee2a8d3d69c22b053e252c6cfacd5/pyparsing-2.2.1-py2.py3-none-any.whl (57kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting python-dateutil>=2.1 (from matplotlib)
  Using cached https://files.pythonhosted.org/packages/cf/f5/af2b09c957ace60dcfac112b669c45c8c97e32f94aa8b56da4c6d1682825/python_dateutil-2.7.3-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/lib/python3/dist-packages (from kiwisolver>=1.0.1->matplotlib)
Collecting six (from cycler>=0.10->matplotlib)
  Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Installing collected packages: kiwisolver, pyparsing, six, cycler, numpy, python-dateutil, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.0.1 matplotlib-3.0.0 numpy-1.15.2 pyparsing-2.2.1 python-dateutil-2.7.3 six-1.11.0
You are using pip version 8.1.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Embedding レイヤの実装

  • やりたいのは行列の対応する行の抜き出し
  • one-hot 表現への変換と MatMul レイヤを使った乗算は必要なかった

%python
class Embedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.idx = None
    
    def forward(self, idx):
        W, = self.params
        self.idx = idx
        out = W[idx]
        return out
        
    def backward(self, dout):
        dW, = self.grads
        dW[...] = 0
        np.add.at(dW, self.idx, dout)
        return None

  • Embedding レイヤを導入することで入力層の計算が改善された
  • 残るは……
    • 中間層のニューロンと重み行列(\(W_{out}\))の積
    • Softmax レイヤの計算

Softmax関数について、語彙数を100万とすると:

$$
y_k = \frac {exp(s_k)} {\sum_{i=1}^{1000000}{exp(s_i)}}
$$

分母の計算が重い。

  • Negative sampling
    • 「多値分類」を「二値分類」で近似する
  • 解きたいのは100万個の単語の中から正しい単語を一つ選ぶ問題
  • これを「二値分類」に落とし込みたい

%python
class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None
    
    def forward(self, h, idx):
        target_W = self.embed.forward(idx)
        out = np.sum(target_W * h, axis=1)
        
        self.cache = (h, target_W)
        return out
    
    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.shape[0], 1)
        
        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

  • 正解に対して間違った答え(負例)を幾つかサンプリングしてその損失も求める
  • 負例はコーパス中でよく使われる単語を対象にする

%python
import numpy as np

# 0〜9からひとつ
print(np.random.choice(10))

# words から一つ
words = ['you', 'say', 'goodbye', 'I', 'hello', '.']
print(np.random.choice(words))

# 5つ
print(np.random.choice(words, size=5))

# 重複なし
print(np.random.choice(words, size=5, replace=False))

# 確率分布
p = [0.5, 0.1, 0.05, 0.2, 0.05, 0.1]
print(np.random.choice(words, size=5, p=p))
3
I
['say' 'goodbye' 'you' 'I' 'I']
['goodbye' 'you' 'say' '.' 'hello']
['you' 'I' '.' 'you' 'goodbye']

word2vec で提案された Negative sampling では元となる確率分布に対して 0.75 を累乗するようにしている

$$
P’(w_i) = \frac {P(w_i)^{0.75}} {\sum_{j}^{n}P(w_j)^{0.75}}
$$

0.75 乗することで出現確率の低い単語に対して、その確率を少しだけ高くすることができる

CBOW モデル: 実装

EmbeddingレイヤとNegative Samplingを利用してCBOWモデルを実装

%python
import collections

class UnigramSampler:
    def __init__(self, corpus, power, sample_size):
        self.sample_size = sample_size
        self.vocab_size = None
        self.word_p = None

        counts = collections.Counter()
        for word_id in corpus:
            counts[word_id] += 1

        vocab_size = len(counts)
        self.vocab_size = vocab_size

        self.word_p = np.zeros(vocab_size)
        for i in range(vocab_size):
            self.word_p[i] = counts[i]

        self.word_p = np.power(self.word_p, power)
        self.word_p /= np.sum(self.word_p)

    def get_negative_sample(self, target):
        batch_size = target.shape[0]

        negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)

        for i in range(batch_size):
            p = self.word_p.copy()
            target_idx = target[i]
            p[target_idx] = 0
            p /= p.sum()
            negative_sample[i, :] = np.random.choice(self.vocab_size, size=self.sample_size, replace=False, p=p)

        return negative_sample

%python
def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    # 教師データがone-hot-vectorの場合、正解ラベルのインデックスに変換
    if t.size == y.size:
        t = t.argmax(axis=1)
             
    batch_size = y.shape[0]

    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size

class SigmoidWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.loss = None
        self.y = None  # sigmoidの出力
        self.t = None  # 教師データ

    def forward(self, x, t):
        self.t = t
        self.y = 1 / (1 + np.exp(-x))

        self.loss = cross_entropy_error(np.c_[1 - self.y, self.y], self.t)

        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = (self.y - self.t) * dout / batch_size
        return dx

%python
class NegativeSamplingLoss:
    def __init__(self, W, corpus, power=0.75, sample_size=5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
        
        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads
    
    # 順伝搬
    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)
        
        # 正例
        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layers[0].forward(score, correct_label)
        
        # 負例
        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layers[1 + i].forward(score, negative_label)
        
        return loss
    
    # 逆伝搬
    def backward(self, dout=1):
        dh = 0
        for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)
        return dh

%python
class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size
        
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')
        
        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
        
        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads
        
        self.word_vecs = W_in
    
    def forward(self, contexts, target):
        h = 0
        for i, layer in enumerate(self.in_layers):
            h += layer.forward(contexts[:, i])
        h *= 1 / len(self.in_layers)
        loss = self.ns_loss.forward(h, target)
        return loss
    
    def backward(self, dout=1):
        dout = self.ns_loss.backward(dout)
        dout *= 1 / len(self.in_layers)
        for layer in self.in_layers:
            layer.backward(dout)
        return None

CBOW モデル: 学習

%python
import numpy
import time
import matplotlib.pyplot as plt

def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate

def remove_duplicate(params, grads):
    '''
    パラメータ配列中の重複する重みをひとつに集約し、
    その重みに対応する勾配を加算する
    '''
    params, grads = params[:], grads[:]  # copy list

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):
                # 重みを共有する場合
                if params[i] is params[j]:
                    grads[i] += grads[j]  # 勾配の加算
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                # 転置行列として重みを共有する場合(weight tying)
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)

                if find_flg: break
            if find_flg: break

        if not find_flg: break

    return params, grads

class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.loss_list = []
        self.eval_interval = None
        self.current_epoch = 0

    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            # シャッフル
            idx = numpy.random.permutation(numpy.arange(data_size))
            x = x[idx]
            t = t[idx]

            for iters in range(max_iters):
                batch_x = x[iters*batch_size:(iters+1)*batch_size]
                batch_t = t[iters*batch_size:(iters+1)*batch_size]

                # 勾配を求め、パラメータを更新
                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads)  # 共有された重みを1つに集約
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                # 評価
                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | loss %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = numpy.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()

Optimizer: Adam

%python
class Adam:
    '''
    Adam (http://arxiv.org/abs/1412.6980v8)
    '''
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = [], []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))
        
        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)

        for i in range(len(params)):
            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
            
            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)

%sh
rm -rf deep-learning-from-scratch-2
git clone https://github.com/oreilly-japan/deep-learning-from-scratch-2
ls -lha deep-learning-from-scratch-2/dataset
Cloning into 'deep-learning-from-scratch-2'...
total 2.6M
drwxr-xr-x  2 root root 4.0K Sep 30 03:41 .
drwxr-xr-x 13 root root 4.0K Sep 30 03:41 ..
-rw-r--r--  1 root root 635K Sep 30 03:41 addition.txt
-rw-r--r--  1 root root 2.0M Sep 30 03:41 date.txt
-rw-r--r--  1 root root    0 Sep 30 03:41 __init__.py
-rw-r--r--  1 root root 2.6K Sep 30 03:41 ptb.py
-rw-r--r--  1 root root 1.7K Sep 30 03:41 sequence.py
-rw-r--r--  1 root root  666 Sep 30 03:41 spiral.py

%python
import sys
sys.path.append('./deep-learning-from-scratch-2')
from dataset import ptb

def create_contexts_target(corpus, window_size=1):
    '''one-hot表現への変換を行う
    :param words: 単語IDのNumPy配列
    :param vocab_size: 語彙数
    :return: one-hot表現に変換後のNumPy配列
    '''
    target = corpus[window_size:-window_size]
    contexts = []

    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)

    return np.array(contexts), np.array(target)

# ハイパーパラメータ
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10

# データの読み込み
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)

contexts, target = create_contexts_target(corpus, window_size)

model = CBOW(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)

word_vecs = model.word_vecs

trainer.plot()
z.show(plt, format='svg')
| epoch 1 |  iter 1 / 9295 | time 0[s] | loss 4.16
| epoch 1 |  iter 21 / 9295 | time 1[s] | loss 4.16
| epoch 1 |  iter 41 / 9295 | time 3[s] | loss 4.15
| epoch 1 |  iter 61 / 9295 | time 4[s] | loss 4.13
| epoch 1 |  iter 81 / 9295 | time 6[s] | loss 4.05
| epoch 1 |  iter 101 / 9295 | time 8[s] | loss 3.93
| epoch 1 |  iter 121 / 9295 | time 9[s] | loss 3.78
| epoch 1 |  iter 141 / 9295 | time 11[s] | loss 3.63
| epoch 1 |  iter 161 / 9295 | time 12[s] | loss 3.50
| epoch 1 |  iter 181 / 9295 | time 14[s] | loss 3.37
| epoch 1 |  iter 201 / 9295 | time 16[s] | loss 3.27
| epoch 1 |  iter 221 / 9295 | time 17[s] | loss 3.16
| epoch 1 |  iter 241 / 9295 | time 19[s] | loss 3.08
| epoch 1 |  iter 261 / 9295 | time 20[s] | loss 3.02
| epoch 1 |  iter 281 / 9295 | time 22[s] | loss 2.96
| epoch 1 |  iter 301 / 9295 | time 23[s] | loss 2.93
| epoch 1 |  iter 321 / 9295 | time 25[s] | loss 2.88
| epoch 1 |  iter 341 / 9295 | time 26[s] | loss 2.84
| epoch 1 |  iter 361 / 9295 | time 28[s] | loss 2.81
| epoch 1 |  iter 381 / 9295 | time 30[s] | loss 2.78
| epoch 1 |  iter 401 / 9295 | time 31[s] | loss 2.78
| epoch 1 |  iter 421 / 9295 | time 33[s] | loss 2.75
| epoch 1 |  iter 441 / 9295 | time 34[s] | loss 2.73
| epoch 1 |  iter 461 / 9295 | time 36[s] | loss 2.71
| epoch 1 |  iter 481 / 9295 | time 37[s] | loss 2.69
| epoch 1 |  iter 501 / 9295 | time 39[s] | loss 2.70
| epoch 1 |  iter 521 / 9295 | time 40[s] | loss 2.65
| epoch 1 |  iter 541 / 9295 | time 42[s] | loss 2.66
| epoch 1 |  iter 561 / 9295 | time 44[s] | loss 2.66
| epoch 1 |  iter 581 / 9295 | time 45[s] | loss 2.66
| epoch 1 |  iter 601 / 9295 | time 47[s] | loss 2.65
| epoch 1 |  iter 621 / 9295 | time 48[s] | loss 2.64
| epoch 1 |  iter 641 / 9295 | time 50[s] | loss 2.63
| epoch 1 |  iter 661 / 9295 | time 51[s] | loss 2.65
| epoch 1 |  iter 681 / 9295 | time 53[s] | loss 2.64
| epoch 1 |  iter 701 / 9295 | time 55[s] | loss 2.60
| epoch 1 |  iter 721 / 9295 | time 56[s] | loss 2.62
| epoch 1 |  iter 741 / 9295 | time 58[s] | loss 2.61
| epoch 1 |  iter 761 / 9295 | time 59[s] | loss 2.59
| epoch 1 |  iter 781 / 9295 | time 61[s] | loss 2.59
| epoch 1 |  iter 801 / 9295 | time 62[s] | loss 2.61
| epoch 1 |  iter 821 / 9295 | time 64[s] | loss 2.58
| epoch 1 |  iter 841 / 9295 | time 66[s] | loss 2.58
| epoch 1 |  iter 861 / 9295 | time 67[s] | loss 2.55
| epoch 1 |  iter 881 / 9295 | time 69[s] | loss 2.57
| epoch 1 |  iter 901 / 9295 | time 70[s] | loss 2.55
| epoch 1 |  iter 921 / 9295 | time 72[s] | loss 2.58
| epoch 1 |  iter 941 / 9295 | time 74[s] | loss 2.60
| epoch 1 |  iter 961 / 9295 | time 75[s] | loss 2.56
| epoch 1 |  iter 981 / 9295 | time 77[s] | loss 2.57
| epoch 1 |  iter 1001 / 9295 | time 78[s] | loss 2.55
| epoch 1 |  iter 1021 / 9295 | time 80[s] | loss 2.57
| epoch 1 |  iter 1041 / 9295 | time 81[s] | loss 2.58
| epoch 1 |  iter 1061 / 9295 | time 83[s] | loss 2.55
| epoch 1 |  iter 1081 / 9295 | time 85[s] | loss 2.54
| epoch 1 |  iter 1101 / 9295 | time 86[s] | loss 2.56
| epoch 1 |  iter 1121 / 9295 | time 88[s] | loss 2.54
| epoch 1 |  iter 1141 / 9295 | time 89[s] | loss 2.54
| epoch 1 |  iter 1161 / 9295 | time 91[s] | loss 2.55
| epoch 1 |  iter 1181 / 9295 | time 93[s] | loss 2.53
| epoch 1 |  iter 1201 / 9295 | time 94[s] | loss 2.54
| epoch 1 |  iter 1221 / 9295 | time 96[s] | loss 2.54
| epoch 1 |  iter 1241 / 9295 | time 97[s] | loss 2.52
| epoch 1 |  iter 1261 / 9295 | time 99[s] | loss 2.52
| epoch 1 |  iter 1281 / 9295 | time 100[s] | loss 2.54
| epoch 1 |  iter 1301 / 9295 | time 102[s] | loss 2.52
| epoch 1 |  iter 1321 / 9295 | time 104[s] | loss 2.54
| epoch 1 |  iter 1341 / 9295 | time 105[s] | loss 2.52
| epoch 1 |  iter 1361 / 9295 | time 107[s] | loss 2.54
| epoch 1 |  iter 1381 / 9295 | time 108[s] | loss 2.55
| epoch 1 |  iter 1401 / 9295 | time 110[s] | loss 2.49
| epoch 1 |  iter 1421 / 9295 | time 111[s] | loss 2.50
| epoch 1 |  iter 1441 / 9295 | time 113[s] | loss 2.55
| epoch 1 |  iter 1461 / 9295 | time 115[s] | loss 2.50
| epoch 1 |  iter 1481 / 9295 | time 116[s] | loss 2.51
| epoch 1 |  iter 1501 / 9295 | time 118[s] | loss 2.50
| epoch 1 |  iter 1521 / 9295 | time 119[s] | loss 2.51
| epoch 1 |  iter 1541 / 9295 | time 121[s] | loss 2.52
| epoch 1 |  iter 1561 / 9295 | time 123[s] | loss 2.52
| epoch 1 |  iter 1581 / 9295 | time 124[s] | loss 2.53
| epoch 1 |  iter 1601 / 9295 | time 126[s] | loss 2.52
| epoch 1 |  iter 1621 / 9295 | time 127[s] | loss 2.52
| epoch 1 |  iter 1641 / 9295 | time 129[s] | loss 2.52
| epoch 1 |  iter 1661 / 9295 | time 131[s] | loss 2.48
| epoch 1 |  iter 1681 / 9295 | time 132[s] | loss 2.49
| epoch 1 |  iter 1701 / 9295 | time 134[s] | loss 2.47
| epoch 1 |  iter 1721 / 9295 | time 135[s] | loss 2.50
| epoch 1 |  iter 1741 / 9295 | time 137[s] | loss 2.51
| epoch 1 |  iter 1761 / 9295 | time 139[s] | loss 2.51
| epoch 1 |  iter 1781 / 9295 | time 140[s] | loss 2.47
| epoch 1 |  iter 1801 / 9295 | time 142[s] | loss 2.51
| epoch 1 |  iter 1821 / 9295 | time 143[s] | loss 2.50
| epoch 1 |  iter 1841 / 9295 | time 145[s] | loss 2.51
| epoch 1 |  iter 1861 / 9295 | time 147[s] | loss 2.47
| epoch 1 |  iter 1881 / 9295 | time 148[s] | loss 2.52
| epoch 1 |  iter 1901 / 9295 | time 150[s] | loss 2.49
| epoch 1 |  iter 1921 / 9295 | time 151[s] | loss 2.49
| epoch 1 |  iter 1941 / 9295 | time 153[s] | loss 2.51
| epoch 1 |  iter 1961 / 9295 | time 155[s] | loss 2.50
| epoch 1 |  iter 1981 / 9295 | time 156[s] | loss 2.47
| epoch 1 |  iter 2001 / 9295 | time 158[s] | loss 2.50
| epoch 1 |  iter 2021 / 9295 | time 159[s] | loss 2.47
| epoch 1 |  iter 2041 / 9295 | time 161[s] | loss 2.49
| epoch 1 |  iter 2061 / 9295 | time 163[s] | loss 2.50
| epoch 1 |  iter 2081 / 9295 | time 164[s] | loss 2.49
| epoch 1 |  iter 2101 / 9295 | time 166[s] | loss 2.48
| epoch 1 |  iter 2121 / 9295 | time 167[s] | loss 2.47
| epoch 1 |  iter 2141 / 9295 | time 169[s] | loss 2.49
| epoch 1 |  iter 2161 / 9295 | time 171[s] | loss 2.47
| epoch 1 |  iter 2181 / 9295 | time 172[s] | loss 2.46
| epoch 1 |  iter 2201 / 9295 | time 174[s] | loss 2.48
| epoch 1 |  iter 2221 / 9295 | time 175[s] | loss 2.49
| epoch 1 |  iter 2241 / 9295 | time 177[s] | loss 2.47
| epoch 1 |  iter 2261 / 9295 | time 179[s] | loss 2.48
| epoch 1 |  iter 2281 / 9295 | time 180[s] | loss 2.48
| epoch 1 |  iter 2301 / 9295 | time 182[s] | loss 2.49
| epoch 1 |  iter 2321 / 9295 | time 183[s] | loss 2.47
| epoch 1 |  iter 2341 / 9295 | time 185[s] | loss 2.50
| epoch 1 |  iter 2361 / 9295 | time 187[s] | loss 2.49
| epoch 1 |  iter 2381 / 9295 | time 188[s] | loss 2.47
| epoch 1 |  iter 2401 / 9295 | time 190[s] | loss 2.48
| epoch 1 |  iter 2421 / 9295 | time 191[s] | loss 2.49
| epoch 1 |  iter 2441 / 9295 | time 193[s] | loss 2.45
| epoch 1 |  iter 2461 / 9295 | time 194[s] | loss 2.47
| epoch 1 |  iter 2481 / 9295 | time 196[s] | loss 2.49
| epoch 1 |  iter 2501 / 9295 | time 198[s] | loss 2.45
| epoch 1 |  iter 2521 / 9295 | time 199[s] | loss 2.48
| epoch 1 |  iter 2541 / 9295 | time 201[s] | loss 2.47
| epoch 1 |  iter 2561 / 9295 | time 202[s] | loss 2.45
| epoch 1 |  iter 2581 / 9295 | time 204[s] | loss 2.49
| epoch 1 |  iter 2601 / 9295 | time 206[s] | loss 2.47
| epoch 1 |  iter 2621 / 9295 | time 207[s] | loss 2.46
| epoch 1 |  iter 2641 / 9295 | time 209[s] | loss 2.46
| epoch 1 |  iter 2661 / 9295 | time 210[s] | loss 2.49
| epoch 1 |  iter 2681 / 9295 | time 212[s] | loss 2.43
| epoch 1 |  iter 2701 / 9295 | time 213[s] | loss 2.43
| epoch 1 |  iter 2721 / 9295 | time 215[s] | loss 2.48
| epoch 1 |  iter 2741 / 9295 | time 217[s] | loss 2.44
| epoch 1 |  iter 2761 / 9295 | time 218[s] | loss 2.47
| epoch 1 |  iter 2781 / 9295 | time 220[s] | loss 2.43
| epoch 1 |  iter 2801 / 9295 | time 221[s] | loss 2.47
| epoch 1 |  iter 2821 / 9295 | time 223[s] | loss 2.44
| epoch 1 |  iter 2841 / 9295 | time 225[s] | loss 2.43
| epoch 1 |  iter 2861 / 9295 | time 226[s] | loss 2.44
| epoch 1 |  iter 2881 / 9295 | time 228[s] | loss 2.50
| epoch 1 |  iter 2901 / 9295 | time 229[s] | loss 2.46
| epoch 1 |  iter 2921 / 9295 | time 231[s] | loss 2.44
| epoch 1 |  iter 2941 / 9295 | time 233[s] | loss 2.46
| epoch 1 |  iter 2961 / 9295 | time 234[s] | loss 2.48
| epoch 1 |  iter 2981 / 9295 | time 236[s] | loss 2.43
| epoch 1 |  iter 3001 / 9295 | time 237[s] | loss 2.47
| epoch 1 |  iter 3021 / 9295 | time 239[s] | loss 2.43
| epoch 1 |  iter 3041 / 9295 | time 241[s] | loss 2.43
| epoch 1 |  iter 3061 / 9295 | time 242[s] | loss 2.45
| epoch 1 |  iter 3081 / 9295 | time 244[s] | loss 2.47
| epoch 1 |  iter 3101 / 9295 | time 245[s] | loss 2.46
| epoch 1 |  iter 3121 / 9295 | time 247[s] | loss 2.45
| epoch 1 |  iter 3141 / 9295 | time 249[s] | loss 2.43
| epoch 1 |  iter 3161 / 9295 | time 250[s] | loss 2.44
| epoch 1 |  iter 3181 / 9295 | time 252[s] | loss 2.42
| epoch 1 |  iter 3201 / 9295 | time 253[s] | loss 2.44
| epoch 1 |  iter 3221 / 9295 | time 255[s] | loss 2.44
| epoch 1 |  iter 3241 / 9295 | time 256[s] | loss 2.45
| epoch 1 |  iter 3261 / 9295 | time 258[s] | loss 2.44
| epoch 1 |  iter 3281 / 9295 | time 260[s] | loss 2.43
| epoch 1 |  iter 3301 / 9295 | time 261[s] | loss 2.41
| epoch 1 |  iter 3321 / 9295 | time 263[s] | loss 2.43
| epoch 1 |  iter 3341 / 9295 | time 265[s] | loss 2.45
| epoch 1 |  iter 3361 / 9295 | time 266[s] | loss 2.41
| epoch 1 |  iter 3381 / 9295 | time 268[s] | loss 2.42
| epoch 1 |  iter 3401 / 9295 | time 269[s] | loss 2.44
| epoch 1 |  iter 3421 / 9295 | time 271[s] | loss 2.41
| epoch 1 |  iter 3441 / 9295 | time 272[s] | loss 2.45
| epoch 1 |  iter 3461 / 9295 | time 274[s] | loss 2.43
| epoch 1 |  iter 3481 / 9295 | time 275[s] | loss 2.44
| epoch 1 |  iter 3501 / 9295 | time 277[s] | loss 2.44
| epoch 1 |  iter 3521 / 9295 | time 279[s] | loss 2.46
| epoch 1 |  iter 3541 / 9295 | time 280[s] | loss 2.40
| epoch 1 |  iter 3561 / 9295 | time 282[s] | loss 2.44
| epoch 1 |  iter 3581 / 9295 | time 283[s] | loss 2.43
| epoch 1 |  iter 3601 / 9295 | time 285[s] | loss 2.42
| epoch 1 |  iter 3621 / 9295 | time 286[s] | loss 2.43
| epoch 1 |  iter 3641 / 9295 | time 288[s] | loss 2.41
| epoch 1 |  iter 3661 / 9295 | time 290[s] | loss 2.42
| epoch 1 |  iter 3681 / 9295 | time 291[s] | loss 2.42
| epoch 1 |  iter 3701 / 9295 | time 293[s] | loss 2.42
| epoch 1 |  iter 3721 / 9295 | time 294[s] | loss 2.42
| epoch 1 |  iter 3741 / 9295 | time 296[s] | loss 2.42
| epoch 1 |  iter 3761 / 9295 | time 297[s] | loss 2.42
| epoch 1 |  iter 3781 / 9295 | time 299[s] | loss 2.42
| epoch 1 |  iter 3801 / 9295 | time 301[s] | loss 2.39
| epoch 1 |  iter 3821 / 9295 | time 302[s] | loss 2.45
| epoch 1 |  iter 3841 / 9295 | time 304[s] | loss 2.40
| epoch 1 |  iter 3861 / 9295 | time 305[s] | loss 2.38
| epoch 1 |  iter 3881 / 9295 | time 307[s] | loss 2.40
| epoch 1 |  iter 3901 / 9295 | time 308[s] | loss 2.39
| epoch 1 |  iter 3921 / 9295 | time 310[s] | loss 2.37
| epoch 1 |  iter 3941 / 9295 | time 311[s] | loss 2.42
| epoch 1 |  iter 3961 / 9295 | time 313[s] | loss 2.38
| epoch 1 |  iter 3981 / 9295 | time 315[s] | loss 2.43
| epoch 1 |  iter 4001 / 9295 | time 316[s] | loss 2.39
| epoch 1 |  iter 4021 / 9295 | time 318[s] | loss 2.40
| epoch 1 |  iter 4041 / 9295 | time 319[s] | loss 2.38
| epoch 1 |  iter 4061 / 9295 | time 321[s] | loss 2.40
| epoch 1 |  iter 4081 / 9295 | time 323[s] | loss 2.43
| epoch 1 |  iter 4101 / 9295 | time 324[s] | loss 2.37
| epoch 1 |  iter 4121 / 9295 | time 326[s] | loss 2.41
| epoch 1 |  iter 4141 / 9295 | time 327[s] | loss 2.40
| epoch 1 |  iter 4161 / 9295 | time 329[s] | loss 2.42
| epoch 1 |  iter 4181 / 9295 | time 330[s] | loss 2.38
| epoch 1 |  iter 4201 / 9295 | time 332[s] | loss 2.39
| epoch 1 |  iter 4221 / 9295 | time 334[s] | loss 2.40
| epoch 1 |  iter 4241 / 9295 | time 335[s] | loss 2.38
| epoch 1 |  iter 4261 / 9295 | time 337[s] | loss 2.39
| epoch 1 |  iter 4281 / 9295 | time 338[s] | loss 2.38
| epoch 1 |  iter 4301 / 9295 | time 340[s] | loss 2.40
| epoch 1 |  iter 4321 / 9295 | time 341[s] | loss 2.38
| epoch 1 |  iter 4341 / 9295 | time 343[s] | loss 2.36
| epoch 1 |  iter 4361 / 9295 | time 345[s] | loss 2.39
| epoch 1 |  iter 4381 / 9295 | time 346[s] | loss 2.39
| epoch 1 |  iter 4401 / 9295 | time 348[s] | loss 2.42
| epoch 1 |  iter 4421 / 9295 | time 349[s] | loss 2.40
| epoch 1 |  iter 4441 / 9295 | time 351[s] | loss 2.38
| epoch 1 |  iter 4461 / 9295 | time 353[s] | loss 2.38
| epoch 1 |  iter 4481 / 9295 | time 354[s] | loss 2.39
| epoch 1 |  iter 4501 / 9295 | time 356[s] | loss 2.39
| epoch 1 |  iter 4521 / 9295 | time 357[s] | loss 2.39
| epoch 1 |  iter 4541 / 9295 | time 359[s] | loss 2.37
| epoch 1 |  iter 4561 / 9295 | time 361[s] | loss 2.39
| epoch 1 |  iter 4581 / 9295 | time 362[s] | loss 2.36
| epoch 1 |  iter 4601 / 9295 | time 364[s] | loss 2.38
| epoch 1 |  iter 4621 / 9295 | time 365[s] | loss 2.38
| epoch 1 |  iter 4641 / 9295 | time 367[s] | loss 2.35
| epoch 1 |  iter 4661 / 9295 | time 369[s] | loss 2.39
| epoch 1 |  iter 4681 / 9295 | time 370[s] | loss 2.37
| epoch 1 |  iter 4701 / 9295 | time 372[s] | loss 2.37
| epoch 1 |  iter 4721 / 9295 | time 373[s] | loss 2.37
| epoch 1 |  iter 4741 / 9295 | time 375[s] | loss 2.35
| epoch 1 |  iter 4761 / 9295 | time 377[s] | loss 2.38
| epoch 1 |  iter 4781 / 9295 | time 378[s] | loss 2.39
| epoch 1 |  iter 4801 / 9295 | time 380[s] | loss 2.38
| epoch 1 |  iter 4821 / 9295 | time 381[s] | loss 2.36
| epoch 1 |  iter 4841 / 9295 | time 383[s] | loss 2.37
| epoch 1 |  iter 4861 / 9295 | time 384[s] | loss 2.37
| epoch 1 |  iter 4881 / 9295 | time 386[s] | loss 2.34
| epoch 1 |  iter 4901 / 9295 | time 388[s] | loss 2.33
| epoch 1 |  iter 4921 / 9295 | time 389[s] | loss 2.36
| epoch 1 |  iter 4941 / 9295 | time 391[s] | loss 2.37
| epoch 1 |  iter 4961 / 9295 | time 392[s] | loss 2.33
| epoch 1 |  iter 4981 / 9295 | time 394[s] | loss 2.35
| epoch 1 |  iter 5001 / 9295 | time 395[s] | loss 2.36
| epoch 1 |  iter 5021 / 9295 | time 397[s] | loss 2.38
| epoch 1 |  iter 5041 / 9295 | time 399[s] | loss 2.36
| epoch 1 |  iter 5061 / 9295 | time 400[s] | loss 2.33
| epoch 1 |  iter 5081 / 9295 | time 402[s] | loss 2.35
| epoch 1 |  iter 5101 / 9295 | time 403[s] | loss 2.38
| epoch 1 |  iter 5121 / 9295 | time 405[s] | loss 2.31
| epoch 1 |  iter 5141 / 9295 | time 407[s] | loss 2.36
| epoch 1 |  iter 5161 / 9295 | time 408[s] | loss 2.37
| epoch 1 |  iter 5181 / 9295 | time 410[s] | loss 2.35
| epoch 1 |  iter 5201 / 9295 | time 411[s] | loss 2.33
| epoch 1 |  iter 5221 / 9295 | time 413[s] | loss 2.34
| epoch 1 |  iter 5241 / 9295 | time 414[s] | loss 2.35
| epoch 1 |  iter 5261 / 9295 | time 416[s] | loss 2.36
| epoch 1 |  iter 5281 / 9295 | time 418[s] | loss 2.34
| epoch 1 |  iter 5301 / 9295 | time 419[s] | loss 2.36
| epoch 1 |  iter 5321 / 9295 | time 421[s] | loss 2.34
| epoch 1 |  iter 5341 / 9295 | time 422[s] | loss 2.32
| epoch 1 |  iter 5361 / 9295 | time 424[s] | loss 2.33
| epoch 1 |  iter 5381 / 9295 | time 425[s] | loss 2.34
| epoch 1 |  iter 5401 / 9295 | time 427[s] | loss 2.34
| epoch 1 |  iter 5421 / 9295 | time 428[s] | loss 2.33
| epoch 1 |  iter 5441 / 9295 | time 430[s] | loss 2.33
| epoch 1 |  iter 5461 / 9295 | time 432[s] | loss 2.33
| epoch 1 |  iter 5481 / 9295 | time 433[s] | loss 2.37
| epoch 1 |  iter 5501 / 9295 | time 435[s] | loss 2.34
| epoch 1 |  iter 5521 / 9295 | time 436[s] | loss 2.36
| epoch 1 |  iter 5541 / 9295 | time 438[s] | loss 2.34
| epoch 1 |  iter 5561 / 9295 | time 439[s] | loss 2.33
| epoch 1 |  iter 5581 / 9295 | time 441[s] | loss 2.31
| epoch 1 |  iter 5601 / 9295 | time 442[s] | loss 2.32
| epoch 1 |  iter 5621 / 9295 | time 444[s] | loss 2.31
| epoch 1 |  iter 5641 / 9295 | time 446[s] | loss 2.32
| epoch 1 |  iter 5661 / 9295 | time 447[s] | loss 2.35
| epoch 1 |  iter 5681 / 9295 | time 449[s] | loss 2.35
| epoch 1 |  iter 5701 / 9295 | time 450[s] | loss 2.31
| epoch 1 |  iter 5721 / 9295 | time 452[s] | loss 2.36
| epoch 1 |  iter 5741 / 9295 | time 453[s] | loss 2.33
| epoch 1 |  iter 5761 / 9295 | time 455[s] | loss 2.31
| epoch 1 |  iter 5781 / 9295 | time 456[s] | loss 2.33
| epoch 1 |  iter 5801 / 9295 | time 458[s] | loss 2.34
| epoch 1 |  iter 5821 / 9295 | time 460[s] | loss 2.34
| epoch 1 |  iter 5841 / 9295 | time 461[s] | loss 2.32
| epoch 1 |  iter 5861 / 9295 | time 463[s] | loss 2.36
| epoch 1 |  iter 5881 / 9295 | time 464[s] | loss 2.35
| epoch 1 |  iter 5901 / 9295 | time 466[s] | loss 2.37
| epoch 1 |  iter 5921 / 9295 | time 467[s] | loss 2.34
| epoch 1 |  iter 5941 / 9295 | time 469[s] | loss 2.30
| epoch 1 |  iter 5961 / 9295 | time 471[s] | loss 2.32
| epoch 1 |  iter 5981 / 9295 | time 472[s] | loss 2.31
| epoch 1 |  iter 6001 / 9295 | time 474[s] | loss 2.30
| epoch 1 |  iter 6021 / 9295 | time 475[s] | loss 2.32
| epoch 1 |  iter 6041 / 9295 | time 477[s] | loss 2.33
| epoch 1 |  iter 6061 / 9295 | time 478[s] | loss 2.29
| epoch 1 |  iter 6081 / 9295 | time 480[s] | loss 2.33
| epoch 1 |  iter 6101 / 9295 | time 482[s] | loss 2.31
| epoch 1 |  iter 6121 / 9295 | time 483[s] | loss 2.33
| epoch 1 |  iter 6141 / 9295 | time 485[s] | loss 2.31
| epoch 1 |  iter 6161 / 9295 | time 486[s] | loss 2.29
| epoch 1 |  iter 6181 / 9295 | time 488[s] | loss 2.32
| epoch 1 |  iter 6201 / 9295 | time 489[s] | loss 2.31
| epoch 1 |  iter 6221 / 9295 | time 491[s] | loss 2.33
| epoch 1 |  iter 6241 / 9295 | time 492[s] | loss 2.31
| epoch 1 |  iter 6261 / 9295 | time 494[s] | loss 2.32
| epoch 1 |  iter 6281 / 9295 | time 496[s] | loss 2.26
| epoch 1 |  iter 6301 / 9295 | time 497[s] | loss 2.32
| epoch 1 |  iter 6321 / 9295 | time 499[s] | loss 2.29
| epoch 1 |  iter 6341 / 9295 | time 500[s] | loss 2.32
| epoch 1 |  iter 6361 / 9295 | time 502[s] | loss 2.31
| epoch 1 |  iter 6381 / 9295 | time 503[s] | loss 2.32
| epoch 1 |  iter 6401 / 9295 | time 505[s] | loss 2.28
| epoch 1 |  iter 6421 / 9295 | time 506[s] | loss 2.32
| epoch 1 |  iter 6441 / 9295 | time 508[s] | loss 2.31
| epoch 1 |  iter 6461 / 9295 | time 510[s] | loss 2.28
| epoch 1 |  iter 6481 / 9295 | time 511[s] | loss 2.31
| epoch 1 |  iter 6501 / 9295 | time 513[s] | loss 2.35
| epoch 1 |  iter 6521 / 9295 | time 514[s] | loss 2.30
| epoch 1 |  iter 6541 / 9295 | time 516[s] | loss 2.32
| epoch 1 |  iter 6561 / 9295 | time 517[s] | loss 2.31
| epoch 1 |  iter 6581 / 9295 | time 519[s] | loss 2.31
| epoch 1 |  iter 6601 / 9295 | time 520[s] | loss 2.28
| epoch 1 |  iter 6621 / 9295 | time 522[s] | loss 2.29
| epoch 1 |  iter 6641 / 9295 | time 523[s] | loss 2.30
| epoch 1 |  iter 6661 / 9295 | time 525[s] | loss 2.30
| epoch 1 |  iter 6681 / 9295 | time 527[s] | loss 2.30
| epoch 1 |  iter 6701 / 9295 | time 528[s] | loss 2.30
| epoch 1 |  iter 6721 / 9295 | time 530[s] | loss 2.29
| epoch 1 |  iter 6741 / 9295 | time 531[s] | loss 2.29
| epoch 1 |  iter 6761 / 9295 | time 533[s] | loss 2.30
| epoch 1 |  iter 6781 / 9295 | time 534[s] | loss 2.31
| epoch 1 |  iter 6801 / 9295 | time 536[s] | loss 2.28
| epoch 1 |  iter 6821 / 9295 | time 538[s] | loss 2.27
| epoch 1 |  iter 6841 / 9295 | time 539[s] | loss 2.30
| epoch 1 |  iter 6861 / 9295 | time 541[s] | loss 2.32
| epoch 1 |  iter 6881 / 9295 | time 542[s] | loss 2.31
| epoch 1 |  iter 6901 / 9295 | time 544[s] | loss 2.29
| epoch 1 |  iter 6921 / 9295 | time 545[s] | loss 2.27
| epoch 1 |  iter 6941 / 9295 | time 547[s] | loss 2.28
| epoch 1 |  iter 6961 / 9295 | time 549[s] | loss 2.23
| epoch 1 |  iter 6981 / 9295 | time 550[s] | loss 2.27
| epoch 1 |  iter 7001 / 9295 | time 552[s] | loss 2.29
| epoch 1 |  iter 7021 / 9295 | time 553[s] | loss 2.28
| epoch 1 |  iter 7041 / 9295 | time 555[s] | loss 2.27
| epoch 1 |  iter 7061 / 9295 | time 556[s] | loss 2.29
| epoch 1 |  iter 7081 / 9295 | time 558[s] | loss 2.27
| epoch 1 |  iter 7101 / 9295 | time 560[s] | loss 2.28
| epoch 1 |  iter 7121 / 9295 | time 561[s] | loss 2.29
| epoch 1 |  iter 7141 / 9295 | time 563[s] | loss 2.25
| epoch 1 |  iter 7161 / 9295 | time 564[s] | loss 2.27
| epoch 1 |  iter 7181 / 9295 | time 566[s] | loss 2.29
| epoch 1 |  iter 7201 / 9295 | time 567[s] | loss 2.30
| epoch 1 |  iter 7221 / 9295 | time 569[s] | loss 2.28
| epoch 1 |  iter 7241 / 9295 | time 571[s] | loss 2.28
| epoch 1 |  iter 7261 / 9295 | time 572[s] | loss 2.30
| epoch 1 |  iter 7281 / 9295 | time 574[s] | loss 2.28
| epoch 1 |  iter 7301 / 9295 | time 575[s] | loss 2.27
| epoch 1 |  iter 7321 / 9295 | time 577[s] | loss 2.30
| epoch 1 |  iter 7341 / 9295 | time 578[s] | loss 2.28
| epoch 1 |  iter 7361 / 9295 | time 580[s] | loss 2.29
| epoch 1 |  iter 7381 / 9295 | time 581[s] | loss 2.28
| epoch 1 |  iter 7401 / 9295 | time 583[s] | loss 2.23
| epoch 1 |  iter 7421 / 9295 | time 584[s] | loss 2.26
| epoch 1 |  iter 7441 / 9295 | time 586[s] | loss 2.28
| epoch 1 |  iter 7461 / 9295 | time 588[s] | loss 2.27
| epoch 1 |  iter 7481 / 9295 | time 589[s] | loss 2.29
| epoch 1 |  iter 7501 / 9295 | time 591[s] | loss 2.25
| epoch 1 |  iter 7521 / 9295 | time 592[s] | loss 2.26
| epoch 1 |  iter 7541 / 9295 | time 594[s] | loss 2.27
| epoch 1 |  iter 7561 / 9295 | time 595[s] | loss 2.25
| epoch 1 |  iter 7581 / 9295 | time 597[s] | loss 2.29
| epoch 1 |  iter 7601 / 9295 | time 598[s] | loss 2.30
| epoch 1 |  iter 7621 / 9295 | time 600[s] | loss 2.28
| epoch 1 |  iter 7641 / 9295 | time 602[s] | loss 2.29
| epoch 1 |  iter 7661 / 9295 | time 603[s] | loss 2.26
| epoch 1 |  iter 7681 / 9295 | time 605[s] | loss 2.29
| epoch 1 |  iter 7701 / 9295 | time 606[s] | loss 2.29
| epoch 1 |  iter 7721 / 9295 | time 608[s] | loss 2.24
| epoch 1 |  iter 7741 / 9295 | time 609[s] | loss 2.25
| epoch 1 |  iter 7761 / 9295 | time 611[s] | loss 2.26
| epoch 1 |  iter 7781 / 9295 | time 613[s] | loss 2.27
| epoch 1 |  iter 7801 / 9295 | time 614[s] | loss 2.27
| epoch 1 |  iter 7821 / 9295 | time 616[s] | loss 2.25
| epoch 1 |  iter 7841 / 9295 | time 617[s] | loss 2.30
| epoch 1 |  iter 7861 / 9295 | time 619[s] | loss 2.27
| epoch 1 |  iter 7881 / 9295 | time 621[s] | loss 2.26
| epoch 1 |  iter 7901 / 9295 | time 622[s] | loss 2.27
| epoch 1 |  iter 7921 / 9295 | time 624[s] | loss 2.27
| epoch 1 |  iter 7941 / 9295 | time 625[s] | loss 2.26
| epoch 1 |  iter 7961 / 9295 | time 627[s] | loss 2.25
| epoch 1 |  iter 7981 / 9295 | time 628[s] | loss 2.27
| epoch 1 |  iter 8001 / 9295 | time 630[s] | loss 2.27
| epoch 1 |  iter 8021 / 9295 | time 632[s] | loss 2.28
| epoch 1 |  iter 8041 / 9295 | time 633[s] | loss 2.27
| epoch 1 |  iter 8061 / 9295 | time 635[s] | loss 2.25
| epoch 1 |  iter 8081 / 9295 | time 636[s] | loss 2.29
| epoch 1 |  iter 8101 / 9295 | time 638[s] | loss 2.26
| epoch 1 |  iter 8121 / 9295 | time 640[s] | loss 2.25
| epoch 1 |  iter 8141 / 9295 | time 641[s] | loss 2.24
| epoch 1 |  iter 8161 / 9295 | time 643[s] | loss 2.24
| epoch 1 |  iter 8181 / 9295 | time 644[s] | loss 2.24
| epoch 1 |  iter 8201 / 9295 | time 646[s] | loss 2.24
| epoch 1 |  iter 8221 / 9295 | time 647[s] | loss 2.27
| epoch 1 |  iter 8241 / 9295 | time 649[s] | loss 2.25
| epoch 1 |  iter 8261 / 9295 | time 651[s] | loss 2.27
| epoch 1 |  iter 8281 / 9295 | time 652[s] | loss 2.24
| epoch 1 |  iter 8301 / 9295 | time 654[s] | loss 2.28
| epoch 1 |  iter 8321 / 9295 | time 655[s] | loss 2.25
| epoch 1 |  iter 8341 / 9295 | time 657[s] | loss 2.26
| epoch 1 |  iter 8361 / 9295 | time 658[s] | loss 2.26
| epoch 1 |  iter 8381 / 9295 | time 660[s] | loss 2.26
| epoch 1 |  iter 8401 / 9295 | time 661[s] | loss 2.25
| epoch 1 |  iter 8421 / 9295 | time 663[s] | loss 2.24
| epoch 1 |  iter 8441 / 9295 | time 664[s] | loss 2.25
| epoch 1 |  iter 8461 / 9295 | time 666[s] | loss 2.23
| epoch 1 |  iter 8481 / 9295 | time 667[s] | loss 2.29
| epoch 1 |  iter 8501 / 9295 | time 669[s] | loss 2.23
| epoch 1 |  iter 8521 / 9295 | time 671[s] | loss 2.22
| epoch 1 |  iter 8541 / 9295 | time 672[s] | loss 2.27
| epoch 1 |  iter 8561 / 9295 | time 674[s] | loss 2.23
| epoch 1 |  iter 8581 / 9295 | time 675[s] | loss 2.20
| epoch 1 |  iter 8601 / 9295 | time 677[s] | loss 2.25
| epoch 1 |  iter 8621 / 9295 | time 679[s] | loss 2.23
| epoch 1 |  iter 8641 / 9295 | time 680[s] | loss 2.23
| epoch 1 |  iter 8661 / 9295 | time 682[s] | loss 2.24
| epoch 1 |  iter 8681 / 9295 | time 683[s] | loss 2.25
| epoch 1 |  iter 8701 / 9295 | time 685[s] | loss 2.24
| epoch 1 |  iter 8721 / 9295 | time 686[s] | loss 2.25
| epoch 1 |  iter 8741 / 9295 | time 688[s] | loss 2.25
| epoch 1 |  iter 8761 / 9295 | time 690[s] | loss 2.27
| epoch 1 |  iter 8781 / 9295 | time 691[s] | loss 2.23
| epoch 1 |  iter 8801 / 9295 | time 693[s] | loss 2.23
| epoch 1 |  iter 8821 / 9295 | time 694[s] | loss 2.22
| epoch 1 |  iter 8841 / 9295 | time 696[s] | loss 2.25
| epoch 1 |  iter 8861 / 9295 | time 697[s] | loss 2.20
| epoch 1 |  iter 8881 / 9295 | time 699[s] | loss 2.24
| epoch 1 |  iter 8901 / 9295 | time 701[s] | loss 2.23
| epoch 1 |  iter 8921 / 9295 | time 702[s] | loss 2.26
| epoch 1 |  iter 8941 / 9295 | time 704[s] | loss 2.26
| epoch 1 |  iter 8961 / 9295 | time 705[s] | loss 2.20
| epoch 1 |  iter 8981 / 9295 | time 707[s] | loss 2.22
| epoch 1 |  iter 9001 / 9295 | time 708[s] | loss 2.21
| epoch 1 |  iter 9021 / 9295 | time 710[s] | loss 2.22
| epoch 1 |  iter 9041 / 9295 | time 712[s] | loss 2.23
| epoch 1 |  iter 9061 / 9295 | time 713[s] | loss 2.21
| epoch 1 |  iter 9081 / 9295 | time 715[s] | loss 2.23
| epoch 1 |  iter 9101 / 9295 | time 716[s] | loss 2.21
| epoch 1 |  iter 9121 / 9295 | time 718[s] | loss 2.26
| epoch 1 |  iter 9141 / 9295 | time 720[s] | loss 2.22
| epoch 1 |  iter 9161 / 9295 | time 721[s] | loss 2.20
| epoch 1 |  iter 9181 / 9295 | time 723[s] | loss 2.20
| epoch 1 |  iter 9201 / 9295 | time 724[s] | loss 2.23
| epoch 1 |  iter 9221 / 9295 | time 726[s] | loss 2.21
| epoch 1 |  iter 9241 / 9295 | time 727[s] | loss 2.24
| epoch 1 |  iter 9261 / 9295 | time 729[s] | loss 2.22
| epoch 1 |  iter 9281 / 9295 | time 731[s] | loss 2.20
| epoch 2 |  iter 1 / 9295 | time 732[s] | loss 2.23
| epoch 2 |  iter 21 / 9295 | time 733[s] | loss 2.20
| epoch 2 |  iter 41 / 9295 | time 735[s] | loss 2.21
| epoch 2 |  iter 61 / 9295 | time 737[s] | loss 2.18
| epoch 2 |  iter 81 / 9295 | time 738[s] | loss 2.20
| epoch 2 |  iter 101 / 9295 | time 740[s] | loss 2.17
| epoch 2 |  iter 121 / 9295 | time 741[s] | loss 2.17
| epoch 2 |  iter 141 / 9295 | time 743[s] | loss 2.21
| epoch 2 |  iter 161 / 9295 | time 745[s] | loss 2.19
| epoch 2 |  iter 181 / 9295 | time 746[s] | loss 2.18
| epoch 2 |  iter 201 / 9295 | time 748[s] | loss 2.19
| epoch 2 |  iter 221 / 9295 | time 749[s] | loss 2.18
| epoch 2 |  iter 241 / 9295 | time 751[s] | loss 2.17
| epoch 2 |  iter 261 / 9295 | time 753[s] | loss 2.17
| epoch 2 |  iter 281 / 9295 | time 754[s] | loss 2.16
| epoch 2 |  iter 301 / 9295 | time 756[s] | loss 2.17
| epoch 2 |  iter 321 / 9295 | time 757[s] | loss 2.17
| epoch 2 |  iter 341 / 9295 | time 759[s] | loss 2.18
| epoch 2 |  iter 361 / 9295 | time 761[s] | loss 2.21
| epoch 2 |  iter 381 / 9295 | time 763[s] | loss 2.18
| epoch 2 |  iter 401 / 9295 | time 764[s] | loss 2.20
| epoch 2 |  iter 421 / 9295 | time 766[s] | loss 2.19
| epoch 2 |  iter 441 / 9295 | time 768[s] | loss 2.18
| epoch 2 |  iter 461 / 9295 | time 769[s] | loss 2.16
| epoch 2 |  iter 481 / 9295 | time 771[s] | loss 2.19
| epoch 2 |  iter 501 / 9295 | time 772[s] | loss 2.16
| epoch 2 |  iter 521 / 9295 | time 774[s] | loss 2.19
| epoch 2 |  iter 541 / 9295 | time 776[s] | loss 2.20
| epoch 2 |  iter 561 / 9295 | time 777[s] | loss 2.20
| epoch 2 |  iter 581 / 9295 | time 779[s] | loss 2.16
| epoch 2 |  iter 601 / 9295 | time 780[s] | loss 2.17
| epoch 2 |  iter 621 / 9295 | time 782[s] | loss 2.21
| epoch 2 |  iter 641 / 9295 | time 784[s] | loss 2.19
| epoch 2 |  iter 661 / 9295 | time 785[s] | loss 2.17
| epoch 2 |  iter 681 / 9295 | time 787[s] | loss 2.18
| epoch 2 |  iter 701 / 9295 | time 788[s] | loss 2.13
| epoch 2 |  iter 721 / 9295 | time 790[s] | loss 2.16
| epoch 2 |  iter 741 / 9295 | time 792[s] | loss 2.15
| epoch 2 |  iter 761 / 9295 | time 793[s] | loss 2.17
| epoch 2 |  iter 781 / 9295 | time 795[s] | loss 2.17
| epoch 2 |  iter 801 / 9295 | time 796[s] | loss 2.14
| epoch 2 |  iter 821 / 9295 | time 798[s] | loss 2.19
| epoch 2 |  iter 841 / 9295 | time 800[s] | loss 2.19
| epoch 2 |  iter 861 / 9295 | time 801[s] | loss 2.15
| epoch 2 |  iter 881 / 9295 | time 803[s] | loss 2.13
| epoch 2 |  iter 901 / 9295 | time 804[s] | loss 2.17
| epoch 2 |  iter 921 / 9295 | time 806[s] | loss 2.20
| epoch 2 |  iter 941 / 9295 | time 808[s] | loss 2.15
| epoch 2 |  iter 961 / 9295 | time 809[s] | loss 2.15
| epoch 2 |  iter 981 / 9295 | time 811[s] | loss 2.15
| epoch 2 |  iter 1001 / 9295 | time 812[s] | loss 2.14
| epoch 2 |  iter 1021 / 9295 | time 814[s] | loss 2.12
| epoch 2 |  iter 1041 / 9295 | time 816[s] | loss 2.15
| epoch 2 |  iter 1061 / 9295 | time 817[s] | loss 2.15
| epoch 2 |  iter 1081 / 9295 | time 819[s] | loss 2.16
| epoch 2 |  iter 1101 / 9295 | time 820[s] | loss 2.14
| epoch 2 |  iter 1121 / 9295 | time 822[s] | loss 2.18
| epoch 2 |  iter 1141 / 9295 | time 824[s] | loss 2.16
| epoch 2 |  iter 1161 / 9295 | time 825[s] | loss 2.15
| epoch 2 |  iter 1181 / 9295 | time 827[s] | loss 2.16
| epoch 2 |  iter 1201 / 9295 | time 828[s] | loss 2.16
| epoch 2 |  iter 1221 / 9295 | time 830[s] | loss 2.15
| epoch 2 |  iter 1241 / 9295 | time 832[s] | loss 2.12
| epoch 2 |  iter 1261 / 9295 | time 833[s] | loss 2.16
| epoch 2 |  iter 1281 / 9295 | time 835[s] | loss 2.15
| epoch 2 |  iter 1301 / 9295 | time 836[s] | loss 2.16
| epoch 2 |  iter 1321 / 9295 | time 838[s] | loss 2.14
| epoch 2 |  iter 1341 / 9295 | time 840[s] | loss 2.14
| epoch 2 |  iter 1361 / 9295 | time 841[s] | loss 2.17
| epoch 2 |  iter 1381 / 9295 | time 843[s] | loss 2.17
| epoch 2 |  iter 1401 / 9295 | time 844[s] | loss 2.15
| epoch 2 |  iter 1421 / 9295 | time 846[s] | loss 2.13
| epoch 2 |  iter 1441 / 9295 | time 848[s] | loss 2.17
| epoch 2 |  iter 1461 / 9295 | time 849[s] | loss 2.16
| epoch 2 |  iter 1481 / 9295 | time 851[s] | loss 2.14
| epoch 2 |  iter 1501 / 9295 | time 852[s] | loss 2.14
| epoch 2 |  iter 1521 / 9295 | time 854[s] | loss 2.19
| epoch 2 |  iter 1541 / 9295 | time 856[s] | loss 2.15
| epoch 2 |  iter 1561 / 9295 | time 857[s] | loss 2.18
| epoch 2 |  iter 1581 / 9295 | time 859[s] | loss 2.16
| epoch 2 |  iter 1601 / 9295 | time 860[s] | loss 2.11
| epoch 2 |  iter 1621 / 9295 | time 862[s] | loss 2.17
| epoch 2 |  iter 1641 / 9295 | time 864[s] | loss 2.15
| epoch 2 |  iter 1661 / 9295 | time 865[s] | loss 2.16
| epoch 2 |  iter 1681 / 9295 | time 867[s] | loss 2.13
| epoch 2 |  iter 1701 / 9295 | time 868[s] | loss 2.13
| epoch 2 |  iter 1721 / 9295 | time 870[s] | loss 2.16
| epoch 2 |  iter 1741 / 9295 | time 872[s] | loss 2.17
| epoch 2 |  iter 1761 / 9295 | time 873[s] | loss 2.14
| epoch 2 |  iter 1781 / 9295 | time 875[s] | loss 2.13
| epoch 2 |  iter 1801 / 9295 | time 877[s] | loss 2.17
| epoch 2 |  iter 1821 / 9295 | time 878[s] | loss 2.17
| epoch 2 |  iter 1841 / 9295 | time 880[s] | loss 2.19
| epoch 2 |  iter 1861 / 9295 | time 881[s] | loss 2.10
| epoch 2 |  iter 1881 / 9295 | time 883[s] | loss 2.13
| epoch 2 |  iter 1901 / 9295 | time 885[s] | loss 2.10
| epoch 2 |  iter 1921 / 9295 | time 886[s] | loss 2.12
| epoch 2 |  iter 1941 / 9295 | time 888[s] | loss 2.13
| epoch 2 |  iter 1961 / 9295 | time 890[s] | loss 2.16
| epoch 2 |  iter 1981 / 9295 | time 891[s] | loss 2.15
| epoch 2 |  iter 2001 / 9295 | time 893[s] | loss 2.17
| epoch 2 |  iter 2021 / 9295 | time 894[s] | loss 2.14
| epoch 2 |  iter 2041 / 9295 | time 896[s] | loss 2.18
| epoch 2 |  iter 2061 / 9295 | time 898[s] | loss 2.13
| epoch 2 |  iter 2081 / 9295 | time 899[s] | loss 2.17
| epoch 2 |  iter 2101 / 9295 | time 901[s] | loss 2.15
| epoch 2 |  iter 2121 / 9295 | time 903[s] | loss 2.13
| epoch 2 |  iter 2141 / 9295 | time 904[s] | loss 2.16
| epoch 2 |  iter 2161 / 9295 | time 906[s] | loss 2.12
| epoch 2 |  iter 2181 / 9295 | time 907[s] | loss 2.12
| epoch 2 |  iter 2201 / 9295 | time 909[s] | loss 2.14
| epoch 2 |  iter 2221 / 9295 | time 911[s] | loss 2.13
| epoch 2 |  iter 2241 / 9295 | time 912[s] | loss 2.13
| epoch 2 |  iter 2261 / 9295 | time 914[s] | loss 2.11
| epoch 2 |  iter 2281 / 9295 | time 915[s] | loss 2.14
| epoch 2 |  iter 2301 / 9295 | time 917[s] | loss 2.13
| epoch 2 |  iter 2321 / 9295 | time 919[s] | loss 2.15
| epoch 2 |  iter 2341 / 9295 | time 920[s] | loss 2.15
| epoch 2 |  iter 2361 / 9295 | time 922[s] | loss 2.15
| epoch 2 |  iter 2381 / 9295 | time 924[s] | loss 2.14
| epoch 2 |  iter 2401 / 9295 | time 925[s] | loss 2.10
| epoch 2 |  iter 2421 / 9295 | time 927[s] | loss 2.12
| epoch 2 |  iter 2441 / 9295 | time 928[s] | loss 2.10
| epoch 2 |  iter 2461 / 9295 | time 930[s] | loss 2.14
| epoch 2 |  iter 2481 / 9295 | time 932[s] | loss 2.12
| epoch 2 |  iter 2501 / 9295 | time 933[s] | loss 2.11
| epoch 2 |  iter 2521 / 9295 | time 935[s] | loss 2.13
| epoch 2 |  iter 2541 / 9295 | time 937[s] | loss 2.13
| epoch 2 |  iter 2561 / 9295 | time 938[s] | loss 2.09
| epoch 2 |  iter 2581 / 9295 | time 940[s] | loss 2.14
| epoch 2 |  iter 2601 / 9295 | time 941[s] | loss 2.11
| epoch 2 |  iter 2621 / 9295 | time 943[s] | loss 2.15
| epoch 2 |  iter 2641 / 9295 | time 945[s] | loss 2.13
| epoch 2 |  iter 2661 / 9295 | time 946[s] | loss 2.13
| epoch 2 |  iter 2681 / 9295 | time 948[s] | loss 2.14
| epoch 2 |  iter 2701 / 9295 | time 949[s] | loss 2.13
| epoch 2 |  iter 2721 / 9295 | time 951[s] | loss 2.10
| epoch 2 |  iter 2741 / 9295 | time 952[s] | loss 2.15
| epoch 2 |  iter 2761 / 9295 | time 954[s] | loss 2.11
| epoch 2 |  iter 2781 / 9295 | time 956[s] | loss 2.13
| epoch 2 |  iter 2801 / 9295 | time 957[s] | loss 2.13
| epoch 2 |  iter 2821 / 9295 | time 959[s] | loss 2.13
| epoch 2 |  iter 2841 / 9295 | time 960[s] | loss 2.13
| epoch 2 |  iter 2861 / 9295 | time 962[s] | loss 2.12
| epoch 2 |  iter 2881 / 9295 | time 963[s] | loss 2.14
| epoch 2 |  iter 2901 / 9295 | time 965[s] | loss 2.15
| epoch 2 |  iter 2921 / 9295 | time 967[s] | loss 2.13
| epoch 2 |  iter 2941 / 9295 | time 968[s] | loss 2.16
| epoch 2 |  iter 2961 / 9295 | time 970[s] | loss 2.13
| epoch 2 |  iter 2981 / 9295 | time 971[s] | loss 2.11
| epoch 2 |  iter 3001 / 9295 | time 973[s] | loss 2.14
| epoch 2 |  iter 3021 / 9295 | time 975[s] | loss 2.11
| epoch 2 |  iter 3041 / 9295 | time 976[s] | loss 2.09
| epoch 2 |  iter 3061 / 9295 | time 978[s] | loss 2.14
| epoch 2 |  iter 3081 / 9295 | time 979[s] | loss 2.12
| epoch 2 |  iter 3101 / 9295 | time 981[s] | loss 2.12
| epoch 2 |  iter 3121 / 9295 | time 983[s] | loss 2.11
| epoch 2 |  iter 3141 / 9295 | time 984[s] | loss 2.11
| epoch 2 |  iter 3161 / 9295 | time 986[s] | loss 2.12
| epoch 2 |  iter 3181 / 9295 | time 987[s] | loss 2.10
| epoch 2 |  iter 3201 / 9295 | time 989[s] | loss 2.09
| epoch 2 |  iter 3221 / 9295 | time 990[s] | loss 2.15
| epoch 2 |  iter 3241 / 9295 | time 992[s] | loss 2.16
| epoch 2 |  iter 3261 / 9295 | time 994[s] | loss 2.12
| epoch 2 |  iter 3281 / 9295 | time 995[s] | loss 2.12
| epoch 2 |  iter 3301 / 9295 | time 997[s] | loss 2.11
| epoch 2 |  iter 3321 / 9295 | time 998[s] | loss 2.13
| epoch 2 |  iter 3341 / 9295 | time 1000[s] | loss 2.12
| epoch 2 |  iter 3361 / 9295 | time 1002[s] | loss 2.08
| epoch 2 |  iter 3381 / 9295 | time 1003[s] | loss 2.11
| epoch 2 |  iter 3401 / 9295 | time 1005[s] | loss 2.11
| epoch 2 |  iter 3421 / 9295 | time 1006[s] | loss 2.14
| epoch 2 |  iter 3441 / 9295 | time 1008[s] | loss 2.11
| epoch 2 |  iter 3461 / 9295 | time 1009[s] | loss 2.10
| epoch 2 |  iter 3481 / 9295 | time 1011[s] | loss 2.09
| epoch 2 |  iter 3501 / 9295 | time 1013[s] | loss 2.11
| epoch 2 |  iter 3521 / 9295 | time 1014[s] | loss 2.08
| epoch 2 |  iter 3541 / 9295 | time 1016[s] | loss 2.11
| epoch 2 |  iter 3561 / 9295 | time 1017[s] | loss 2.10
| epoch 2 |  iter 3581 / 9295 | time 1019[s] | loss 2.13
| epoch 2 |  iter 3601 / 9295 | time 1021[s] | loss 2.10
| epoch 2 |  iter 3621 / 9295 | time 1022[s] | loss 2.11
| epoch 2 |  iter 3641 / 9295 | time 1024[s] | loss 2.12
| epoch 2 |  iter 3661 / 9295 | time 1025[s] | loss 2.15
| epoch 2 |  iter 3681 / 9295 | time 1027[s] | loss 2.11
| epoch 2 |  iter 3701 / 9295 | time 1028[s] | loss 2.12
| epoch 2 |  iter 3721 / 9295 | time 1030[s] | loss 2.14
| epoch 2 |  iter 3741 / 9295 | time 1032[s] | loss 2.14
| epoch 2 |  iter 3761 / 9295 | time 1033[s] | loss 2.09
| epoch 2 |  iter 3781 / 9295 | time 1035[s] | loss 2.08
| epoch 2 |  iter 3801 / 9295 | time 1036[s] | loss 2.14
| epoch 2 |  iter 3821 / 9295 | time 1038[s] | loss 2.11
| epoch 2 |  iter 3841 / 9295 | time 1040[s] | loss 2.12
| epoch 2 |  iter 3861 / 9295 | time 1041[s] | loss 2.09
| epoch 2 |  iter 3881 / 9295 | time 1043[s] | loss 2.04
| epoch 2 |  iter 3901 / 9295 | time 1044[s] | loss 2.07
| epoch 2 |  iter 3921 / 9295 | time 1046[s] | loss 2.11
| epoch 2 |  iter 3941 / 9295 | time 1048[s] | loss 2.09
| epoch 2 |  iter 3961 / 9295 | time 1049[s] | loss 2.10
| epoch 2 |  iter 3981 / 9295 | time 1051[s] | loss 2.10
| epoch 2 |  iter 4001 / 9295 | time 1052[s] | loss 2.06
| epoch 2 |  iter 4021 / 9295 | time 1054[s] | loss 2.12
| epoch 2 |  iter 4041 / 9295 | time 1055[s] | loss 2.12
| epoch 2 |  iter 4061 / 9295 | time 1057[s] | loss 2.13
| epoch 2 |  iter 4081 / 9295 | time 1058[s] | loss 2.08
| epoch 2 |  iter 4101 / 9295 | time 1060[s] | loss 2.07
| epoch 2 |  iter 4121 / 9295 | time 1062[s] | loss 2.08
| epoch 2 |  iter 4141 / 9295 | time 1063[s] | loss 2.13
| epoch 2 |  iter 4161 / 9295 | time 1065[s] | loss 2.12
| epoch 2 |  iter 4181 / 9295 | time 1066[s] | loss 2.08
| epoch 2 |  iter 4201 / 9295 | time 1068[s] | loss 2.06
| epoch 2 |  iter 4221 / 9295 | time 1069[s] | loss 2.07
| epoch 2 |  iter 4241 / 9295 | time 1071[s] | loss 2.06
| epoch 2 |  iter 4261 / 9295 | time 1073[s] | loss 2.09
| epoch 2 |  iter 4281 / 9295 | time 1074[s] | loss 2.09
| epoch 2 |  iter 4301 / 9295 | time 1076[s] | loss 2.12
| epoch 2 |  iter 4321 / 9295 | time 1077[s] | loss 2.04
| epoch 2 |  iter 4341 / 9295 | time 1079[s] | loss 2.06
| epoch 2 |  iter 4361 / 9295 | time 1080[s] | loss 2.09
| epoch 2 |  iter 4381 / 9295 | time 1082[s] | loss 2.06
| epoch 2 |  iter 4401 / 9295 | time 1083[s] | loss 2.05
| epoch 2 |  iter 4421 / 9295 | time 1085[s] | loss 2.08
| epoch 2 |  iter 4441 / 9295 | time 1087[s] | loss 2.12
| epoch 2 |  iter 4461 / 9295 | time 1088[s] | loss 2.09
| epoch 2 |  iter 4481 / 9295 | time 1090[s] | loss 2.07
| epoch 2 |  iter 4501 / 9295 | time 1091[s] | loss 2.10
| epoch 2 |  iter 4521 / 9295 | time 1093[s] | loss 2.09
| epoch 2 |  iter 4541 / 9295 | time 1094[s] | loss 2.12
| epoch 2 |  iter 4561 / 9295 | time 1096[s] | loss 2.06
| epoch 2 |  iter 4581 / 9295 | time 1098[s] | loss 2.10
| epoch 2 |  iter 4601 / 9295 | time 1099[s] | loss 2.09
| epoch 2 |  iter 4621 / 9295 | time 1101[s] | loss 2.07
| epoch 2 |  iter 4641 / 9295 | time 1102[s] | loss 2.10
| epoch 2 |  iter 4661 / 9295 | time 1104[s] | loss 2.09
| epoch 2 |  iter 4681 / 9295 | time 1105[s] | loss 2.09
| epoch 2 |  iter 4701 / 9295 | time 1107[s] | loss 2.06
| epoch 2 |  iter 4721 / 9295 | time 1109[s] | loss 2.08
| epoch 2 |  iter 4741 / 9295 | time 1110[s] | loss 2.07
| epoch 2 |  iter 4761 / 9295 | time 1112[s] | loss 2.09
| epoch 2 |  iter 4781 / 9295 | time 1113[s] | loss 2.06
| epoch 2 |  iter 4801 / 9295 | time 1115[s] | loss 2.08
| epoch 2 |  iter 4821 / 9295 | time 1117[s] | loss 2.05
| epoch 2 |  iter 4841 / 9295 | time 1118[s] | loss 2.07
| epoch 2 |  iter 4861 / 9295 | time 1120[s] | loss 2.08
| epoch 2 |  iter 4881 / 9295 | time 1121[s] | loss 2.08
| epoch 2 |  iter 4901 / 9295 | time 1123[s] | loss 2.05
| epoch 2 |  iter 4921 / 9295 | time 1124[s] | loss 2.09
| epoch 2 |  iter 4941 / 9295 | time 1126[s] | loss 2.08
| epoch 2 |  iter 4961 / 9295 | time 1128[s] | loss 2.09
| epoch 2 |  iter 4981 / 9295 | time 1129[s] | loss 2.08
| epoch 2 |  iter 5001 / 9295 | time 1131[s] | loss 2.08
| epoch 2 |  iter 5021 / 9295 | time 1132[s] | loss 2.09
| epoch 2 |  iter 5041 / 9295 | time 1134[s] | loss 2.07
| epoch 2 |  iter 5061 / 9295 | time 1135[s] | loss 2.09
| epoch 2 |  iter 5081 / 9295 | time 1137[s] | loss 2.10
| epoch 2 |  iter 5101 / 9295 | time 1138[s] | loss 2.10
| epoch 2 |  iter 5121 / 9295 | time 1140[s] | loss 2.08
| epoch 2 |  iter 5141 / 9295 | time 1142[s] | loss 2.04
| epoch 2 |  iter 5161 / 9295 | time 1143[s] | loss 2.06
| epoch 2 |  iter 5181 / 9295 | time 1145[s] | loss 2.07
| epoch 2 |  iter 5201 / 9295 | time 1146[s] | loss 2.08
| epoch 2 |  iter 5221 / 9295 | time 1148[s] | loss 2.09
| epoch 2 |  iter 5241 / 9295 | time 1149[s] | loss 2.08
| epoch 2 |  iter 5261 / 9295 | time 1151[s] | loss 2.08
| epoch 2 |  iter 5281 / 9295 | time 1153[s] | loss 2.05
| epoch 2 |  iter 5301 / 9295 | time 1154[s] | loss 2.05
| epoch 2 |  iter 5321 / 9295 | time 1156[s] | loss 2.06
| epoch 2 |  iter 5341 / 9295 | time 1157[s] | loss 2.09
| epoch 2 |  iter 5361 / 9295 | time 1159[s] | loss 2.06
| epoch 2 |  iter 5381 / 9295 | time 1160[s] | loss 2.07
| epoch 2 |  iter 5401 / 9295 | time 1162[s] | loss 2.05
| epoch 2 |  iter 5421 / 9295 | time 1164[s] | loss 2.06
| epoch 2 |  iter 5441 / 9295 | time 1165[s] | loss 2.09
| epoch 2 |  iter 5461 / 9295 | time 1167[s] | loss 2.09
| epoch 2 |  iter 5481 / 9295 | time 1168[s] | loss 2.07
| epoch 2 |  iter 5501 / 9295 | time 1170[s] | loss 2.10
| epoch 2 |  iter 5521 / 9295 | time 1171[s] | loss 2.10
| epoch 2 |  iter 5541 / 9295 | time 1173[s] | loss 2.06
| epoch 2 |  iter 5561 / 9295 | time 1175[s] | loss 2.09
| epoch 2 |  iter 5581 / 9295 | time 1176[s] | loss 2.09
| epoch 2 |  iter 5601 / 9295 | time 1178[s] | loss 2.08
| epoch 2 |  iter 5621 / 9295 | time 1179[s] | loss 2.08
| epoch 2 |  iter 5641 / 9295 | time 1181[s] | loss 2.05
| epoch 2 |  iter 5661 / 9295 | time 1182[s] | loss 2.06
| epoch 2 |  iter 5681 / 9295 | time 1184[s] | loss 2.06
| epoch 2 |  iter 5701 / 9295 | time 1186[s] | loss 2.07
| epoch 2 |  iter 5721 / 9295 | time 1187[s] | loss 2.11
| epoch 2 |  iter 5741 / 9295 | time 1189[s] | loss 2.10
| epoch 2 |  iter 5761 / 9295 | time 1190[s] | loss 2.06
| epoch 2 |  iter 5781 / 9295 | time 1192[s] | loss 2.07
| epoch 2 |  iter 5801 / 9295 | time 1193[s] | loss 2.09
| epoch 2 |  iter 5821 / 9295 | time 1195[s] | loss 2.05
| epoch 2 |  iter 5841 / 9295 | time 1197[s] | loss 2.02
| epoch 2 |  iter 5861 / 9295 | time 1198[s] | loss 2.03
| epoch 2 |  iter 5881 / 9295 | time 1200[s] | loss 2.05
| epoch 2 |  iter 5901 / 9295 | time 1201[s] | loss 2.07
| epoch 2 |  iter 5921 / 9295 | time 1203[s] | loss 2.07
| epoch 2 |  iter 5941 / 9295 | time 1204[s] | loss 2.04
| epoch 2 |  iter 5961 / 9295 | time 1206[s] | loss 2.08
| epoch 2 |  iter 5981 / 9295 | time 1207[s] | loss 2.06
| epoch 2 |  iter 6001 / 9295 | time 1209[s] | loss 2.07
| epoch 2 |  iter 6021 / 9295 | time 1211[s] | loss 2.09
| epoch 2 |  iter 6041 / 9295 | time 1212[s] | loss 2.07
| epoch 2 |  iter 6061 / 9295 | time 1214[s] | loss 2.08
| epoch 2 |  iter 6081 / 9295 | time 1215[s] | loss 2.06
| epoch 2 |  iter 6101 / 9295 | time 1217[s] | loss 2.06
| epoch 2 |  iter 6121 / 9295 | time 1218[s] | loss 2.09
| epoch 2 |  iter 6141 / 9295 | time 1220[s] | loss 2.02
| epoch 2 |  iter 6161 / 9295 | time 1222[s] | loss 2.08
| epoch 2 |  iter 6181 / 9295 | time 1223[s] | loss 2.07
| epoch 2 |  iter 6201 / 9295 | time 1225[s] | loss 2.02
| epoch 2 |  iter 6221 / 9295 | time 1226[s] | loss 2.08
| epoch 2 |  iter 6241 / 9295 | time 1228[s] | loss 2.03
| epoch 2 |  iter 6261 / 9295 | time 1230[s] | loss 2.05
| epoch 2 |  iter 6281 / 9295 | time 1231[s] | loss 2.10
| epoch 2 |  iter 6301 / 9295 | time 1233[s] | loss 2.08
| epoch 2 |  iter 6321 / 9295 | time 1234[s] | loss 2.06
| epoch 2 |  iter 6341 / 9295 | time 1236[s] | loss 2.06
| epoch 2 |  iter 6361 / 9295 | time 1238[s] | loss 2.05
| epoch 2 |  iter 6381 / 9295 | time 1239[s] | loss 2.06
| epoch 2 |  iter 6401 / 9295 | time 1241[s] | loss 2.09
| epoch 2 |  iter 6421 / 9295 | time 1242[s] | loss 2.02
| epoch 2 |  iter 6441 / 9295 | time 1244[s] | loss 2.04
| epoch 2 |  iter 6461 / 9295 | time 1246[s] | loss 2.02
| epoch 2 |  iter 6481 / 9295 | time 1247[s] | loss 2.08
| epoch 2 |  iter 6501 / 9295 | time 1249[s] | loss 2.06
| epoch 2 |  iter 6521 / 9295 | time 1250[s] | loss 2.06
| epoch 2 |  iter 6541 / 9295 | time 1252[s] | loss 2.04
| epoch 2 |  iter 6561 / 9295 | time 1254[s] | loss 2.09
| epoch 2 |  iter 6581 / 9295 | time 1255[s] | loss 2.02
| epoch 2 |  iter 6601 / 9295 | time 1257[s] | loss 2.02
| epoch 2 |  iter 6621 / 9295 | time 1258[s] | loss 2.03
| epoch 2 |  iter 6641 / 9295 | time 1260[s] | loss 2.05
| epoch 2 |  iter 6661 / 9295 | time 1262[s] | loss 2.03
| epoch 2 |  iter 6681 / 9295 | time 1263[s] | loss 2.02
| epoch 2 |  iter 6701 / 9295 | time 1265[s] | loss 2.05
| epoch 2 |  iter 6721 / 9295 | time 1266[s] | loss 2.03
| epoch 2 |  iter 6741 / 9295 | time 1268[s] | loss 2.13
| epoch 2 |  iter 6761 / 9295 | time 1270[s] | loss 2.03
| epoch 2 |  iter 6781 / 9295 | time 1271[s] | loss 2.04
| epoch 2 |  iter 6801 / 9295 | time 1273[s] | loss 2.06
| epoch 2 |  iter 6821 / 9295 | time 1274[s] | loss 2.08
| epoch 2 |  iter 6841 / 9295 | time 1276[s] | loss 2.06
| epoch 2 |  iter 6861 / 9295 | time 1278[s] | loss 2.06
| epoch 2 |  iter 6881 / 9295 | time 1279[s] | loss 2.05
| epoch 2 |  iter 6901 / 9295 | time 1281[s] | loss 2.09
| epoch 2 |  iter 6921 / 9295 | time 1282[s] | loss 2.07
| epoch 2 |  iter 6941 / 9295 | time 1284[s] | loss 2.03
| epoch 2 |  iter 6961 / 9295 | time 1286[s] | loss 2.01
| epoch 2 |  iter 6981 / 9295 | time 1287[s] | loss 2.08
| epoch 2 |  iter 7001 / 9295 | time 1289[s] | loss 2.04
| epoch 2 |  iter 7021 / 9295 | time 1290[s] | loss 2.03
| epoch 2 |  iter 7041 / 9295 | time 1292[s] | loss 2.04
| epoch 2 |  iter 7061 / 9295 | time 1294[s] | loss 2.05
| epoch 2 |  iter 7081 / 9295 | time 1295[s] | loss 2.01
| epoch 2 |  iter 7101 / 9295 | time 1297[s] | loss 2.02
| epoch 2 |  iter 7121 / 9295 | time 1298[s] | loss 2.05
| epoch 2 |  iter 7141 / 9295 | time 1300[s] | loss 2.06
| epoch 2 |  iter 7161 / 9295 | time 1302[s] | loss 2.02
| epoch 2 |  iter 7181 / 9295 | time 1303[s] | loss 2.04
| epoch 2 |  iter 7201 / 9295 | time 1305[s] | loss 2.02
| epoch 2 |  iter 7221 / 9295 | time 1306[s] | loss 2.03
| epoch 2 |  iter 7241 / 9295 | time 1308[s] | loss 2.03
| epoch 2 |  iter 7261 / 9295 | time 1309[s] | loss 2.06
| epoch 2 |  iter 7281 / 9295 | time 1311[s] | loss 2.04
| epoch 2 |  iter 7301 / 9295 | time 1313[s] | loss 2.04
| epoch 2 |  iter 7321 / 9295 | time 1314[s] | loss 2.02
| epoch 2 |  iter 7341 / 9295 | time 1316[s] | loss 2.07
| epoch 2 |  iter 7361 / 9295 | time 1317[s] | loss 2.02
| epoch 2 |  iter 7381 / 9295 | time 1319[s] | loss 2.04
| epoch 2 |  iter 7401 / 9295 | time 1320[s] | loss 2.02
| epoch 2 |  iter 7421 / 9295 | time 1322[s] | loss 2.03
| epoch 2 |  iter 7441 / 9295 | time 1324[s] | loss 2.05
| epoch 2 |  iter 7461 / 9295 | time 1325[s] | loss 2.01
| epoch 2 |  iter 7481 / 9295 | time 1327[s] | loss 2.04
| epoch 2 |  iter 7501 / 9295 | time 1328[s] | loss 2.00
| epoch 2 |  iter 7521 / 9295 | time 1330[s] | loss 2.06
| epoch 2 |  iter 7541 / 9295 | time 1332[s] | loss 2.05
| epoch 2 |  iter 7561 / 9295 | time 1333[s] | loss 2.03
| epoch 2 |  iter 7581 / 9295 | time 1335[s] | loss 2.03
| epoch 2 |  iter 7601 / 9295 | time 1336[s] | loss 2.05
| epoch 2 |  iter 7621 / 9295 | time 1338[s] | loss 2.06
| epoch 2 |  iter 7641 / 9295 | time 1339[s] | loss 2.05
| epoch 2 |  iter 7661 / 9295 | time 1341[s] | loss 2.06
| epoch 2 |  iter 7681 / 9295 | time 1343[s] | loss 2.06
| epoch 2 |  iter 7701 / 9295 | time 1344[s] | loss 2.04
| epoch 2 |  iter 7721 / 9295 | time 1346[s] | loss 2.06
| epoch 2 |  iter 7741 / 9295 | time 1347[s] | loss 2.02
| epoch 2 |  iter 7761 / 9295 | time 1349[s] | loss 2.04
| epoch 2 |  iter 7781 / 9295 | time 1350[s] | loss 2.00
| epoch 2 |  iter 7801 / 9295 | time 1352[s] | loss 2.03
| epoch 2 |  iter 7821 / 9295 | time 1354[s] | loss 2.06
| epoch 2 |  iter 7841 / 9295 | time 1355[s] | loss 2.02
| epoch 2 |  iter 7861 / 9295 | time 1357[s] | loss 1.99
| epoch 2 |  iter 7881 / 9295 | time 1358[s] | loss 2.04
| epoch 2 |  iter 7901 / 9295 | time 1360[s] | loss 2.01
| epoch 2 |  iter 7921 / 9295 | time 1362[s] | loss 2.04
| epoch 2 |  iter 7941 / 9295 | time 1363[s] | loss 2.04
| epoch 2 |  iter 7961 / 9295 | time 1365[s] | loss 2.05
| epoch 2 |  iter 7981 / 9295 | time 1367[s] | loss 2.08
| epoch 2 |  iter 8001 / 9295 | time 1368[s] | loss 2.03
| epoch 2 |  iter 8021 / 9295 | time 1370[s] | loss 2.05
| epoch 2 |  iter 8041 / 9295 | time 1371[s] | loss 2.00
| epoch 2 |  iter 8061 / 9295 | time 1373[s] | loss 2.02
| epoch 2 |  iter 8081 / 9295 | time 1375[s] | loss 2.03
| epoch 2 |  iter 8101 / 9295 | time 1376[s] | loss 1.99
| epoch 2 |  iter 8121 / 9295 | time 1378[s] | loss 2.06
| epoch 2 |  iter 8141 / 9295 | time 1379[s] | loss 2.03
| epoch 2 |  iter 8161 / 9295 | time 1381[s] | loss 2.03
| epoch 2 |  iter 8181 / 9295 | time 1383[s] | loss 2.04
| epoch 2 |  iter 8201 / 9295 | time 1384[s] | loss 2.06
| epoch 2 |  iter 8221 / 9295 | time 1386[s] | loss 2.03
| epoch 2 |  iter 8241 / 9295 | time 1387[s] | loss 2.02
| epoch 2 |  iter 8261 / 9295 | time 1389[s] | loss 1.98
| epoch 2 |  iter 8281 / 9295 | time 1391[s] | loss 2.03
| epoch 2 |  iter 8301 / 9295 | time 1392[s] | loss 2.02
| epoch 2 |  iter 8321 / 9295 | time 1394[s] | loss 2.05
| epoch 2 |  iter 8341 / 9295 | time 1395[s] | loss 2.05
| epoch 2 |  iter 8361 / 9295 | time 1397[s] | loss 2.02
| epoch 2 |  iter 8381 / 9295 | time 1399[s] | loss 2.05
| epoch 2 |  iter 8401 / 9295 | time 1400[s] | loss 2.02
| epoch 2 |  iter 8421 / 9295 | time 1402[s] | loss 2.07
| epoch 2 |  iter 8441 / 9295 | time 1404[s] | loss 2.02
| epoch 2 |  iter 8461 / 9295 | time 1405[s] | loss 2.04
| epoch 2 |  iter 8481 / 9295 | time 1407[s] | loss 2.02
| epoch 2 |  iter 8501 / 9295 | time 1409[s] | loss 2.04
| epoch 2 |  iter 8521 / 9295 | time 1410[s] | loss 2.04
| epoch 2 |  iter 8541 / 9295 | time 1412[s] | loss 2.04
| epoch 2 |  iter 8561 / 9295 | time 1413[s] | loss 2.02
| epoch 2 |  iter 8581 / 9295 | time 1415[s] | loss 2.05
| epoch 2 |  iter 8601 / 9295 | time 1417[s] | loss 2.03
| epoch 2 |  iter 8621 / 9295 | time 1418[s] | loss 2.05
| epoch 2 |  iter 8641 / 9295 | time 1420[s] | loss 1.99
| epoch 2 |  iter 8661 / 9295 | time 1421[s] | loss 2.06
| epoch 2 |  iter 8681 / 9295 | time 1423[s] | loss 2.03
| epoch 2 |  iter 8701 / 9295 | time 1425[s] | loss 2.03
| epoch 2 |  iter 8721 / 9295 | time 1426[s] | loss 2.04
| epoch 2 |  iter 8741 / 9295 | time 1428[s] | loss 2.02
| epoch 2 |  iter 8761 / 9295 | time 1429[s] | loss 2.02
| epoch 2 |  iter 8781 / 9295 | time 1431[s] | loss 2.01
| epoch 2 |  iter 8801 / 9295 | time 1433[s] | loss 2.06
| epoch 2 |  iter 8821 / 9295 | time 1434[s] | loss 2.01
| epoch 2 |  iter 8841 / 9295 | time 1436[s] | loss 1.98
| epoch 2 |  iter 8861 / 9295 | time 1437[s] | loss 2.04
| epoch 2 |  iter 8881 / 9295 | time 1439[s] | loss 2.05
| epoch 2 |  iter 8901 / 9295 | time 1441[s] | loss 2.03
| epoch 2 |  iter 8921 / 9295 | time 1442[s] | loss 2.03
| epoch 2 |  iter 8941 / 9295 | time 1444[s] | loss 2.01
| epoch 2 |  iter 8961 / 9295 | time 1445[s] | loss 2.02
| epoch 2 |  iter 8981 / 9295 | time 1447[s] | loss 2.03
| epoch 2 |  iter 9001 / 9295 | time 1449[s] | loss 2.04
| epoch 2 |  iter 9021 / 9295 | time 1450[s] | loss 2.03
| epoch 2 |  iter 9041 / 9295 | time 1452[s] | loss 2.03
| epoch 2 |  iter 9061 / 9295 | time 1454[s] | loss 2.01
| epoch 2 |  iter 9081 / 9295 | time 1455[s] | loss 2.02
| epoch 2 |  iter 9101 / 9295 | time 1457[s] | loss 2.06
| epoch 2 |  iter 9121 / 9295 | time 1459[s] | loss 2.01
| epoch 2 |  iter 9141 / 9295 | time 1460[s] | loss 2.02
| epoch 2 |  iter 9161 / 9295 | time 1462[s] | loss 2.05
| epoch 2 |  iter 9181 / 9295 | time 1464[s] | loss 2.04
| epoch 2 |  iter 9201 / 9295 | time 1465[s] | loss 2.04
| epoch 2 |  iter 9221 / 9295 | time 1467[s] | loss 2.02
| epoch 2 |  iter 9241 / 9295 | time 1468[s] | loss 2.00
| epoch 2 |  iter 9261 / 9295 | time 1470[s] | loss 2.02
| epoch 2 |  iter 9281 / 9295 | time 1472[s] | loss 2.02
| epoch 3 |  iter 1 / 9295 | time 1473[s] | loss 1.98
| epoch 3 |  iter 21 / 9295 | time 1475[s] | loss 1.91
| epoch 3 |  iter 41 / 9295 | time 1476[s] | loss 1.95
| epoch 3 |  iter 61 / 9295 | time 1478[s] | loss 1.94
| epoch 3 |  iter 81 / 9295 | time 1479[s] | loss 1.97
| epoch 3 |  iter 101 / 9295 | time 1481[s] | loss 1.95
| epoch 3 |  iter 121 / 9295 | time 1483[s] | loss 1.95
| epoch 3 |  iter 141 / 9295 | time 1484[s] | loss 1.92
| epoch 3 |  iter 161 / 9295 | time 1486[s] | loss 2.01
| epoch 3 |  iter 181 / 9295 | time 1487[s] | loss 1.94
| epoch 3 |  iter 201 / 9295 | time 1489[s] | loss 1.92
| epoch 3 |  iter 221 / 9295 | time 1491[s] | loss 1.94
| epoch 3 |  iter 241 / 9295 | time 1492[s] | loss 1.95
| epoch 3 |  iter 261 / 9295 | time 1494[s] | loss 1.96
| epoch 3 |  iter 281 / 9295 | time 1495[s] | loss 1.92
| epoch 3 |  iter 301 / 9295 | time 1497[s] | loss 1.96
| epoch 3 |  iter 321 / 9295 | time 1498[s] | loss 1.95
| epoch 3 |  iter 341 / 9295 | time 1500[s] | loss 1.96
| epoch 3 |  iter 361 / 9295 | time 1502[s] | loss 1.93
| epoch 3 |  iter 381 / 9295 | time 1503[s] | loss 1.95
| epoch 3 |  iter 401 / 9295 | time 1505[s] | loss 1.97
| epoch 3 |  iter 421 / 9295 | time 1506[s] | loss 1.96
| epoch 3 |  iter 441 / 9295 | time 1508[s] | loss 1.95
| epoch 3 |  iter 461 / 9295 | time 1510[s] | loss 1.94
| epoch 3 |  iter 481 / 9295 | time 1511[s] | loss 1.96
| epoch 3 |  iter 501 / 9295 | time 1513[s] | loss 1.94
| epoch 3 |  iter 521 / 9295 | time 1514[s] | loss 1.98
| epoch 3 |  iter 541 / 9295 | time 1516[s] | loss 1.95
| epoch 3 |  iter 561 / 9295 | time 1517[s] | loss 1.92
| epoch 3 |  iter 581 / 9295 | time 1519[s] | loss 1.97
| epoch 3 |  iter 601 / 9295 | time 1521[s] | loss 1.96
| epoch 3 |  iter 621 / 9295 | time 1522[s] | loss 1.98
| epoch 3 |  iter 641 / 9295 | time 1524[s] | loss 1.93
| epoch 3 |  iter 661 / 9295 | time 1525[s] | loss 1.95
| epoch 3 |  iter 681 / 9295 | time 1527[s] | loss 1.97
| epoch 3 |  iter 701 / 9295 | time 1529[s] | loss 1.91
| epoch 3 |  iter 721 / 9295 | time 1530[s] | loss 1.91
| epoch 3 |  iter 741 / 9295 | time 1532[s] | loss 1.99
| epoch 3 |  iter 761 / 9295 | time 1533[s] | loss 1.94
| epoch 3 |  iter 781 / 9295 | time 1535[s] | loss 1.98
| epoch 3 |  iter 801 / 9295 | time 1536[s] | loss 1.97
| epoch 3 |  iter 821 / 9295 | time 1538[s] | loss 1.92
| epoch 3 |  iter 841 / 9295 | time 1540[s] | loss 1.95
| epoch 3 |  iter 861 / 9295 | time 1541[s] | loss 1.98
| epoch 3 |  iter 881 / 9295 | time 1543[s] | loss 1.95
| epoch 3 |  iter 901 / 9295 | time 1544[s] | loss 1.95
| epoch 3 |  iter 921 / 9295 | time 1546[s] | loss 1.96
| epoch 3 |  iter 941 / 9295 | time 1548[s] | loss 1.92
| epoch 3 |  iter 961 / 9295 | time 1549[s] | loss 1.93
| epoch 3 |  iter 981 / 9295 | time 1551[s] | loss 1.96
| epoch 3 |  iter 1001 / 9295 | time 1552[s] | loss 1.93
| epoch 3 |  iter 1021 / 9295 | time 1554[s] | loss 1.96
| epoch 3 |  iter 1041 / 9295 | time 1556[s] | loss 1.94
| epoch 3 |  iter 1061 / 9295 | time 1557[s] | loss 1.96
| epoch 3 |  iter 1081 / 9295 | time 1559[s] | loss 1.95
| epoch 3 |  iter 1101 / 9295 | time 1560[s] | loss 2.00
| epoch 3 |  iter 1121 / 9295 | time 1562[s] | loss 1.93
| epoch 3 |  iter 1141 / 9295 | time 1564[s] | loss 1.96
| epoch 3 |  iter 1161 / 9295 | time 1565[s] | loss 1.99
| epoch 3 |  iter 1181 / 9295 | time 1567[s] | loss 1.94
| epoch 3 |  iter 1201 / 9295 | time 1568[s] | loss 1.96
| epoch 3 |  iter 1221 / 9295 | time 1570[s] | loss 1.98
| epoch 3 |  iter 1241 / 9295 | time 1571[s] | loss 1.90
| epoch 3 |  iter 1261 / 9295 | time 1573[s] | loss 1.95
| epoch 3 |  iter 1281 / 9295 | time 1575[s] | loss 1.93
| epoch 3 |  iter 1301 / 9295 | time 1576[s] | loss 1.98
| epoch 3 |  iter 1321 / 9295 | time 1578[s] | loss 1.92
| epoch 3 |  iter 1341 / 9295 | time 1579[s] | loss 2.00
| epoch 3 |  iter 1361 / 9295 | time 1581[s] | loss 1.94
| epoch 3 |  iter 1381 / 9295 | time 1583[s] | loss 1.95
| epoch 3 |  iter 1401 / 9295 | time 1584[s] | loss 1.98
| epoch 3 |  iter 1421 / 9295 | time 1586[s] | loss 1.94
| epoch 3 |  iter 1441 / 9295 | time 1587[s] | loss 1.96
| epoch 3 |  iter 1461 / 9295 | time 1589[s] | loss 1.97
| epoch 3 |  iter 1481 / 9295 | time 1591[s] | loss 1.97
| epoch 3 |  iter 1501 / 9295 | time 1592[s] | loss 1.93
| epoch 3 |  iter 1521 / 9295 | time 1594[s] | loss 1.95
| epoch 3 |  iter 1541 / 9295 | time 1595[s] | loss 1.96
| epoch 3 |  iter 1561 / 9295 | time 1597[s] | loss 1.97
| epoch 3 |  iter 1581 / 9295 | time 1598[s] | loss 1.92
| epoch 3 |  iter 1601 / 9295 | time 1600[s] | loss 1.95
| epoch 3 |  iter 1621 / 9295 | time 1602[s] | loss 1.96
| epoch 3 |  iter 1641 / 9295 | time 1603[s] | loss 1.94
| epoch 3 |  iter 1661 / 9295 | time 1605[s] | loss 1.94
| epoch 3 |  iter 1681 / 9295 | time 1606[s] | loss 1.91
| epoch 3 |  iter 1701 / 9295 | time 1608[s] | loss 1.97
| epoch 3 |  iter 1721 / 9295 | time 1609[s] | loss 1.93
| epoch 3 |  iter 1741 / 9295 | time 1611[s] | loss 1.96
| epoch 3 |  iter 1761 / 9295 | time 1613[s] | loss 1.97
| epoch 3 |  iter 1781 / 9295 | time 1614[s] | loss 1.96
| epoch 3 |  iter 1801 / 9295 | time 1616[s] | loss 1.91
| epoch 3 |  iter 1821 / 9295 | time 1617[s] | loss 1.95
| epoch 3 |  iter 1841 / 9295 | time 1619[s] | loss 1.89
| epoch 3 |  iter 1861 / 9295 | time 1621[s] | loss 1.94
| epoch 3 |  iter 1881 / 9295 | time 1622[s] | loss 1.92
| epoch 3 |  iter 1901 / 9295 | time 1624[s] | loss 1.97
| epoch 3 |  iter 1921 / 9295 | time 1625[s] | loss 1.91
| epoch 3 |  iter 1941 / 9295 | time 1627[s] | loss 1.95
| epoch 3 |  iter 1961 / 9295 | time 1628[s] | loss 1.96
| epoch 3 |  iter 1981 / 9295 | time 1630[s] | loss 1.96
| epoch 3 |  iter 2001 / 9295 | time 1632[s] | loss 1.93
| epoch 3 |  iter 2021 / 9295 | time 1633[s] | loss 1.96
| epoch 3 |  iter 2041 / 9295 | time 1635[s] | loss 1.94
| epoch 3 |  iter 2061 / 9295 | time 1636[s] | loss 1.93
| epoch 3 |  iter 2081 / 9295 | time 1638[s] | loss 1.97
| epoch 3 |  iter 2101 / 9295 | time 1640[s] | loss 1.98
| epoch 3 |  iter 2121 / 9295 | time 1641[s] | loss 1.94
| epoch 3 |  iter 2141 / 9295 | time 1643[s] | loss 1.94
| epoch 3 |  iter 2161 / 9295 | time 1644[s] | loss 1.97
| epoch 3 |  iter 2181 / 9295 | time 1646[s] | loss 1.93
| epoch 3 |  iter 2201 / 9295 | time 1648[s] | loss 1.92
| epoch 3 |  iter 2221 / 9295 | time 1649[s] | loss 1.91
| epoch 3 |  iter 2241 / 9295 | time 1651[s] | loss 1.92
| epoch 3 |  iter 2261 / 9295 | time 1652[s] | loss 1.99
| epoch 3 |  iter 2281 / 9295 | time 1654[s] | loss 1.94
| epoch 3 |  iter 2301 / 9295 | time 1656[s] | loss 1.96
| epoch 3 |  iter 2321 / 9295 | time 1657[s] | loss 1.91
| epoch 3 |  iter 2341 / 9295 | time 1659[s] | loss 1.92
| epoch 3 |  iter 2361 / 9295 | time 1660[s] | loss 1.94
| epoch 3 |  iter 2381 / 9295 | time 1662[s] | loss 1.95
| epoch 3 |  iter 2401 / 9295 | time 1664[s] | loss 1.89
| epoch 3 |  iter 2421 / 9295 | time 1665[s] | loss 1.95
| epoch 3 |  iter 2441 / 9295 | time 1667[s] | loss 1.92
| epoch 3 |  iter 2461 / 9295 | time 1668[s] | loss 1.94
| epoch 3 |  iter 2481 / 9295 | time 1670[s] | loss 1.93
| epoch 3 |  iter 2501 / 9295 | time 1672[s] | loss 1.97
| epoch 3 |  iter 2521 / 9295 | time 1673[s] | loss 1.92
| epoch 3 |  iter 2541 / 9295 | time 1675[s] | loss 1.93
| epoch 3 |  iter 2561 / 9295 | time 1677[s] | loss 1.95
| epoch 3 |  iter 2581 / 9295 | time 1678[s] | loss 1.92
| epoch 3 |  iter 2601 / 9295 | time 1680[s] | loss 1.96
| epoch 3 |  iter 2621 / 9295 | time 1681[s] | loss 1.96
| epoch 3 |  iter 2641 / 9295 | time 1683[s] | loss 1.93
| epoch 3 |  iter 2661 / 9295 | time 1684[s] | loss 1.92
| epoch 3 |  iter 2681 / 9295 | time 1686[s] | loss 1.93
| epoch 3 |  iter 2701 / 9295 | time 1688[s] | loss 1.95
| epoch 3 |  iter 2721 / 9295 | time 1689[s] | loss 1.99
| epoch 3 |  iter 2741 / 9295 | time 1691[s] | loss 1.90
| epoch 3 |  iter 2761 / 9295 | time 1692[s] | loss 1.90
| epoch 3 |  iter 2781 / 9295 | time 1694[s] | loss 1.90
| epoch 3 |  iter 2801 / 9295 | time 1695[s] | loss 1.93
| epoch 3 |  iter 2821 / 9295 | time 1697[s] | loss 1.90
| epoch 3 |  iter 2841 / 9295 | time 1699[s] | loss 1.96
| epoch 3 |  iter 2861 / 9295 | time 1700[s] | loss 1.95
| epoch 3 |  iter 2881 / 9295 | time 1702[s] | loss 1.95
| epoch 3 |  iter 2901 / 9295 | time 1704[s] | loss 1.93
| epoch 3 |  iter 2921 / 9295 | time 1705[s] | loss 1.93
| epoch 3 |  iter 2941 / 9295 | time 1707[s] | loss 1.89
| epoch 3 |  iter 2961 / 9295 | time 1708[s] | loss 1.96
| epoch 3 |  iter 2981 / 9295 | time 1710[s] | loss 1.93
| epoch 3 |  iter 3001 / 9295 | time 1712[s] | loss 1.94
| epoch 3 |  iter 3021 / 9295 | time 1713[s] | loss 1.91
| epoch 3 |  iter 3041 / 9295 | time 1715[s] | loss 1.92
| epoch 3 |  iter 3061 / 9295 | time 1717[s] | loss 1.93
| epoch 3 |  iter 3081 / 9295 | time 1718[s] | loss 1.93
| epoch 3 |  iter 3101 / 9295 | time 1720[s] | loss 1.93
| epoch 3 |  iter 3121 / 9295 | time 1721[s] | loss 1.95
| epoch 3 |  iter 3141 / 9295 | time 1723[s] | loss 1.94
| epoch 3 |  iter 3161 / 9295 | time 1725[s] | loss 1.95
| epoch 3 |  iter 3181 / 9295 | time 1726[s] | loss 1.88
| epoch 3 |  iter 3201 / 9295 | time 1728[s] | loss 1.96
| epoch 3 |  iter 3221 / 9295 | time 1729[s] | loss 1.95
| epoch 3 |  iter 3241 / 9295 | time 1731[s] | loss 1.91
| epoch 3 |  iter 3261 / 9295 | time 1733[s] | loss 1.97
| epoch 3 |  iter 3281 / 9295 | time 1734[s] | loss 1.94
| epoch 3 |  iter 3301 / 9295 | time 1736[s] | loss 1.94
| epoch 3 |  iter 3321 / 9295 | time 1737[s] | loss 1.94
| epoch 3 |  iter 3341 / 9295 | time 1739[s] | loss 1.93
| epoch 3 |  iter 3361 / 9295 | time 1741[s] | loss 1.93
| epoch 3 |  iter 3381 / 9295 | time 1742[s] | loss 1.93
| epoch 3 |  iter 3401 / 9295 | time 1744[s] | loss 1.93
| epoch 3 |  iter 3421 / 9295 | time 1745[s] | loss 1.96
| epoch 3 |  iter 3441 / 9295 | time 1747[s] | loss 1.92
| epoch 3 |  iter 3461 / 9295 | time 1749[s] | loss 1.91
| epoch 3 |  iter 3481 / 9295 | time 1750[s] | loss 1.93
| epoch 3 |  iter 3501 / 9295 | time 1752[s] | loss 1.94
| epoch 3 |  iter 3521 / 9295 | time 1753[s] | loss 1.92
| epoch 3 |  iter 3541 / 9295 | time 1755[s] | loss 1.96
| epoch 3 |  iter 3561 / 9295 | time 1757[s] | loss 1.93
| epoch 3 |  iter 3581 / 9295 | time 1758[s] | loss 1.95
| epoch 3 |  iter 3601 / 9295 | time 1760[s] | loss 1.94
| epoch 3 |  iter 3621 / 9295 | time 1762[s] | loss 1.91
| epoch 3 |  iter 3641 / 9295 | time 1763[s] | loss 1.92
| epoch 3 |  iter 3661 / 9295 | time 1765[s] | loss 1.92
| epoch 3 |  iter 3681 / 9295 | time 1766[s] | loss 1.92
| epoch 3 |  iter 3701 / 9295 | time 1768[s] | loss 1.93
| epoch 3 |  iter 3721 / 9295 | time 1770[s] | loss 1.89
| epoch 3 |  iter 3741 / 9295 | time 1771[s] | loss 1.93
| epoch 3 |  iter 3761 / 9295 | time 1773[s] | loss 1.90
| epoch 3 |  iter 3781 / 9295 | time 1774[s] | loss 1.96
| epoch 3 |  iter 3801 / 9295 | time 1776[s] | loss 1.96
| epoch 3 |  iter 3821 / 9295 | time 1778[s] | loss 1.94
| epoch 3 |  iter 3841 / 9295 | time 1779[s] | loss 1.96
| epoch 3 |  iter 3861 / 9295 | time 1781[s] | loss 1.90
| epoch 3 |  iter 3881 / 9295 | time 1782[s] | loss 1.94
| epoch 3 |  iter 3901 / 9295 | time 1784[s] | loss 1.91
| epoch 3 |  iter 3921 / 9295 | time 1785[s] | loss 1.93
| epoch 3 |  iter 3941 / 9295 | time 1787[s] | loss 1.95
| epoch 3 |  iter 3961 / 9295 | time 1789[s] | loss 1.94
| epoch 3 |  iter 3981 / 9295 | time 1790[s] | loss 1.93
| epoch 3 |  iter 4001 / 9295 | time 1792[s] | loss 1.94
| epoch 3 |  iter 4021 / 9295 | time 1793[s] | loss 1.92
| epoch 3 |  iter 4041 / 9295 | time 1795[s] | loss 1.93
| epoch 3 |  iter 4061 / 9295 | time 1797[s] | loss 1.93
| epoch 3 |  iter 4081 / 9295 | time 1798[s] | loss 1.95
| epoch 3 |  iter 4101 / 9295 | time 1800[s] | loss 1.91
| epoch 3 |  iter 4121 / 9295 | time 1801[s] | loss 1.92
| epoch 3 |  iter 4141 / 9295 | time 1803[s] | loss 1.92
| epoch 3 |  iter 4161 / 9295 | time 1804[s] | loss 1.96
| epoch 3 |  iter 4181 / 9295 | time 1806[s] | loss 1.94
| epoch 3 |  iter 4201 / 9295 | time 1808[s] | loss 1.91
| epoch 3 |  iter 4221 / 9295 | time 1809[s] | loss 1.96
| epoch 3 |  iter 4241 / 9295 | time 1811[s] | loss 1.92
| epoch 3 |  iter 4261 / 9295 | time 1812[s] | loss 1.92
| epoch 3 |  iter 4281 / 9295 | time 1814[s] | loss 1.93
| epoch 3 |  iter 4301 / 9295 | time 1816[s] | loss 1.93
| epoch 3 |  iter 4321 / 9295 | time 1817[s] | loss 1.98
| epoch 3 |  iter 4341 / 9295 | time 1819[s] | loss 1.92
| epoch 3 |  iter 4361 / 9295 | time 1820[s] | loss 1.92
| epoch 3 |  iter 4381 / 9295 | time 1822[s] | loss 1.96
| epoch 3 |  iter 4401 / 9295 | time 1824[s] | loss 1.97
| epoch 3 |  iter 4421 / 9295 | time 1825[s] | loss 1.94
| epoch 3 |  iter 4441 / 9295 | time 1827[s] | loss 1.90
| epoch 3 |  iter 4461 / 9295 | time 1828[s] | loss 1.89
| epoch 3 |  iter 4481 / 9295 | time 1830[s] | loss 1.92
| epoch 3 |  iter 4501 / 9295 | time 1832[s] | loss 1.93
| epoch 3 |  iter 4521 / 9295 | time 1833[s] | loss 1.94
| epoch 3 |  iter 4541 / 9295 | time 1835[s] | loss 1.93
| epoch 3 |  iter 4561 / 9295 | time 1836[s] | loss 1.90
| epoch 3 |  iter 4581 / 9295 | time 1838[s] | loss 1.91
| epoch 3 |  iter 4601 / 9295 | time 1840[s] | loss 1.93
| epoch 3 |  iter 4621 / 9295 | time 1841[s] | loss 1.94
| epoch 3 |  iter 4641 / 9295 | time 1843[s] | loss 1.90
| epoch 3 |  iter 4661 / 9295 | time 1844[s] | loss 1.93
| epoch 3 |  iter 4681 / 9295 | time 1846[s] | loss 1.92
| epoch 3 |  iter 4701 / 9295 | time 1848[s] | loss 1.94
| epoch 3 |  iter 4721 / 9295 | time 1849[s] | loss 1.93
| epoch 3 |  iter 4741 / 9295 | time 1851[s] | loss 1.92
| epoch 3 |  iter 4761 / 9295 | time 1852[s] | loss 1.92
| epoch 3 |  iter 4781 / 9295 | time 1854[s] | loss 1.93
| epoch 3 |  iter 4801 / 9295 | time 1856[s] | loss 1.90
| epoch 3 |  iter 4821 / 9295 | time 1857[s] | loss 1.91
| epoch 3 |  iter 4841 / 9295 | time 1859[s] | loss 1.93
| epoch 3 |  iter 4861 / 9295 | time 1860[s] | loss 1.90
| epoch 3 |  iter 4881 / 9295 | time 1862[s] | loss 1.94
| epoch 3 |  iter 4901 / 9295 | time 1864[s] | loss 1.95
| epoch 3 |  iter 4921 / 9295 | time 1865[s] | loss 1.93
| epoch 3 |  iter 4941 / 9295 | time 1867[s] | loss 1.91
| epoch 3 |  iter 4961 / 9295 | time 1868[s] | loss 1.93
| epoch 3 |  iter 4981 / 9295 | time 1870[s] | loss 1.92
| epoch 3 |  iter 5001 / 9295 | time 1872[s] | loss 1.92
| epoch 3 |  iter 5021 / 9295 | time 1873[s] | loss 1.89
| epoch 3 |  iter 5041 / 9295 | time 1875[s] | loss 1.94
| epoch 3 |  iter 5061 / 9295 | time 1876[s] | loss 1.95
| epoch 3 |  iter 5081 / 9295 | time 1878[s] | loss 1.92
| epoch 3 |  iter 5101 / 9295 | time 1880[s] | loss 1.93
| epoch 3 |  iter 5121 / 9295 | time 1881[s] | loss 1.95
| epoch 3 |  iter 5141 / 9295 | time 1883[s] | loss 1.94
| epoch 3 |  iter 5161 / 9295 | time 1884[s] | loss 1.87
| epoch 3 |  iter 5181 / 9295 | time 1886[s] | loss 1.90
| epoch 3 |  iter 5201 / 9295 | time 1887[s] | loss 1.93
| epoch 3 |  iter 5221 / 9295 | time 1889[s] | loss 1.96
| epoch 3 |  iter 5241 / 9295 | time 1891[s] | loss 1.89
| epoch 3 |  iter 5261 / 9295 | time 1892[s] | loss 1.93
| epoch 3 |  iter 5281 / 9295 | time 1894[s] | loss 1.91
| epoch 3 |  iter 5301 / 9295 | time 1895[s] | loss 1.88
| epoch 3 |  iter 5321 / 9295 | time 1897[s] | loss 1.93
| epoch 3 |  iter 5341 / 9295 | time 1899[s] | loss 1.90
| epoch 3 |  iter 5361 / 9295 | time 1900[s] | loss 1.89
| epoch 3 |  iter 5381 / 9295 | time 1902[s] | loss 1.92
| epoch 3 |  iter 5401 / 9295 | time 1903[s] | loss 1.92
| epoch 3 |  iter 5421 / 9295 | time 1905[s] | loss 1.92
| epoch 3 |  iter 5441 / 9295 | time 1907[s] | loss 1.91
| epoch 3 |  iter 5461 / 9295 | time 1908[s] | loss 1.93
| epoch 3 |  iter 5481 / 9295 | time 1910[s] | loss 1.93
| epoch 3 |  iter 5501 / 9295 | time 1911[s] | loss 1.91
| epoch 3 |  iter 5521 / 9295 | time 1913[s] | loss 1.90
| epoch 3 |  iter 5541 / 9295 | time 1915[s] | loss 1.91
| epoch 3 |  iter 5561 / 9295 | time 1916[s] | loss 1.87
| epoch 3 |  iter 5581 / 9295 | time 1918[s] | loss 1.91
| epoch 3 |  iter 5601 / 9295 | time 1919[s] | loss 1.91
| epoch 3 |  iter 5621 / 9295 | time 1921[s] | loss 1.90
| epoch 3 |  iter 5641 / 9295 | time 1923[s] | loss 1.91
| epoch 3 |  iter 5661 / 9295 | time 1924[s] | loss 1.94
| epoch 3 |  iter 5681 / 9295 | time 1926[s] | loss 1.89
| epoch 3 |  iter 5701 / 9295 | time 1927[s] | loss 1.92
| epoch 3 |  iter 5721 / 9295 | time 1929[s] | loss 1.91
| epoch 3 |  iter 5741 / 9295 | time 1930[s] | loss 1.89
| epoch 3 |  iter 5761 / 9295 | time 1932[s] | loss 1.93
| epoch 3 |  iter 5781 / 9295 | time 1934[s] | loss 1.95
| epoch 3 |  iter 5801 / 9295 | time 1935[s] | loss 1.90
| epoch 3 |  iter 5821 / 9295 | time 1937[s] | loss 1.92
| epoch 3 |  iter 5841 / 9295 | time 1939[s] | loss 1.94
| epoch 3 |  iter 5861 / 9295 | time 1940[s] | loss 1.90
| epoch 3 |  iter 5881 / 9295 | time 1942[s] | loss 1.92
| epoch 3 |  iter 5901 / 9295 | time 1943[s] | loss 1.89
| epoch 3 |  iter 5921 / 9295 | time 1945[s] | loss 1.92
| epoch 3 |  iter 5941 / 9295 | time 1946[s] | loss 1.90
| epoch 3 |  iter 5961 / 9295 | time 1948[s] | loss 1.92
| epoch 3 |  iter 5981 / 9295 | time 1950[s] | loss 1.94
| epoch 3 |  iter 6001 / 9295 | time 1951[s] | loss 1.91
| epoch 3 |  iter 6021 / 9295 | time 1953[s] | loss 1.91
| epoch 3 |  iter 6041 / 9295 | time 1954[s] | loss 1.95
| epoch 3 |  iter 6061 / 9295 | time 1956[s] | loss 1.89
| epoch 3 |  iter 6081 / 9295 | time 1958[s] | loss 1.90
| epoch 3 |  iter 6101 / 9295 | time 1959[s] | loss 1.90
| epoch 3 |  iter 6121 / 9295 | time 1961[s] | loss 1.91
| epoch 3 |  iter 6141 / 9295 | time 1962[s] | loss 1.94
| epoch 3 |  iter 6161 / 9295 | time 1964[s] | loss 1.89
| epoch 3 |  iter 6181 / 9295 | time 1966[s] | loss 1.91
| epoch 3 |  iter 6201 / 9295 | time 1967[s] | loss 1.93
| epoch 3 |  iter 6221 / 9295 | time 1969[s] | loss 1.90
| epoch 3 |  iter 6241 / 9295 | time 1970[s] | loss 1.88
| epoch 3 |  iter 6261 / 9295 | time 1972[s] | loss 1.93
| epoch 3 |  iter 6281 / 9295 | time 1974[s] | loss 1.93
| epoch 3 |  iter 6301 / 9295 | time 1975[s] | loss 1.94
| epoch 3 |  iter 6321 / 9295 | time 1977[s] | loss 1.91
| epoch 3 |  iter 6341 / 9295 | time 1978[s] | loss 1.93
| epoch 3 |  iter 6361 / 9295 | time 1980[s] | loss 1.95
| epoch 3 |  iter 6381 / 9295 | time 1981[s] | loss 1.91
| epoch 3 |  iter 6401 / 9295 | time 1983[s] | loss 1.89
| epoch 3 |  iter 6421 / 9295 | time 1985[s] | loss 1.93
| epoch 3 |  iter 6441 / 9295 | time 1986[s] | loss 1.91
| epoch 3 |  iter 6461 / 9295 | time 1988[s] | loss 1.90
| epoch 3 |  iter 6481 / 9295 | time 1989[s] | loss 1.93
| epoch 3 |  iter 6501 / 9295 | time 1991[s] | loss 1.89
| epoch 3 |  iter 6521 / 9295 | time 1992[s] | loss 1.89
| epoch 3 |  iter 6541 / 9295 | time 1994[s] | loss 1.93
| epoch 3 |  iter 6561 / 9295 | time 1995[s] | loss 1.93
| epoch 3 |  iter 6581 / 9295 | time 1997[s] | loss 1.92
| epoch 3 |  iter 6601 / 9295 | time 1999[s] | loss 1.88
| epoch 3 |  iter 6621 / 9295 | time 2000[s] | loss 1.90
| epoch 3 |  iter 6641 / 9295 | time 2002[s] | loss 1.93
| epoch 3 |  iter 6661 / 9295 | time 2003[s] | loss 1.90
| epoch 3 |  iter 6681 / 9295 | time 2005[s] | loss 1.91
| epoch 3 |  iter 6701 / 9295 | time 2006[s] | loss 1.94
| epoch 3 |  iter 6721 / 9295 | time 2008[s] | loss 1.89
| epoch 3 |  iter 6741 / 9295 | time 2010[s] | loss 1.94
| epoch 3 |  iter 6761 / 9295 | time 2011[s] | loss 1.94
| epoch 3 |  iter 6781 / 9295 | time 2013[s] | loss 1.89
| epoch 3 |  iter 6801 / 9295 | time 2014[s] | loss 1.94
| epoch 3 |  iter 6821 / 9295 | time 2016[s] | loss 1.91
| epoch 3 |  iter 6841 / 9295 | time 2017[s] | loss 1.87
| epoch 3 |  iter 6861 / 9295 | time 2019[s] | loss 1.94
| epoch 3 |  iter 6881 / 9295 | time 2021[s] | loss 1.92
| epoch 3 |  iter 6901 / 9295 | time 2022[s] | loss 1.87
| epoch 3 |  iter 6921 / 9295 | time 2024[s] | loss 1.89
| epoch 3 |  iter 6941 / 9295 | time 2025[s] | loss 1.91
| epoch 3 |  iter 6961 / 9295 | time 2027[s] | loss 1.91
| epoch 3 |  iter 6981 / 9295 | time 2028[s] | loss 1.90
| epoch 3 |  iter 7001 / 9295 | time 2030[s] | loss 1.91
| epoch 3 |  iter 7021 / 9295 | time 2031[s] | loss 1.90
| epoch 3 |  iter 7041 / 9295 | time 2033[s] | loss 1.89
| epoch 3 |  iter 7061 / 9295 | time 2035[s] | loss 1.93
| epoch 3 |  iter 7081 / 9295 | time 2036[s] | loss 1.89
| epoch 3 |  iter 7101 / 9295 | time 2038[s] | loss 1.92
| epoch 3 |  iter 7121 / 9295 | time 2039[s] | loss 1.87
| epoch 3 |  iter 7141 / 9295 | time 2041[s] | loss 1.90
| epoch 3 |  iter 7161 / 9295 | time 2042[s] | loss 1.89
| epoch 3 |  iter 7181 / 9295 | time 2044[s] | loss 1.89
| epoch 3 |  iter 7201 / 9295 | time 2046[s] | loss 1.91
| epoch 3 |  iter 7221 / 9295 | time 2047[s] | loss 1.95
| epoch 3 |  iter 7241 / 9295 | time 2049[s] | loss 1.87
| epoch 3 |  iter 7261 / 9295 | time 2050[s] | loss 1.89
| epoch 3 |  iter 7281 / 9295 | time 2052[s] | loss 1.88
| epoch 3 |  iter 7301 / 9295 | time 2054[s] | loss 1.93
| epoch 3 |  iter 7321 / 9295 | time 2055[s] | loss 1.93
| epoch 3 |  iter 7341 / 9295 | time 2057[s] | loss 1.93
| epoch 3 |  iter 7361 / 9295 | time 2058[s] | loss 1.91
| epoch 3 |  iter 7381 / 9295 | time 2060[s] | loss 1.91
| epoch 3 |  iter 7401 / 9295 | time 2062[s] | loss 1.90
| epoch 3 |  iter 7421 / 9295 | time 2063[s] | loss 1.89
| epoch 3 |  iter 7441 / 9295 | time 2065[s] | loss 1.90
| epoch 3 |  iter 7461 / 9295 | time 2066[s] | loss 1.88
| epoch 3 |  iter 7481 / 9295 | time 2068[s] | loss 1.95
| epoch 3 |  iter 7501 / 9295 | time 2070[s] | loss 1.90
| epoch 3 |  iter 7521 / 9295 | time 2071[s] | loss 1.96
| epoch 3 |  iter 7541 / 9295 | time 2073[s] | loss 1.92
| epoch 3 |  iter 7561 / 9295 | time 2074[s] | loss 1.88
| epoch 3 |  iter 7581 / 9295 | time 2076[s] | loss 1.90
| epoch 3 |  iter 7601 / 9295 | time 2078[s] | loss 1.91
| epoch 3 |  iter 7621 / 9295 | time 2079[s] | loss 1.90
| epoch 3 |  iter 7641 / 9295 | time 2081[s] | loss 1.87
| epoch 3 |  iter 7661 / 9295 | time 2082[s] | loss 1.93
| epoch 3 |  iter 7681 / 9295 | time 2084[s] | loss 1.89
| epoch 3 |  iter 7701 / 9295 | time 2086[s] | loss 1.90
| epoch 3 |  iter 7721 / 9295 | time 2087[s] | loss 1.88
| epoch 3 |  iter 7741 / 9295 | time 2089[s] | loss 1.91
| epoch 3 |  iter 7761 / 9295 | time 2090[s] | loss 1.89
| epoch 3 |  iter 7781 / 9295 | time 2092[s] | loss 1.90
| epoch 3 |  iter 7801 / 9295 | time 2093[s] | loss 1.95
| epoch 3 |  iter 7821 / 9295 | time 2095[s] | loss 1.87
| epoch 3 |  iter 7841 / 9295 | time 2097[s] | loss 1.91
| epoch 3 |  iter 7861 / 9295 | time 2098[s] | loss 1.91
| epoch 3 |  iter 7881 / 9295 | time 2100[s] | loss 1.89
| epoch 3 |  iter 7901 / 9295 | time 2101[s] | loss 1.88
| epoch 3 |  iter 7921 / 9295 | time 2103[s] | loss 1.92
| epoch 3 |  iter 7941 / 9295 | time 2105[s] | loss 1.92
| epoch 3 |  iter 7961 / 9295 | time 2106[s] | loss 1.89
| epoch 3 |  iter 7981 / 9295 | time 2108[s] | loss 1.88
| epoch 3 |  iter 8001 / 9295 | time 2109[s] | loss 1.89
| epoch 3 |  iter 8021 / 9295 | time 2111[s] | loss 1.90
| epoch 3 |  iter 8041 / 9295 | time 2113[s] | loss 1.93
| epoch 3 |  iter 8061 / 9295 | time 2114[s] | loss 1.84
| epoch 3 |  iter 8081 / 9295 | time 2116[s] | loss 1.89
| epoch 3 |  iter 8101 / 9295 | time 2117[s] | loss 1.89
| epoch 3 |  iter 8121 / 9295 | time 2119[s] | loss 1.88
| epoch 3 |  iter 8141 / 9295 | time 2121[s] | loss 1.88
| epoch 3 |  iter 8161 / 9295 | time 2122[s] | loss 1.84
| epoch 3 |  iter 8181 / 9295 | time 2124[s] | loss 1.92
| epoch 3 |  iter 8201 / 9295 | time 2125[s] | loss 1.92
| epoch 3 |  iter 8221 / 9295 | time 2127[s] | loss 1.88
| epoch 3 |  iter 8241 / 9295 | time 2128[s] | loss 1.89
| epoch 3 |  iter 8261 / 9295 | time 2130[s] | loss 1.88
| epoch 3 |  iter 8281 / 9295 | time 2132[s] | loss 1.87
| epoch 3 |  iter 8301 / 9295 | time 2133[s] | loss 1.86
| epoch 3 |  iter 8321 / 9295 | time 2135[s] | loss 1.87
| epoch 3 |  iter 8341 / 9295 | time 2136[s] | loss 1.87
| epoch 3 |  iter 8361 / 9295 | time 2138[s] | loss 1.87
| epoch 3 |  iter 8381 / 9295 | time 2140[s] | loss 1.94
| epoch 3 |  iter 8401 / 9295 | time 2141[s] | loss 1.92
| epoch 3 |  iter 8421 / 9295 | time 2143[s] | loss 1.90
| epoch 3 |  iter 8441 / 9295 | time 2144[s] | loss 1.89
| epoch 3 |  iter 8461 / 9295 | time 2146[s] | loss 1.92
| epoch 3 |  iter 8481 / 9295 | time 2147[s] | loss 1.91
| epoch 3 |  iter 8501 / 9295 | time 2149[s] | loss 1.89
| epoch 3 |  iter 8521 / 9295 | time 2151[s] | loss 1.89
| epoch 3 |  iter 8541 / 9295 | time 2152[s] | loss 1.85
| epoch 3 |  iter 8561 / 9295 | time 2154[s] | loss 1.87
| epoch 3 |  iter 8581 / 9295 | time 2155[s] | loss 1.91
| epoch 3 |  iter 8601 / 9295 | time 2157[s] | loss 1.91
| epoch 3 |  iter 8621 / 9295 | time 2159[s] | loss 1.91
| epoch 3 |  iter 8641 / 9295 | time 2160[s] | loss 1.86
| epoch 3 |  iter 8661 / 9295 | time 2162[s] | loss 1.90
| epoch 3 |  iter 8681 / 9295 | time 2163[s] | loss 1.90
| epoch 3 |  iter 8701 / 9295 | time 2165[s] | loss 1.88
| epoch 3 |  iter 8721 / 9295 | time 2167[s] | loss 1.90
| epoch 3 |  iter 8741 / 9295 | time 2168[s] | loss 1.87
| epoch 3 |  iter 8761 / 9295 | time 2170[s] | loss 1.88
| epoch 3 |  iter 8781 / 9295 | time 2171[s] | loss 1.90
| epoch 3 |  iter 8801 / 9295 | time 2173[s] | loss 1.86
| epoch 3 |  iter 8821 / 9295 | time 2175[s] | loss 1.87
| epoch 3 |  iter 8841 / 9295 | time 2176[s] | loss 1.86
| epoch 3 |  iter 8861 / 9295 | time 2178[s] | loss 1.91
| epoch 3 |  iter 8881 / 9295 | time 2180[s] | loss 1.92
| epoch 3 |  iter 8901 / 9295 | time 2181[s] | loss 1.89
| epoch 3 |  iter 8921 / 9295 | time 2183[s] | loss 1.91
| epoch 3 |  iter 8941 / 9295 | time 2185[s] | loss 1.91
| epoch 3 |  iter 8961 / 9295 | time 2186[s] | loss 1.86
| epoch 3 |  iter 8981 / 9295 | time 2188[s] | loss 1.91
| epoch 3 |  iter 9001 / 9295 | time 2189[s] | loss 1.91
| epoch 3 |  iter 9021 / 9295 | time 2191[s] | loss 1.90
| epoch 3 |  iter 9041 / 9295 | time 2193[s] | loss 1.88
| epoch 3 |  iter 9061 / 9295 | time 2194[s] | loss 1.88
| epoch 3 |  iter 9081 / 9295 | time 2196[s] | loss 1.86
| epoch 3 |  iter 9101 / 9295 | time 2197[s] | loss 1.90
| epoch 3 |  iter 9121 / 9295 | time 2199[s] | loss 1.85
| epoch 3 |  iter 9141 / 9295 | time 2200[s] | loss 1.88
| epoch 3 |  iter 9161 / 9295 | time 2202[s] | loss 1.86
| epoch 3 |  iter 9181 / 9295 | time 2204[s] | loss 1.90
| epoch 3 |  iter 9201 / 9295 | time 2205[s] | loss 1.89
| epoch 3 |  iter 9221 / 9295 | time 2207[s] | loss 1.88
| epoch 3 |  iter 9241 / 9295 | time 2208[s] | loss 1.87
| epoch 3 |  iter 9261 / 9295 | time 2210[s] | loss 1.89
| epoch 3 |  iter 9281 / 9295 | time 2211[s] | loss 1.93
| epoch 4 |  iter 1 / 9295 | time 2213[s] | loss 1.86
| epoch 4 |  iter 21 / 9295 | time 2214[s] | loss 1.80
| epoch 4 |  iter 41 / 9295 | time 2216[s] | loss 1.84
| epoch 4 |  iter 61 / 9295 | time 2217[s] | loss 1.82
| epoch 4 |  iter 81 / 9295 | time 2219[s] | loss 1.81
| epoch 4 |  iter 101 / 9295 | time 2221[s] | loss 1.79
| epoch 4 |  iter 121 / 9295 | time 2222[s] | loss 1.82
| epoch 4 |  iter 141 / 9295 | time 2224[s] | loss 1.84
| epoch 4 |  iter 161 / 9295 | time 2225[s] | loss 1.84
| epoch 4 |  iter 181 / 9295 | time 2227[s] | loss 1.79
| epoch 4 |  iter 201 / 9295 | time 2228[s] | loss 1.80
| epoch 4 |  iter 221 / 9295 | time 2230[s] | loss 1.81
| epoch 4 |  iter 241 / 9295 | time 2232[s] | loss 1.83
| epoch 4 |  iter 261 / 9295 | time 2233[s] | loss 1.78
| epoch 4 |  iter 281 / 9295 | time 2235[s] | loss 1.83
| epoch 4 |  iter 301 / 9295 | time 2237[s] | loss 1.83
| epoch 4 |  iter 321 / 9295 | time 2238[s] | loss 1.82
| epoch 4 |  iter 341 / 9295 | time 2240[s] | loss 1.83
| epoch 4 |  iter 361 / 9295 | time 2241[s] | loss 1.80
| epoch 4 |  iter 381 / 9295 | time 2243[s] | loss 1.81
| epoch 4 |  iter 401 / 9295 | time 2245[s] | loss 1.81
| epoch 4 |  iter 421 / 9295 | time 2246[s] | loss 1.80
| epoch 4 |  iter 441 / 9295 | time 2248[s] | loss 1.82
| epoch 4 |  iter 461 / 9295 | time 2249[s] | loss 1.84
| epoch 4 |  iter 481 / 9295 | time 2251[s] | loss 1.84
| epoch 4 |  iter 501 / 9295 | time 2252[s] | loss 1.79
| epoch 4 |  iter 521 / 9295 | time 2254[s] | loss 1.83
| epoch 4 |  iter 541 / 9295 | time 2256[s] | loss 1.80
| epoch 4 |  iter 561 / 9295 | time 2257[s] | loss 1.79
| epoch 4 |  iter 581 / 9295 | time 2259[s] | loss 1.85
| epoch 4 |  iter 601 / 9295 | time 2260[s] | loss 1.84
| epoch 4 |  iter 621 / 9295 | time 2262[s] | loss 1.83
| epoch 4 |  iter 641 / 9295 | time 2264[s] | loss 1.84
| epoch 4 |  iter 661 / 9295 | time 2265[s] | loss 1.83
| epoch 4 |  iter 681 / 9295 | time 2267[s] | loss 1.80
| epoch 4 |  iter 701 / 9295 | time 2268[s] | loss 1.81
| epoch 4 |  iter 721 / 9295 | time 2270[s] | loss 1.80
| epoch 4 |  iter 741 / 9295 | time 2271[s] | loss 1.84
| epoch 4 |  iter 761 / 9295 | time 2273[s] | loss 1.82
| epoch 4 |  iter 781 / 9295 | time 2275[s] | loss 1.77
| epoch 4 |  iter 801 / 9295 | time 2276[s] | loss 1.78
| epoch 4 |  iter 821 / 9295 | time 2278[s] | loss 1.80
| epoch 4 |  iter 841 / 9295 | time 2279[s] | loss 1.78
| epoch 4 |  iter 861 / 9295 | time 2281[s] | loss 1.83
| epoch 4 |  iter 881 / 9295 | time 2282[s] | loss 1.80
| epoch 4 |  iter 901 / 9295 | time 2284[s] | loss 1.82
| epoch 4 |  iter 921 / 9295 | time 2286[s] | loss 1.83
| epoch 4 |  iter 941 / 9295 | time 2287[s] | loss 1.86
| epoch 4 |  iter 961 / 9295 | time 2289[s] | loss 1.81
| epoch 4 |  iter 981 / 9295 | time 2290[s] | loss 1.82
| epoch 4 |  iter 1001 / 9295 | time 2292[s] | loss 1.79
| epoch 4 |  iter 1021 / 9295 | time 2293[s] | loss 1.80
| epoch 4 |  iter 1041 / 9295 | time 2295[s] | loss 1.80
| epoch 4 |  iter 1061 / 9295 | time 2297[s] | loss 1.79
| epoch 4 |  iter 1081 / 9295 | time 2298[s] | loss 1.79
| epoch 4 |  iter 1101 / 9295 | time 2300[s] | loss 1.80
| epoch 4 |  iter 1121 / 9295 | time 2301[s] | loss 1.82
| epoch 4 |  iter 1141 / 9295 | time 2303[s] | loss 1.79
| epoch 4 |  iter 1161 / 9295 | time 2305[s] | loss 1.82
| epoch 4 |  iter 1181 / 9295 | time 2306[s] | loss 1.83
| epoch 4 |  iter 1201 / 9295 | time 2308[s] | loss 1.82
| epoch 4 |  iter 1221 / 9295 | time 2309[s] | loss 1.83
| epoch 4 |  iter 1241 / 9295 | time 2311[s] | loss 1.80
| epoch 4 |  iter 1261 / 9295 | time 2313[s] | loss 1.83
| epoch 4 |  iter 1281 / 9295 | time 2314[s] | loss 1.86
| epoch 4 |  iter 1301 / 9295 | time 2316[s] | loss 1.81
| epoch 4 |  iter 1321 / 9295 | time 2317[s] | loss 1.83
| epoch 4 |  iter 1341 / 9295 | time 2319[s] | loss 1.84
| epoch 4 |  iter 1361 / 9295 | time 2321[s] | loss 1.80
| epoch 4 |  iter 1381 / 9295 | time 2322[s] | loss 1.81
| epoch 4 |  iter 1401 / 9295 | time 2324[s] | loss 1.82
| epoch 4 |  iter 1421 / 9295 | time 2326[s] | loss 1.81
| epoch 4 |  iter 1441 / 9295 | time 2327[s] | loss 1.76
| epoch 4 |  iter 1461 / 9295 | time 2329[s] | loss 1.84
| epoch 4 |  iter 1481 / 9295 | time 2330[s] | loss 1.84
| epoch 4 |  iter 1501 / 9295 | time 2332[s] | loss 1.83
| epoch 4 |  iter 1521 / 9295 | time 2334[s] | loss 1.80
| epoch 4 |  iter 1541 / 9295 | time 2335[s] | loss 1.80
| epoch 4 |  iter 1561 / 9295 | time 2337[s] | loss 1.84
| epoch 4 |  iter 1581 / 9295 | time 2338[s] | loss 1.81
| epoch 4 |  iter 1601 / 9295 | time 2340[s] | loss 1.81
| epoch 4 |  iter 1621 / 9295 | time 2342[s] | loss 1.83
| epoch 4 |  iter 1641 / 9295 | time 2343[s] | loss 1.85
| epoch 4 |  iter 1661 / 9295 | time 2345[s] | loss 1.78
| epoch 4 |  iter 1681 / 9295 | time 2346[s] | loss 1.80
| epoch 4 |  iter 1701 / 9295 | time 2348[s] | loss 1.83
| epoch 4 |  iter 1721 / 9295 | time 2350[s] | loss 1.85
| epoch 4 |  iter 1741 / 9295 | time 2351[s] | loss 1.82
| epoch 4 |  iter 1761 / 9295 | time 2353[s] | loss 1.82
| epoch 4 |  iter 1781 / 9295 | time 2354[s] | loss 1.84
| epoch 4 |  iter 1801 / 9295 | time 2356[s] | loss 1.82
| epoch 4 |  iter 1821 / 9295 | time 2358[s] | loss 1.85
| epoch 4 |  iter 1841 / 9295 | time 2359[s] | loss 1.80
| epoch 4 |  iter 1861 / 9295 | time 2361[s] | loss 1.81
| epoch 4 |  iter 1881 / 9295 | time 2363[s] | loss 1.80
| epoch 4 |  iter 1901 / 9295 | time 2364[s] | loss 1.83
| epoch 4 |  iter 1921 / 9295 | time 2366[s] | loss 1.81
| epoch 4 |  iter 1941 / 9295 | time 2367[s] | loss 1.77
| epoch 4 |  iter 1961 / 9295 | time 2369[s] | loss 1.82
| epoch 4 |  iter 1981 / 9295 | time 2371[s] | loss 1.84
| epoch 4 |  iter 2001 / 9295 | time 2372[s] | loss 1.86
| epoch 4 |  iter 2021 / 9295 | time 2374[s] | loss 1.82
| epoch 4 |  iter 2041 / 9295 | time 2375[s] | loss 1.81
| epoch 4 |  iter 2061 / 9295 | time 2377[s] | loss 1.83
| epoch 4 |  iter 2081 / 9295 | time 2379[s] | loss 1.82
| epoch 4 |  iter 2101 / 9295 | time 2380[s] | loss 1.83
| epoch 4 |  iter 2121 / 9295 | time 2382[s] | loss 1.83
| epoch 4 |  iter 2141 / 9295 | time 2383[s] | loss 1.81
| epoch 4 |  iter 2161 / 9295 | time 2385[s] | loss 1.78
| epoch 4 |  iter 2181 / 9295 | time 2387[s] | loss 1.84
| epoch 4 |  iter 2201 / 9295 | time 2388[s] | loss 1.80
| epoch 4 |  iter 2221 / 9295 | time 2390[s] | loss 1.82
| epoch 4 |  iter 2241 / 9295 | time 2391[s] | loss 1.80
| epoch 4 |  iter 2261 / 9295 | time 2393[s] | loss 1.82
| epoch 4 |  iter 2281 / 9295 | time 2394[s] | loss 1.78
| epoch 4 |  iter 2301 / 9295 | time 2396[s] | loss 1.84
| epoch 4 |  iter 2321 / 9295 | time 2398[s] | loss 1.80
| epoch 4 |  iter 2341 / 9295 | time 2399[s] | loss 1.79
| epoch 4 |  iter 2361 / 9295 | time 2401[s] | loss 1.82
| epoch 4 |  iter 2381 / 9295 | time 2402[s] | loss 1.85
| epoch 4 |  iter 2401 / 9295 | time 2404[s] | loss 1.82
| epoch 4 |  iter 2421 / 9295 | time 2406[s] | loss 1.79
| epoch 4 |  iter 2441 / 9295 | time 2407[s] | loss 1.85
| epoch 4 |  iter 2461 / 9295 | time 2409[s] | loss 1.82
| epoch 4 |  iter 2481 / 9295 | time 2410[s] | loss 1.83
| epoch 4 |  iter 2501 / 9295 | time 2412[s] | loss 1.85
| epoch 4 |  iter 2521 / 9295 | time 2413[s] | loss 1.81
| epoch 4 |  iter 2541 / 9295 | time 2415[s] | loss 1.84
| epoch 4 |  iter 2561 / 9295 | time 2417[s] | loss 1.78
| epoch 4 |  iter 2581 / 9295 | time 2418[s] | loss 1.80
| epoch 4 |  iter 2601 / 9295 | time 2420[s] | loss 1.86
| epoch 4 |  iter 2621 / 9295 | time 2421[s] | loss 1.77
| epoch 4 |  iter 2641 / 9295 | time 2423[s] | loss 1.81
| epoch 4 |  iter 2661 / 9295 | time 2425[s] | loss 1.80
| epoch 4 |  iter 2681 / 9295 | time 2426[s] | loss 1.82
| epoch 4 |  iter 2701 / 9295 | time 2428[s] | loss 1.82
| epoch 4 |  iter 2721 / 9295 | time 2429[s] | loss 1.75
| epoch 4 |  iter 2741 / 9295 | time 2431[s] | loss 1.84
| epoch 4 |  iter 2761 / 9295 | time 2432[s] | loss 1.83
| epoch 4 |  iter 2781 / 9295 | time 2434[s] | loss 1.83
| epoch 4 |  iter 2801 / 9295 | time 2436[s] | loss 1.80
| epoch 4 |  iter 2821 / 9295 | time 2437[s] | loss 1.85
| epoch 4 |  iter 2841 / 9295 | time 2439[s] | loss 1.83
| epoch 4 |  iter 2861 / 9295 | time 2440[s] | loss 1.81
| epoch 4 |  iter 2881 / 9295 | time 2442[s] | loss 1.81
| epoch 4 |  iter 2901 / 9295 | time 2444[s] | loss 1.84
| epoch 4 |  iter 2921 / 9295 | time 2445[s] | loss 1.81
| epoch 4 |  iter 2941 / 9295 | time 2447[s] | loss 1.84
| epoch 4 |  iter 2961 / 9295 | time 2448[s] | loss 1.84
| epoch 4 |  iter 2981 / 9295 | time 2450[s] | loss 1.82
| epoch 4 |  iter 3001 / 9295 | time 2451[s] | loss 1.84
| epoch 4 |  iter 3021 / 9295 | time 2453[s] | loss 1.85
| epoch 4 |  iter 3041 / 9295 | time 2455[s] | loss 1.85
| epoch 4 |  iter 3061 / 9295 | time 2456[s] | loss 1.83
| epoch 4 |  iter 3081 / 9295 | time 2458[s] | loss 1.81
| epoch 4 |  iter 3101 / 9295 | time 2459[s] | loss 1.78
| epoch 4 |  iter 3121 / 9295 | time 2461[s] | loss 1.78
| epoch 4 |  iter 3141 / 9295 | time 2463[s] | loss 1.81
| epoch 4 |  iter 3161 / 9295 | time 2464[s] | loss 1.82
| epoch 4 |  iter 3181 / 9295 | time 2466[s] | loss 1.81
| epoch 4 |  iter 3201 / 9295 | time 2467[s] | loss 1.81
| epoch 4 |  iter 3221 / 9295 | time 2469[s] | loss 1.84
| epoch 4 |  iter 3241 / 9295 | time 2471[s] | loss 1.81
| epoch 4 |  iter 3261 / 9295 | time 2472[s] | loss 1.79
| epoch 4 |  iter 3281 / 9295 | time 2474[s] | loss 1.79
| epoch 4 |  iter 3301 / 9295 | time 2476[s] | loss 1.80
| epoch 4 |  iter 3321 / 9295 | time 2477[s] | loss 1.84
| epoch 4 |  iter 3341 / 9295 | time 2479[s] | loss 1.89
| epoch 4 |  iter 3361 / 9295 | time 2480[s] | loss 1.84
| epoch 4 |  iter 3381 / 9295 | time 2482[s] | loss 1.85
| epoch 4 |  iter 3401 / 9295 | time 2484[s] | loss 1.81
| epoch 4 |  iter 3421 / 9295 | time 2485[s] | loss 1.78
| epoch 4 |  iter 3441 / 9295 | time 2487[s] | loss 1.81
| epoch 4 |  iter 3461 / 9295 | time 2488[s] | loss 1.79
| epoch 4 |  iter 3481 / 9295 | time 2490[s] | loss 1.80
| epoch 4 |  iter 3501 / 9295 | time 2491[s] | loss 1.80
| epoch 4 |  iter 3521 / 9295 | time 2493[s] | loss 1.82
| epoch 4 |  iter 3541 / 9295 | time 2495[s] | loss 1.81
| epoch 4 |  iter 3561 / 9295 | time 2496[s] | loss 1.79
| epoch 4 |  iter 3581 / 9295 | time 2498[s] | loss 1.75
| epoch 4 |  iter 3601 / 9295 | time 2499[s] | loss 1.83
| epoch 4 |  iter 3621 / 9295 | time 2501[s] | loss 1.81
| epoch 4 |  iter 3641 / 9295 | time 2503[s] | loss 1.81
| epoch 4 |  iter 3661 / 9295 | time 2504[s] | loss 1.81
| epoch 4 |  iter 3681 / 9295 | time 2506[s] | loss 1.78
| epoch 4 |  iter 3701 / 9295 | time 2507[s] | loss 1.77
| epoch 4 |  iter 3721 / 9295 | time 2509[s] | loss 1.80
| epoch 4 |  iter 3741 / 9295 | time 2511[s] | loss 1.83
| epoch 4 |  iter 3761 / 9295 | time 2512[s] | loss 1.79
| epoch 4 |  iter 3781 / 9295 | time 2514[s] | loss 1.80
| epoch 4 |  iter 3801 / 9295 | time 2515[s] | loss 1.81
| epoch 4 |  iter 3821 / 9295 | time 2517[s] | loss 1.82
| epoch 4 |  iter 3841 / 9295 | time 2519[s] | loss 1.77
| epoch 4 |  iter 3861 / 9295 | time 2520[s] | loss 1.83
| epoch 4 |  iter 3881 / 9295 | time 2522[s] | loss 1.82
| epoch 4 |  iter 3901 / 9295 | time 2523[s] | loss 1.78
| epoch 4 |  iter 3921 / 9295 | time 2525[s] | loss 1.83
| epoch 4 |  iter 3941 / 9295 | time 2527[s] | loss 1.83
| epoch 4 |  iter 3961 / 9295 | time 2528[s] | loss 1.85
| epoch 4 |  iter 3981 / 9295 | time 2530[s] | loss 1.81
| epoch 4 |  iter 4001 / 9295 | time 2531[s] | loss 1.78
| epoch 4 |  iter 4021 / 9295 | time 2533[s] | loss 1.83
| epoch 4 |  iter 4041 / 9295 | time 2535[s] | loss 1.79
| epoch 4 |  iter 4061 / 9295 | time 2536[s] | loss 1.79
| epoch 4 |  iter 4081 / 9295 | time 2538[s] | loss 1.79
| epoch 4 |  iter 4101 / 9295 | time 2539[s] | loss 1.82
| epoch 4 |  iter 4121 / 9295 | time 2541[s] | loss 1.79
| epoch 4 |  iter 4141 / 9295 | time 2543[s] | loss 1.78
| epoch 4 |  iter 4161 / 9295 | time 2544[s] | loss 1.81
| epoch 4 |  iter 4181 / 9295 | time 2546[s] | loss 1.81
| epoch 4 |  iter 4201 / 9295 | time 2547[s] | loss 1.80
| epoch 4 |  iter 4221 / 9295 | time 2549[s] | loss 1.82
| epoch 4 |  iter 4241 / 9295 | time 2551[s] | loss 1.83
| epoch 4 |  iter 4261 / 9295 | time 2552[s] | loss 1.78
| epoch 4 |  iter 4281 / 9295 | time 2554[s] | loss 1.83
| epoch 4 |  iter 4301 / 9295 | time 2555[s] | loss 1.81
| epoch 4 |  iter 4321 / 9295 | time 2557[s] | loss 1.87
| epoch 4 |  iter 4341 / 9295 | time 2559[s] | loss 1.77
| epoch 4 |  iter 4361 / 9295 | time 2560[s] | loss 1.78
| epoch 4 |  iter 4381 / 9295 | time 2562[s] | loss 1.81
| epoch 4 |  iter 4401 / 9295 | time 2563[s] | loss 1.79
| epoch 4 |  iter 4421 / 9295 | time 2565[s] | loss 1.82
| epoch 4 |  iter 4441 / 9295 | time 2567[s] | loss 1.80
| epoch 4 |  iter 4461 / 9295 | time 2568[s] | loss 1.81
| epoch 4 |  iter 4481 / 9295 | time 2570[s] | loss 1.81
| epoch 4 |  iter 4501 / 9295 | time 2571[s] | loss 1.82
| epoch 4 |  iter 4521 / 9295 | time 2573[s] | loss 1.82
| epoch 4 |  iter 4541 / 9295 | time 2575[s] | loss 1.84
| epoch 4 |  iter 4561 / 9295 | time 2576[s] | loss 1.84
| epoch 4 |  iter 4581 / 9295 | time 2578[s] | loss 1.83
| epoch 4 |  iter 4601 / 9295 | time 2579[s] | loss 1.80
| epoch 4 |  iter 4621 / 9295 | time 2581[s] | loss 1.81
| epoch 4 |  iter 4641 / 9295 | time 2583[s] | loss 1.77
| epoch 4 |  iter 4661 / 9295 | time 2584[s] | loss 1.84
| epoch 4 |  iter 4681 / 9295 | time 2586[s] | loss 1.83
| epoch 4 |  iter 4701 / 9295 | time 2587[s] | loss 1.81
| epoch 4 |  iter 4721 / 9295 | time 2589[s] | loss 1.84
| epoch 4 |  iter 4741 / 9295 | time 2590[s] | loss 1.81
| epoch 4 |  iter 4761 / 9295 | time 2592[s] | loss 1.80
| epoch 4 |  iter 4781 / 9295 | time 2594[s] | loss 1.79
| epoch 4 |  iter 4801 / 9295 | time 2595[s] | loss 1.76
| epoch 4 |  iter 4821 / 9295 | time 2597[s] | loss 1.79
| epoch 4 |  iter 4841 / 9295 | time 2598[s] | loss 1.86
| epoch 4 |  iter 4861 / 9295 | time 2600[s] | loss 1.82
| epoch 4 |  iter 4881 / 9295 | time 2602[s] | loss 1.82
| epoch 4 |  iter 4901 / 9295 | time 2603[s] | loss 1.80
| epoch 4 |  iter 4921 / 9295 | time 2605[s] | loss 1.81
| epoch 4 |  iter 4941 / 9295 | time 2607[s] | loss 1.86
| epoch 4 |  iter 4961 / 9295 | time 2608[s] | loss 1.77
| epoch 4 |  iter 4981 / 9295 | time 2610[s] | loss 1.81
| epoch 4 |  iter 5001 / 9295 | time 2611[s] | loss 1.82
| epoch 4 |  iter 5021 / 9295 | time 2613[s] | loss 1.79
| epoch 4 |  iter 5041 / 9295 | time 2615[s] | loss 1.80
| epoch 4 |  iter 5061 / 9295 | time 2616[s] | loss 1.81
| epoch 4 |  iter 5081 / 9295 | time 2618[s] | loss 1.82
| epoch 4 |  iter 5101 / 9295 | time 2619[s] | loss 1.80
| epoch 4 |  iter 5121 / 9295 | time 2621[s] | loss 1.84
| epoch 4 |  iter 5141 / 9295 | time 2623[s] | loss 1.79
| epoch 4 |  iter 5161 / 9295 | time 2624[s] | loss 1.76
| epoch 4 |  iter 5181 / 9295 | time 2626[s] | loss 1.79
| epoch 4 |  iter 5201 / 9295 | time 2627[s] | loss 1.83
| epoch 4 |  iter 5221 / 9295 | time 2629[s] | loss 1.80
| epoch 4 |  iter 5241 / 9295 | time 2630[s] | loss 1.78
| epoch 4 |  iter 5261 / 9295 | time 2632[s] | loss 1.82
| epoch 4 |  iter 5281 / 9295 | time 2634[s] | loss 1.82
| epoch 4 |  iter 5301 / 9295 | time 2635[s] | loss 1.77
| epoch 4 |  iter 5321 / 9295 | time 2637[s] | loss 1.76
| epoch 4 |  iter 5341 / 9295 | time 2639[s] | loss 1.79
| epoch 4 |  iter 5361 / 9295 | time 2640[s] | loss 1.82
| epoch 4 |  iter 5381 / 9295 | time 2642[s] | loss 1.81
| epoch 4 |  iter 5401 / 9295 | time 2643[s] | loss 1.77
| epoch 4 |  iter 5421 / 9295 | time 2645[s] | loss 1.78
| epoch 4 |  iter 5441 / 9295 | time 2646[s] | loss 1.79
| epoch 4 |  iter 5461 / 9295 | time 2648[s] | loss 1.81
| epoch 4 |  iter 5481 / 9295 | time 2650[s] | loss 1.83
| epoch 4 |  iter 5501 / 9295 | time 2651[s] | loss 1.78
| epoch 4 |  iter 5521 / 9295 | time 2653[s] | loss 1.78
| epoch 4 |  iter 5541 / 9295 | time 2654[s] | loss 1.78
| epoch 4 |  iter 5561 / 9295 | time 2656[s] | loss 1.81
| epoch 4 |  iter 5581 / 9295 | time 2658[s] | loss 1.82
| epoch 4 |  iter 5601 / 9295 | time 2659[s] | loss 1.78
| epoch 4 |  iter 5621 / 9295 | time 2661[s] | loss 1.80
| epoch 4 |  iter 5641 / 9295 | time 2662[s] | loss 1.79
| epoch 4 |  iter 5661 / 9295 | time 2664[s] | loss 1.78
| epoch 4 |  iter 5681 / 9295 | time 2665[s] | loss 1.80
| epoch 4 |  iter 5701 / 9295 | time 2667[s] | loss 1.81
| epoch 4 |  iter 5721 / 9295 | time 2669[s] | loss 1.80
| epoch 4 |  iter 5741 / 9295 | time 2670[s] | loss 1.82
| epoch 4 |  iter 5761 / 9295 | time 2672[s] | loss 1.86
| epoch 4 |  iter 5781 / 9295 | time 2673[s] | loss 1.81
| epoch 4 |  iter 5801 / 9295 | time 2675[s] | loss 1.83
| epoch 4 |  iter 5821 / 9295 | time 2676[s] | loss 1.81
| epoch 4 |  iter 5841 / 9295 | time 2678[s] | loss 1.82
| epoch 4 |  iter 5861 / 9295 | time 2680[s] | loss 1.78
| epoch 4 |  iter 5881 / 9295 | time 2681[s] | loss 1.82
| epoch 4 |  iter 5901 / 9295 | time 2683[s] | loss 1.82
| epoch 4 |  iter 5921 / 9295 | time 2684[s] | loss 1.84
| epoch 4 |  iter 5941 / 9295 | time 2686[s] | loss 1.81
| epoch 4 |  iter 5961 / 9295 | time 2687[s] | loss 1.80
| epoch 4 |  iter 5981 / 9295 | time 2689[s] | loss 1.80
| epoch 4 |  iter 6001 / 9295 | time 2691[s] | loss 1.79
| epoch 4 |  iter 6021 / 9295 | time 2692[s] | loss 1.82
| epoch 4 |  iter 6041 / 9295 | time 2694[s] | loss 1.84
| epoch 4 |  iter 6061 / 9295 | time 2695[s] | loss 1.78
| epoch 4 |  iter 6081 / 9295 | time 2697[s] | loss 1.82
| epoch 4 |  iter 6101 / 9295 | time 2699[s] | loss 1.81
| epoch 4 |  iter 6121 / 9295 | time 2700[s] | loss 1.79
| epoch 4 |  iter 6141 / 9295 | time 2702[s] | loss 1.82
| epoch 4 |  iter 6161 / 9295 | time 2703[s] | loss 1.79
| epoch 4 |  iter 6181 / 9295 | time 2705[s] | loss 1.81
| epoch 4 |  iter 6201 / 9295 | time 2706[s] | loss 1.80
| epoch 4 |  iter 6221 / 9295 | time 2708[s] | loss 1.78
| epoch 4 |  iter 6241 / 9295 | time 2710[s] | loss 1.78
| epoch 4 |  iter 6261 / 9295 | time 2711[s] | loss 1.83
| epoch 4 |  iter 6281 / 9295 | time 2713[s] | loss 1.77
| epoch 4 |  iter 6301 / 9295 | time 2714[s] | loss 1.77
| epoch 4 |  iter 6321 / 9295 | time 2716[s] | loss 1.81
| epoch 4 |  iter 6341 / 9295 | time 2718[s] | loss 1.80
| epoch 4 |  iter 6361 / 9295 | time 2719[s] | loss 1.84
| epoch 4 |  iter 6381 / 9295 | time 2721[s] | loss 1.84
| epoch 4 |  iter 6401 / 9295 | time 2722[s] | loss 1.81
| epoch 4 |  iter 6421 / 9295 | time 2724[s] | loss 1.78
| epoch 4 |  iter 6441 / 9295 | time 2725[s] | loss 1.79
| epoch 4 |  iter 6461 / 9295 | time 2727[s] | loss 1.82
| epoch 4 |  iter 6481 / 9295 | time 2729[s] | loss 1.80
| epoch 4 |  iter 6501 / 9295 | time 2730[s] | loss 1.80
| epoch 4 |  iter 6521 / 9295 | time 2732[s] | loss 1.84
| epoch 4 |  iter 6541 / 9295 | time 2733[s] | loss 1.82
| epoch 4 |  iter 6561 / 9295 | time 2735[s] | loss 1.78
| epoch 4 |  iter 6581 / 9295 | time 2736[s] | loss 1.82
| epoch 4 |  iter 6601 / 9295 | time 2738[s] | loss 1.84
| epoch 4 |  iter 6621 / 9295 | time 2740[s] | loss 1.84
| epoch 4 |  iter 6641 / 9295 | time 2741[s] | loss 1.79
| epoch 4 |  iter 6661 / 9295 | time 2743[s] | loss 1.83
| epoch 4 |  iter 6681 / 9295 | time 2744[s] | loss 1.80
| epoch 4 |  iter 6701 / 9295 | time 2746[s] | loss 1.83
| epoch 4 |  iter 6721 / 9295 | time 2747[s] | loss 1.77
| epoch 4 |  iter 6741 / 9295 | time 2749[s] | loss 1.80
| epoch 4 |  iter 6761 / 9295 | time 2751[s] | loss 1.77
| epoch 4 |  iter 6781 / 9295 | time 2752[s] | loss 1.79
| epoch 4 |  iter 6801 / 9295 | time 2754[s] | loss 1.82
| epoch 4 |  iter 6821 / 9295 | time 2755[s] | loss 1.79
| epoch 4 |  iter 6841 / 9295 | time 2757[s] | loss 1.83
| epoch 4 |  iter 6861 / 9295 | time 2758[s] | loss 1.79
| epoch 4 |  iter 6881 / 9295 | time 2760[s] | loss 1.81
| epoch 4 |  iter 6901 / 9295 | time 2762[s] | loss 1.80
| epoch 4 |  iter 6921 / 9295 | time 2763[s] | loss 1.79
| epoch 4 |  iter 6941 / 9295 | time 2765[s] | loss 1.79
| epoch 4 |  iter 6961 / 9295 | time 2766[s] | loss 1.79
| epoch 4 |  iter 6981 / 9295 | time 2768[s] | loss 1.79
| epoch 4 |  iter 7001 / 9295 | time 2769[s] | loss 1.81
| epoch 4 |  iter 7021 / 9295 | time 2771[s] | loss 1.78
| epoch 4 |  iter 7041 / 9295 | time 2773[s] | loss 1.79
| epoch 4 |  iter 7061 / 9295 | time 2774[s] | loss 1.82
| epoch 4 |  iter 7081 / 9295 | time 2776[s] | loss 1.80
| epoch 4 |  iter 7101 / 9295 | time 2777[s] | loss 1.80
| epoch 4 |  iter 7121 / 9295 | time 2779[s] | loss 1.80
| epoch 4 |  iter 7141 / 9295 | time 2781[s] | loss 1.77
| epoch 4 |  iter 7161 / 9295 | time 2782[s] | loss 1.84
| epoch 4 |  iter 7181 / 9295 | time 2784[s] | loss 1.81
| epoch 4 |  iter 7201 / 9295 | time 2785[s] | loss 1.82
| epoch 4 |  iter 7221 / 9295 | time 2787[s] | loss 1.81
| epoch 4 |  iter 7241 / 9295 | time 2788[s] | loss 1.77
| epoch 4 |  iter 7261 / 9295 | time 2790[s] | loss 1.83
| epoch 4 |  iter 7281 / 9295 | time 2792[s] | loss 1.81
| epoch 4 |  iter 7301 / 9295 | time 2793[s] | loss 1.79
| epoch 4 |  iter 7321 / 9295 | time 2795[s] | loss 1.86
| epoch 4 |  iter 7341 / 9295 | time 2796[s] | loss 1.79
| epoch 4 |  iter 7361 / 9295 | time 2798[s] | loss 1.77
| epoch 4 |  iter 7381 / 9295 | time 2800[s] | loss 1.80
| epoch 4 |  iter 7401 / 9295 | time 2801[s] | loss 1.81
| epoch 4 |  iter 7421 / 9295 | time 2803[s] | loss 1.79
| epoch 4 |  iter 7441 / 9295 | time 2804[s] | loss 1.81
| epoch 4 |  iter 7461 / 9295 | time 2806[s] | loss 1.80
| epoch 4 |  iter 7481 / 9295 | time 2807[s] | loss 1.81
| epoch 4 |  iter 7501 / 9295 | time 2809[s] | loss 1.81
| epoch 4 |  iter 7521 / 9295 | time 2810[s] | loss 1.79
| epoch 4 |  iter 7541 / 9295 | time 2812[s] | loss 1.84
| epoch 4 |  iter 7561 / 9295 | time 2814[s] | loss 1.80
| epoch 4 |  iter 7581 / 9295 | time 2815[s] | loss 1.83
| epoch 4 |  iter 7601 / 9295 | time 2817[s] | loss 1.82
| epoch 4 |  iter 7621 / 9295 | time 2818[s] | loss 1.81
| epoch 4 |  iter 7641 / 9295 | time 2820[s] | loss 1.80
| epoch 4 |  iter 7661 / 9295 | time 2821[s] | loss 1.80
| epoch 4 |  iter 7681 / 9295 | time 2823[s] | loss 1.80
| epoch 4 |  iter 7701 / 9295 | time 2825[s] | loss 1.79
| epoch 4 |  iter 7721 / 9295 | time 2826[s] | loss 1.82
| epoch 4 |  iter 7741 / 9295 | time 2828[s] | loss 1.79
| epoch 4 |  iter 7761 / 9295 | time 2829[s] | loss 1.81
| epoch 4 |  iter 7781 / 9295 | time 2831[s] | loss 1.79
| epoch 4 |  iter 7801 / 9295 | time 2832[s] | loss 1.79
| epoch 4 |  iter 7821 / 9295 | time 2834[s] | loss 1.81
| epoch 4 |  iter 7841 / 9295 | time 2836[s] | loss 1.82
| epoch 4 |  iter 7861 / 9295 | time 2837[s] | loss 1.80
| epoch 4 |  iter 7881 / 9295 | time 2839[s] | loss 1.84
| epoch 4 |  iter 7901 / 9295 | time 2840[s] | loss 1.83
| epoch 4 |  iter 7921 / 9295 | time 2842[s] | loss 1.79
| epoch 4 |  iter 7941 / 9295 | time 2844[s] | loss 1.80
| epoch 4 |  iter 7961 / 9295 | time 2845[s] | loss 1.80
| epoch 4 |  iter 7981 / 9295 | time 2847[s] | loss 1.81
| epoch 4 |  iter 8001 / 9295 | time 2848[s] | loss 1.80
| epoch 4 |  iter 8021 / 9295 | time 2850[s] | loss 1.79
| epoch 4 |  iter 8041 / 9295 | time 2852[s] | loss 1.82
| epoch 4 |  iter 8061 / 9295 | time 2853[s] | loss 1.82
| epoch 4 |  iter 8081 / 9295 | time 2855[s] | loss 1.79
| epoch 4 |  iter 8101 / 9295 | time 2856[s] | loss 1.84
| epoch 4 |  iter 8121 / 9295 | time 2858[s] | loss 1.75
| epoch 4 |  iter 8141 / 9295 | time 2860[s] | loss 1.78
| epoch 4 |  iter 8161 / 9295 | time 2861[s] | loss 1.80
| epoch 4 |  iter 8181 / 9295 | time 2863[s] | loss 1.78
| epoch 4 |  iter 8201 / 9295 | time 2864[s] | loss 1.78
| epoch 4 |  iter 8221 / 9295 | time 2866[s] | loss 1.77
| epoch 4 |  iter 8241 / 9295 | time 2868[s] | loss 1.78
| epoch 4 |  iter 8261 / 9295 | time 2869[s] | loss 1.82
| epoch 4 |  iter 8281 / 9295 | time 2871[s] | loss 1.80
| epoch 4 |  iter 8301 / 9295 | time 2872[s] | loss 1.80
| epoch 4 |  iter 8321 / 9295 | time 2874[s] | loss 1.77
| epoch 4 |  iter 8341 / 9295 | time 2876[s] | loss 1.79

損失値のプロット

%python
plt.plot(trainer.loss_list)
z.show(plt, format='svg')

CPUで2時間くらい

sh
pip3 install pickle
Collecting pickle
  Could not find a version that satisfies the requirement pickle (from versions: )
No matching distribution found for pickle
You are using pip version 8.1.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

%python
import pickle

params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

%sh
ls -lha *.pkl
-rw-r--r-- 1 root root 2.2M Sep 30 05:52 cbow_params.pkl

CBOW モデル: 評価

def cos_similarity(x, y, eps=1e-8):
    '''コサイン類似度の算出
    :param x: ベクトル
    :param y: ベクトル
    :param eps: ”0割り”防止のための微小値
    :return:
    '''
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    '''類似単語の検索
    :param query: クエリ(テキスト)
    :param word_to_id: 単語から単語IDへのディクショナリ
    :param id_to_word: 単語IDから単語へのディクショナリ
    :param word_matrix: 単語ベクトルをまとめた行列。各行に対応する単語のベクトルが格納されていることを想定する
    :param top: 上位何位まで表示するか
    '''
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)

    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

%python
with open(pkl_file, 'rb') as f:
    params = pickle.load(f)
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

word2vec: 類似の単語を求める

%python
most_similar('you', word_to_id, id_to_word, word_vecs, top=5)
[query] you
 we: 0.72802734375
 i: 0.68896484375
 your: 0.65234375
 they: 0.61474609375
 someone: 0.5908203125

%python
most_similar('year', word_to_id, id_to_word, word_vecs, top=5)
[query] year
 month: 0.84326171875
 summer: 0.7578125
 week: 0.75537109375
 spring: 0.75341796875
 decade: 0.66552734375

%python
most_similar('car', word_to_id, id_to_word, word_vecs, top=5)
[query] car
 window: 0.62939453125
 luxury: 0.611328125
 truck: 0.6044921875
 cars: 0.58642578125
 auto: 0.5576171875

%python
most_similar('toyota', word_to_id, id_to_word, word_vecs, top=5)
[query] toyota
 seita: 0.64990234375
 nissan: 0.6376953125
 minicomputers: 0.63720703125
 honda: 0.6328125
 coated: 0.61767578125

word2vec: 類推問題を解く

%python
def analogy(a, b, c, word_to_id, id_to_word, word_matrix, top=5, answer=None):
    for word in (a, b, c):
        if word not in word_to_id:
            print('%s is not found' % word)
            return

    print('\n[analogy] ' + a + ':' + b + ' = ' + c + ':?')
    a_vec, b_vec, c_vec = word_matrix[word_to_id[a]], word_matrix[word_to_id[b]], word_matrix[word_to_id[c]]
    query_vec = b_vec - a_vec + c_vec
    query_vec = normalize(query_vec)

    similarity = np.dot(word_matrix, query_vec)

    if answer is not None:
        print("==>" + answer + ":" + str(np.dot(word_matrix[word_to_id[answer]], query_vec)))

    count = 0
    for i in (-1 * similarity).argsort():
        if np.isnan(similarity[i]):
            continue
        if id_to_word[i] in (a, b, c):
            continue
        print(' {0}: {1}'.format(id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return


def normalize(x):
    if x.ndim == 2:
        s = np.sqrt((x * x).sum(1))
        x /= s.reshape((s.shape[0], 1))
    elif x.ndim == 1:
        s = np.sqrt((x * x).sum())
        x /= s
    return x

引いて足して正規化

%python
analogy('king', 'man', 'queen', word_to_id, id_to_word, word_vecs)
[analogy] king:man = queen:?
 woman: 5.36328125
 mother: 4.9296875
 a.m: 4.734375
 naczelnik: 4.6875
 father: 4.6640625

%python
analogy('car', 'cars', 'child', word_to_id, id_to_word, word_vecs)
[analogy] car:cars = child:?
 a.m: 6.3046875
 rape: 5.6015625
 daffynition: 5.125
 children: 5.12109375
 incest: 5.0859375

%python
analogy('good', 'better', 'bad', word_to_id, id_to_word, word_vecs)
[analogy] good:better = bad:?
 more: 5.75
 rather: 5.6796875
 less: 5.66796875
 greater: 4.58984375
 faster: 3.826171875

  • ウィンドウサイズは2〜10、中間層のニューロン数は50〜500くらいが良い結果になるらしい
  • 分散表現は転移学習に利用できる
  • RNNを使えば単語の分散表現を利用しながら文章を固定長のベクトルに変換することが出来る(Doc2Vec?)

%md