MNISTの次元削減

MNISTでは主成分分析，t-SNE，UMAPでMNIST訓練データ6万文字の次元削減を試し，結果をカラーでプロットした。しかし，結果のプロットは手書き文字そのもので行うほうがわかりやすい。以下では6万文字を次元削減した上で，全体の約1/50をランダムに選び，元の手書き文字そのものでプロットした。

PCA:

t-SNE:

UMAP:

プログラムはちょっと強引で，もっとエレガントに書けそうだけれど，とりあえず：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import tensorflow as tf

rng = np.random.default_rng()

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x = x_train.reshape(60000, 28*28)
y = PCA(n_components=2).fit_transform(x)

def toimg(nx, ny, ax, ay):
    img = np.zeros((ny,nx), dtype='uint8')
    x1 = min(ax)
    x2 = max(ax)
    y1 = min(ay)
    y2 = max(ay)
    hx = (nx - 28) / (x2 - x1)
    hy = (ny - 28) / (y2 - y1)
    h = min(hx, hy)
    for k in range(60000):
        if rng.random() < 0.02:
            i1 = int(h * (ax[k] - x1))
            j1 = int(h * (y2 - ay[k]))
            for i in range(28):
                for j in range(28):
                    img[j1+j, i1+i] = min(int(img[j1+j, i1+i]) + int(x_train[k][j,i]), 255)
    return img

img = toimg(1024, 1024, y[:,0], y[:,1])
plt.imshow(img, cmap='gray_r')
plt.imsave("mnistpca.png", img, cmap='gray_r')

t-SNE，UMAPも同様。

追記: Deep TDA というのもあるらしい。Why you should use Topological Data Analysis over t-SNE or UMAP?