ColossalAI: Making Large AI Models Cheaper, Faster and More Accessible

Colossal-AI

Colossal-AI: Making large AI models cheaper, faster and more accessible.


Features

Colossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models just like how you write your model on your laptop. We provide user-friendly tools to kickstart distributed training and inference in a few lines.

Parallelism strategies

Heterogeneous Memory Management

Friendly Usage

  • Parallelism based on configuration file

Inference

Parallel Training Demo

GPT-3

  • Save 50% GPU resources, and 10.7% acceleration

GPT-2

  • 11x lower GPU memory consumption, and superlinear scaling efficiency with Tensor Parallelism

24x larger model size on the same hardware

over 3x acceleration

BERT

2x faster training, or 50% longer sequence length

PaLM

OPT

  • Open Pretrained Transformer (OPT), a 175-Billion parameter AI language model released by Meta, which stimulates AI programmers to perform various downstream tasks and application deployments because public pretrained model weights.
  • 45% speedup fine-tuning OPT at low cost in lines. [Example] [Online Serving]

Please visit our documentation and examples for more details.

ViT

  • 14x larger batch size, and 5x faster training for Tensor Parallelism = 64

Recommendation System Models

  • Cached Embedding, utilize software cache to train larger embedding tables with a smaller GPU memory budget.

Single GPU Training Demo

GPT-2

  • 20x larger model size on the same hardware

  • 120x larger model size on the same hardware (RTX 3080)

PaLM

  • 34x larger model size on the same hardware

Inference (Energon-AI) Demo

  • Energon-AI: 50% inference acceleration on the same hardware

  • OPT Serving: Try 175-billion-parameter OPT online services

  • BLOOM: Reduce hardware deployment costs of 176-billion-parameter BLOOM by more than 10 times.

Colossal-AI in the Real World

ColossalChat

 

ColossalChat: An open-source solution for cloning ChatGPT with a complete RLHF pipeline. [code] [blog] [demo]

  • Up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference

  • Up to 10.3x growth in model capacity on one GPU
  • A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU)

  • Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU
  • Keep in a sufficiently high running speed

AIGC

Acceleration of AIGC (AI-Generated Content) models such as Stable Diffusion v1 and Stable Diffusion v2.

  • Training: Reduce Stable Diffusion memory consumption by up to 5.6x and hardware cost by up to 46x (from A100 to RTX3060).

  • Inference: Reduce inference GPU memory consumption by 2.5x.

Biomedicine

Acceleration of AlphaFold Protein Structure

  • FastFold: Accelerating training and inference on GPU Clusters, faster data processing, inference sequence containing more than 10000 residues.

  • xTrimoMultimer: accelerating structure prediction of protein monomers and multimer by 11x.

Installation

Requirements:

  • PyTorch >= 1.11 (PyTorch 2.x in progress)
  • Python >= 3.7
  • CUDA >= 11.0

If you encounter any problem about installation, you may want to raise an issue in this repository.

Install from PyPI

You can easily install Colossal-AI with the following command. By default, we do not build PyTorch extensions during installation.

pip install colossalai

Note: only Linux is supported for now.

However, if you want to build the PyTorch extensions during installation, you can set CUDA_EXT=1.

CUDA_EXT=1 pip install colossalai

Otherwise, CUDA kernels will be built during runtime when you actually need it.

We also keep release the nightly version to PyPI on a weekly basis. This allows you to access the unreleased features and bug fixes in the main branch. Installation can be made via

pip install colossalai-nightly

Download From Source

The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# install colossalai
pip install .

By default, we do not compile CUDA/C++ kernels. ColossalAI will build them during runtime. If you want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):

CUDA_EXT=1 pip install .

Use Docker

Pull from DockerHub

You can directly pull the docker image from our DockerHub page. The image is automatically uploaded upon release.

Build On Your Own

Run the following command to build a docker image from Dockerfile provided.

Building Colossal-AI from scratch requires GPU support, you need to use Nvidia Docker Runtime as the default when doing docker build. More details can be found here. We recommend you install Colossal-AI from our project page directly.

cd ColossalAI
docker build -t colossalai ./docker

Run the following command to start the docker container in interactive mode.

docker run -ti --gpus all --rm --ipc=host colossalai bash

Community

Join the Colossal-AI community on Forum, Slack, and WeChat(微信) to share your suggestions, feedback, and questions with our engineering team.

Contributing

Referring to the successful attempts of BLOOM and Stable Diffusion, any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models!

You may contact us or participate in the following ways:

  1. Leaving a Star ⭐ to show your like and support. Thanks!
  2. Posting an issue, or submitting a PR on GitHub follow the guideline in Contributing
  3. Send your official proposal to email contact@hpcaitech.com

Thanks so much to all of our amazing contributors!

CI/CD

We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.

Cite Us

This project is inspired by some related projects (some by our team and some by other organizations). We would like to credit these amazing projects as listed in the Reference List.

To cite this project, you can use the following BibTeX citation.

@article{bian2021colossal,
  title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
  author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
  journal={arXiv preprint arXiv:2110.14883},
  year={2021}
}

Colossal-AI has been accepted as official tutorials by top conference SC, AAAI, PPoPP, CVPR, ISC, etc.


Latest News


Why Colossal-AI

 

Prof. James Demmel (UC Berkeley): Colossal-AI makes training AI models efficient, easy, and scalable.


Download Details:

Author: hpcaitech
Source Code: https://github.com/hpcaitech/ColossalAI 
License: Apache-2.0 license

#AI #deeplearning #big #model #data #parallelism 

ColossalAI: Making Large AI Models Cheaper, Faster and More Accessible

ML-Notebooks: Machine Learning Notebooks

🐙 Machine Learning Notebooks

This repo contains machine learning notebooks for different tasks and applications. The notebooks are meant to be minimal, easily reusable, and extendable. You are free to use them for educational and research purposes.

This repo supports Codespaces!

  • Spin up a new instance by clicking on the green "<> Code" button followed by the "Configure and create codespace" option. Make sure to select the dev container config provided with this repo. This setups an environment with all the dependencies installed and ready to go.
  • Once the codespace is fully running, you can install all the libraries you will need to run the notebooks under the /notebooks folder. Open up a terminal and simply run conda create --name myenv --file spec-file.txt to install all the Python libraries including PyTorch.
  • Activate your environment conda activate myenv. You might need to run conda init zsh or whatever shell you are using... and then close + reopen terminal.
  • Finally you can try out if everything is working by opening a notebook such as /notebooks/bow.ipynb.

Getting Started

NameDescriptionNotebook
Introduction to Computational GraphsA basic tutorial to learn about computational graphs
PyTorch Hello World!Build a simple neural network and train it
A Gentle Introduction to PyTorchA detailed explanation introducing PyTorch concepts
Counterfactual ExplanationsA basic tutorial to learn about counterfactual explanations for explainable AI
Linear Regression from ScratchAn implementation of linear regression from scratch using stochastic gradient descent
Logistic Regression from ScratchAn implementation of logistic regression from scratch
Concise Logistic RegressionConcise implementation of logistic regression model for binary image classification.
First Neural Network - Image ClassifierBuild a minimal image classifier using MNIST
Neural Network from ScratchAn implementation of simple neural network from scratch
Introduction to GNNsIntroduction to Graph Neural Networks. Applies basic GCN to Cora dataset for node classification.

NLP

NameDescriptionNotebook
Bag of Words Text ClassifierBuild a simple bag of words text classifier.
Continuous Bag of Words (CBOW) Text ClassifierBuild a continuous bag of words text classifier.
Deep Continuous Bag of Words (Deep CBOW) Text ClassifierBuild a deep continuous bag of words text classifier.
Text Data AugmentationAn introduction to the most commonly used data augmentation techniques for text and their implementation
Emotion Classification with Fine-tuned BERTEmotion classification using fine-tuned BERT model

Transformers

NameDescriptionNotebook
Text Classification using TransformerAn implementation of Attention Mechanism and Positional Embeddings on a text classification task 
Kaggle
Neural Machine Translation using TransformerAn implementation of Transformer to translate human readable dates in any format to YYYY-MM-DD format. 
Kaggle
Feature Tokenizer TransformerAn implementation of Feature Tokenizer Transformer on a classification task 
Kaggle
Named Entity Recognition using TransformerAn implementation of Transformer to perform token classification and identify species in PubMed abstracts 
Kaggle
Extractive Question Answering using TransformerAn implementation of Transformer to perform extractive question answering 
Kaggle

Computer Vision

NameDescriptionNotebook
Siamese NetworkAn implementation of Siamese Network for finding Image Similarity 
Kaggle
Variational Auto EncoderAn implementation of Variational Auto Encoder to generate Augmentations for MNIST Handwritten Digits 
Kaggle
Object Detection using Sliding Window and Image PyramidA basic object detection implementation using sliding window and image pyramid on top of an image classifier 
Kaggle
Object Detection using Selective SearchA basic object detection implementation using selective search on top of an image classifier 
Kaggle

Generative Adversarial Network

NameDescriptionNotebook
Deep Convolutional GANAn Implementation of Deep Convolutional GAN to generate MNIST digits 
Kaggle
Wasserstein GAN with Gradient PenaltyAn Implementation of Wasserstein GAN with Gradient Penalty to generate MNIST digits 
Kaggle
Conditional GANAn Implementation of Conditional GAN to generate MNIST digits 
Kaggle

If you find any bugs or have any questions regarding these notebooks, please open an issue. We will address it as soon as we can.

Reach out on Twitter if you have any questions.

Please cite the following if you use the code examples in your research:

@misc{saravia2022ml,
  title={ML Notebooks},
  author={Saravia, Elvis and Rastogi, Ritvik},
  journal={https://github.com/dair-ai/ML-Notebooks},
  year={2022}
}

Download Details:

Author: Dair-ai
Source Code: https://github.com/dair-ai/ML-Notebooks 
License: Apache-2.0 license

#python #machinelearning #ai #deeplearning #pytorch 

ML-Notebooks: Machine Learning Notebooks

Guides, Papers, Lecture, Notebooks & Resources for Prompt Engineering

Prompt Engineering Guide

Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs). Researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Developers use prompt engineering to design robust and effective prompting techniques that interface with LLMs and other tools.

Motivated by the high interest in developing with LLMs, we have created this new prompt engineering guide that contains all the latest papers, learning guides, lectures, references, and tools related to prompt engineering.

Happy Prompting!


Guides

The following are a set of guides on prompt engineering developed by us. Guides are work in progress.


If you are using the guide for your work, please cite us as follows:

@article{Saravia_Prompt_Engineering_Guide_2022,
author = {Saravia, Elvis},
journal = {https://github.com/dair-ai/Prompt-Engineering-Guide},
month = {12},
title = {{Prompt Engineering Guide}},
year = {2022}
}

Feel free to open a PR if you think something is missing here. Always welcome feedback and suggestions. Just open an issue!


Announcements / Updates

  • 🎉 We have launched new web version of the guide here
  • 🎓 Partnered with Sphere to deliver a new course on Prompt Engineering for LLMs
  • 💬 New ChatGPT prompt engineering guide coming soon!
  • 🔥 We reached #1 on Hacker News on 21 Feb 2023
  • 🎉 The Prompt Engineering Lecture went live here
  • 🎓 We're creating a set of comprehensive guides here

Join our Discord

Follow us on Twitter

Subscribe to our Newsletter


Lecture

We have published a 1 hour lecture that provides a comprehensive overview of prompting techniques, applications, and tools.


Download Details:

Author: Dair-ai
Source Code: https://github.com/dair-ai/Prompt-Engineering-Guide 
License: MIT license

#deeplearning #engineering #jupyternotebook #notebooks 

Guides, Papers, Lecture, Notebooks & Resources for Prompt Engineering
Lawson  Wehner

Lawson Wehner

1679535000

OGB: Benchmark Datasets, Data Loaders, Evaluators for Graph ML

Overview

The Open Graph Benchmark (OGB) is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications. The OGB data loaders are fully compatible with popular graph deep learning frameworks, including PyTorch Geometric and Deep Graph Library (DGL). They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.

OGB aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.

Graph ML Tasks: We cover three fundamental graph machine learning tasks: prediction at the level of nodes, links, and graphs.

Diverse scale: Small-scale graph datasets can be processed within a single GPU, while medium- and large-scale graphs might require multiple GPUs or clever sampling/partition techniques.

Rich domains: Graph datasets come from diverse domains ranging from scientific ones to social/information networks, and also include heterogeneous knowledge graphs.

OGB is an on-going effort, and we are planning to increase our coverage in the future.

Installation

You can install OGB using Python's package manager pip. If you have previously installed ogb, please make sure you update the version to 1.3.5. The release note is available here.

Requirements

  • Python>=3.6
  • PyTorch>=1.6
  • DGL>=0.5.0 or torch-geometric>=2.0.2
  • Numpy>=1.16.0
  • pandas>=0.24.0
  • urllib3>=1.24.0
  • scikit-learn>=0.20.0
  • outdated>=0.2.0

Pip install

The recommended way to install OGB is using Python's package manager pip:

pip install ogb
python -c "import ogb; print(ogb.__version__)"
# This should print "1.3.5". Otherwise, please update the version by
pip install -U ogb

From source

You can also install OGB from source. This is recommended if you want to contribute to OGB.

git clone https://github.com/snap-stanford/ogb
cd ogb
pip install -e .

Package Usage

We highlight two key features of OGB, namely, (1) easy-to-use data loaders, and (2) standardized evaluators.

(1) Data loaders

We prepare easy-to-use PyTorch Geometric and DGL data loaders. We handle dataset downloading as well as standardized dataset splitting. Below, on PyTorch Geometric, we see that a few lines of code is sufficient to prepare and split the dataset! Needless to say, you can enjoy the same convenience for DGL!

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.loader import DataLoader

# Download and process data at './dataset/ogbg_molhiv/'
dataset = PygGraphPropPredDataset(name = 'ogbg-molhiv')

split_idx = dataset.get_idx_split() 
train_loader = DataLoader(dataset[split_idx['train']], batch_size=32, shuffle=True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size=32, shuffle=False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size=32, shuffle=False)

(2) Evaluators

We also prepare standardized evaluators for easy evaluation and comparison of different methods. The evaluator takes input_dict (a dictionary whose format is specified in evaluator.expected_input_format) as input, and returns a dictionary storing the performance metric appropriate for the given dataset. The standardized evaluation protocol allows researchers to reliably compare their methods.

from ogb.graphproppred import Evaluator

evaluator = Evaluator(name = 'ogbg-molhiv')
# You can learn the input and output format specification of the evaluator as follows.
# print(evaluator.expected_input_format) 
# print(evaluator.expected_output_format) 
input_dict = {'y_true': y_true, 'y_pred': y_pred}
result_dict = evaluator.eval(input_dict) # E.g., {'rocauc': 0.7321}

Citing OGB / OGB-LSC

If you use OGB or OGB-LSC datasets in your work, please cite our papers (Bibtex below).

@article{hu2020ogb,
  title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
  journal={arXiv preprint arXiv:2005.00687},
  year={2020}
}
@article{hu2021ogblsc,
  title={OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Ren, Hongyu and Nakata, Maho and Dong, Yuxiao and Leskovec, Jure},
  journal={arXiv preprint arXiv:2103.09430},
  year={2021}
}

Download Details:

Author: snap-stanford
Source Code: https://github.com/snap-stanford/ogb 
License: MIT license

#machinelearning #deeplearning #dataset #python 

OGB: Benchmark Datasets, Data Loaders, Evaluators for Graph ML
Lawson  Wehner

Lawson Wehner

1679531040

Pytorch_geometric: Graph Neural Network Library for PyTorch

Pytorch Geometric

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers. In addition, it consists of easy-to-use mini-batch loaders for operating on many small and single giant graphs, multi GPU-support, DataPipe support, distributed graph learning via Quiver, a large number of common benchmark datasets (based on simple interfaces to create your own), the GraphGym experiment manager, and helpful transforms, both for learning on arbitrary graphs as well as on 3D meshes or point clouds. Click here to join our Slack community!


Library Highlights

Whether you are a machine learning researcher or first-time user of machine learning toolkits, here are some reasons to try out PyG for machine learning on graph-structured data.

  • Easy-to-use and unified API: All it takes is 10-20 lines of code to get started with training a GNN model (see the next section for a quick tour). PyG is PyTorch-on-the-rocks: It utilizes a tensor-centric API and keeps design principles close to vanilla PyTorch. If you are already familiar with PyTorch, utilizing PyG is straightforward.
  • Comprehensive and well-maintained GNN models: Most of the state-of-the-art Graph Neural Network architectures have been implemented by library developers or authors of research papers and are ready to be applied.
  • Great flexibility: Existing PyG models can easily be extended for conducting your own research with GNNs. Making modifications to existing models or creating new architectures is simple, thanks to its easy-to-use message passing API, and a variety of operators and utility functions.
  • Large-scale real-world GNN models: We focus on the need of GNN applications in challenging real-world scenarios, and support learning on diverse types of graphs, including but not limited to: scalable GNNs for graphs with millions of nodes; dynamic GNNs for node predictions over time; heterogeneous GNNs with multiple node types and edge types.
  • GraphGym integration: GraphGym lets users easily reproduce GNN experiments, is able to launch and analyze thousands of different GNN configurations, and is customizable by registering new modules to a GNN learning pipeline.

Quick Tour for New Users

In this quick tour, we highlight the ease of creating and training a GNN model with only a few lines of code.

Train your own GNN model

In the first glimpse of PyG, we implement the training of a GNN for classifying papers in a citation graph. For this, we load the Cora dataset, and create a simple 2-layer GCN model using the pre-defined GCNConv:

import torch
from torch import Tensor
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='.', name='Cora')

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
        # x: Node feature matrix of shape [num_nodes, in_channels]
        # edge_index: Graph connectivity matrix of shape [2, num_edges]
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

model = GCN(dataset.num_features, 16, dataset.num_classes)

We can now optimize the model in a training loop, similar to the standard PyTorch training procedure.

import torch.nn.functional as F

data = dataset[0]
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    pred = model(data.x, data.edge_index)
    loss = F.cross_entropy(pred[data.train_mask], data.y[data.train_mask])

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

More information about evaluating final model performance can be found in the corresponding example.

Create your own GNN layer

In addition to the easy application of existing GNNs, PyG makes it simple to implement custom Graph Neural Networks (see here for the accompanying tutorial). For example, this is all it takes to implement the edge convolutional layer from Wang et al.:

import torch
from torch import Tensor
from torch.nn import Sequential, Linear, ReLU
from torch_geometric.nn import MessagePassing

class EdgeConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super().__init__(aggr="max")  # "Max" aggregation.
        self.mlp = Sequential(
            Linear(2 * in_channels, out_channels),
            ReLU(),
            Linear(out_channels, out_channels),
        )

    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
        # x: Node feature matrix of shape [num_nodes, in_channels]
        # edge_index: Graph connectivity matrix of shape [2, num_edges]
        return self.propagate(edge_index, x=x)  # shape [num_nodes, out_channels]

    def message(self, x_j: Tensor, x_i: Tensor) -> Tensor:
        # x_j: Source node features of shape [num_edges, in_channels]
        # x_i: Target node features of shape [num_edges, in_channels]
        edge_features = torch.cat([x_i, x_j - x_i], dim=-1)
        return self.mlp(edge_features)  # shape [num_edges, out_channels]

Manage experiments with GraphGym

GraphGym allows you to manage and launch GNN experiments, using a highly modularized pipeline (see here for the accompanying tutorial).

git clone https://github.com/pyg-team/pytorch_geometric.git
cd pytorch_geometric/graphgym
bash run_single.sh  # run a single GNN experiment (node/edge/graph-level)
bash run_batch.sh   # run a batch of GNN experiments, using differnt GNN designs/datasets/tasks

Users are highly encouraged to check out the documentation, which contains additional tutorials on the essential functionalities of PyG, including data handling, creation of datasets and a full list of implemented methods, transforms, and datasets. For a quick start, check out our examples in examples/.

Architecture Overview

PyG provides a multi-layer framework that enables users to build Graph Neural Network solutions on both low and high levels. It comprises of the following components:

  • The PyG engine utilizes the powerful PyTorch deep learning framework, as well as additions of efficient CUDA libraries for operating on sparse data, e.g., pyg-lib, torch_scatter, torch_sparse and torch-cluster.
  • The PyG storage handles data processing, transformation and loading pipelines. It is capable of handling and processing large-scale graph datasets, and provides effective solutions for heterogeneous graphs. It further provides a variety of sampling solutions, which enable training of GNNs on large-scale graphs.
  • The PyG operators bundle essential functionalities for implementing Graph Neural Networks. PyG supports important GNN building blocks that can be combined and applied to various parts of a GNN model, ensuring rich flexibility of GNN design.
  • Finally, PyG provides an abundant set of GNN models, and examples that showcase GNN models on standard graph benchmarks. Thanks to its flexibility, users can easily build and modify custom GNN models to fit their specific needs.

Implemented GNN Models

We list currently supported PyG models, layers and operators according to category:

GNN layers: All Graph Neural Network layers are implemented via the nn.MessagePassing interface. A GNN layer specifies how to perform message passing, i.e. by designing different message, aggregation and update functions as defined here. These GNN layers can be stacked together to create Graph Neural Network models.

Expand to see all implemented GNN layers...

Pooling layers: Graph pooling layers combine the vectorial representations of a set of nodes in a graph (or a subgraph) into a single vector representation that summarizes its properties of nodes. It is commonly applied to graph-level tasks, which require combining node features into a single graph representation.

Expand to see all implemented pooling layers...

GNN models: Our supported GNN models incorporate multiple message passing layers, and users can directly use these pre-defined models to make predictions on graphs. Unlike simple stacking of GNN layers, these models could involve pre-processing, additional learnable parameters, skip connections, graph coarsening, etc.

Expand to see all implemented GNN models...

GNN operators and utilities: PyG comes with a rich set of neural network operators that are commonly used in many GNN models. They follow an extensible design: It is easy to apply these operators and graph utilities to existing GNN layers and models to further enhance model performance.

Expand to see all implemented GNN operators and utilities...

Scalable GNNs: PyG supports the implementation of Graph Neural Networks that can scale to large-scale graphs. Such application is challenging since the entire graph, its associated features and the GNN parameters cannot fit into GPU memory. Many state-of-the-art scalability approaches tackle this challenge by sampling neighborhoods for mini-batch training, graph clustering and partitioning, or by using simplified GNN models. These approaches have been implemented in PyG, and can benefit from the above GNN layers, operators and models.

Expand to see all implemented scalable GNNs...

Installation

PyG is available for Python 3.7 to Python 3.10.

Anaconda

You can now install PyG via Anaconda for all major OS/PyTorch/CUDA combinations 🤗 If you have not yet installed PyTorch, install it via conda as described in the official PyTorch documentation. Given that you have PyTorch installed (>=1.8.0), simply run

conda install pyg -c pyg

Pip Wheels

We alternatively provide pip wheels for all major OS/PyTorch/CUDA combinations, see here.

PyTorch 1.13

To install the binaries for PyTorch 1.13.0, simply run

pip install pyg_lib torch_scatter torch_sparse -f https://data.pyg.org/whl/torch-1.13.0+${CUDA}.html
pip install torch_geometric

where ${CUDA} should be replaced by either cpu, cu116, or cu117 depending on your PyTorch installation.

 cpucu116cu117
Linux
Windows
macOS  

For additional but optional functionality, run

pip install torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+${CUDA}.html

PyTorch 1.12

To install the binaries for PyTorch 1.12.0, simply run

pip install pyg_lib torch_scatter torch_sparse -f https://data.pyg.org/whl/torch-1.12.0+${CUDA}.html
pip install torch_geometric

where ${CUDA} should be replaced by either cpu, cu102, cu113, or cu116 depending on your PyTorch installation.

 cpucu102cu113cu116
Linux
Windows 
macOS   

For additional but optional functionality, run

pip install torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.12.0+${CUDA}.html

Note: Binaries of older versions are also provided for PyTorch 1.4.0, PyTorch 1.5.0, PyTorch 1.6.0, PyTorch 1.7.0/1.7.1, PyTorch 1.8.0/1.8.1, PyTorch 1.9.0, PyTorch 1.10.0/1.10.1/1.10.2, and PyTorch 1.11.0 (following the same procedure). For older versions, you might need to explicitly specify the latest supported version number or install via pip install --no-index in order to prevent a manual installation from source. You can look up the latest supported version number here.

Nightly and Master

In case you want to experiment with the latest PyG features which are not fully released yet, ensure that pyg_lib, torch_scatter and torch_sparse are installed by following the steps mentioned above, and install either the nightly version of PyG via

pip install pyg-nightly

or install PyG from master via

pip install git+https://github.com/pyg-team/pytorch_geometric.git

Cite

Please cite our paper (and the respective papers of the methods used) if you use this code in your own work:

@inproceedings{Fey/Lenssen/2019,
  title={Fast Graph Representation Learning with {PyTorch Geometric}},
  author={Fey, Matthias and Lenssen, Jan E.},
  booktitle={ICLR Workshop on Representation Learning on Graphs and Manifolds},
  year={2019},
}

Feel free to email us if you wish your work to be listed in the external resources. If you notice anything unexpected, please open an issue and let us know. If you have any questions or are missing a specific feature, feel free to discuss them with us. We are motivated to constantly make PyG even better.


Documentation | Paper | Colab Notebooks and Video Tutorials | External Resources | OGB Examples


Download Details:

Author: pyg-team
Source Code: https://github.com/pyg-team/pytorch_geometric 
License: MIT license

#python #pytorch #deeplearning #graph #network 

Pytorch_geometric: Graph Neural Network Library for PyTorch
Michio JP

Michio JP

1679371376

How to Visualize Neural Networks in Python

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature.

Neural networks can adapt to changing input; so the network generates the best possible result without needing to redesign the output criteria. The concept of neural networks, which has its roots in artificial intelligence, is swiftly gaining popularity in the development of trading systems.

There are several tools and package that we can use to visualize neural networks. In this tutorial we will talk about 4 of them. These tools or packages include

  • 1: Plot_model from TensorFlow
  • 2: ANN-Visualizer
  • 3: Netron
  • 4: Tensorboard

1: Tensorflow/Keras Plot_model

Keras/Tensorflow comes with a native function to help visualize the components and the structure of your artificial neural network. The plot_model() function can be used to visualize any keras-related or tensorflow generated neural network. This will give you a flow chart of the input, the layers and the output for your artificial neural network. The plot_model function takes as input the model and then the filename you want to save your plot as via the ‘to_file‘ argument

# Load utils 
from tensorflow.keras.utils import plot_model

#  Build your model
model = Sequential()
.....

# Visualize the model
plot_model(model,to_file='my_ann_model.png',show_shapes=False)

# Visualize the Model showing the input and output shapes
plot_model(model,to_file='my_ann_model.png',show_shapes=True)

2: ANN-Visualizer

This is another alternative to visualizing the component of a neural network, however there can be some challenges when using this tool. Howbeit it is quite simple to use.

To install it you can use pip via

pip install ann-visualizer

In order to use Ann-Visualizer you can do the following

from ann_visualizer.visualize import ann_viz
import graphviz

# Usage
ann_viz(model,filename='my_ann_model.gv',title='Artificial Neuron')
graph_file = graphviz.Source.from_file('my_ann_model.gv')
graph_file

3: Netron

Netron is one of the alternative. It comes as a standard alone desktop software that is cross-platform compatible since it was built with electron and react. Moreover there is also a free online service by the same team that made netron for visualizing the components of an ANN.

You can install netron as follows

# For Python
pip install netron
# For Linux
snap install netron
# For Mac
brew install netron

In order to use netron, you will have to save your neural network model as h5 or any other format and then upload it to netron app or the online service. That is it

4: Tensorboard

The final but not the least important is tensorfboard from tensorflow. This library has several features beside visualizing the components of a neural network. It offers tons of features.

To utilize it , you will have to install it via pip as below

pip install tensorboard

This is the end of the tutorial which are 4 ways to visualize neural nets in Python we would like to introduce to you.

Happy Coding !!!

#python #neuralnetwork #machinelearning #deeplearning  

How to Visualize Neural Networks in Python
Gordon  Matlala

Gordon Matlala

1679361540

State-of-the-art Diffusion Models for Image & Audio Generation

🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on usability over performance, simple over easy, and customizability over abstractions.

🤗 Diffusers offers three core components:

  • State-of-the-art diffusion pipelines that can be run in inference with just a few lines of code.
  • Interchangeable noise schedulers for different diffusion speeds and output quality.
  • Pretrained models that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.

Installation

We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing PyTorch and Flax, please refer to their official documentation.

PyTorch

With pip (official package):

pip install --upgrade diffusers[torch]

With conda (maintained by the community):

conda install -c conda-forge diffusers

Flax

With pip (official package):

pip install --upgrade diffusers[flax]

Apple Silicon (M1/M2) support

Please refer to the How to use Stable Diffusion in Apple Silicon guide.

Quickstart

Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the from_pretrained method to load any pretrained diffusion model (browse the Hub for 4000+ checkpoints):

from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.to("cuda")
pipeline("An image of a squirrel in Picasso style").images[0]

You can also dig into the models and schedulers toolbox to build your own diffusion system:

from diffusers import DDPMScheduler, UNet2DModel
from PIL import Image
import torch
import numpy as np

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
scheduler.set_timesteps(50)

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")
input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
        prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
        input = prev_noisy_sample

image = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).round().astype("uint8"))
image

Check out the Quickstart to launch your diffusion journey today!

How to navigate the documentation

DocumentationWhat can I learn?
TutorialA basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model.
LoadingGuides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers.
Pipelines for inferenceGuides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library.
OptimizationGuides for how to optimize your diffusion model to run faster and consume less memory.
TrainingGuides for how to train a diffusion model for different tasks with different training techniques.

Supported pipelines

PipelinePaperTasks
alt_diffusionAltDiffusionImage-to-Image Text-Guided Generation
audio_diffusionAudio DiffusionUnconditional Audio Generation
controlnetControlNet with Stable DiffusionImage-to-Image Text-Guided Generation
cycle_diffusionCycle DiffusionImage-to-Image Text-Guided Generation
dance_diffusionDance DiffusionUnconditional Audio Generation
ddpmDenoising Diffusion Probabilistic ModelsUnconditional Image Generation
ddimDenoising Diffusion Implicit ModelsUnconditional Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsText-to-Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsSuper Resolution Image-to-Image
latent_diffusion_uncondHigh-Resolution Image Synthesis with Latent Diffusion ModelsUnconditional Image Generation
paint_by_examplePaint by Example: Exemplar-based Image Editing with Diffusion ModelsImage-Guided Image Inpainting
pndmPseudo Numerical Methods for Diffusion Models on ManifoldsUnconditional Image Generation
score_sde_veScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
score_sde_vpScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
semantic_stable_diffusionSemantic GuidanceText-Guided Generation
stable_diffusion_text2imgStable DiffusionText-to-Image Generation
stable_diffusion_img2imgStable DiffusionImage-to-Image Text-Guided Generation
stable_diffusion_inpaintStable DiffusionText-Guided Image Inpainting
stable_diffusion_panoramaMultiDiffusionText-to-Panorama Generation
stable_diffusion_pix2pixInstructPix2PixText-Guided Image Editing
stable_diffusion_pix2pix_zeroZero-shot Image-to-Image TranslationText-Guided Image Editing
stable_diffusion_attend_and_exciteAttend and Excite for Stable DiffusionText-to-Image Generation
stable_diffusion_self_attention_guidanceSelf-Attention GuidanceText-to-Image Generation
stable_diffusion_image_variationStable Diffusion Image VariationsImage-to-Image Generation
stable_diffusion_latent_upscaleStable Diffusion Latent UpscalerText-Guided Super Resolution Image-to-Image
stable_diffusion_2Stable Diffusion 2Text-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Image Inpainting
stable_diffusion_2Depth-Conditional Stable DiffusionDepth-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Super Resolution Image-to-Image
stable_diffusion_safeSafe Stable DiffusionText-Guided Generation
stable_unclipStable unCLIPText-to-Image Generation
stable_unclipStable unCLIPImage-to-Image Text-Guided Generation
stochastic_karras_veElucidating the Design Space of Diffusion-Based Generative ModelsUnconditional Image Generation
unclipHierarchical Text-Conditional Image Generation with CLIP LatentsText-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelText-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelImage Variations Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelDual Image and Text Guided Generation
vq_diffusionVector Quantized Diffusion Model for Text-to-Image SynthesisText-to-Image Generation

Contribution

We ❤️ contributions from the open-source community! If you want to contribute to this library, please check out our Contribution guide. You can look out for issues you'd like to tackle to contribute to the library.

Also, say 👋 in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out ☕.

Credits

This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:

  • @CompVis' latent diffusion models library, available here
  • @hojonathanho original DDPM implementation, available here as well as the extremely useful translation into PyTorch by @pesser, available here
  • @ermongroup's DDIM implementation, available here
  • @yang-song's Score-VE and Score-VP implementations, available here

We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available here as well as @crowsonkb and @rromb for useful discussions and insights.

Citation

@misc{von-platen-etal-2022-diffusers,
  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
  title = {Diffusers: State-of-the-art diffusion models},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/diffusers}}
}

Download Details:

Author: Huggingface
Source Code: https://github.com/huggingface/diffusers 
License: Apache-2.0 license

#python #deeplearning #pytorch #image #generate #hacktoberfest 

State-of-the-art Diffusion Models for Image & Audio Generation
Gordon  Matlala

Gordon Matlala

1679321528

Deep Learning Optimization Library That Makes Distributed Training

Extreme Speed and Scale for DL Training and Inference

DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:

  • Train/Inference dense or sparse models with billions or trillions of parameters
  • Achieve excellent system throughput and efficiently scale to thousands of GPUs
  • Train/Inference on resource constrained GPU systems
  • Achieve unprecedented low latency and high throughput for inference
  • Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

DeepSpeed's three innovation pillars

DeepSpeed-Training

DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: DeepSpeed-Training

DeepSpeed-Inference

DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: DeepSpeed-Inference

DeepSpeed-Compression

To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: DeepSpeed-Compression


DeepSpeed Software Suite

DeepSpeed Library

The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see DeepSpeed Adoption).

Model Implementations for Inference (MII)

Model Implementations for Inference (MII) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.

DeepSpeed on Azure

DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML recipes. The job submission and data preparation scripts have been made available here. For more details on how to use DeepSpeed on Azure, please follow the Azure tutorial.


DeepSpeed Adoption

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

 Documentation
Transformers with DeepSpeed
Accelerate with DeepSpeed
Lightning with DeepSpeed
MosaicML with DeepSpeed
Determined with DeepSpeed

Build Pipeline Status

DescriptionStatus
NVIDIAnv-torch12-p40 nv-torch18-v100 nv-torch-latest-v100 nv-inference nv-nightly
AMDamd
PyTorch Nightlynv-torch-nightly-v100
Integrationsnv-transformers-v100 nv-lightning-v100 nv-accelerate-v100
MiscFormatting pages-build-deployment Documentation Status

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Requirements

  • PyTorch must be installed before installing DeepSpeed.
  • For full feature support we recommend a version of PyTorch that is >= 1.8 and ideally the latest PyTorch stable release.
  • A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
  • Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
    • NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
    • AMD: MI100 and MI200

PyPI

We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Windows

Windows support is partially supported with DeepSpeed. On Windows you can build wheel with following steps, currently only inference mode is supported.

  1. Install pytorch, such as pytorch 1.8 + cuda 11.1
  2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
  3. Launch cmd console with Administrator privilege for creating required symlink folders
  4. Run python setup.py bdist_wheel to build wheel in dist folder

Features

Please checkout DeepSpeed-Training, DeepSpeed-Inference and DeepSpeed-Compression pages for full set of features offered along each of these three pillars.

Further Reading

All DeepSpeed documentation, tutorials, and blogs can be found on our website: deepspeed.ai

 Description
Getting StartedFirst steps with DeepSpeed
DeepSpeed JSON ConfigurationConfiguring DeepSpeed
API DocumentationGenerated DeepSpeed API documentation
TutorialsTutorials
BlogsBlogs

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Publications

  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021.
  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888 and ICML 2021.
  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857 and SC 2021.
  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069 and HiPC 2022.
  8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. arXiv:2108.06084 and NeurIPS 2022.
  9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.
  10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596 and ICML 2022.
  11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.
  12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859 and NeurIPS 2022.
  13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861 and NeurIPS 2022.
  14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 and SC 2022.
  15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. arXiv:2211.11586.
  16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597.
  17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv:2301.12017.
  18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR:2023.
  19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226.

Videos

  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Community Tutorials

Latest News

DeepSpeed trained the world's most powerful language models (MT-530B, BLOOM); learn how.


Download Details:

Author: Microsoft
Source Code: https://github.com/microsoft/DeepSpeed 
License: MIT license

#python #machinelearning #deeplearning #gpu #pytorch 

Deep Learning Optimization Library That Makes Distributed Training
Gordon  Matlala

Gordon Matlala

1679309520

Bringing Stable Diffusion Models to Web Browsers

Web Stable Diffusion

This project brings stable diffusion models onto web browsers. Everything runs inside the browser with no server support. To our knowledge, this is the the world’s first stable diffusion completely running on the browser. Please checkout our demo webpage to try it out.

Browser screenshot

We have been seeing amazing progress through AI models recently. Thanks to the open-source effort, developers can now easily compose open-source models together to produce amazing tasks. Stable diffusion enables the automatic creation of photorealistic images as well as images in various styles based on text input. These models are usually big and compute-heavy, which means we have to pipe through all computation requests to (GPU) servers when developing web applications based on these models. Additionally, most of the workloads have to run on a specific type of GPUs where popular deep-learning frameworks are readily available.

This project takes a step to change that status quo and bring more diversity to the ecosystem. There are a lot of reasons to get some (or all) of the computation to the client side. There are many possible benefits, such as cost reduction on the service provider side, as well as an enhancement for personalization and privacy protection. The development of personal computers (even mobile devices) is going in the direction that enables such possibilities. The client side is getting pretty powerful. For example, the latest MacBook Pro can have up to 96GB of unified RAM that can be used to store the model weights and a reasonably powerful GPU to run many of the workloads.

Building special client apps for those applications is one option (which we also support), but won’t it be even more amazing if we can simply open a browser and directly bring AI natively to your browser tab? There is some level of readiness in the ecosystem. WebAssembly allows us to port more lower-level runtimes onto the web. To solve the compute problem, WebGPU is getting matured lately and enables native GPU executions on the browser.

We are just seeing necessary elements coming together on the client side, both in terms of hardware and browser ecosystem. Still, there are big hurdles to cross, to name a few:

  • We need to bring the models somewhere without the relevant GPU-accelerated Python frameworks.
  • Most of the AI frameworks have a heavy reliance on optimized computed libraries that are maintained by hardware vendors. We need to start from zero. To get the maximum benefit, we might also need to produce variants per client environment.
  • Careful planning of memory usage so we can fit the models into memory.

We do not want to only do it for just one model. Instead, we would like to present a repeatable, hackable, composable workflow that enables anyone to easily develop and optimize these models in a Python-first environment and universally deploy them everywhere, including the web.

Get Started

We have a Jupyter notebook that walks you through all the stages, including

  • elaborate the key points of web ML model deployment and how we do to meet these points,
  • import the stable diffusion model,
  • optimize the model,
  • build the model,
  • deploy the model locally with native GPU runtime, and
  • deploy the model on web with WebGPU runtime.

If you want to go through these steps in command line, please follow the commands below:

Commands

Install TVM Unity. You can either

  • use pip3 install mlc-ai-nightly -f https://mlc.ai/wheels to install the TVM Unity wheel, or
  • follow TVM’s documentation to build from source. Please use git checkout origin/unity to checkout to TVM Unity after git clone.

To import, optimize and build the stable diffusion model:

By default build.py takes apple/m2-gpu as build target. You can also specify CUDA target via

python3 build.py --target cuda
python3 build.py

To deploy the model locally with native GPU runtime:

You can substitute the prompt with your own one, and optionally use --negative-prompt "Your negative prompt" to specify a negative prompt.

python3 deploy.py --prompt "A photo of an astronaut riding a horse on mars."

To deploy the model on web with WebGPU runtime, the last section “Deploy on web” of the walkthrough notebook has listed the full instructions which you can refer to. We also provide the same list of plain instructions here:

  • Instructions

 

First, let’s install all the prerequisite:

  1. emscripten. It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
    • Follow the installation instruction to install the latest emsdk.
    • Source emsdk_env.sh by source path/to/emsdk_env.sh, so that emcc is reachable from PATH and the command emcc works.
  2. Rust.
  3. wasm-pack. It helps build Rust-generated WebAssembly, which used for tokenizer in our case here.
  4. Install jekyll by following the official guides. It is the package we use for website.
  5. Install jekyll-remote-theme by command
gem install jekyll-remote-theme
  1. Install Chrome Canary. It is a developer version of Chrome that enables the use of WebGPU.

We can verify the success installation by trying out emcc, jekyll and wasm-pack in terminal respectively.

Then, prepare all the necessary dependencies for web build:

./scripts/prep_deps.sh

We can now build the model to WebGPU backend and export the executable to disk in the WebAssembly file format, by running

python3 build.py --target webgpu

The last thing to do is setting up the site with

./scripts/local_deploy_site.sh

With the site set up, you can go to localhost:8888/web-stable-diffusion/ in Chrome Canary to try out the demo on your local machine. Don’t forget to use

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --enable-dawn-features=disable_robustness

to launch Chrome Canary to turn off the robustness check from Chrome.

How?

The key technology here is machine learning compilation (MLC). Our solution is built on the shoulders of the open-source ecosystem, including PyTorch, Hugging Face diffusers and tokenizers, rust, wasm, and WebGPU. The main flow is built on Apache TVM Unity, an exciting ongoing development in the Apache TVM

  • We take Runway’s stable diffusion v1-5 models from the Hugging Face diffuser library.
  • We use TorchDynamo and Torch FX to capture key model components into an IRModule in TVM.
  • Each function in TVM’s IRModule can be further transformed and generated with runnable code that can be deployed universally on any environment supported by minimum TVM runtime (javascript being one of them).
  • TensorIR and MetaSchedule are used to build automated solutions to generate optimized programs. These transformations are tuned on a specific device through native GPU runtimes and then used to generate optimized GPU shaders. We provide a database that records these transformations so new builds can be done without tuning.
  • We build static memory planning optimizations to reuse memory across multiple layers.
  • We use Emscripten and typescript to build a TVM web runtime that can deploy generated modules.
  • We also leverage the wasm port of the rust tokenizers library from hugging face.

workflow

All parts of this workflow are done in Python, except, of course, the last part which builds a 400-loc JavaScript app that connects things together. This is also a fun process of interactive development, bringing new models.

All these are made possible by the open-source ecosystem that we leverage. Specifically, we make heavy use of TVM Unity, an exciting latest development in the TVM project that enables such Python-first interactive MLC development experiences which allows us to easily compose new optimizations, all in Python, and incrementally bring our app to the web. TVM Unity also provides an easy way to compose new solutions in the ecosystem. For example, we can bring in other WebGPU shader generators or shader libraries easily to this workflow in the future.

Comparison with Native GPU Runtime, Limitations, and Opportunities

Besides the WebGPU runtime, we also provide options for native deployment with local GPU runtime. These options can be used both as a tool to deploy on a native environment as well as a reference point to compare native GPU driver performance and WebGPU.

WebGPU works by translating WGSL (WebGPU Shading Language) shaders to native shaders. So, in theory, we can reach zero gaps between the WebGPU runtime and the native environment. If we directly use Chrome to check the current demo on Apple silicon, however, we can find a performance degradation (about 3x). This is because Chrome’s WebGPU implementation inserts bound clips for all array index access, such that a[i] becomes a[min(i, a.size)]. Ideally, downstream shader compilers should be able to optimize the bound clipping out, but here unfortunately, it is not the case. This gap can be fixed once WebGPU implementation becomes more mature, checks the index access range, and drops such clipping.

You can get around this by using a special flag to launch Chrome (thanks to Dawn developers for providing the pointers), by exiting Chrome completely, then in the command line, type

/path/to/chrome-canary --enable-dawn-features=disable_robustness

Then you will find that the execution speed is as fast as the native GPU environment. We anticipate this problem will get resolved as WebGPU matures.

We are just seeing the dawn of what we believe to be an eruption. WebGPU is still evolving (though it is getting close to shipping this year), and only available through Chrome Canary, and can be unstable. It also still comes with limitations, such as only support for FP32 (FP16 shader extension is on the spec but not yet implemented). The stable diffusion here would require a GPU with a decent amount of RAM (8GB). We have only tested our solution through Apple silicons so far. There are also opportunities to support advanced optimizations such as FlashAttention and quantization to further improve the performance of the system.

These are opportunities to bring several times of performance improvements to the current solutions. We believe many of them can be tackled in the near future. A single component of this solution can still be useful. For example, one can choose just to deploy the text encoder part of the model. Additionally, the same Python-first development, universal deployment workflow can be used to bring ML models to other environments, such as new hardware or mobile cases. Finally, the same machine learning compilation stack is also shared with server class use cases and can be used to optimize server workloads as well.

Acknowledgement

This project is made possible thanks to collaboration with

CMU School of Computer Science Catalyst MLC OctoML

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. We want to thank the open-source ML community members who make these models publicly available, and PyTorch, Hugging Face communities that make these models accessible. We would like to thank the tokenizer wasm port by Mithril Security. We also would like to thank the WebAssembly, Emscripten, Rust, and WebGPU communities. Finally, thanks to Dawn developers, who provide timely answers to questions on Chrome.


Download Details:

Author: mlc-ai
Source Code: https://github.com/mlc-ai/web-stable-diffusion 
License: Apache-2.0 license

#jupyternotebook #deeplearning #webassembly #webgpu #stable 

Bringing Stable Diffusion Models to Web Browsers

Stanford_alpaca: An Instruction-following LLaMA Model

Stanford Alpaca: An Instruction-following LLaMA Model  

This is the repo for the Stanford Alpaca project, which aims to build and share an instruction-following LLaMA model. The repo contains:

Overview

The current Alpaca model is fine-tuned from a 7B LLaMA model [1] on 52K instruction-following data generated by the techniques in the Self-Instruct [2] paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite [2].

Alpaca is still under development, and there are many limitations that have to be addressed. Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Our initial release contains the data generation procedure, dataset, and training recipe. We intend to release the model weights if we are given permission to do so by the creators of LLaMA. For now, we have chosen to host a live demo to help readers better understand the capabilities and limits of Alpaca, as well as a way to help us better evaluate Alpaca's performance on a broader audience.

Please read our release blog post for more details about the model, our discussion of the potential harm and limitations of Alpaca models, and our thought process for releasing a reproducible model.

[1]: LLaMA: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1

[2]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560

Data Release

alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields:

  • instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.
  • input: str, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
  • output: str, the answer to the instruction as generated by text-davinci-003.

We used the following prompts for fine-tuning the Alpaca model:

  • for examples with a non-empty input field:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
  • for examples with an empty input field:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

During inference (eg for the web demo), we use the user instruction with an empty input field (second option).

Data Generation Process

Running the code

We built on the data generation pipeline from self-instruct and made the following modifications:

  • We used text-davinci-003 to generate the instruction data instead of davinci.
  • We wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in #24
  • We adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
  • We simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
  • We only generated a single instance for each instruction, instead of 2 to 3 instances as in [1].

This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, we also find our 52K generated data to be much more diverse than the data released by self-instruct. We plot the below figure (in the style of Figure 2 in the self-instruct paper to demonstrate the diversity of our data. The inner circle of the plot represents the root verb of the instructions, and the outer circle represents the direct objects.

Fine-tuning

We fine-tune our models using standard Hugging Face training code with the following hyperparameters:

HyperparameterValue
Batch size128
Learning rate2e-5
Epochs3
Max length512
Weight decay0

Given Hugging Face hasn't officially supported the LLaMA models, we fine-tuned LLaMA with Hugging Face's transformers library by installing it from a particular fork (i.e. this PR to be merged). The hash of the specific commit we installed was 68d640f7c368bcaaaecfc678f11908ebbd3d6176.

To reproduce our fine-tuning runs for LLaMA, first install the requirements

pip install -r requirements.txt

Then, install the particular fork of Hugging Face's transformers library.

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode. We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10. Replace <your_random_port> with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer> with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir> with where you want to store your outputs.

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

Warning

fsdp_transformer_layer_cls_to_wrap must be set to the name of the specific decoder layer. The LLaMA Hugging Face PR is not stable. Earlier commits used the name LLaMADecoderLayer for their decoder layer (the commit hash our code is based on this). More recent commits use LlamaDecoderLayer (notice the small case difference). Not setting fsdp_transformer_layer_cls_to_wrap to the correct name will lead to drastic slowdowns in training.

Side notes

The same script also works for OPT fine-tuning. Here's an example for fine-tuning OPT-6.7B

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

Note the given training script is meant to be simple and easy to use, and is not particularly optimized. To run on more gpus, you may prefer to turn down gradient_accumulation_steps to keep a global batch size of 128. Global batch size has not been tested for optimality.

Authors

All grad students below contributed equally and the order is determined by random draw.

All advised by Tatsunori B. Hashimoto. Yann is also advised by Percy Liang and Xuechen is also advised by Carlos Guestrin.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}

Naturally, you should also cite the original LLaMA paper [1] and the Self-Instruct paper [2].

Acknowledgements

We thank Yizhong Wang for his help in explaining the data generation pipeline in Self-Instruct and providing the code for the parse analysis plot. We thank Yifan Mai for helpful support, and members of the Stanford NLP Group as well as the Center for Research on Foundation Models (CRFM) for their helpful feedback.


Download Details:

Author: tatsu-lab
Source Code: https://github.com/tatsu-lab/stanford_alpaca 
License: Apache-2.0 license

#python #deeplearning #language #model #follow 

Stanford_alpaca: An Instruction-following LLaMA Model
Royce  Reinger

Royce Reinger

1678984680

DLL: Fast Deep Learning Library (DLL) for C++

Deep Learning Library (DLL) 1.1

DLL is a library that aims to provide a C++ implementation of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) and their convolution versions as well. It also has support for some more standard neural networks.

Features

  • Restricted Boltzmann Machine
    • Various units: Stochastic binary, Gaussian, Softmax and nRLU units
    • Contrastive Divergence and Persistence Contrastive Divergence
      • CD-1 learning by default
    • Momentum
    • Weight decay
    • Sparsity target
    • Train as Denoising autoencoder
  • Convolutional Restricted Boltzmann Machine
    • Standard version
    • Version with Probabilistic Max Pooling (Honglak Lee)
    • Binary and Gaussian visible units
    • Binary and ReLU hidden units for the standard version
    • Binary hidden units for the Probabilistic Max Pooling version
    • Training with CD-k or PCD-k (only for standard version)
    • Momentum, Weight Decay, Sparsity Target
    • Train as Denoising autoencoder
  • Deep Belief Network
    • Pretraining with RBMs
    • Fine tuning with Conjugate Gradient
    • Fine tuning with Stochastic Gradient Descent
    • Classification with SVM (libsvm)
  • Convolutional Deep Belief Network
    • Pretraining with CRBMs
    • Classification with SVM (libsvm)
  • Input data
    • Input data can be either in containers or in iterators
      • Even if iterators are supported for SVM classifier, libsvm will move all the data in memory structure.

Building

Note: When you clone the library, you need to clone the sub modules as well, using the --recursive option.

The folder include must be included with the -I option, as well as the etl/include folder.

This library is completely header-only, there is no need to build it.

However, this library makes extensive use of C++11 and C++14, therefore, a recent compiler is necessary to use it. Currently, this library is only tested with g++ 9.3.0.

If for some reasons, it should not work on one of the supported compilers, contact me and I'll fix it. It should work fine on recent versions of clang.

This has never been tested on Windows. While it should compile on Mingw, I don't expect Visual Studio to be able to compile it for now, although VS 2017 sounds promising. If you have problems compiling this library, I'd be glad to help, but cannot guarantee that this will work on other compilers.

If you want to use GPU, you should use CUDA 8.0 or superior and CUDNN 5.0.1 or superior. I haven't tried other versions, but lower versions of CUDA, such as 7, should work, and higher versions as well. If you got issues with different versions of CUDA and CUDNN, please open an issue on Github.


Download Details:

Author: Wichtounet
Source Code: https://github.com/wichtounet/dll 
License: MIT license

#machinelearning #cpluplus #performance #cpu #deeplearning

DLL: Fast Deep Learning Library (DLL) for C++
Regie  Lucas

Regie Lucas

1678864295

How to Build a Deep CNN Image Classifier with Any Images

So...you wanna build your own image classifier eh? Well in this tutorial you're going to learn how to do exactly that...FROM SCRATCH using Python, Tensorflow and Keras. But best yet, you can do it on virtually any dataset. Go on, give it a go!

Chapters
0:00 - Start
0:28 - Explainer 
1:19 - PART 1: Building a Data Pipeline
3:08 - Installing Dependencies
8:30 - Getting Data from Google Images
23:12 - Load Data using Keras Utils
33:22 - PART 2: Preprocessing Data
35:56 - Scaling Images
42:23 - Partitioning the Dataset
47:34 - PART 3: Building the Deep Neural Network
48:21 - Build the Network
1:02:32 - Training the DNN
1:06:37 - Plotting Model Performance
1:09:50 - PART 4: Evaluating Perofmrnace
1:10:38 - Evaluating on the Test Partition
1:13:59 - Testing on New Data 
1:20:39 - PART 5: Saving the Model
1:21:08 - Saving the model as h5 file
1:24:43 - Wrap Up

Get the Code https://github.com/nicknochnack/ImageClassification   

Links
Sigmoid Activation: https://en.wikipedia.org/wiki/Sigmoid_function 
Relu Activation: https://en.wikipedia.org/wiki/Rectifier_(neural_networks) 
Image Downloader Extension: https://chrome.google.com/webstore/detail/download-all-images/ifipmflagepipjokmbdecpmjbibjnakm?hl=en 
Conv2D Layer: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D 
MaxPooling Layer: https://keras.io/api/layers/pooling_layers/max_pooling2d/ 

Subscribe: https://www.youtube.com/@NicholasRenotte/featured 

#machinelearning #deeplearning #python  

How to Build a Deep CNN Image Classifier with Any Images

Data Science Job Interview – Full Mock Interview

This full-length interview will show you what a data science interview is like. This is video for anyone who wants a better understanding of the machine learning process. They cover topics that include building a dataset for training/testing purposes, feature vectorization, and model implementation details. 

Data Science Job Interview – Full Mock Interview

This full-length interview will show you what a data science interview is like. This is a great video for anyone currently in the job-market for a data-focused role. It is also a solid video for anyone who wants a better understanding of the machine learning process. They cover topics that include building a dataset for training/testing purposes, feature vectorization, and model implementation details. Consider pausing after the question and thinking about how you would answer them.

⭐️ Contents ⭐️
⌨️ (0:00:00) Video overview & format
⌨️ (0:02:13) Introductory Behavioral questions
⌨️ (0:07:46) Social media platform bot issue task overview
⌨️ (0:15:26) What are some features we should investigate regarding the bot issue?
⌨️ (0:25:02) Classification model implementation details (using feature vectors)
⌨️ (0:41:38) What would a dataset to train models to detect bots look like? How would you approach collecting this data?
⌨️ (0:51:38) Technical implementation details (python libraries, cloud services, etc)
⌨️ (0:56:01) Any questions for me?
⌨️ (1:03:42) Post-interview breakdown & analysis

#datascience #machinelearning #deeplearning #ai #artificialintelligence 

Data Science Job Interview – Full Mock Interview

Первая реализация с открытым исходным кодом AlphaTensor от DeepMind

Выпущена первая реализация AlphaTensor с открытым исходным кодом, которая открывает двери для новых разработок, революционизирующих вычислительную производительность моделей глубокого обучения.

Умножение матриц — это фундаментальная операция, используемая во многих системах, от нейронных сетей до процедур научных вычислений. Поиск эффективных и доказуемо правильных алгоритмов умножения матриц может оказать огромное влияние на ускорение и эффективность вычислений, но это очень сложная задача. Пространство возможных алгоритмов огромно, и традиционные методы обнаружения алгоритмов, такие как разработанная человеком эвристика или комбинаторный поиск, часто неоптимальны.

Недавно предложенное DeepMind решение для автоматизированного поиска на основе искусственного интеллекта выходит далеко за рамки человеческой интуиции . Решение состоит из агента глубокого обучения с подкреплением под названием AlphaTensor, построенного поверх AlphaZero . Этот агент обучен играть в однопользовательскую игру TensorGame, цель которой — найти вычислительно эффективные алгоритмы умножения матриц.

AlphaTensor особенно хорошо справляется с большими матрицами, разлагая умножения больших матриц на умножения меньшего размера. Кроме того, AlphaTensor можно использовать для достижения современной производительности для матричного умножения после тонкой настройки на конкретном аппаратном устройстве.

AlphaTensor обладает большим потенциалом для ускорения вычислений глубокого обучения. В глубоком обучении многие трудоемкие операции могут быть сопоставлены с умножением матриц. Используя AlphaTensor для оптимизации этих операций, можно значительно повысить общую производительность моделей глубокого обучения.

Недавно была выпущена OpenAlphaTensor, первая реализация AlphaTensor с открытым исходным кодом , которая может произвести революцию в вычислительной мощности моделей глубокого обучения.

Тензор умножения матриц

Для неспециалистов по оптимизации умножения матриц может быть непросто понять, как такая операция, как умножение матриц, может быть отображена в трехмерном тензоре. Постараюсь объяснить простыми словами и на примерах.

Давайте рассмотрим произведение C = A * B, где для простоты и A, и B являются квадратными матрицами размера N. Операция умножения может быть отображена в трехмерном тензоре формы (N ^ 2, N ^ 2, N ^ 2). Первое измерение тензора представляет собой сплющенную матрицу A, второе измерение — сплющенную матрицу B, а третье измерение — сплющенную матрицу C.

Тензор имеет только двоичные значения (либо 1, либо 0) для каждой записи. Обратите внимание, что тензор представляет собой операцию умножения, поэтому он не зависит от значений матриц A и B.

Каждому элементу тензора соответствует коэффициент операции. Например, чтобы вычислить C[1,1], необходимо умножить как A[1,1], так и B[1,1]. Следовательно, элемент тензора [0,0,0], который соответствует A[1,1], B[1,1] и C[1,1], будет иметь значение 1. Напротив, для вычисления C[1 ,1], A[2,1] не нужен. Таким образом, тензорная строка T[N+1, :, 0] будет содержать только нули.

На изображении ниже показан пример тензора для N=2.
 

ХХХХХ


Изображение из статьи DeepMind, опубликованной в Nature.

Как показано в (b) и (c) на рисунке выше, можно реализовать алгоритм вычисления произведения с использованием разложения трехмерного тензора. В частности, приведенный ниже алгоритм можно использовать для преобразования тензорного разложения (матриц U, V, W) в алгоритм умножения матриц.

 
 

ХХХХХ


Параметризованный метаалгоритм для вычисления матричного произведения C = AB, представленный в статье DeepMind.

TensorGame

Проблема поиска эффективных алгоритмов умножения матриц чрезвычайно сложна, потому что количество возможных алгоритмов для рассмотрения намного больше, чем количество атомов во Вселенной, даже для небольших случаев умножения матриц.

DeepMind превратил эту задачу в игру для одного игрока и назвал ее TensorGame. В этой игре игрок выбирает, как комбинировать разные элементы матриц, чтобы умножить их. Оценка присваивается на основе количества операций, необходимых для достижения правильного результата умножения. Игра заканчивается, когда достигается нулевой тензор или когда сделано максимальное количество ходов. Окончательная факторизация оценивается на основе оценки остаточного ранга и определенных критериев оптимизации, таких как асимптотическая временная сложность или практическое время выполнения.

Начальная позиция в TensorGame соответствует тензору умножения матриц, выраженному на некоторой случайной основе.

На каждом шаге t игры игрок записывает три вектораУравнение, которые задают тензоры ранга 1Уравнение. Состояние игры обновляется вычитанием выбранных игроком векторов:

 
 

ХХХХХ


 

 

гдеУравнение– тензор умножения матриц.

Если игра заканчивается за p шагов, это означает, что тензор умножения матрицУравнениеможно разложить на p тензоров ранга 1Уравнение, т.е. он имеет ранг не меньше p.

Затем TensorGame можно интерпретировать как алгоритм ранговой декомпозиции, а AlphaTensor можно рассматривать как алгоритм для оценки ранга тензора.

Архитектура АльфаТензор

Итак, мы узнали о TensorGame и выяснили, как ее решение можно рассматривать как алгоритм умножения матриц. Давайте теперь рассмотрим основные концепции AlphaTensor, алгоритма, используемого в игре.

Архитектура AlphaTensor в основном представляет собой архитектуру преобразователя кодера-декодера, где:

  • кодировщик принимает в качестве входных данных состояние игрыУравнение, n предыдущих действий, предпринятых моделью (обычно n=7), и индекс времени t текущего действия. Информация складывается в тензор формы (n+1, N^2, N^2, N^2). Затем этот тензор преобразуется и преобразуется (с использованием трех линейных слоев) в тензор формы (N ^ 2, N ^ 2, c), где c — внутренний размер модели.
  • декодер генерирует действия n_steps из встроенного вектора, заданного кодером, авторегрессивным способом. Каждому действию соответствует фишка троекУравнение, представляющая одну из троек, разлагающих игровой тензор (т.е. уменьшающих его ранг).

Модель обучается путем чередования обратного распространения и действия модели. Действие модели используется для генерации данных, которые затем используются для обучения модели. На практике модель обучается с помощью смеси синтетически сгенерированных данных и данных, сгенерированных моделью во время игры. Шаг действия выполняется путем взятия трехмерного тензора, соответствующего матричной операции, и игры с ним n_actors. Каждый актор играет в игру либо на стандартной основе, либо на альтернативной основе (смена основы применяется с заданной вероятностью). Затем результаты собираются и могут использоваться на этапе обучения с синтетическими данными.

Шаг действия основан на поиске по дереву Монте-Карло AlphaZero (MCTS), модифицированном для поддержки больших пространств действий. Короче говоря, перед выбором действия исследуются пути n_sims из выходных данных модели с максимальным будущим исследованием из 5 шагов. Затем вероятности, сгенерированные моделью, корректируются с учетом сгенерированных путей. Затем для продолжения игры выбирается действие с наиболее многообещающим будущим путем.

При обучении модели вознаграждение на самом деле является отрицательным вознаграждением (штрафом). Его абсолютное значение увеличивается с каждым дополнительным шагом, необходимым для решения игры. Если модель выполняет m шагов для решения TensorGame, вознаграждение, связанное с игрой, равно r=-m. Если модель не может решить TensorGame за шаги max_rank, вознаграждение вычисляется путем оценки ранга оставшегося тензора. Ранг оценивается как сумма рангов матриц, составляющих тензор. Оценка является верхней границей истинного ранга тензора.

При точной настройке модели вознаграждение в виде штрафа в конечном состоянии также должно учитывать задержку алгоритма, созданного моделью. Формула вознаграждения принимает вид rt'=rt+λbt, где rt — схема вознаграждения, описанная ранее, bt — эталонное вознаграждение (отличное от нуля только в терминальном состоянии), а λ — заданный пользователем коэффициент .

 
 

ХХХХХ


Ускорение (%) обнаруженных AlphaTensor алгоритмов, адаптированных для GPU и TPU, извлечено из статьи DeepMind. Ускорение измеряется относительно стандартного (например, cuBLAS для графического процессора) матричного умножения на том же оборудовании и сравнивается с алгоритмом квадрата Штрассена . Источник: DeepMind .

Реализация AlphaTensor от DeepMind с открытым исходным кодом

Недавно я выпустил OpenAlphaTensor , первую реализацию AlphaTensor с открытым исходным кодом. В этом разделе я расскажу о реализации. Как мы обсуждали ранее, архитектура AlphaTensor довольно проста и основана на стандартном преобразователе с архитектурой кодер-декодер. Наиболее интересными компонентами AlphaTensor являются первый слой в части кодировщика и способ выборки действий.

Начнем с первого слоя кодирования.

# x.size = (N, T, S, S, S)
# scalars.size = (N, s)
batch_size = x.shape[0]
S = x.shape[-1]
T = x.shape[1]
x1 = x.permute(0, 2, 3, 4, 1).reshape(batch_size, S, S, S * T)
x2 = x.permute(0, 4, 2, 3, 1).reshape(batch_size, S, S, S * T)
x3 = x.permute(0, 3, 4, 2, 1).reshape(batch_size, S, S, S * T)
input_list = [x1, x2, x3]
for i in range(3):
    temp = self.linears_1[i](scalars).reshape(batch_size, S, S, 1)
    input_list[i] = torch.cat([input_list[i], temp], dim=-1)
    input_list[i] = self.linears_2[i](input_list[i])
x1, x2, x3 = input_list

В приведенном выше фрагменте мы показываем, как входной тензор разлагается на три тензора, которые затем используются в качестве входных данных запроса, ключа и значения слоя-преобразователя.

  1. По трем измерениям тензора, представляющим сглаженные матрицы (A, B, C), входной тензор сглаживается по каждому измерению вместе с измерением, представляющим предыдущие действия. Таким образом, в каждой сглаженной копии входного тензора выбранное измерение представляет собой совокупность последних значений T-1 и фактического значения для всех значений S выбранного измерения, где S=N^2. Философски это как если бы для каждого измерения мы фокусировались на том, что произошло в предыдущих действиях в этом измерении.
  2. Скаляры отображаются в трех разных пространствах размерности S ^ 2, а затем изменяются для объединения с тензорами, полученными в предыдущей точке. Концептуально скаляры сопоставляются с пространством вложений размерности S^2, а затем встроенная информация разбивается на S векторов и складывается вместе, аналогично тому, что происходит с текстом при маркировке.
  3. Скалярные токены объединяются с реструктурированным входным тензором, а затем передаются в качестве входных данных линейному слою для отображения фокусной информации скаляры + история канала во внутреннем измерении модели.

Эти три шага можно интерпретировать как способ предоставления модели как информации о скалярах (как на временном шаге TensorGame), так и фокуса на предыдущих действиях для каждого канала.

Что касается способа создания действий, интересно отметить, что AlphaTensor генерирует на выходе триплет u, v, w, целью которого является уменьшение ранга тензора. Три вектора имеют размер S, и, поскольку они объединены, модель должна создать вектор размера 3*S. AlphaTensor обучается с помощью алгоритма RL, поэтому все возможные действия должны быть выражены в терминах вероятностей в пронумерованном пространстве, т.е. модель выдает вероятность различных действий. Это означает, что каждый вектор в пространстве 3S должен быть сопоставлен с другим действием. Это приводит к пространству действий размера |F|^(3S), где |F| это количество различных значений, которые может принимать элемент u, v, w. Обычно значения ограничены (-2, -1, 0, 1, 2), в результате чего количество элементов равно 5.

Здесь возникает серьезная проблема: чтобы сгенерировать вероятности действия для матричного произведения матриц размера 5, нам потребуется память размером 5 ^ 75 * 4 байта, что означает ~ 10 ^ 44 ГБ памяти. Ясно, что мы не можем управлять таким большим пространством действия.

Как решить проблему? Чтобы уменьшить объем памяти вероятностей действий, мы можем разделить триплеты на более мелкие фрагменты, «маркировать» их и рассматривать фрагменты как сгенерированные маркеры в архитектуре преобразователя, т. е. маркеры передаются на вход декодеру в авторегрессивном алгоритме способ. В приведенном выше примере мы можем разделить триплеты на 15 кусков, уменьшив потребление памяти до 15 * 5^(75/15) * 4, т.е. 187,5 КБ.

def _eval_forward(self, e: torch.Tensor):
    bs = e.shape[0]
    future_g = (
        torch.zeros((bs, self.n_samples, self.n_steps)).long().to(e.device)
    )
    ps = torch.ones((bs, self.n_samples)).to(e.device)
    e = e.unsqueeze(1).repeat(1, self.n_samples, 1, 1)

    future_g = future_g.view(-1, self.n_steps)
    ps = ps.view(-1)
    e = e.view(-1, e.shape[-2], e.shape[-1])
    for i in range(self.n_steps):
        o_s, z_s = self.core(future_g[:, : i + 1], e)
        future_g[:, i], p_i = sample_from_logits(o_s[:, i])
        ps *= p_i
    future_g = future_g.view(bs, self.n_samples, self.n_steps)
    ps = ps.view(bs, self.n_samples)
    return (
        future_g,
        ps,
        z_s[:, 0].view(bs, self.n_samples, *z_s.shape[2:]).mean(1),
    )

Выше мы показываем фрагмент кода для создания полного действия. В коде self.core содержит слой декодера, а тензор e представляет выходные данные уровня кодировщика. Ноль можно рассматривать как токен <eos> в моделях НЛП, а действия n_steps, представляющие фрагменты n_steps, генерируются прогрессивным образом.

Модель возвращает три величины:

  1. Сгенерированные действия
  2. Вероятность полного действия
  3. Логиты, созданные для создания первого действия (первого фрагмента), которые будут использоваться для вычисления значения модели.

Стоит сказать несколько слов о параметре n_samples. Параметр используется для шага действия и позволяет модели генерировать различные версии триплетов, которые затем будут использоваться для исследования пространства действий в алгоритме поиска по дереву Монте-Карло, используемом в процессе действия. Различные действия n_samples выбираются в соответствии с политикой, созданной моделью.

Действующий шаг

Самая сложная часть всего алгоритма — это, вероятно, шаг действия, используемый для решения TensorGame. Алгоритм не подробно объяснен в статье AlphaTensor, поскольку он основан на нескольких предыдущих статьях DeepMind, которые просто цитируются и считаются известными. Здесь я восстановлю все недостающие части и шаг за шагом объясню нашу реализацию.

Мы можем организовать актерские шаги в трех разных компонентах:

  • Поиск по дереву Монте-Карло
  • Симулятор игры
  • Улучшенный расчет политики

Давайте проанализируем их один за другим.

Поиск по дереву Монте-Карло (MCTS)

Поиск по дереву Монте-Карло (MCTS) — это широко используемый метод искусственного интеллекта для игр, особенно в настольных и видеоиграх. Алгоритм создает игровое дерево, которое имитирует возможные ходы и результаты и использует случайную выборку для оценки ожидаемого вознаграждения за каждый ход. Затем алгоритм итеративно выбирает ход с наибольшим ожидаемым вознаграждением и моделирует результаты до тех пор, пока не достигнет конечного состояния или заданного условия остановки. Моделирование используется для оценки вероятности выигрыша для каждого хода и управления процессом принятия решений. Было показано, что MCTS эффективен в сложных играх, где количество возможных ходов и результатов велико, и он использовался в успешных игровых системах искусственного интеллекта, таких как AlphaGo.

В AlphaTensor используется модифицированная версия оригинального MCTS. В частности, вместо случайного выбора действия из всего пространства действий, действие выбирается среди подмножества, сгенерированного непосредственно моделью (через представленные ранее n_samples). Исправление обновления политики затем применяется на шаге вычисления улучшенной политики.

В нашей реализации мы решили хранить всю информацию о дереве Монте-Карло в словаре, имея в качестве ключа хэш-версию состояния TensorGame, а в качестве значений информацию, связанную с самим состоянием. Каждый шаг Монте-Карло начинается с узла и имитирует мини-игры n_sim, исследуя будущее с горизонтом в 5 ходов. Если узел уже был исследован в предыдущих симуляциях, n_sim корректируется с учетом количества предыдущих исследований. Для каждого узла количество посещений хранится в тензоре N_s_a, так как этот тензор содержит количество посещений на одно дочернее действие узла (среди тех, которые выбраны моделью).

def monte_carlo_tree_search(
    model: torch.nn.Module,
    state: torch.Tensor,
    n_sim: int,
    t_time: int,
    n_steps: int,
    game_tree: Dict,
    state_dict: Dict,
):
"""Runs the monte carlo tree search algorithm.

    Args:
        model (torch.nn.Module): The model to use for the simulation.
        state (torch.Tensor): The initial state.
        n_sim (int): The number of simulations to run.
        t_time (int): The current time step.
        n_steps (int): The maximum number of steps to simulate.
        game_tree (Dict): The game tree.
        state_dict (Dict): The dictionary containing the states.
    """
    state_hash = to_hash(extract_present_state(state))
    if state_hash in state_dict:
        with torch.no_grad():
            N_s_a = state_dict[state_hash][3]
            n_sim -= int(N_s_a.sum())
            n_sim = max(n_sim, 0)

    for _ in range(n_sim):
        simulate_game(model, state, t_time, n_steps, game_tree, state_dict)
    # return next state
    possible_states_dict, _, repetitions, N_s_a, q_values, _ = state_dict[
        state_hash
    ]
    possible_states = _recompose_possible_states(possible_states_dict)
    next_state_idx = select_future_state(
        possible_states, q_values, N_s_a, repetitions, return_idx=True
    )
    next_state = possible_states[next_state_idx]
    return next_state

Код выше показывает нашу реализацию алгоритма. Для простоты кода исправление политики выполняется в функции Simulation_game.

Игровое моделирование

Функция Simulation_game отвечает за исследование дерева, состоящего из узлов, представляющих определенное состояние TensorGame. Он также запускает модель всякий раз, когда встречается конечный узел, и сохраняет всю информацию об узле в словаре state_dict. Давайте подробно рассмотрим его реализацию:

@torch.no_grad()
def simulate_game(
    model,
    state: torch.Tensor,
    t_time: int,
    max_steps: int,
    game_tree: Dict,
    states_dict: Dict,
    horizon: int = 5,
):
"""Simulates a game from a given state.

  Args:
      model: The model to use for the simulation.
      state (torch.Tensor): The initial state.
      t_time (int): The current time step.
      max_steps (int): The maximum number of steps to simulate.
      game_tree (Dict): The game tree.
      states_dict (Dict): The states dictionary.
      horizon (int): The horizon to use for the simulation.
  """
  idx = t_time
  max_steps = min(max_steps, t_time + horizon)
  state_hash = to_hash(extract_present_state(state))
  trajectory = []
  # selection
  while state_hash in game_tree:
      (
          possible_states_dict,
          old_idx_to_new_idx,
          repetition_map,
          N_s_a,
          q_values,
          actions,
      ) = states_dict[state_hash]
      possible_states = _recompose_possible_states(possible_states_dict)
      state_idx = select_future_state(
          possible_states, q_values, N_s_a, repetition_map, return_idx=True
      )
      trajectory.append((state_hash, state_idx))  # state_hash, action_idx
      future_state = extract_present_state(possible_states[state_idx])
      state = possible_states[state_idx]
      state_hash = to_hash(future_state)
      idx += 1

  # expansion
  if idx <= max_steps:
      trajectory.append((state_hash, None))
      if not game_is_finished(extract_present_state(state)):
          state = state.to(model.device)
          scalars = get_scalars(state, idx).to(state.device)
          actions, probs, q_values = model(state, scalars)
          (
              possible_states,
              cloned_idx_to_idx,
              repetitions,
              not_dupl_indexes,
          ) = extract_children_states_from_actions(
              state,
              actions,
          )
          not_dupl_actions = actions[:, not_dupl_indexes].to("cpu")
          not_dupl_q_values = torch.zeros(not_dupl_actions.shape[:-1]).to(
              "cpu"
          )
          N_s_a = torch.zeros_like(not_dupl_q_values).to("cpu")
          present_state = extract_present_state(state)
          states_dict[to_hash(present_state)] = (
              _reduce_memory_consumption_before_storing(possible_states),
              cloned_idx_to_idx,
              repetitions,
              N_s_a,
              not_dupl_q_values,
              not_dupl_actions,
          )
          game_tree[to_hash(present_state)] = [
              to_hash(extract_present_state(fut_state))
              for fut_state in possible_states
          ]
          leaf_q_value = q_values
  else:
      leaf_q_value = -int(torch.linalg.matrix_rank(state).sum())
  # backup
  backward_pass(trajectory, states_dict, leaf_q_value=leaf_q_value)

Каждая симуляция разделена на три части:

  • Выбор
  • Расширение
  • Резервное копирование

В части выбора симуляция запускается на уже сгенерированных узлах дерева, и следующий узел выбирается с помощью следующей функции:

def select_future_state(
    possible_states: List[torch.Tensor],
    q_values: torch.Tensor,
    N_s_a: torch.Tensor,
    repetitions: Dict[int, list],
    c_1: float = 1.25,
    c_2: float = 19652,
    return_idx: bool = False,
) -> torch.Tensor:
"""Select the future state maximizing the upper confidence bound."""
# q_values (1, K, 1)
    pi = torch.tensor(
        [
            len(repetitions[i])
            for i in range(len(possible_states))
            if i in repetitions
        ]
    ).to(q_values.device)
    ucb = q_values.reshape(-1) + pi * torch.sqrt(
        torch.sum(N_s_a) / (1 + N_s_a)
    ) * (c_1 + torch.log((torch.sum(N_s_a) + c_2 + 1) / c_2))
    if return_idx:
        return ucb.argmax()
    return possible_states[ucb.argmax()]

На практике действие, максимизирующее функцию ucb:

ХХХХХ


 

для данного состояния выбирается. Здесь Q представляет значения Q, сгенерированные моделью, а π представляет собой случайное распределение по действиям, выбранным с использованием политики модели. N(s, a) представляет количество обращений узла к действию a из узла s.

Как только фаза выбора достигает листового узла, если симуляция не достигла конечного состояния (с точки зрения либо максимального исследования, т. е. будущего горизонта, либо окончания игры), модель затем используется для выбора n_samples альтернативных узлов (они будут конечными). узлов в последующей итерации). Это называется фазой расширения, поскольку в дерево добавляются новые узлы. Затем в текущей симуляции не исследуется дальнейший узел, но лист q_value отправляется на следующий этап симуляции: резервное копирование.

Резервное копирование является завершающим этапом каждой симуляции. Во время резервного копирования, если конечный узел находился в терминальном состоянии, вычисляется окончательное вознаграждение; в противном случае значение листа q используется в качестве предполагаемого вознаграждения. Затем вознаграждение распространяется обратно по траектории моделирования, обновляя оба состояния q_values ​​и обновляя счетчик посещений N(s, a). Во фрагменте ниже мы показываем код для обратного распространения вознаграждения.

def backward_pass(trajectory, states_dict, leaf_q_value: torch.Tensor):
"""Backward pass of the montecarlo algorithm"""
reward = 0
    for idx, (state, action_idx) in enumerate(reversed(trajectory)):
        if action_idx is None:  # leaf node
            reward += leaf_q_value
        else:
            (
                _,
                old_idx_to_new_idx,
                _,
                N_s_a,
                q_values,
                _,
            ) = states_dict[state]
            if isinstance(reward, torch.Tensor):
                reward = reward.to(q_values.device)
            action_idx = int(action_idx)
            if action_idx in old_idx_to_new_idx:
                not_dupl_index = old_idx_to_new_idx[int(action_idx)]
            else:
                not_dupl_index = action_idx
            reward -= 1
            q_values[:, not_dupl_index] = (
                N_s_a[:, not_dupl_index] * q_values[:, not_dupl_index] + reward
            ) / (N_s_a[:, not_dupl_index] + 1)
            N_s_a[:, not_dupl_index] += 1

Улучшенный расчет политики

Когда все симуляции выполнены и MCTS предлагает интересный снимок ближайшего будущего, пришло время обновить политику, связанную с предсказанными узлами, и вернуть их, чтобы их можно было использовать во время обучения. Улучшенная политика, следующая методу, описанному в Hubert et al., используется для управления большими пространствами действий. Фактически, для небольшого пространства поиска во время MCTS можно случайным образом выбрать действие из пространства действий и оценить его влияние. Подобный подход в гораздо большем пространстве действий привел бы к тому, что все траектории расходились бы по разным путям, и потребовалось бы бесконечное количество траекторий для получения значимой статистики и последующего обновления политики. Поскольку здесь мы используем sample-MCTS, чтобы избежать дисперсии, т. е. действия n_samples выбираются в соответствии с политикой модели, а затем MCTS просто выбирает одно из выбранных действий при исследовании дерева, нам нужно учитывать поправку на выборку при вычислении. окончательная обновленная политика, которая будет использоваться при обучении модели.

На практике улучшенная политика рассчитывается как

 
 

ХХХХХ


 

 

где
 

ХХХХХ


 

def compute_improved_policy(
    state_dict: Dict,
    states: List[str],
    model_n_steps: int,
    model_n_logits: int,
    N_bar: int,
):
    """Compute the improved policy given the state_dict, the list of states.
    The improved policy is computed as (N_s_a / N_s_a.sum())^(1/tau) where tau
    is (log(N_s_a.sum()) / log(N_bar)) if N_s_a.sum() > N_bar else 1.
    """
    policies = torch.zeros(len(states), model_n_steps, model_n_logits)
    N_bar = torch.tensor(N_bar)
    for idx, state in enumerate(states):
        N_s_a = state_dict[state][3]
        actions = state_dict[state][5]
        if N_s_a.sum() > N_bar:
            tau = (torch.log(N_s_a.sum()) / torch.log(N_bar)).item()
        else:
            tau = 1
	 N_s_a = N_s_a ** (1 / tau)
        improved_policy = N_s_a / N_s_a.sum()
        for sample_id in range(actions.shape[1]):
            action_ids = actions[0, sample_id]
            for step_id, action_id in enumerate(action_ids):
                policies[idx, step_id, action_id] += improved_policy[
                    0, sample_id
                ]
    return policies

Обратите внимание, что в нашей реализации после вычисления политики из тензора N_s_a мы должны сопоставить ее обратно с исходным тензором действия. По сути, N_s_a просто рассматривает действия, выбранные моделью, а окончательная политика должна содержать вероятности и для неисследованных действий.

Различия в алгоритме обучения ChatGPT

AlphaTensor — последний член семейства методов искусственного интеллекта AlphaGo/AlphaZero от DeepMind. Эти методы основаны на алгоритме поиска по дереву Монте-Карло (MCTS), который был усовершенствован и улучшен DeepMind для решения все более сложных задач. Другая система искусственного интеллекта, ChatGPT от OpenAI, вызвавшая много шума из-за своей замечательной производительности, была обучена с использованием другого подхода, называемого Reinforcement Learning with Human Feedback (RLHF).

RLHF — это метод тонкой настройки, используемый для настройки языковых моделей в соответствии с набором письменных инструкций. Он использует человеческие предпочтения в качестве сигнала вознаграждения для точной настройки модели, тем самым согласовывая поведение языковой модели с заявленными предпочтениями конкретной группы людей, а не с более широким понятием «человеческих ценностей».

Напротив, MCTS представляет собой алгоритм поиска на основе дерева, используемый для определения оптимальных ходов в играх. Он имитирует возможные ходы и обновляет значения каждого хода на основе их результатов, помогая выбрать лучший ход.

RLHF собирает данные из написанных человеком демонстраций и помеченных людьми сравнений между моделями ИИ и обучает модель вознаграждения прогнозировать предпочтения данной группы людей. Затем модель вознаграждения используется для тонкой настройки моделей ИИ. MCTS, с другой стороны, использует моделирование и оценку для определения наилучшего решения.

Хотя это разные подходы, RLHF и MCTS также имеют сходство. Оба метода искусственного интеллекта используют методы принятия решений и решения проблем, и оба используют метод проб и ошибок для изучения различных вариантов и принятия решений на основе доступной информации. И то, и другое является итеративным процессом, который со временем совершенствуется по мере сбора дополнительной информации и опыта.

Выбор между RLHF и MCTS зависит от поставленной задачи. RLHF идеально подходит, когда нет четкой метрики для оценки производительности модели, в то время как MCTS доказал свою эффективность в игровых задачах, где знание и исследование будущего дают модели значительное преимущество.

Оптимизация кода для обучения AlphaTensor

Реализация алгоритма обучения AlphaTensor требует нахождения идеального компромисса между скоростью обучения и потреблением памяти. Как видно из раздела «Модель», простое рассмотрение токенизации действий может сэкономить много памяти, но чрезмерно агрессивное сокращение пространства действий может привести как к снижению точности, так и к снижению производительности. Последнее происходит потому, что все токены генерируются последовательно авторегрессионным способом декодером модели. Следовательно, время вывода растет линейно с количеством токенов на действие, как только softmax в пространстве действий больше не является узким местом.

При настройке обучения AlphaTensor основные трудности были обнаружены при работе с актерским процессом. Если тензоры не хранятся в правильном формате, MCTS может легко вызвать неконтролируемый рост использования памяти. С другой стороны, если количество тензоров, сохраняемых во время каждой симуляции, слишком сильно уменьшается, MCTS может потратить бесконечное количество времени на повторное вычисление требуемых состояний.

Давайте возьмем в качестве примера этап моделирования игры, на котором игра исследуется путем рассмотрения возможных будущих сценариев. Для каждого состояния, если мы не сохраняем действия, сгенерированные моделью, и решим сохранить только случайное начальное число, используемое для выборки действий из политики, то каждый раз, когда мы исследуем узел дерева, нам придется заново вычислять политику и затем примерьте действия. Понятно, что мы решили сохранить выбранные действия, чтобы сэкономить время и избежать необходимости управлять совместным использованием модели между различными процессами в случае распараллеливания исследования MCTS. Однако простого сохранения действий было недостаточно, чтобы получить достаточно эффективный действующий шаг. Фактически время преобразования действий n_steps в тройку (u, v, w), уменьшение состояния игрового тензора и создание новых трехмерных тензоров из действий n_samples легко может стать узким местом для всего обучения. Во-вторых, мы не хотели хранить все возможные будущие состояния для каждого выбранного действия, так как это оказало бы огромное влияние на память, используемую алгоритмом. Предположим, мы установили n_samples=32, n=7 и N=5, и давайте вспомним, что N — это размер произведения квадратных матриц, которое мы хотим уменьшить, а n — это количество предыдущих действий, запомненных моделью. В этой ситуации каждый тензор состояния будет иметь форму (8, 25, 25, 25), что, умноженное на 32, даст 32 Помните, что N — это размер квадратного матричного произведения, которое мы хотим уменьшить, а n — количество предыдущих действий, запомненных моделью. В этой ситуации каждый тензор состояния будет иметь форму (8, 25, 25, 25), что, умноженное на 32, даст 32 Помните, что N — это размер квадратного матричного произведения, которое мы хотим уменьшить, а n — количество предыдущих действий, запомненных моделью. В этой ситуации каждый тензор состояния будет иметь форму (8, 25, 25, 25), что, умноженное на 32, даст 328 25 25 25 4 байта для каждого узла в графе. Теперь, учитывая, что каждая симуляция на этапе расширения создает новый узел (и n_sim=200), у нас будет конечное потребление памяти 200 32 8 25 25 25*4 = 3,2 ГБ только для первого узла MCTS. В худшем случае при изучении действующих узлов max_rank (где max_rank=150) это приведет к общему потреблению памяти 150 * 3,2 ГБ = 480 ГБ в оперативной памяти (или в памяти графического процессора, если все тензоры хранились на графическом процессоре). . Мы проводили обучение на нашей рабочей станции с 128 ГБ ОЗУ и 48 ГБ памяти графического процессора, поэтому нам пришлось уменьшить потребление памяти.

Поскольку мы не хотели увеличивать время выполнения, мы приняли оптимизацию, которая использует избыточность в создаваемых тензорах состояний. На самом деле тензоры имеют n-1 общих предыдущих действий, которые затем можно сохранить один раз и не повторять для каждого сохраненного тензора. Это приводит к уменьшению памяти на 2/7~28%, а это означает, что в худшем случае можно хранить 137 ГБ. На этом этапе, просто обрезав неиспользуемую часть дерева (например, невыбранные траектории) и сохранив тензоры в памяти процессора, мы смогли избежать любой ошибки памяти во время обучения.

Следующие шаги

Теперь, когда OpenAlphaTensor является открытым исходным кодом, открывается несколько захватывающих направлений для дальнейшего развития.

Естественным прогрессом является тонкая настройка OpenAlphaTensor на целевых аппаратных устройствах. Ожидается, что это приведет к очень конкурентоспособной вычислительной производительности. Подробнее о производительности OpenAlphaTensor на различном оборудовании я буду публиковать на GitHub . На момент написания этой статьи OpenAlphaTensor проходил обучение.

Еще одним важным достижением станет поддержка удаленной компиляции, позволяющая пользователям создавать алгоритмы, оптимизированные для периферийных устройств. Этого можно достичь, сохраняя модель OpenAlphaTensor на сервере, в то время как алгоритм умножения матриц оценивается на другом оборудовании.

Также может быть важно расширить поддержку различных компиляторов для вычисления коррекции вознаграждения на основе задержки. Различные компиляторы могут привести к различным оптимизированным алгоритмам на данном оборудовании. Например, документ DeepMind показал многообещающие результаты с использованием JAX и компилятора XLA на графических процессорах TPU и Nvidia. Было бы интересно оценить это с помощью NCCL на Nvidia или LLVM на процессорах.

Наконец, расширение модели и алгоритма обучения для поддержки больших размеров матриц остается серьезной открытой проблемой. В настоящее время OpenAlphaTensor поддерживает максимальный размер матрицы 5, но его можно применять путем разделения более крупных матричных умножений на группы крошечных ММ размером меньше 5. Этот подход неоптимален и выполняет сокращение непосредственно на большом тензоре, соответствующем полный MM теоретически может привести к лучшим результатам.
 
Диего Фиори — технический директор Nebuly AI, компании, стремящейся сделать оптимизацию ИИ частью набора инструментов каждого разработчика.

Оригинальный источник статьи: https://www.kdnuggets.com/

#opensource #alpha #machinelearning #deeplearning 

Первая реализация с открытым исходным кодом AlphaTensor от DeepMind
佐藤  桃子

佐藤 桃子

1678714033

如何首先开源 DeepMind 的 AlphaTensor 实现

AlphaTensor 的第一个开源实现已经发布,为彻底改变深度学习模型的计算性能的新发展打开了大门。

矩阵乘法是许多系统中使用的基本运算,从神经网络到科学计算例程。为矩阵乘法寻找高效且可证明正确的算法可以对加快计算速度和提高效率产生巨大影响,但这是一项非常具有挑战性的任务。可能算法的空间是巨大的,而用于发现算法的传统方法,如人工设计的启发式或组合搜索,往往不是最优的。

DeepMind最近提出的基于 AI 的自动搜索解决方案远远超出了人类的直觉。该解决方案包含一个名为 AlphaTensor 的深度强化学习代理,它构建在AlphaZero之上。该代理经过训练可以玩单人游戏 TensorGame,其目标是发现计算效率高的矩阵乘法算法。

AlphaTensor 特别擅长通过将大矩阵乘法分解为更小的乘法来处理大矩阵。此外,一旦在特定硬件设备上进行微调,AlphaTensor 可用于实现最先进的矩阵乘法性能。

AlphaTensor 具有加速深度学习计算的巨大潜力。在深度学习中,许多耗时的操作可以映射到矩阵乘法。通过使用 AlphaTensor 优化这些操作,可以显着提高深度学习模型的整体性能。

近日,首个 AlphaTensor 开源实现OpenAlphaTensor发布,有望彻底改变深度学习模型的计算能力。

矩阵乘法张量

对于矩阵乘法优化方面的非专家来说,理解矩阵乘法等运算如何映射到三维张量可能并不简单。我将尝试用简单的文字和示例来解释它。

让我们考虑乘积 C = A*B,其中为简单起见,A 和 B 都是大小为 N 的方阵。乘法运算可以映射到形状为 (N^2, N^2, N^2) 的 3D 张量中。第一个张量维度表示展平矩阵 A,第二个维度表示展平矩阵 B,第三个维度表示展平矩阵 C。

对于每个条目,张量只有二进制值(1 或 0)。请注意,张量表示乘法运算,因此它与矩阵 A 和 B 的值无关。

张量的每个条目都对应于运算的系数。例如,要计算 C[1,1],需要将 A[1,1] 和 B[1,1] 相乘。因此,对应于 A[1,1]、B[1,1] 和 C[1,1] 的张量项 [0,0,0] 的值为 1。相反,要计算 C[1 ,1],不需要 A[2,1]。因此,张量行 T[N+1, :, 0] 将仅包含零。

下图显示了 N=2 的张量示例。
 

XXXXX


图片来自 DeepMind发表在Nature上的论文

如上图(b)和(c)所示,可以使用3D张量的分解来实现计算乘积的算法。更具体地说,下面的算法可用于将张量分解(矩阵 U、V、W)转换为矩阵乘法算法。

 
 

XXXXX


DeepMind论文中介绍的用于计算矩阵乘积 C=AB 的元算法参数化

张量游戏

寻找有效的矩阵乘法算法的问题极具挑战性,因为要考虑的可能算法的数量远大于宇宙中的原子数量,即使对于矩阵乘法的小实例也是如此。

DeepMind 将这个问题转化为单人游戏,并称之为 TensorGame。在这个游戏中,玩家选择如何组合不同的矩阵条目以将它们相乘。根据获得正确乘法结果所需的运算次数分配分数。当达到零张量或已进行最大移动次数时,游戏结束。最终的因式分解是基于对残差等级的估计和某些优化标准(例如渐近时间复杂度或实际运行时间)进行评估的。

TensorGame 中的初始位置对应于在某种随机基础上表示的矩阵乘法张量。

在游戏的每个步骤 t 中,玩家记下三个向量方程,它们指定 rank-1 张量方程。游戏状态通过减去玩家选择的向量来更新:

 
 

XXXXX


 

 

其中方程是矩阵乘法张量。

如果游戏以 p 步结束,这意味着 Matrix Multiplication Tensor方程可以分解为 p rank-1 张量方程,即它至少具有秩 p。

然后可以将 TensorGame 解释为秩分解算法,而 AlphaTensor 可以看作是估计张量秩的算法。

AlphaTensor 架构

到目前为止,我们已经了解了 TensorGame 并阐明了如何将其解决方案视为矩阵乘法算法。现在让我们探讨 AlphaTensor 的主要概念,该算法用于游戏。

AlphaTensor 架构基本上是一种编码器-解码器 Transformer 架构,其中:

  • 方程编码器将游戏状态、模型采取的 n 个先前动作(通常 n=7)和当前动作的时间索引 t 作为输入。信息以张量的形式堆叠在一起,形状为 (n+1, N^2, N^2, N^2)。然后将该张量重新整形并转换(使用三个线性层)为形状为 (N^2, N^2, c) 的张量,其中 c 是模型的内部维度。
  • 解码器以自回归的方式从编码器给出的嵌入向量生成 n_steps 动作。每个动作对应于三元组的一个标记,方程代表分解游戏张量的三元组之一(即降低其等级)

该模型通过交替反向传播和模型作用进行训练。模型表演用于生成数据,然后用于训练模型。在实践中,模型是用综合生成的数据和模型在表演过程中生成的数据的混合物来训练的。执行步骤是通过获取对应于矩阵运算的 3D 张量并在其上玩 n_actors 游戏来完成的。每个参与者在标准基础上或在替代基础上玩游戏(基础的变化以给定的概率应用)。然后收集结果,并可用于合成数据的训练步骤。

行动步骤基于 AlphaZero 的蒙特卡洛树搜索 (MCTS),经过修改以支持大型行动空间。简而言之,在选择动作之前,从模型输出中探索 n_sims 路径,未来探索最多 5 步。然后根据生成的路径调整模型生成的概率。然后选择具有最有希望的未来路径的动作来继续游戏。

在训练模型时,奖励实际上是负奖励(惩罚)。它的绝对值随着解决游戏所需的每个额外步骤而增加。如果模型需要 m 个步骤来解决 TensorGame,则与游戏相关的奖励为 r=-m。如果模型无法在 max_rank 步骤中解决 TensorGame,则通过估计剩余张量的等级来计算奖励。秩估计为构成张量的矩阵的秩之和。该估计值是张量真实秩的上限。

在微调模型时,终端状态下的惩罚奖励还应考虑模型产生的算法的延迟。奖励公式变为 rt'=rt+λbt,其中 rt 是前面描述的奖励方案,bt 是基准奖励(仅在终端状态非零),λ 是用户指定的系数

 
 

XXXXX


为 GPU 和 TPU 量身定制的 AlphaTensor 发现算法的加速 (%),摘自 DeepMind 的论文。加速是相对于相同硬件上的标准(例如 GPU 的 cuBLAS)矩阵乘法测量的,并与Strassen-square 算法进行比较。资料来源:深度思维

DeepMind 的 AlphaTensor 的开源实现

我最近发布了OpenAlphaTensor,这是 AlphaTensor 的第一个开源实现。在本节中,我将介绍实施过程。正如我们之前讨论的那样,AlphaTensor 架构非常简单,它基于具有编码器-解码器架构的标准转换器。AlphaTensor 最有趣的组件是编码器部分的第一层和动作的采样方式。

让我们从第一个编码层开始。

# x.size = (N, T, S, S, S)
# scalars.size = (N, s)
batch_size = x.shape[0]
S = x.shape[-1]
T = x.shape[1]
x1 = x.permute(0, 2, 3, 4, 1).reshape(batch_size, S, S, S * T)
x2 = x.permute(0, 4, 2, 3, 1).reshape(batch_size, S, S, S * T)
x3 = x.permute(0, 3, 4, 2, 1).reshape(batch_size, S, S, S * T)
input_list = [x1, x2, x3]
for i in range(3):
    temp = self.linears_1[i](scalars).reshape(batch_size, S, S, 1)
    input_list[i] = torch.cat([input_list[i], temp], dim=-1)
    input_list[i] = self.linears_2[i](input_list[i])
x1, x2, x3 = input_list

在上面的代码片段中,我们展示了如何将输入张量分解为三个张量,然后将其用作转换层的查询、键和值输入。

  1. 在表示展平矩阵(A、B、C)的三个张量维度上,输入张量沿着每个维度与表示先前动作的维度一起展平。这样,在输入张量的每个扁平化副本中,对于所选维度的所有 S 值,所选维度是最后 T-1 个值和实际值的聚合,其中 S=N^2。从哲学上讲,就好像对于每个维度,我们都专注于该维度中先前行动中发生的事情。
  2. 标量被映射到维度 S^2 的三个不同空间中,然后重新整形以与在前一点获得的张量连接。从概念上讲,标量被映射到维度为 S^2 的嵌入空间,然后嵌入信息被分块为 S 向量并堆叠在一起,类似于标记化文本时发生的情况。
  3. 标量标记与重组后的输入张量连接,然后作为线性层的输入,用于在模型的内部维度中映射标量+通道历史焦点信息。

这三个步骤可以解释为一种向模型提供有关标量的信息(如在 TensorGame 时间步长中)和关注每个通道的先前操作的方式。

关于动作的产生方式,有趣的是 AlphaTensor 生成三元组 u、v、w 作为输出,其目的是降低张量等级。这三个向量的大小为 S,由于它们是串联的,因此模型必须生成一个大小为 3*S 的向量。AlphaTensor 是用 RL 算法训练的,因此所有可能的动作都必须用枚举空间中的概率来表示,即模型产生不同动作的概率。这意味着 3S 空间中的每个向量都应映射到不同的动作。这导致大小为 |F|^(3S) 的动作空间,其中 |F| 是u,v,w的元素可以取的不同值的个数。通常,值被限制为 (-2, -1, 0, 1, 2),导致 5 个元素的基数。

这是一个主要的挑战:要为大小为 5 的矩阵的矩阵乘积生成动作概率,我们需要 5^75 * 4 字节的内存,这意味着大约 10^44 GB 的内存。显然,我们无法管理如此大的行动空间。

我们如何解决这个问题?为了减少动作概率的内存占用,我们可以将三元组分成更小的块,对它们进行“标记化”,并将这些块视为变换器体系结构中生成的标记,即标记作为自回归解码器的输入方式。在上面的示例中,我们可以将三元组拆分为 15 个块,从而将内存消耗减少到 15 * 5^(75/15) * 4,即 187.5 KB。

def _eval_forward(self, e: torch.Tensor):
    bs = e.shape[0]
    future_g = (
        torch.zeros((bs, self.n_samples, self.n_steps)).long().to(e.device)
    )
    ps = torch.ones((bs, self.n_samples)).to(e.device)
    e = e.unsqueeze(1).repeat(1, self.n_samples, 1, 1)

    future_g = future_g.view(-1, self.n_steps)
    ps = ps.view(-1)
    e = e.view(-1, e.shape[-2], e.shape[-1])
    for i in range(self.n_steps):
        o_s, z_s = self.core(future_g[:, : i + 1], e)
        future_g[:, i], p_i = sample_from_logits(o_s[:, i])
        ps *= p_i
    future_g = future_g.view(bs, self.n_samples, self.n_steps)
    ps = ps.view(bs, self.n_samples)
    return (
        future_g,
        ps,
        z_s[:, 0].view(bs, self.n_samples, *z_s.shape[2:]).mean(1),
    )

上面我们展示了生成完整动作的代码片段。在代码中,self.core 包含解码器层,张量 e 表示编码器层的输出。零可以被认为是 NLP 模型中的 <eos> 标记,代表 n_steps 块的 n_steps 动作以渐进的方式生成。

该模型返回三个数量:

  1. 生成的动作
  2. 与完整动作相关的概率
  3. 为生成将用于计算模型值的第一个动作(第一个块)而生成的逻辑。

值得在 n_samples 参数上多说几句。该参数用于动作步骤,它允许模型生成不同版本的三元组,然后用于探索动作过程中使用的蒙特卡洛树搜索算法中的动作空间。根据模型生成的策略对 n_samples 个不同的动作进行采样。

作用步骤

整个算法中最棘手的部分可能是用于解决 TensorGame 的 Acting 步骤。该算法在 AlphaTensor 论文中没有深入解释,因为它基于几篇 DeepMind 之前的论文,这些论文只是被引用并作为已知给出。在这里,我将重建所有缺失的部分并逐步解释我们的实现。

我们可以将动作步骤组织成三个不同的部分:

  • 蒙特卡洛树搜索
  • 游戏模拟
  • 改进的策略计算

让我们一一分析。

蒙特卡洛树搜索 (MCTS)

蒙特卡洛树搜索 (MCTS) 是一种广泛用于玩游戏的人工智能技术,尤其是在棋盘游戏和视频游戏中。该算法创建了一个游戏树来模拟潜在的动作和结果,并使用随机抽样来评估每个动作的预期奖励。然后,该算法迭代地选择具有最高预期奖励的移动并模拟结果,直到它达到最终状态或指定的停止条件。模拟用于估计每一步获胜的概率并指导决策过程。MCTS 已被证明在可能的移动和结果数量很大的复杂游戏中是有效的,并且它已被用于成功的游戏人工智能系统,例如 AlphaGo。

在 AlphaTensor 中,使用了原始 MCTS 的修改版本。特别是,不是从整个动作空间中随机选择动作,而是在模型直接生成的子集中选择动作(通过前面介绍的 n_samples)。然后在改进的策略计算步骤中应用对策略升级的更正。

在我们的实现中,我们决定将关于蒙特卡洛树的所有信息保存在一个字典中,该字典以 TensorGame 状态的哈希版本作为键,以与状态本身相关的信息作为值。每个蒙特卡洛步骤都从一个节点开始,模拟 n_sim 迷你游戏,以 5 步的视野探索未来。如果该节点已经在之前的模拟中被探索过,则 n_sim 会根据之前的探索次数进行调整。对于每个节点,访问次数存储在 N_s_a 张量中,因为该张量包含每个节点子操作的访问次数(在模型采样的次数中)。

def monte_carlo_tree_search(
    model: torch.nn.Module,
    state: torch.Tensor,
    n_sim: int,
    t_time: int,
    n_steps: int,
    game_tree: Dict,
    state_dict: Dict,
):
"""Runs the monte carlo tree search algorithm.

    Args:
        model (torch.nn.Module): The model to use for the simulation.
        state (torch.Tensor): The initial state.
        n_sim (int): The number of simulations to run.
        t_time (int): The current time step.
        n_steps (int): The maximum number of steps to simulate.
        game_tree (Dict): The game tree.
        state_dict (Dict): The dictionary containing the states.
    """
    state_hash = to_hash(extract_present_state(state))
    if state_hash in state_dict:
        with torch.no_grad():
            N_s_a = state_dict[state_hash][3]
            n_sim -= int(N_s_a.sum())
            n_sim = max(n_sim, 0)

    for _ in range(n_sim):
        simulate_game(model, state, t_time, n_steps, game_tree, state_dict)
    # return next state
    possible_states_dict, _, repetitions, N_s_a, q_values, _ = state_dict[
        state_hash
    ]
    possible_states = _recompose_possible_states(possible_states_dict)
    next_state_idx = select_future_state(
        possible_states, q_values, N_s_a, repetitions, return_idx=True
    )
    next_state = possible_states[next_state_idx]
    return next_state

上面的代码显示了我们对算法的实现。为了代码简单,策略修正在 simulate_game 函数中执行。

游戏模拟

simulate_game 函数负责探索由代表 TensorGame 特定状态的节点组成的树。它还会在遇到叶节点时运行模型,并将所有节点信息存储在 state_dict 字典中。让我们深入了解一下它的实现:

@torch.no_grad()
def simulate_game(
    model,
    state: torch.Tensor,
    t_time: int,
    max_steps: int,
    game_tree: Dict,
    states_dict: Dict,
    horizon: int = 5,
):
"""Simulates a game from a given state.

  Args:
      model: The model to use for the simulation.
      state (torch.Tensor): The initial state.
      t_time (int): The current time step.
      max_steps (int): The maximum number of steps to simulate.
      game_tree (Dict): The game tree.
      states_dict (Dict): The states dictionary.
      horizon (int): The horizon to use for the simulation.
  """
  idx = t_time
  max_steps = min(max_steps, t_time + horizon)
  state_hash = to_hash(extract_present_state(state))
  trajectory = []
  # selection
  while state_hash in game_tree:
      (
          possible_states_dict,
          old_idx_to_new_idx,
          repetition_map,
          N_s_a,
          q_values,
          actions,
      ) = states_dict[state_hash]
      possible_states = _recompose_possible_states(possible_states_dict)
      state_idx = select_future_state(
          possible_states, q_values, N_s_a, repetition_map, return_idx=True
      )
      trajectory.append((state_hash, state_idx))  # state_hash, action_idx
      future_state = extract_present_state(possible_states[state_idx])
      state = possible_states[state_idx]
      state_hash = to_hash(future_state)
      idx += 1

  # expansion
  if idx <= max_steps:
      trajectory.append((state_hash, None))
      if not game_is_finished(extract_present_state(state)):
          state = state.to(model.device)
          scalars = get_scalars(state, idx).to(state.device)
          actions, probs, q_values = model(state, scalars)
          (
              possible_states,
              cloned_idx_to_idx,
              repetitions,
              not_dupl_indexes,
          ) = extract_children_states_from_actions(
              state,
              actions,
          )
          not_dupl_actions = actions[:, not_dupl_indexes].to("cpu")
          not_dupl_q_values = torch.zeros(not_dupl_actions.shape[:-1]).to(
              "cpu"
          )
          N_s_a = torch.zeros_like(not_dupl_q_values).to("cpu")
          present_state = extract_present_state(state)
          states_dict[to_hash(present_state)] = (
              _reduce_memory_consumption_before_storing(possible_states),
              cloned_idx_to_idx,
              repetitions,
              N_s_a,
              not_dupl_q_values,
              not_dupl_actions,
          )
          game_tree[to_hash(present_state)] = [
              to_hash(extract_present_state(fut_state))
              for fut_state in possible_states
          ]
          leaf_q_value = q_values
  else:
      leaf_q_value = -int(torch.linalg.matrix_rank(state).sum())
  # backup
  backward_pass(trajectory, states_dict, leaf_q_value=leaf_q_value)

每个模拟分为三个部分:

  • 选择
  • 扩张
  • 备份

在选择部分,模拟在已经生成的树节点上运行,并使用以下函数选择以下节点:

def select_future_state(
    possible_states: List[torch.Tensor],
    q_values: torch.Tensor,
    N_s_a: torch.Tensor,
    repetitions: Dict[int, list],
    c_1: float = 1.25,
    c_2: float = 19652,
    return_idx: bool = False,
) -> torch.Tensor:
"""Select the future state maximizing the upper confidence bound."""
# q_values (1, K, 1)
    pi = torch.tensor(
        [
            len(repetitions[i])
            for i in range(len(possible_states))
            if i in repetitions
        ]
    ).to(q_values.device)
    ucb = q_values.reshape(-1) + pi * torch.sqrt(
        torch.sum(N_s_a) / (1 + N_s_a)
    ) * (c_1 + torch.log((torch.sum(N_s_a) + c_2 + 1) / c_2))
    if return_idx:
        return ucb.argmax()
    return possible_states[ucb.argmax()]

在实践中,最大化 ucb 函数的动作:

XXXXX


 

对于给定的状态被选中。这里 Q 表示模型生成的 Q 值,π 表示使用模型策略采样的动作的随机分布。N(s, a) 表示节点从节点 s 到动作 a 的访问次数。

一旦选择阶段到达叶节点,如果模拟没有达到终止条件(根据最大探索,即未来地平线或游戏结束),则该模型将用于选择 n_samples 个替代节点(它们将是叶连续迭代中的节点)。这称为扩展阶段,因为新节点被添加到树中。然后,在当前模拟中不再探索其他节点,但叶 q_value 被发送到以下模拟步骤:备份。

备份是每个模拟的最后阶段。在备份期间,如果叶节点是终端状态,则计算最终奖励;否则叶 q 值被用作估计的奖励。然后奖励在模拟轨迹上反向传播更新状态 q_values 和更新访问计数器 N(s, a)。在下面的代码片段中,我们展示了奖励反向传播的代码。

def backward_pass(trajectory, states_dict, leaf_q_value: torch.Tensor):
"""Backward pass of the montecarlo algorithm"""
reward = 0
    for idx, (state, action_idx) in enumerate(reversed(trajectory)):
        if action_idx is None:  # leaf node
            reward += leaf_q_value
        else:
            (
                _,
                old_idx_to_new_idx,
                _,
                N_s_a,
                q_values,
                _,
            ) = states_dict[state]
            if isinstance(reward, torch.Tensor):
                reward = reward.to(q_values.device)
            action_idx = int(action_idx)
            if action_idx in old_idx_to_new_idx:
                not_dupl_index = old_idx_to_new_idx[int(action_idx)]
            else:
                not_dupl_index = action_idx
            reward -= 1
            q_values[:, not_dupl_index] = (
                N_s_a[:, not_dupl_index] * q_values[:, not_dupl_index] + reward
            ) / (N_s_a[:, not_dupl_index] + 1)
            N_s_a[:, not_dupl_index] += 1

改进的策略计算

一旦运行了所有模拟并且 MCTS 提供了近期的有趣快照,就该更新与预测节点关联的策略并返回它们,以便它们可以在训练期间使用。改进的策略,遵循Hubert 等人描述的方法, 用于管理大型动作空间。事实上,对于小的搜索空间,在 MCTS 期间可以从动作空间中随机采样一个动作并评估其影响。在更大的动作空间中采用类似的方法会导致所有轨迹在不同路径上发散,并且需要无限数量的轨迹才能获得有意义的统计数据,然后更新策略。由于这里我们使用 sample-MCTS 来避免分散,即根据模型策略对 n_samples 个动作进行采样,然后 MCTS 在探索树时仅选择一个采样动作,因此我们需要在计算时考虑样本校正训练模型时将使用的最终更新策略。

在实践中,改进的策略被计算为

 
 

XXXXX


 

 

在哪里
 

XXXXX


 

def compute_improved_policy(
    state_dict: Dict,
    states: List[str],
    model_n_steps: int,
    model_n_logits: int,
    N_bar: int,
):
    """Compute the improved policy given the state_dict, the list of states.
    The improved policy is computed as (N_s_a / N_s_a.sum())^(1/tau) where tau
    is (log(N_s_a.sum()) / log(N_bar)) if N_s_a.sum() > N_bar else 1.
    """
    policies = torch.zeros(len(states), model_n_steps, model_n_logits)
    N_bar = torch.tensor(N_bar)
    for idx, state in enumerate(states):
        N_s_a = state_dict[state][3]
        actions = state_dict[state][5]
        if N_s_a.sum() > N_bar:
            tau = (torch.log(N_s_a.sum()) / torch.log(N_bar)).item()
        else:
            tau = 1
	 N_s_a = N_s_a ** (1 / tau)
        improved_policy = N_s_a / N_s_a.sum()
        for sample_id in range(actions.shape[1]):
            action_ids = actions[0, sample_id]
            for step_id, action_id in enumerate(action_ids):
                policies[idx, step_id, action_id] += improved_policy[
                    0, sample_id
                ]
    return policies

请注意,在我们的实现中,在从 N_s_a 张量计算出策略后,我们必须将其映射回原始动作张量。事实上,N_s_a 只考虑模型采样的动作,而最终的策略还必须包含未探索动作的概率。

ChatGPT 训练算法的差异

AlphaTensor 是 DeepMind 的 AlphaGo/AlphaZero 人工智能方法家族的最新成员。这些方法基于蒙特卡洛树搜索 (MCTS) 算法,DeepMind 对其进行了改进和增强,以应对日益复杂的任务。另一个 AI 系统,OpenAI 的 ChatGPT,以其卓越的性能引起了广泛的关注,它使用一种不同的方法进行训练,称为人类反馈强化学习 (RLHF)。

RLHF 是一种微调技术,用于调整语言模型以遵循一组书面说明。它使用人类偏好作为奖励信号来微调模型,从而使语言模型的行为与特定人群的既定偏好保持一致,而不是更广泛的“人类价值观”概念。

相比之下,MCTS 是一种基于树的搜索算法,用于确定游戏中的最佳动作。它模拟潜在的移动并根据结果更新每个移动的值,指导最佳移动的选择。

RLHF 从人工编写的演示和人工智能模型之间人工标记的比较中收集数据,并训练奖励模型来预测给定人群的偏好。然后使用奖励模型微调 AI 模型。另一方面,MCTS 使用模拟和评估来确定最佳决策。

尽管它们是不同的方法,但 RLHF 和 MCTS 也有相似之处。两种人工智能技术都使用决策和解决问题的方法,并且都使用试错法来探索不同的选项并根据可用信息做出决策。两者都是迭代过程,随着时间的推移会随着收集到更多信息和经验而改进。

RLHF 和 MCTS 之间的选择取决于手头的任务。当没有明确的指标来评估模型性能时,RLHF 是理想的选择,而 MCTS 已被证明在类似游戏的任务中有效,在这些任务中,对未来的知识和探索使模型具有显着优势。

AlphaTensor 训练的代码优化

实施 AlphaTensor 训练算法需要在训练速度和内存消耗之间找到完美的折衷。如模型部分所示,简单地考虑动作标记化可以节省大量内存,但过度激进的动作空间减少会导致准确性下降和性能下降。后者的发生是因为所有标记都是由模型解码器以自回归方式顺序生成的。因此,一旦动作空间上的 softmax 不再是瓶颈,推理时间就会随着每个动作的标记数量线性增长。

在设置 AlphaTensor 训练时,发现主要困难在于处理 acting 过程。如果张量没有以正确的格式存储,MCTS 很容易导致不受控制的内存使用量增长。另一方面,如果每次模拟期间存储的张量数量减少太多,MCTS 可能会花费无限多的时间重新计算所需的状态。

让我们以游戏模拟步骤为例,其中通过查看可能的未来场景来探索游戏。对于每个状态,如果我们不保存模型生成的动作并且我们决定只保存用于从策略中采样动作的随机种子,那么每次我们探索树节点时我们都必须重新计算策略和然后对动作进行采样。显然,我们决定存储采样操作以节省时间,并避免在 MCTS 探索并行化的情况下必须管理不同进程之间的模型共享。然而,仅保存动作不足以获得足够有效的动作步骤。实际上,将 n_steps 动作转换为 (u, v, w) 三元组的时间,减少游戏张量状态并从 n_samples 动作创建新的三维张量很容易成为整个训练的瓶颈。其次,我们不想为每个采样动作存储所有可能的未来状态,因为这会对算法使用的内存产生巨大影响。假设我们设置 n_samples=32,n=7 和 N=5,让我们记住 N 是我们要减少的方阵乘积的大小,n 是模型记住的先前动作的数量。在这种情况下,每个状态张量的形式为 (8, 25, 25, 25),乘以 32 将得到 32 请记住,N 是我们要减少的方阵乘积的大小,n 是模型记住的先前动作的数量。在这种情况下,每个状态张量的形式为 (8, 25, 25, 25),乘以 32 将得到 32 请记住,N 是我们要减少的方阵乘积的大小,n 是模型记住的先前动作的数量。在这种情况下,每个状态张量的形式为 (8, 25, 25, 25),乘以 32 将得到 32图中每个节点8 25 25 25 4 个字节。现在,考虑到扩展阶段的每个模拟都会生成一个新节点(并且 n_sim=200),我们最终的内存消耗将为 200 32 8 25 25 25 *4 = 3.2GB,仅第一个 MCTS 节点。在最坏的情况下,在探索活动的 max_rank 节点(其中 max_rank=150)时,这将导致 RAM 内存(或 GPU 内存,如果所有张量都存储在 GPU 上)中的总内存消耗为 150 * 3.2GB = 480GB . 我们在配备 128 GB RAM 和 48 GB GPU 内存的工作站上运行训练,因此我们必须减少内存消耗。

由于我们不想增加执行时间,我们采用了一种优化,利用所产生的状态张量中的冗余。事实上,张量有 n-1 个共同的先前动作,然后可以存储一次,而不是为每个存储的张量重复。这导致内存减少 2/7~28%,这意味着在最坏的情况下可以存储 137GB。此时,通过简单地修剪树中未使用的部分(例如未选择的轨迹)并将张量存储在 CPU 内存中,我们能够避免训练期间出现任何内存错误。

下一步

随着 OpenAlphaTensor 现已开源,为进一步开发开辟了几个令人兴奋的途径。

一个自然的过程是在目标硬件设备上对 OpenAlphaTensor 进行微调。这有望带来极具竞争力的计算性能。我将在GitHub上发布更多关于 OpenAlphaTensor 在各种硬件上的性能。在撰写本文时,OpenAlphaTensor 正在接受训练。

另一个重要的进步是支持远程编译,允许用户构建针对边缘设备优化的算法。这可以通过将 OpenAlphaTensor 模型存储在服务器上来实现,而矩阵乘法算法在不同的硬件上进行评估。

扩展对不同编译器的支持以计算基于延迟的奖励校正也很重要。不同的编译器可以在给定的硬件上产生不同的优化算法。例如,DeepMind 论文展示了在 TPU 和 Nvidia GPU 上使用 JAX 和 XLA 编译器的可喜成果。在 Nvidia 上使用 NCCL 或在 CPU 上使用 LLVM 对此进行评估会很有趣。

最后,扩展模型和训练算法以支持更大的矩阵大小仍然是一个主要的开放挑战。目前,OpenAlphaTensor 支持的最大矩阵大小为 5,但可以通过将较大的矩阵乘法拆分为大小小于 5 的微小 MM 组来应用。这种方法不是最优的,直接对对应于完整的 MM 理论上可以带来更好的结果。
 
Diego Fiori是 Nebuly AI 的 CTO,该公司致力于使 AI 优化成为每个开发人员工具包的一部分。

文章原文出处:https: //www.kdnuggets.com/

#opensource #alpha #machinelearning #deeplearning 

如何首先开源 DeepMind 的 AlphaTensor 实现