Hermann  Frami

Hermann Frami

1679887042

Dolly: This Fine-tunes The GPT-J 6B Model on The Alpaca Dataset

Dolly

This fine-tunes the GPT-J 6B model on the Alpaca dataset using a Databricks notebook. Please note that while GPT-J 6B is Apache 2.0 licensed, the Alpaca dataset is licensed under Creative Commons NonCommercial (CC BY-NC 4.0).

Get Started Training

  • Add the dolly repo to Databricks (under Repos click Add Repo, enter https://github.com/databrickslabs/dolly.git, then click Create Repo).
  • Start a 12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12) single-node cluster with node type having 8 A100 GPUs (e.g. Standard_ND96asr_v4 or p4d.24xlarge).
  • Open the train_dolly notebook in the dolly repo, attach to your GPU cluster, and run all cells. When training finishes, the notebook will save the model under /dbfs/dolly_training.

Running Unit Tests Locally

pyenv local 3.8.13
python -m venv .venv
. .venv/bin/activate
pip install -r requirements_dev.txt
./run_pytest.sh

Download Details:

Author: Databrickslabs
Source Code: https://github.com/databrickslabs/dolly 
License: Apache-2.0 license

#python #dataset #databricks #notebook 

Dolly: This Fine-tunes The GPT-J 6B Model on The Alpaca Dataset
Lawson  Wehner

Lawson Wehner

1679535000

OGB: Benchmark Datasets, Data Loaders, Evaluators for Graph ML

Overview

The Open Graph Benchmark (OGB) is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications. The OGB data loaders are fully compatible with popular graph deep learning frameworks, including PyTorch Geometric and Deep Graph Library (DGL). They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.

OGB aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.

Graph ML Tasks: We cover three fundamental graph machine learning tasks: prediction at the level of nodes, links, and graphs.

Diverse scale: Small-scale graph datasets can be processed within a single GPU, while medium- and large-scale graphs might require multiple GPUs or clever sampling/partition techniques.

Rich domains: Graph datasets come from diverse domains ranging from scientific ones to social/information networks, and also include heterogeneous knowledge graphs.

OGB is an on-going effort, and we are planning to increase our coverage in the future.

Installation

You can install OGB using Python's package manager pip. If you have previously installed ogb, please make sure you update the version to 1.3.5. The release note is available here.

Requirements

  • Python>=3.6
  • PyTorch>=1.6
  • DGL>=0.5.0 or torch-geometric>=2.0.2
  • Numpy>=1.16.0
  • pandas>=0.24.0
  • urllib3>=1.24.0
  • scikit-learn>=0.20.0
  • outdated>=0.2.0

Pip install

The recommended way to install OGB is using Python's package manager pip:

pip install ogb
python -c "import ogb; print(ogb.__version__)"
# This should print "1.3.5". Otherwise, please update the version by
pip install -U ogb

From source

You can also install OGB from source. This is recommended if you want to contribute to OGB.

git clone https://github.com/snap-stanford/ogb
cd ogb
pip install -e .

Package Usage

We highlight two key features of OGB, namely, (1) easy-to-use data loaders, and (2) standardized evaluators.

(1) Data loaders

We prepare easy-to-use PyTorch Geometric and DGL data loaders. We handle dataset downloading as well as standardized dataset splitting. Below, on PyTorch Geometric, we see that a few lines of code is sufficient to prepare and split the dataset! Needless to say, you can enjoy the same convenience for DGL!

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.loader import DataLoader

# Download and process data at './dataset/ogbg_molhiv/'
dataset = PygGraphPropPredDataset(name = 'ogbg-molhiv')

split_idx = dataset.get_idx_split() 
train_loader = DataLoader(dataset[split_idx['train']], batch_size=32, shuffle=True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size=32, shuffle=False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size=32, shuffle=False)

(2) Evaluators

We also prepare standardized evaluators for easy evaluation and comparison of different methods. The evaluator takes input_dict (a dictionary whose format is specified in evaluator.expected_input_format) as input, and returns a dictionary storing the performance metric appropriate for the given dataset. The standardized evaluation protocol allows researchers to reliably compare their methods.

from ogb.graphproppred import Evaluator

evaluator = Evaluator(name = 'ogbg-molhiv')
# You can learn the input and output format specification of the evaluator as follows.
# print(evaluator.expected_input_format) 
# print(evaluator.expected_output_format) 
input_dict = {'y_true': y_true, 'y_pred': y_pred}
result_dict = evaluator.eval(input_dict) # E.g., {'rocauc': 0.7321}

Citing OGB / OGB-LSC

If you use OGB or OGB-LSC datasets in your work, please cite our papers (Bibtex below).

@article{hu2020ogb,
  title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
  journal={arXiv preprint arXiv:2005.00687},
  year={2020}
}
@article{hu2021ogblsc,
  title={OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Ren, Hongyu and Nakata, Maho and Dong, Yuxiao and Leskovec, Jure},
  journal={arXiv preprint arXiv:2103.09430},
  year={2021}
}

Download Details:

Author: snap-stanford
Source Code: https://github.com/snap-stanford/ogb 
License: MIT license

#machinelearning #deeplearning #dataset #python 

OGB: Benchmark Datasets, Data Loaders, Evaluators for Graph ML
Royce  Reinger

Royce Reinger

1679471940

se(3)-TrackNet: Data-driven 6D Pose Tracking

iros20-6d-pose-tracking

This is the official implementation of our paper "se(3)-TrackNet: Data-driven 6D Pose Tracking by Calibrating Image Residuals in Synthetic Domains" accepted in International Conference on Intelligent Robots and Systems (IROS) 2020. [PDF]

Abstract: Tracking the 6D pose of objects in video sequences is important for robot manipulation. This task, however, introduces multiple challenges: (i) robot manipulation involves significant occlusions; (ii) data and annotations are troublesome and difficult to collect for 6D poses, which complicates machine learning solutions, and (iii) incremental error drift often accumulates in long term tracking to necessitate re-initialization of the object's pose. This work proposes a data-driven optimization approach for long-term, 6D pose tracking. It aims to identify the optimal relative pose given the current RGB-D observation and a synthetic image conditioned on the previous best estimate and the object's model. The key contribution in this context is a novel neural network architecture, which appropriately disentangles the feature encoding to help reduce domain shift, and an effective 3D orientation representation via Lie Algebra. Consequently, even when the network is trained only with synthetic data can work effectively over real images. Comprehensive experiments over benchmarks - existing ones as well as a new dataset with significant occlusions related to object manipulation - show that the proposed approach achieves consistently robust estimates and outperforms alternatives, even though they have been trained with real images. The approach is also the most computationally efficient among the alternatives and achieves a tracking frequency of 90.9Hz.

Applications: model-based RL, manipulation, AR/VR, human-robot-interaction, automatic 6D pose labeling.

This repo can be used when you have the CAD model of the target object. When such model is not available, checkout our another repo BundleTrack, which can be instantly used for 6D pose tracking of novel unknown objects without needing CAD models

Bibtex

@article{wen2020se,
   title={se(3)-TrackNet: Data-driven 6D Pose Tracking by Calibrating Image Residuals in Synthetic Domains},
   url={http://dx.doi.org/10.1109/IROS45743.2020.9341314},
   DOI={10.1109/iros45743.2020.9341314},
   journal={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
   publisher={IEEE},
   author={Wen, Bowen and Mitash, Chaitanya and Ren, Baozhang and Bekris, Kostas E.},
   year={2020},
   month={Oct} }

(New) Application to visual feedback control

Some example experiments using se(3)-TrackNet in our recent work "Vision-driven Compliant Manipulation for Reliable, High-Precision Assembly Tasks", RSS 2021.

ycb_packing.gif

cup_stacking_and_charger.gif

industrial_insertion.gif

Supplementary Video:

Click to watch

Results on YCB

occlusion.gif

About YCBInEOAT Dataset

Due to the lack of suitable dataset about RGBD-based 6D pose tracking in robotic manipulation, a novel dataset is developed in this work. It has these key attributes:

Real manipulation tasks

3 kinds of end-effectors

5 YCB objects

9 videos for evaluation, 7449 RGBD in total

Ground-truth poses annotated for each frame

Forward-kinematics recorded

Camera extrinsic parameters calibrated

Link to download this dataset is provided below under 'Data Preparation'. Example manipulation sequence:

manipulation1.gif

Current benchmark:

More details are in the paper and supplementary video.

Quick setup

Use docker and pull the pre-built image. (Install docker first if you haven't)

 docker pull wenbowen123/se3_tracknet:latest

Launch docker container as below and now it's ready to run

cd docker
bash run_container.sh

Data Download

syndata_gen.gif

Test on YCB_Video and YCBInEOAT datasets

Please refer to predict.py and predict.sh

Benchmarking

Please refer to eval_ycb.py and eval_ycbineoat.py

Training

  1. Edit the config.yml. Make sure the paths are correct. Other settings need not be changed in most cases.
  2. Then python train.py

Generate your own data

Here we take object_models/bunny as an exmaple, you need to prepare your own CAD models like it for new objects.

Download the blender file and put it inside this repository folder

Edit dataset_info.yml. The params are self-explained. In particular, add the object model, e.g. /home/se3_tracknet/object_models/bunny/1.ply in our example.

Start generation, it should save to /home/se3_tracknet/generated_data/

python blender_main.py

Generate paired data as neighboring images, it should save to /home/se3_tracknet/generated_data_pair/

python produce_train_pair_data.py

Example pair:

Now refer to the Training section.

Test in the wild with ROS

python predict_ros.py

For more information

python predict_ros.py --help

Download Details:

Author: Wenbowen123
Source Code: https://github.com/wenbowen123/iros20-6d-pose-tracking 
License: View license

#machinelearning #python #robots #computervision #dataset #3d 

 se(3)-TrackNet: Data-driven 6D Pose Tracking

GeoStatsImages.jl: Training Images for Geostastical Simulation

GeoStatsImages.jl

Training images for geostastical simulation in Julia.


This package converts famous training images from the geostatistcs literature to a standard format for quick experimentation in Julia. It is part of the GeoStats.jl framework and can be used in conjunction with multiple-point simulation solvers.

The author does not hold any copyright on the data. Please give credit to the sources in the table.

Usage

TI = geostatsimage(identifier)

where identifier can be any of the strings listed with the command GeoStatsImages.available()

Preview

IdentifierPreviewTypeData source
WalkerLakeWalkerLakePreviewContinuousMariethoz & Caers, 2014
WalkerLakeTruthWalkerLakeTruthPreviewContinuousMariethoz & Caers, 2014
StoneWallStoneWallPreviewContinuousMariethoz & Caers 2014
HertenHertenPreviewContinuousMariethoz & Caers 2014
LenaLenaPreviewContinuousMariethoz & Caers 2014
StanfordVStanfordVPreviewContinuousMao & Journel 2014
Gaussian30x10Gaussian30x10PreviewContinuousHoffimann 2020
StrebelleStrebellePreviewCategoricalStrebelle 2002
EllipsoidsEllipsoidsPreviewCategoricalMariethoz & Caers 2014
WestCoastAfricaWestCoastAfricaPreviewCategoricalMariethoz & Caers 2014
FlumyFlumyPreviewCategoricalHoffimann et al 2017
FluvsimFluvsimPreviewCategoricalMariethoz & Caers, 2014
KettonKettonCategoricalImperial College Pore-Scale Modelling Group

Collections

St. Anthony Falls Laboratory

FlumeContinuous

FlumeContinuousPreview

FlumeBinary

FlumeBinaryPreview

Contributing

Contributions are very welcome, as are feature requests and suggestions.

If you have any questions, please contact our community on the gitter channel.


Download Details:

Author: JuliaEarth
Source Code: https://github.com/JuliaEarth/GeoStatsImages.jl 
License: MIT license

#julia #earth #images #dataset

GeoStatsImages.jl: Training Images for Geostastical Simulation
Lawson  Wehner

Lawson Wehner

1672825278

Convert Spark RDD into DataFrame and Dataset

In this blog, we will be talking about Spark RDD, Dataframe, Datasets, and how we can transform RDD into Dataframes and Datasets.

What is RDD?

A RDD is an immutable distributed collection of elements of your data. It’s partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

RDDs are so integral to the function of Spark that the entire Spark API can be considered to be a collection of operations to create, transform, and export RDDs. Every algorithm implemented in Spark is effectively a series of transformative operations performed upon data represented as an RDD.

What is Dataframe?

A DataFrame is a Dataset that is organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

What is Dataset?

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

Dataset can be constructed from JVM objects and then manipulated using functional transformations (mapflatMapfilter, etc.). The Dataset API is available in Scala and Java. Python does not have support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).

Working with RDD

Prerequisites: In order to work with RDD we need to create a SparkContext object

val conf: SparkConf =

  new SparkConf()

   .setMaster("local[*]")

   .setAppName("AppName")

   .set("spark.driver.host", "localhost")

val sc: SparkContext = new SparkContext(conf)

There are 2 common ways to build the RDD:

* Pass your existing collection to SparkContext.parallelize method (you will do it mostly for tests or POC)

scala> val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(data)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 

at <console>:26

* Read from external sources

val lines = sc.textFile("data.txt")

val lineLengths = lines.map(s => s.length)

https://blog.knoldus.com/?p=187442&preview=true

val totalLength = lineLengths.reduce((a, b) => a + b

Things are getting interesting when you want to convert your Spark RDD to DataFrame. It might not be obvious why you want to switch to Spark DataFrame or Dataset. You will write less code, the code itself will be more expressive, and there are a lot of out-of-the-box optimizations available for DataFrames and Datasets.

Working with Dataframe:-

DataFrame has two main advantages over RDD:

Prerequisites: To work with DataFrames we will need SparkSession

val spark: SparkSession =

  SparkSession

    .builder()

    .appName("AppName")

    .config("spark.master", "local")

    .getOrCreate()

First, let’s sum up the main ways of creating the DataFrame:

  • From existing RDD using a reflection

In case you have structured or semi-structured data with simple unambiguous data types, you can infer a schema using a reflection.

import spark.implicits._

// for implicit conversions from Spark RDD to Dataframe

val dataFrame = rdd.toDF()
  • From existing RDD by programmatically specifying the schema
def dfSchema(columnNames: List[String]): StructType =

  StructType(

    Seq(

      StructField(name = "name", dataType = StringType, nullable = false),

      StructField(name = "age", dataType = IntegerType, nullable = false)
    )

  )

def row(line: List[String]): Row = Row(line(0), line(1).toInt)

val rdd: RDD[String] = ...

val schema = dfSchema(Seq("name", "age"))

val data = rdd.map(_.split(",").to[List]).map(row)

val dataFrame = spark.createDataFrame(data, schema)
  • Loading data from a structured file (JSON, Parquet, CSV)
val dataFrame = spark.read.json("example.json")

val dataFrame = spark.read.csv("example.csv")

val dataFrame = spark.read.parquet("example.parquet")
  • External database via JDBC
val dataFrame = spark.read.jdbc(url,"person",prop)

The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute.

Working with Dataset

The Dataset API aims to provide the best of both worlds: the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

The idea behind Dataset “is to provide an API that allows users to easily perform transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine”. It represents competition to RDDs as they have overlapping functions.

Let’s say we have a case class, you can create Dataset By implicit conversion, By hand.

case class FeedbackRow(manager_name: String, response_time: Double, 

satisfaction_level: Double)
  • By implicit conversion
// create Dataset via implicit conversions

val ds: Dataset[FeedbackRow] = dataFrame.as[FeedbackRow]

val theSameDS = spark.read.parquet("example.parquet").as[FeedbackRow]
  • By hand
// create Dataset by hand

val ds1: Dataset[FeedbackRow] = dataFrame.map {

  row => FeedbackRow(row.getAs[String](0), row.getAs[Double](4), 

row.getAs[Double](5))

}
  • From collection
import spark.implicits._

case class Person(name: String, age: Long)

val data = Seq(Person("Bob", 21), Person("Mandy", 22), Person("Julia", 19))

val ds = spark.createDataset(data)
  • From RDD
val rdd = sc.textFile("data.txt")

val ds = spark.createDataset(rdd)

Original article source at: https://blog.knoldus.com/

#dataframe #dataset #spark 

Convert Spark RDD into DataFrame and Dataset
Royce  Reinger

Royce Reinger

1667895908

Vision: Datasets, Transforms and Models Specific to Computer Vision

Torchvision

The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision.

Installation

We recommend Anaconda as Python package management system. Please refer to pytorch.org for the detail of PyTorch (torch) installation. The following is the corresponding torchvision versions and supported Python versions.

torchtorchvisionpython
main / nightlymain / nightly>=3.7, <=3.10
1.13.00.14.0>=3.7, <=3.10
1.12.00.13.0>=3.7, <=3.10
1.11.00.12.0>=3.7, <=3.10
1.10.20.11.3>=3.6, <=3.9
1.10.10.11.2>=3.6, <=3.9
1.10.00.11.1>=3.6, <=3.9
1.9.10.10.1>=3.6, <=3.9
1.9.00.10.0>=3.6, <=3.9
1.8.20.9.2>=3.6, <=3.9
1.8.10.9.1>=3.6, <=3.9
1.8.00.9.0>=3.6, <=3.9
1.7.10.8.2>=3.6, <=3.9
1.7.00.8.1>=3.6, <=3.8
1.7.00.8.0>=3.6, <=3.8
1.6.00.7.0>=3.6, <=3.8
1.5.10.6.1>=3.5, <=3.8
1.5.00.6.0>=3.5, <=3.8
1.4.00.5.0==2.7, >=3.5, <=3.8
1.3.10.4.2==2.7, >=3.5, <=3.7
1.3.00.4.1==2.7, >=3.5, <=3.7
1.2.00.4.0==2.7, >=3.5, <=3.7
1.1.00.3.0==2.7, >=3.5, <=3.7
<=1.0.10.2.2==2.7, >=3.5, <=3.7

Anaconda:

conda install torchvision -c pytorch

pip:

pip install torchvision

From source:

python setup.py install
# or, for OSX
# MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

We don't officially support building from source using pip, but if you do, you'll need to use the --no-build-isolation flag. In case building TorchVision from source fails, install the nightly version of PyTorch following the linked guide on the contributing page and retry the install.

By default, GPU support is built if CUDA is found and torch.cuda.is_available() is true. It's possible to force building GPU support by setting FORCE_CUDA=1 environment variable, which is useful when building a docker image.

Image Backend

Torchvision currently supports the following image backends:

  • Pillow (default)
  • Pillow-SIMD - a much faster drop-in replacement for Pillow with SIMD. If installed will be used as the default.
  • accimage - if installed can be activated by calling torchvision.set_image_backend('accimage')
  • libpng - can be installed via conda conda install libpng or any of the package managers for debian-based and RHEL-based Linux distributions.
  • libjpeg - can be installed via conda conda install jpeg or any of the package managers for debian-based and RHEL-based Linux distributions. libjpeg-turbo can be used as well.

Notes: libpng and libjpeg must be available at compilation time in order to be available. Make sure that it is available on the standard library locations, otherwise, add the include and library paths in the environment variables TORCHVISION_INCLUDE and TORCHVISION_LIBRARY, respectively.

Video Backend

Torchvision currently supports the following video backends:

  • pyav (default) - Pythonic binding for ffmpeg libraries.
  • video_reader - This needs ffmpeg to be installed and torchvision to be built from source. There shouldn't be any conflicting version of ffmpeg installed. Currently, this is only supported on Linux.
conda install -c conda-forge ffmpeg
python setup.py install

Using the models on C++

TorchVision provides an example project for how to use the models on C++ using JIT Script.

Installation From source:

mkdir build
cd build
# Add -DWITH_CUDA=on support for the CUDA if needed
cmake ..
make
make install

Once installed, the library can be accessed in cmake (after properly configuring CMAKE_PREFIX_PATH) via the TorchVision::TorchVision target:

find_package(TorchVision REQUIRED)
target_link_libraries(my-target PUBLIC TorchVision::TorchVision)

The TorchVision package will also automatically look for the Torch package and add it as a dependency to my-target, so make sure that it is also available to cmake via the CMAKE_PREFIX_PATH.

For an example setup, take a look at examples/cpp/hello_world.

Python linking is disabled by default when compiling TorchVision with CMake, this allows you to run models without any Python dependency. In some special cases where TorchVision's operators are used from Python code, you may need to link to Python. This can be done by passing -DUSE_PYTHON=on to CMake.

TorchVision Operators

In order to get the torchvision operators registered with torch (eg. for the JIT), all you need to do is to ensure that you #include <torchvision/vision.h> in your project.

Documentation

You can find the API documentation on the pytorch website: https://pytorch.org/vision/stable/index.html

Contributing

See the CONTRIBUTING file for how to help out.

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Pre-trained Model License

The pre-trained models provided in this library may have their own licenses or terms and conditions derived from the dataset used for training. It is your responsibility to determine whether you have permission to use the models for your use case.

More specifically, SWAG models are released under the CC-BY-NC 4.0 license. See SWAG LICENSE for additional details.

Download Details:

Author: Pytorch
Source Code: https://github.com/pytorch/vision 
License: BSD-3-Clause license

#machinelearning #computer #vision #dataset 

Vision: Datasets, Transforms and Models Specific to Computer Vision

Read SAS® Software Transport Files & Convert Datasets to DataFrames

XPT

NOTE Requires the master version of DataFrames. Get that with Pkg.checkout("DataFrames"). I don't know how to specify that in the REQUIRE file.

About

The XPT package reads SAS® software transport files and converts SAS software datasets to DataFrames. Transport files are assumed to follow the specifications described in the technical note tiled "THE RECORD LAYOUT OF A DATA SET IN SAS TRANSPORT (XPORT) FORMAT" available here (pdf).

Datasets are tagged with member type SASDATA in transport files. No other member types are referenced in the tech note, so I am assuming they cannot exist (in a transport file). If this is not the case, you'll get an error. Please file an issue and send me a an example of an offending transport file, if possible.

Character variables in a dataset are converted to {ASCIIString}s. Missing character variables in SAS datasets are just empty strings, and are treated as such here.

SAS software numeric variables are not standard IEEE Float64s and can be shorter than 8 bytes and can have missing values. (Twenty-eight kinds in fact: ._, ., .a, ..., .z.) All numeric variables are converted to Float64s unless they are missing. All missing values are treated as DataArrays.NA.

NOTE Currently, only the first dataset found in a transport file is read and converted to a dataframe, even if the transport file has more than one dataset. If you need to access a dataset after the first in a transport file and I haven't gotten around to adding support for that yet, please file an issue.

Usage

Open a transport file (and process the header information):

xpt = XPTFile("path/to/xpt")

or

f = open("path/to/xpt")
xpt = XPTFile(f)

Convert the first SAS dataset in an xpt file to a dataframe:

df = readdf(xpt)

Future work

  • Convert all datasets in a transport file after the first to julia DataFrames
  • Or only a selection, indexing by name or number.
  • Add some useful tests.
  • Make it go faster. I assume my implementation is slow but I haven't benchmarked it.
  • Subset observations in a dataset by index before converting to DataFrame.
  • Maybe interface with DataStreams to read datasets sequentially.

Download Details:

Author: lendle
Source Code: https://github.com/lendle/XPT.jl 
License: View license

#julia #convert #dataset 

Read SAS® Software Transport Files & Convert Datasets to DataFrames

NCDatasets.jl: Load and Create NetCDF Files in Julia

NCDatasets   

NCDatasets allows one to read and create netCDF files. NetCDF data set and attribute list behave like Julia dictionaries and variables like Julia arrays.

The module NCDatasets provides support for the following netCDF CF conventions:

  • _FillValue will be returned as missing (more information)
  • scale_factor and add_offset are applied if present
  • time variables (recognized by the units attribute) are returned as DateTime objects.
  • Support of the CF calendars (standard, gregorian, proleptic gregorian, julian, all leap, no leap, 360 day)
  • The raw data can also be accessed (without the transformations above).
  • Contiguous ragged array representation

Other features include:

  • Support for NetCDF 4 compression and variable-length arrays (i.e. arrays of vectors where each vector can have potentailly a different length)
  • The module also includes an utility function ncgen which generates the Julia code that would produce a netCDF file with the same metadata as a template netCDF file.

Installation

Inside the Julia shell, you can download and install the package by issuing:

using Pkg
Pkg.add("NCDatasets")

Manual

This Manual is a quick introduction in using NCDatasets.jl. For more details you can read the stable or latest documentation.

Explore the content of a netCDF file

Before reading the data from a netCDF file, it is often useful to explore the list of variables and attributes defined in it.

For interactive use, the following commands (without ending semicolon) display the content of the file similarly to ncdump -h file.nc:

using NCDatasets
ds = Dataset("file.nc")

This creates the central structure of NCDatasets.jl, Dataset, which represents the contents of the netCDF file (without immediatelly loading everything in memory). NCDataset is an alias for Dataset.

The following displays the information just for the variable varname:

ds["varname"]

while to get the global attributes you can do:

ds.attrib

which produces a listing like:

Dataset: file.nc
Group: /

Dimensions
   time = 115

Variables
  time   (115)
    Datatype:    Float64
    Dimensions:  time
    Attributes:
     calendar             = gregorian
     standard_name        = time
     units                = days since 1950-01-01 00:00:00
[...]

Load a netCDF file

Loading a variable with known structure can be achieved by accessing the variables and attributes directly by their name.

# The mode "r" stands for read-only. The mode "r" is the default mode and the parameter can be omitted.
ds = Dataset("/tmp/test.nc","r")
v = ds["temperature"]

# load a subset
subdata = v[10:30,30:5:end]

# load all data
data = v[:,:]

# load all data ignoring attributes like scale_factor, add_offset, _FillValue and time units
data2 = v.var[:,:]


# load an attribute
unit = v.attrib["units"]
close(ds)

In the example above, the subset can also be loaded with:

subdata = Dataset("/tmp/test.nc")["temperature"][10:30,30:5:end]

This might be useful in an interactive session. However, the file test.nc is not directly closed (closing the file will be triggered by Julia's garbage collector), which can be a problem if you open many files. On Linux the number of opened files is often limited to 1024 (soft limit). If you write to a file, you should also always close the file to make sure that the data is properly written to the disk.

An alternative way to ensure the file has been closed is to use a do block: the file will be closed automatically when leaving the block.

data =
Dataset(filename,"r") do ds
    ds["temperature"][:,:]
end # ds is closed

Create a netCDF file

The following gives an example of how to create a netCDF file by defining dimensions, variables and attributes.

using NCDatasets
using DataStructures
# This creates a new NetCDF file /tmp/test.nc.
# The mode "c" stands for creating a new file (clobber)
ds = Dataset("/tmp/test.nc","c")

# Define the dimension "lon" and "lat" with the size 100 and 110 resp.
defDim(ds,"lon",100)
defDim(ds,"lat",110)

# Define a global attribute
ds.attrib["title"] = "this is a test file"

# Define the variables temperature with the attribute units
v = defVar(ds,"temperature",Float32,("lon","lat"), attrib = OrderedDict(
    "units" => "degree Celsius"))

# add additional attributes
v.attrib["comments"] = "this is a string attribute with Unicode Ω ∈ ∑ ∫ f(x) dx"

# Generate some example data
data = [Float32(i+j) for i = 1:100, j = 1:110]

# write a single column
v[:,1] = data[:,1]

# write a the complete data set
v[:,:] = data

close(ds)

Edit an existing netCDF file

When you need to modify variables or attributes in a netCDF file, you have to open it with the "a" option. Here, for example, we add a global attribute creator to the file created in the previous step.

ds = Dataset("/tmp/test.nc","a")
ds.attrib["creator"] = "your name"
close(ds);

Benchmark

The benchmark loads a variable of the size 1000x500x100 in slices of 1000x500 (applying the scaling of the CF conventions) and computes the maximum of each slice and the average of each maximum over all slices. This operation is repeated 100 times. The code is available at https://github.com/Alexander-Barth/NCDatasets.jl/tree/master/test/perf .

Modulemedianminimummeanstd. dev.
R-ncdf40.5720.5500.5750.023
python-netCDF40.5040.4980.5050.003
julia-NCDatasets0.2280.2120.2260.005

All runtimes are in seconds. Julia 1.6.0 (with NCDatasets b953bf5), R 3.4.4 (with ncdf4 1.17) and Python 3.6.9 (with netCDF4 1.5.4). This CPU is a i7-7700.

Filing an issue

When you file an issue, please include sufficient information that would allow somebody else to reproduce the issue, in particular:

Provide the code that generates the issue.

If necessary to run your code, provide the used netCDF file(s).

Make your code and netCDF file(s) as simple as possible (while still showing the error and being runnable). A big thank you for the 5-star-premium-gold users who do not forget this point! 👍🏅🏆

The full error message that you are seeing (in particular file names and line numbers of the stack-trace).

Which version of Julia and NCDatasets are you using? Please include the output of:

versioninfo()
using Pkg
Pkg.installed()["NCDatasets"]

Does NCDatasets pass its test suite? Please include the output of:

using Pkg
Pkg.test("NCDatasets")

Alternative

The package NetCDF.jl from Fabian Gans and contributors is an alternative to this package which supports a more Matlab/Octave-like interface for reading and writing NetCDF files.

Credits

netcdf_c.jl, build.jl and the error handling code of the NetCDF C API are from NetCDF.jl by Fabian Gans (Max-Planck-Institut für Biogeochemie, Jena, Germany) released under the MIT license.

Download Details:

Author: Alexander-Barth
Source Code: https://github.com/Alexander-Barth/NCDatasets.jl 
License: View license

#julia #dataset 

NCDatasets.jl: Load and Create NetCDF Files in Julia
Nat  Grady

Nat Grady

1661180820

ICON: Easy Access to Complex Systems Datasets

ICON: easy access to complex systems datasets  

Overview

The ICON R package provides easy-to-use and easy-to-access datasets from the Index of COmplex Networks (ICON) database available at the University of Colorado website. All datasets can be loaded with a single function call and new datasets are being slowly added from ICON at https://icon.colorado.edu. Currently, the ICON R package includes 1,075 complex networks.

Installation

To install the ICON package, run the following R code:

# install from CRAN (older, fewer networks)
install.packages("ICON")

# install development version from GitHub (updated, more networks)
devtools::install_github("rrrlw/ICON")

Sample code

The sample code below demonstrates network visualization using the igraph R package. For a more detailed look at network analysis (using the network R package) and visualization (using the ggnetwork R package), please take a look at the package vignette.

# load ICON package and data frame of available datasets
library("ICON")
data(ICON_data)

# vector of names of available datasets
print(ICON_data$Var_name)

# look at entire data frame in Rstudio
View(ICON_data)

# load the chess dataset for use and look at the first few lines
get_data("chess")
head(chess)

# load another dataset for use
get_data("seed_disperse_beehler")

# plot interaction network using igraph
library("igraph")
my_graph <- graph_from_edgelist(as.matrix(seed_disperse_beehler[, 1:2]), directed = FALSE)
plot(my_graph, vertex.label = NA, vertex.size = 5)

# following plot is generated (exact vertex positioning varies each time code is run)


 

Contribute

See contribution guidelines here. First-timers and beginners are welcome!

Download Details:

Author: rrrlw
Source Code: https://github.com/rrrlw/ICON 
License: View license

#r #network #dataset 

ICON: Easy Access to Complex Systems Datasets
Nat  Grady

Nat Grady

1661087820

Visual interface for Loading Datasets In RStudio From All Installed

datasets.load     

Visual interface for loading datasets in RStudio from all installed (including unloaded) packages.

Demonstration

datasets.load GUI demonstration

Installation

You can install the latest stable version from CRAN.

install.packages('datasets.load')

The development version, to be used at your peril, can be installed from GitHub using the remotes package.

if (!require('remotes')) install.packages('remotes')
remotes::install_github('bquast/datasets.load')

Development

Development takes place on the GitHub page.

https://github.com/bquast/datasets.load

Bugs can be filed on the issues page on GitHub.

https://github.com/bquast/datasets.load/issues

Download Details:

Author: Bquast
Source Code: https://github.com/bquast/datasets.load 

#r #dataset #load #rstudio 

Visual interface for Loading Datasets In RStudio From All Installed
Monty  Boehm

Monty Boehm

1656488100

MLDataUtils.jl: Utility Package for Generating, Loading, Splitting

MLDataUtils

Utility package for generating, loading, partitioning, and processing Machine Learning datasets. This package serves as a end-user friendly front end to the data related JuliaML packages.

Overview

This package is designed to be the end-user facing frond-end to all the data related functionalty that is spread out accross the JuliaML ecosystem. Most of the following sub-categories are covered by a single back-end package that is specialized on that specific problem. Consequently, if one of the following topics is of special interest to you, make sure to check out the corresponding documentation of that package.

Label Encodings provided by MLLabelUtils.jl

Documentation Status

Various tools needed to deal with classification targets of arbitrary format. This includes asserting if the targets are of a desired encoding, inferring the concrete encoding the targets are in and how many classes they represent, and converting from their native encoding to the desired one.

[docs] Infer which encoding some classification targets use.

julia> enc = labelenc([-1,1,1,-1,1])
# MLLabelUtils.LabelEnc.MarginBased{Int64}()

[docs] Assert if some classification targets are of the encoding I need them in.

julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased)
# false

[docs] Convert targets into a specific encoding that my model requires.

julia> convertlabel(LabelEnc.OneOfK{Float32}, [-1,1,-1,1,1,-1])
# 2×6 Array{Float32,2}:
#  0.0  1.0  0.0  1.0  1.0  0.0
#  1.0  0.0  1.0  0.0  0.0  1.0

[docs] Work with matrices in which the user can choose of the rows or the columns denote the observations.

julia> convertlabel(LabelEnc.OneOfK{Float32}, Int8[-1,1,-1,1,1,-1], obsdim = 1)
# 6×2 Array{Float32,2}:
#  0.0  1.0
#  1.0  0.0
#  0.0  1.0
#  1.0  0.0
#  1.0  0.0
#  0.0  1.0

[docs] Group observations according to their class-label.

julia> labelmap([0, 1, 1, 0, 0])
# Dict{Int64,Array{Int64,1}} with 2 entries:
#   0 => [1,4,5]
#   1 => [2,3]

[docs] Classify model predictions into class labels appropriate for the encoding of the targets.

julia> classify(-0.3, LabelEnc.MarginBased())
# -1.0

Data Access Pattern provided by MLDataPattern.jl

Documentation Status

Native and generic Julia implementation for commonly used data access pattern in Machine Learning. Most notably we provide a number of pattern for shuffling, partitioning, and resampling data sets of various types and origin. At its core, the package was designed around the key requirement of allowing any user-defined type to serve as a custom data source and/or access pattern in a first class manner. That said, there was also a lot of attention focused on first class support for those types that are most commonly employed to represent the data of interest, such as DataFrame and Array.

[docs] Create a lazy data subset of some data.

julia> X = rand(2, 6)
# 2×6 Array{Float64,2}:
#  0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202
#  0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996

julia> datasubset(X, 2:3)
# 2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
#  0.933372  0.505208
#  0.522172  0.0997825

[docs] Shuffle the observations of a data container.

julia> shuffleobs(X)
# 2×6 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
#  0.505208   0.812814  0.11202      0.0443222  0.933372  0.226582
#  0.0997825  0.245457  0.000341996  0.722906   0.522172  0.504629

[docs] Split data into train/test subsets.

julia> train, test = splitobs(X, at = 0.7);

julia> train
# 2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
#  0.226582  0.933372  0.505208   0.0443222
#  0.504629  0.522172  0.0997825  0.722906

julia> test
# 2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
#  0.812814  0.11202
#  0.245457  0.000341996

[docs] Partition data into train/test subsets using stratified sampling.

julia> train, test = stratifiedobs([:a,:a,:b,:b,:b,:b], p = 0.5)
# (Symbol[:b,:b,:a],Symbol[:b,:b,:a])

julia> train
# 3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
# :b
# :b
# :a

julia> test
# 3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
# :b
# :b
# :a

[docs] Group multiple variables together and treat them as a single data set.

julia> shuffleobs(([1,2,3], [:a,:b,:c]))
([3,1,2],Symbol[:c,:a,:b])

[docs] Support my own custom user-defined data container type.

julia> using DataTables, LearnBase

julia> LearnBase.nobs(dt::AbstractDataTable) = nrow(dt)

julia> LearnBase.getobs(dt::AbstractDataTable, idx) = dt[idx,:]

julia> LearnBase.datasubset(dt::AbstractDataTable, idx, ::ObsDim.Undefined) = view(dt, idx)

[docs] Over- or undersample an imbalanced labeled data set.

julia> undersample([:a,:b,:b,:a,:b,:b])
# 4-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
#  :a
#  :b
#  :b
#  :a

[docs] Repartition a data container using a k-folds scheme.

julia> folds = kfolds([1,2,3,4,5,6,7,8,9,10], k = 5)
# 5-fold MLDataPattern.FoldsView of 10 observations:
#   data: 10-element Array{Int64,1}
#   training: 8 observations/fold
#   validation: 2 observations/fold
#   obsdim: :last

julia> folds[1]
# ([3, 4, 5, 6, 7, 8, 9, 10], [1, 2])

[docs] Iterate over my data one observation or batch at a time.

julia> obsview(([1 2 3; 4 5 6], [:a, :b, :c]))
# 3-element MLDataPattern.ObsView{Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Colon,Int64},true},SubArray{Symbol,0,Array{Symbol,1},Tuple{Int64},false}},Tuple{Array{Int64,2},Array{Symbol,1}},Tuple{LearnBase.ObsDim.Last,LearnBase.ObsDim.Last}}:
#  ([1,4],:a)
#  ([2,5],:b)
#  ([3,6],:c)

[docs] Prepare sequence data such as text for supervised learning.

julia> txt = split("The quick brown fox jumps over the lazy dog")
# 9-element Array{SubString{String},1}:
# "The"
# "quick"
# "brown"
# ⋮
# "the"
# "lazy"
# "dog"

julia> seq = slidingwindow(i->i+2, txt, 2, stride=1)
# 7-element slidingwindow(::##9#10, ::Array{SubString{String},1}, 2, stride = 1) with element type Tuple{...}:
# (["The", "quick"], "brown")
# (["quick", "brown"], "fox")
# (["brown", "fox"], "jumps")
# (["fox", "jumps"], "over")
# (["jumps", "over"], "the")
# (["over", "the"], "lazy")
# (["the", "lazy"], "dog")

julia> seq = slidingwindow(i->[i-2:i-1; i+1:i+2], txt, 1)
# 5-element slidingwindow(::##11#12, ::Array{SubString{String},1}, 1) with element type Tuple{...}:
# (["brown"], ["The", "quick", "fox", "jumps"])
# (["fox"], ["quick", "brown", "jumps", "over"])
# (["jumps"], ["brown", "fox", "over", "the"])
# (["over"], ["fox", "jumps", "the", "lazy"])
# (["the"], ["jumps", "over", "lazy", "dog"])

Data Processing: This package contains a number of simple pre-processing strategies that are often applied for ML purposes, such as feature centering and rescaling.

Data Generators: When studying learning algorithm or other ML related functionality, it is usually of high interest to empirically test the behaviour of the system under specific conditions. Generators can provide the means to fabricate artificial data sets that observe certain attributes, which can help to deepen the understanding of the system under investigation.

Example Datasets: We provide a small number of toy datasets. These are mainly intended for didactic and testing purposes.

Documentation

Check out the latest documentation

Additionally, you can make use of Julia's native docsystem. The following example shows how to get additional information on kfolds within Julia's REPL:

?kfolds

Installation

This package is registered in METADATA.jl and can be installed as usual. Just start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.

import Pkg
Pkg.add("MLDataUtils")
Package StatusPackage EvaluatorBuild Status
License Documentation StatusPkgEvalCI Coverage Status

WARNING

This package has been discontinued. Most functionalities have been moved MLUtils.jl.

Author: JuliaML
Source Code: https://github.com/JuliaML/MLDataUtils.jl 
License: View license

#julia #machinelearning #dataset 

MLDataUtils.jl: Utility Package for Generating, Loading, Splitting

Dataset: Databases for Lazy People

dataset: databases for lazy people

In short, dataset makes reading and writing data in databases as simple as reading and writing JSON files.

Read the docs

To install dataset, fetch it with pip:

$ pip install dataset

Note: as of version 1.0, dataset is split into two packages, with the data export features now extracted into a stand-alone package, datafreeze. See the relevant repository here.

Author: pudo
Source Code: https://github.com/pudo/dataset
License: MIT License

#python #dataset #sql 

Dataset: Databases for Lazy People
Brielle  Maggio

Brielle Maggio

1646279100

How to Access an Array in Google BigQuery Dataset Column

In this tutorial, You'll learn how to access an Array in Google BigQuery Dataset Column. We use unnest() function to do the same. This demo is built of Firebase / Google Analytics Dataset from Google Big Query.

#bigquery #firebase #dataset 

How to Access an Array in Google BigQuery Dataset Column
Elian  Harber

Elian Harber

1645166820

Tablib: Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c

Tablib: format-agnostic tabular dataset library

_____         ______  ___________ ______
__  /_______ ____  /_ ___  /___(_)___  /_
_  __/_  __ `/__  __ \__  / __  / __  __ \
/ /_  / /_/ / _  /_/ /_  /  _  /  _  /_/ /
\__/  \__,_/  /_.___/ /_/   /_/   /_.___/

Tablib is a format-agnostic tabular dataset library, written in Python.

Output formats supported:

  • Excel (Sets + Books)
  • JSON (Sets + Books)
  • YAML (Sets + Books)
  • Pandas DataFrames (Sets)
  • HTML (Sets)
  • Jira (Sets)
  • TSV (Sets)
  • ODS (Sets)
  • CSV (Sets)
  • DBF (Sets)

Note that tablib purposefully excludes XML support. It always will. (Note: This is a joke. Pull requests are welcome.)

Tablib documentation is graciously hosted on https://tablib.readthedocs.io

It is also available in the docs directory of the source distribution.

Make sure to check out Tablib on PyPI!

Contribute

Please see the contributing guide.

Author: jazzband
Source Code: https://github.com/jazzband/tablib 
License: MIT License

#python #dataset #json #excel 

Tablib: Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c
HI Python

HI Python

1642313940

A Dataset of Python Challenges (Puzzles) for AI Research

Python Programming Puzzles (P3)

This repo contains a dataset of Python programming puzzles which can be used to teach and evaluate an AI's programming proficiency. We present code generated by OpenAI's recently released codex 12-billion parameter neural network
solving many of these puzzles. We hope this dataset will grow rapidly, and it is already diverse in terms of problem difficulty, domain, and algorithmic tools needed to solve the problems. Please propose a new puzzle or browse newly proposed puzzles or contribute through pull requests.

To learn more about how well AI systems such as GPT-3 can solve these problems, read our paper:

Programming Puzzles. Tal Schuster, Ashwin Kalyan, Oleksandr Polozov, Adam Tauman Kalai. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2021.

@inproceedings{
schuster2021programming,
title={Programming Puzzles},
author={Tal Schuster and Ashwin Kalyan and Alex Polozov and Adam Tauman Kalai},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2021},
url={https://openreview.net/forum?id=fe_hCc4RBrg}
}

To reproduce the results in the paper, see the solvers folder.

If you just want to dive right into solving a few puzzles, try the intro notebook at Binder that shows which puzzles the AI baselines solved and which they did not, so you can see how your programming compares.

What is a Python programming puzzle?

Each puzzle takes the form of a Python function that takes an answer as an argument. The answer is an input which makes the function return True. This is called satisfying the puzzle, and that is why the puzzles are all named sat.

def sat(s: str):
    return "Hello " + s == "Hello world"

The answer to the above puzzle is the string "world" because sat("world") returns True. The puzzles range from trivial problems like this, to classic puzzles, to programming competition problems, all the way through open problems in algorithms and mathematics.

The classic Towers of Hanoi puzzle can be written as follows:

def sat(moves: List[List[int]]):  
    """
    Eight disks of sizes 1-8 are stacked on three towers, with each tower having disks in order of largest to
    smallest. Move [i, j] corresponds to taking the smallest disk off tower i and putting it on tower j, and it
    is legal as long as the towers remain in sorted order. Find a sequence of moves that moves all the disks
    from the first to last towers.
    """
    rods = ([8, 7, 6, 5, 4, 3, 2, 1], [], [])
    for [i, j] in moves:
        rods[j].append(rods[i].pop())
        assert rods[j][-1] == min(rods[j]), "larger disk on top of smaller disk"
    return rods[0] == rods[1] == []

The shortest answer is a list of 255 moves, so instead we ask for the AI to generate code that outputs an answer. In this case, the codex API generated the following code:

def sol():
    # taken from https://www.geeksforgeeks.org/c-program-for-tower-of-hanoi/
    moves = []
    def hanoi(n, source, temp, dest):
        if n > 0:
            hanoi(n - 1, source, dest, temp)
            moves.append([source, dest])
            hanoi(n - 1, temp, source, dest)
    hanoi(8, 0, 1, 2)
    return moves

This was not on its first try, but that is one of the advantages of puzzles---it is easy for the computer to check its answers so it can generate many answers until it finds one. For this puzzle, about 1 in 1,000 solutions were satisfactory. Clearly, codex has seen this problem before in other input formats---it even generated a url! (Upon closer inspection, the website exists and contains Python Tower-of-Hanoi code in a completely different format with different variable names.) On a harder, less-standard Hanoi puzzle variant that requires moving from particular start to end positions, codex didn't solve it on 10,000 attempts.

Next, consider a puzzle inspired by this easy competitive programming problem from codeforces.com website:

def sat(inds: List[int], string="Sssuubbstrissiingg"):
    """Find increasing indices to make the substring "substring"""
    return inds == sorted(inds) and "".join(string[i] for i in inds) == "substring"

Codex generated the code below, which when run gives the valid answer [1, 3, 5, 7, 8, 9, 10, 15, 16]. This satisfies this puzzle because it's an increasing list of indices which if you join the characters "Sssuubbstrissiingg" in these indices you get "substring".

def sol(string="Sssuubbstrissiingg"):
    x = "substring"
    pos = string.index(x[0])
    inds = [pos]
    while True:
        x = x[1:]
        if not x:
            return inds
        pos = string.find(x[0], pos+1)
        if pos == -1:
            return inds
        inds.append(pos)

Again, there are multiple valid answers, and again this was out of many attempts (only 1 success in 10k).

A more challenging puzzle that requires dynamic programming is the longest increasing subsequence problem which we can also describe with strings:

def sat(x: List[int], length=20, s="Dynamic programming solves this classic job-interview puzzle!!!"):
    """Find the indices of the longest substring with characters in sorted order"""
    return all(s[x[i]] <= s[x[i + 1]] and x[i + 1] > x[i] for i in range(length - 1))

Codex didn't solve this one.

The dataset also has a number of open problems in computer science and mathematics. For example,
Conway's 99-graph problem is an unsolved problem in graph theory (see also Five $1,000 Problems (Update 2017))

def sat(edges: List[List[int]]):
    """
    Find an undirected graph with 99 vertices, in which each two adjacent vertices have exactly one common
    neighbor, and in which each two non-adjacent vertices have exactly two common neighbors.
    """
    # first compute neighbors sets, N:
    N = {i: {j for j in range(99) if j != i and ([i, j] in edges or [j, i] in edges)} for i in range(99)}
    return all(len(N[i].intersection(N[j])) == (1 if j in N[i] else 2) for i in range(99) for j in range(i))

Why puzzles? One reason is that, if we can solve them better than human programmers, then we could make progress on some important algorithms problems. But until then, a second reason is that they can be valuable for training and evaluating AI systems. Many programming datasets have been proposed over the years, and several have problems of a similar nature (like programming competition problems). In puzzles, the spec is defined by code, while other datasets usually use a combination of English and a hidden test set of input-output pairs. English-based specs are notoriously ambiguous and test the system's understanding of English. And with input-output test cases, you would have to have solved a puzzle before you pose it, so what is the use there? Code-based specs have the advantage that they are unambiguous, there is no need to debug the AI-generated code or fears that it doesn't do what you want. If it solved the puzzle, then it succeeded by definition.

For more information on the motivation and how programming puzzles can help AI learn to program, see the paper:
Programming Puzzles, by Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Tauman Kalai. 2021 (Link to be added shortly)

Click here to browse the puzzles and solutions

The problems in this repo are based on:

Notebooks

The notebooks subdirectory has some relevant notebooks. Intro.ipynb has a dozen puzzles indicating which ones the AI solved and did not Try the notebook at Binder and see how your programming compares to the AI baselines!

Demo.ipynb has the 30 problems completed by our users in a user study. Try the demo notebook and see how your programming compares to the AI baselines!

Hackathon

During a Microsoft hackathon July 27-29, 2020, several people completed 30 user study puzzles. We also had tons of fun making the puzzles in Hackathon_puzzles.ipynb. These are of a somewhat different flavor as they are more often hacks like

def sat(x):
    return x > x

where the type of x is clearly non-standard. The creators of these puzzles include github users: Adam Tauman Kalai, Alec Helbling, Alexander Vorobev, Alexander Wei, Alexey Romanov, Keith Battaochi, Kodai Sudo, Maggie Hei, Mariia Mykhailova, Misha Khodak, Monil Mehta, Philip Rosenfield, Qida Ma, Raj Bhargava, Rishi Jaiswal, Saikiran Mullaguri, Tal Schuster, and Varsha Srinivasan. You can try out the notebook at (link to be added).

Highlights

Numerous trivial puzzles like reversing a list, useful for learning to program

Classic puzzles like:

  • Towers of Hanoi
  • Verbal Arithmetic (solve digit-substitutions like SEND + MORE = MONEY)
  • The Game of Life (e.g., finding oscillators of a given period, some open)
  • Chess puzzles (e.g., knight's tour and n-queen problem variants)

Two-player games

  • Finding optimal strategies for Tic-Tac-Toe, Rock-Paper-Scissors, Mastermind (to add: connect four?)
  • Finding minimax strategies for zero-sum bimatrix games, which is equivalent to linear programming
  • Finding Nash equilibria of general-sum games (open, PPAD complete)

Math and programming competitions

  • International Mathematical Olympiad (IMO) problems
  • International Collegiate Programming Contest (ICPC) problems
  • Competitive programming problems from codeforces.com

Graph theory algorithmic puzzles

  • Shortest path
  • Planted clique (open)

Elementary algebra

  • Solving equations
  • Solving quadratic, cubic, and quartic equations

Number theory algorithmic puzzles:

  • Finding common divisors (e.g., using Euclid's algorithm)
  • Factoring numbers (easy for small factors, over $100k in prizes have been awarded and open for large numbers)
  • Discrete log (again open in general, easy for some)

Lattices

  • Learning parity (typically solved using Gaussian elimination)
  • Learning parity with noise (open)

Compression

  • Compress a given string given the decompression algorithm (but not the compression algorithm), or decompress a given compressed string given only the compression algorithm
  • (to add: compute huffman tree)

Hard math problems

  • Conway's 99-graph problem (open)
  • Finding a cycle in the Collatz process (open)

Contributing

This project welcomes contributions and suggestions. Use your creativity to help teach AI's to program! See our wiki on how to add a puzzle.

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

See the datasheet for our dataset.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Download Details:
Author: microsoft
Source Code: https://github.com/microsoft/PythonProgrammingPuzzles
License: MIT License

#python #artificial-intelligence #research #dataset 

A Dataset of Python Challenges (Puzzles) for AI Research