1684333575
Knowledge graphs (KGs) are data structures that store information about different entities (nodes) and their relations (edges). A common approach of using KGs in various machine learning tasks is to compute knowledge graph embeddings. DGL-KE is a high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings. The package is implemented on the top of Deep Graph Library (DGL) and developers can run DGL-KE on CPU machine, GPU machine, as well as clusters with a set of popular models, including TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.
Figure: DGL-KE Overall Architecture
Currently DGL-KE support three tasks:
dglke_train
(single machine) or dglke_dist_train
(distributed environment).dglke_eval
.dglke_predict
or do the embedding similarity inference tasks using dglke_emb_sim
.To install the latest version of DGL-KE run:
sudo pip3 install dgl
sudo pip3 install dglke
Train a transE
model on FB15k
dataset by running the following command:
DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 500 --log_interval 100 \
--batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --num_thread 1 --num_proc 8
This command will download the FB15k
dataset, train the transE
model and save the trained embeddings into the file.
DGL-KE is designed for learning at scale. It introduces various novel optimizations that accelerate training on knowledge graphs with millions of nodes and billions of edges. Our benchmark on knowledge graphs consisting of over 86M nodes and 338M edges shows that DGL-KE can compute embeddings in 100 minutes on an EC2 instance with 8 GPUs and 30 minutes on an EC2 cluster with 4 machines (48 cores/machine). These results represent a 2×∼5× speedup over the best competing approaches.
Figure: DGL-KE vs GraphVite on FB15k
Figure: DGL-KE vs Pytorch-BigGraph on Freebase
Learn more details with our documentation! If you are interested in the optimizations in DGL-KE, please check out our paper for more details.
If you use DGL-KE in a scientific publication, we would appreciate citations to the following paper:
@inproceedings{DGL-KE,
author = {Zheng, Da and Song, Xiang and Ma, Chao and Tan, Zeyuan and Ye, Zihao and Dong, Jin and Xiong, Hao and Zhang, Zheng and Karypis, George},
title = {DGL-KE: Training Knowledge Graph Embeddings at Scale},
year = {2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {739–748},
numpages = {10},
series = {SIGIR '20}
}
Author: awslabs
Source Code: https://github.com/awslabs/dgl-ke
License: Apache-2.0 license
1680142140
In a conda env with pytorch available, run:
pip install -r requirements.txt
Demo Page: https://huggingface.co/spaces/ChatDoctor/ChatDoctor It is worth noting that our model has not yet achieved 100% accurate output, please do not apply it to real clinical scenarios.
For those who want to try the online demo, please register for hugging face and fill out this form link.
You can download the following training dataset
200k real conversations between patients and doctors from HealthCareMagic.com HealthCareMagic-200k.
26k real conversations between patients and doctors from icliniq.com icliniq-26k.
5k generated conversations between patients and physicians from ChatGPT GenMedGPT-5k and disease database.
Our model was firstly be fine-tuned by Stanford Alpaca's data to have some basic conversational capabilities. Alpaca link
In order to download the checkpoints, fill this form: link. Place the model weights file in the ./pretrained folder.
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./HealthCareMagic-200k.json \
--bf16 True \
--output_dir pretrained \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
--tf32 True
You can build a ChatDoctor model on your own machine and communicate with it.
python chat.py
ChatDoctor is a next-generation AI doctor model that is based on the LLaMA model. The goal of this project is to provide patients with an intelligent and reliable healthcare companion that can answer their medical queries and provide them with personalized medical advice.
The ChatDoctor is an advanced language model that is specifically designed for medical applications. It has been trained on a large corpus of medical literature and has a deep understanding of medical terminology, procedures, and diagnoses. This model serves as the foundation for ChatDoctor, enabling it to analyze patients' symptoms and medical history, provide accurate diagnoses, and suggest appropriate treatment options.
The ChatDoctor model is designed to simulate a conversation between a doctor and a patient, using natural language processing (NLP) and machine learning techniques. Patients can interact with the ChatDoctor model through a chat interface, asking questions about their health, symptoms, or medical conditions. The model will then analyze the input and provide a response that is tailored to the patient's unique situation.
One of the key features of the ChatDoctor model is its ability to learn and adapt over time. As more patients interact with the model, it will continue to refine its responses and improve its accuracy. This means that patients can expect to receive increasingly personalized and accurate medical advice over time.
Recent large language models (LLMs) in the general domain, such as ChatGPT, have shown remarkable success in following instructions and producing human-like responses. However, such language models have not been tailored to the medical domain, resulting in poor answer accuracy and inability to give plausible recommendations for medical diagnosis, medications, etc. To address this issue, we collected more than 700 diseases and their corresponding symptoms, required medical tests, and recommended medications, from which we generated 5K doctor-patient conversations. In addition, we obtained 200K real patient-doctor conversations from online Q&A medical consultation sites. By fine-tuning LLMs using these doctor-patient conversations, the resulting models emerge with great potential to understand patients' needs, provide informed advice, and offer valuable assistance in a variety of medical-related fields. The integration of these advanced language models into healthcare can revolutionize the way healthcare professionals and patients communicate, ultimately improving the overall efficiency and quality of patient care and outcomes. In addition, we made public all the source codes, datasets, and model weights to facilitate the further development of dialogue models in the medical field.
The development of instruction-following large language models (LLMs) such as ChatGPT has garnered significant attention due to their remarkable success in instruction understanding and human-like response generation. These auto-regressive LLMs are pre-trained over web-scale natural languages by predicting the next token and then fine-tuned to follow large-scale human instructions. Also, they have shown strong performances over a wide range of NLP tasks and generalizations to unseen tasks, demonstrating their potential as a unified solution for various problems such as natural language understanding, text generation, and conversational AI. However, the exploration of such general-domain LLMs in the medical field remains relatively untapped, despite the immense potential they hold for transforming healthcare communication and decision-making. The specific reason is that the existing models do not learn the medical field in detail, resulting in the models often giving wrong diagnoses and wrong medical advice when playing the role of a doctor. By fine-tuning the large language dialogue model on the data of doctor-patient conversations, the application of the model in the medical field can be significantly improved. Especially in areas where medical resources are scarce, ChatDoctor can be used for initial diagnosis and triage of patients, significantly improving the operational efficiency of existing hospitals.
Since large language models such as ChatGPT are in a non-open source state, we used Meta's LLaMA and first trained a generic conversation model using 52K instruction-following data provided by Stanford Alpaca, and then fine-tuned the model on our collected physician-patient conversation dataset. The main contributions of our method are three-fold:
The first step in building a physician-patient conversation dataset is to collect the disease database that serves as the gold standard. Therefore, we collected and organized a database of diseases, which contains about 700 diseases with their relative symptoms, medical tests, and recommended medications. To train high-quality conversation models on an academic budget, we input each message from the disease database separately as a prompt into the ChatGPT API to automatically generate instruction data. It is worth noting that our prompts to the ChatGPT API contain the gold standard of diseases and symptoms, and drugs, so our fine-tuned ChatDoctor is not only able to achieve ChatGPT's conversational fluency but also higher diagnostic accuracy compared to ChatGPT. We finally collected 5K doctor-patient conversation instructions and named it InstructorDoctor-5K.
The generated conversations, while ensuring accuracy, have a low diversity of conversations. Therefore, we also collected about 200k real doctor-patient conversations from an online Q&A based medical advisory service website -- "Health Care Magic." We manually and automatically filtered these data to remove physician and patient names and used language tools to correct grammatical errors in the responses.
We build ChatDoctor utilizing Meta's LLaMA model, a distinguished publicly accessible LLM. Notably, in spite of its 7 billion parameters, LLaMA has been reported that LLaMA's efficacy can attain competitive or superior outcomes in comparison to the considerably larger GPT-3 (with 175 billion parameters) on several NLP benchmarks. LLaMA's performance improvement was achieved by amplifying the magnitude of training data, as opposed to parameter quantity. Specifically, LLaMA was trained on 1.4 trillion tokens, procured from publicly accessible data repositories such as CommonCrawl and arXiv documents. We utilize conversation demonstrations synthesized via ChatGPT and subsequently validated by medical practitioners to fine-tune the LLaMA model, in accordance with the Stanford Alpaca training methodology, and our model was firstly be fine-tuned by Stanford Alpaca's data to have some basic conversational capabilities. The fine-tuning process was conducted using 6 A*100 GPUs for a duration of 30 minutes. The hyperparameters employed in the training process were as follows: the total batch size of 192, a learning rate of 2e-5, a total of 3 epochs, a maximum sequence length of 512 tokens, a warmup ratio of 0.03, with no weight decay.
We emphasize that ChatDoctor is for academic research only and any commercial use and clinical use is prohibited. There are three factors in this decision: First, ChatDoctor is based on LLaMA and has a non-commercial license, so we necessarily inherited this decision. Second, our model is not licensed for healthcare-related purposes. Also, we have not designed sufficient security measures, and the current model still does not guarantee the full correctness of medical diagnoses.
ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge
@misc{yunxiang2023chatdoctor,
title={ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge},
author={Li Yunxiang and Li Zihan and Zhang Kai and Dan Ruilong and Zhang You},
year={2023},
eprint={2303.14070},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Web: https://arxiv.org/abs/2303.14070
200k real conversations between patients and doctors from HealthCareMagic.com HealthCareMagic-200k.
26k real conversations between patients and doctors from icliniq.com icliniq-26k.
5k generated conversations between patients and physicians from ChatGPT GenMedGPT-5k and disease database.
Checkpoints of ChatDoctor, fill this form.
Online hugging face demo application form.
Stanford Alpaca data for basic conversational capabilities. Alpaca link.
Yunxiang Li1, Zihan Li2, Kai Zhang3, Ruilong Dan4, You Zhang1
1 University of Texas Southwestern Medical Center, Dallas, USA
2 University of Illinois at Urbana-Champaign, Urbana, USA
3 Ohio State University, Columbus, USA
4 Hangzhou Dianzi University, Hangzhou, China
Author: Kent0n-Li
Source Code: https://github.com/Kent0n-Li/ChatDoctor
License: Apache-2.0 license
1678156140
kb is a text-oriented minimalist command line knowledge base manager. kb can be considered a quick note collection and access tool oriented toward software developers, penetration testers, hackers, students or whoever has to collect and organize notes in a clean way. Although kb is mainly targeted on text-based note collection, it supports non-text files as well (e.g., images, pdf, videos and others).
The project was born from the frustration of trying to find a good way to quickly access my notes, procedures, cheatsheets and lists (e.g., payloads) but at the same time, keeping them organized. This is particularly useful for any kind of student. I use it in the context of penetration testing to organize pentesting procedures, cheatsheets, payloads, guides and notes.
I found myself too frequently spending time trying to search for that particular payload list quickly, or spending too much time trying to find a specific guide/cheatsheet for a needed tool. kb tries to solve this problem by providing you a quick and intuitive way to access knowledge.
In few words kb allows a user to quickly and efficiently:
Basically, kb provides a clean text-based way to organize your knowledge.
You should have Python 3.6 or above installed.
To install the most recent stable version of kb just type:
pip install -U kb-manager
If you want to install the bleeding-edge version of kb (that may have some bugs) you should do:
git clone https://github.com/gnebbia/kb
cd kb
pip install -r requirements.txt
python setup.py install
# or with pip
pip install -U git+https://github.com/gnebbia/kb
Tip for GNU/Linux and MacOS users: For a better user experience, also set the following kb bash aliases:
cat <<EOF > ~/.kb_alias
alias kbl="kb list"
alias kbe="kb edit"
alias kba="kb add"
alias kbv="kb view"
alias kbd="kb delete --id"
alias kbg="kb grep"
alias kbt="kb list --tags"
EOF
echo "source ~/.kb_alias" >> ~/.bashrc
source ~/.kb_alias
Please remember to upgrade kb frequently by doing:
pip install -U kb-manager
Arch Linux users can install kb or kb-git with their favorite AUR Helper.
Stable:
yay -S kb
Dev:
yay -S kb-git
Of course it runs on NetBSD (and on pkgsrc). We can install it from pkgsrc source tree (databases/py-kb) or as a binary package using pkgin:
pkgin in py38-kb
Note that at the moment the package is only available from -current repositories.
To install using homebrew, use:
brew tap gnebbia/kb https://github.com/gnebbia/kb.git
brew install gnebbia/kb/kb
To upgrade with homebrew:
brew update
brew upgrade gnebbia/kb/kb
Windows users should keep in mind these things:
EDITOR=C:\Program Files\Editor\my cool editor.exe -> WRONG!
EDITOR="C:\Program Files\Editor\my cool editor.exe" -> OK!
To set the "EDITOR" Environment variable by using cmd.exe, just issue the following commands, after having inserted the path to your desired text editor:
set EDITOR="C:\path\to\editor\here.exe"
setx EDITOR "\"C:\path\to\editor\here.exe\""
To set the "EDITOR" Environment variable by using Powershell, just issue the following commands, after having inserted the path to your desired text editor:
$env:EDITOR='"C:\path\to\editor\here.exe"'
[System.Environment]::SetEnvironmentVariable('EDITOR','"C:\path\to\editor\here.exe"', [System.EnvironmentVariableTarget]::User)
Open a cmd.exe terminal with administrative rights and paste the following commands:
reg add "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor" /v "AutoRun" /t REG_EXPAND_SZ /d "%USERPROFILE%\autorun.cmd"
(
echo @echo off
echo doskey kbl=kb list $*
echo doskey kbe=kb edit $*
echo doskey kba=kb add $*
echo doskey kbv=kb view $*
echo doskey kbd=kb delete --id $*
echo doskey kbg=kb grep $*
echo doskey kbt=kb list --tags $*
)> %USERPROFILE%\autorun.cmd
Open a Powershell terminal and paste the following commands:
@'
function kbl { kb list $args }
function kbe { kb edit $args }
function kba { kb add $args }
function kbv { kb view $args }
function kbd { kb delete --id $args }
function kbg { kb grep $args }
function kbt { kb list --tags $args }
'@ > $env:USERPROFILE\Documents\WindowsPowerShell\profile.ps1
A docker setup has been included to help with development.
To install and start the project with docker:
docker-compose up -d
docker-compose exec kb bash
The container has the aliases included in its .bashrc
so you can use kb in the running container as you would if you installed it on the host directly. The ./docker/data
directory on the host is bound to /data
in the container, which is the image's working directly also. To interact with the container, place (or symlink) the files on your host into the ./docker/data
directory, which can then be seen and used in the /data
directory in the container.
A quick demo of a typical scenario using kb:
A quick demo with kb aliases enabled:
A quick demo for non-text documents:
kb list
# or if aliases are used:
kbl
kb list zip
# or if aliases are used:
kbl zip
kb list --category cheatsheet
# or
kb list -c cheatsheet
# or if aliases are used:
kbl -c cheatsheet
kb list --tags "web;pentest"
# or if aliases are used:
kbl --tags "web;pentest"
kb list -v
# or if aliases are used:
kbl -v
kb add ~/Notes/cheatsheets/pytest
# or if aliases are used:
kba ~/Notes/cheatsheets/pytest
kb add ~/ssh_tunnels --title pentest_ssh --category "procedure" \
--tags "pentest;network" --author "gnc" --status "draft"
kb add ~/Notes/cheatsheets/general/* --category "cheatsheet"
kb add --title "ftp" --category "notes" --tags "protocol;network"
# a text editor ($EDITOR) will be launched for editing
kb add --title "my_network_scan" --category "scans" --body "$(nmap -T5 -p80 192.168.1.0/24)"
kb delete --id 2
# or if aliases are used:
kbd 2
kb delete --id 2 3 4
# or if aliases are used:
kbd 2 3 4
kb delete --title zap --category cheatsheet
kb view --id 3
# or
kb view -i 3
# or
kb view 3
# or if aliases are used:
kbv 3
kb view --title "gobuster"
# or
kb view -t "gobuster"
# or
kb view gobuster
kb view -t dirb -n
kb view -i 2 -e
# or if aliases are used:
kbv 2 -e
Editing artifacts involves opening a text editor. Hence, binary files cannot be edited by kb.
The editor can be set by the "EDITOR" environment variable.
kb edit --id 13
# or
kbe 13
# or if aliases are used:
kbe 13
kb edit --title "git" --category "cheatsheet"
# or
kb edit -t "git" -c "cheatsheet"
# or if git is unique as artifact
kb edit git
kb grep "[bg]zip"
# or if aliases are used:
kbg "[bg]zip"
kb grep -i "[BG]ZIP"
kb grep -v "[bg]zip"
kb grep -m "[bg]zip"
To export the entire knowledge base, do:
kb export
This will generate a .kb.tar.gz archive that can be later be imported by kb.
If you want to export only data (so that it can be used in other software):
kb export --only-data
This will export a directory containing a subdirectory for each category and within these subdirectories we will have all the artifacts belonging to that specific category.
kb import archive.kb.tar.gz
NOTE: Importing a knowledge base erases all the previous data. Basically it erases everything and imports the new knowledge base.
kb erase
kb supports custom templates for the artifacts. A template is basically a file using the "toml" format, structured in this way:
TITLES = [ "^#.*", "blue", ]
WARNINGS = [ "!.*" , "yellow",]
COMMENTS = [ ";;.*", "green", ]
Where the first element of each list is a regex and the second element is a color.
Note that by default an artifact is assigned with the 'default' template, and this template can be changed too (look at "Edit a template" subsection).
To list all available templates:
kb template list
To list all the templates containing the string "theory":
kb template list "theory"
Create a new template called "lisp-cheatsheets", note that an example template will be put as example in the editor.
kb template new lisp-cheatsheets
To delete the template called "lisp-cheatsheets" just do:
kb template delete lisp-cheatsheets
To edit the template called "listp-cheatsheets" just do:
kb template edit lisp-cheatsheets
We can also add a template from an already existing toml configuration file by just doing:
kb template add ~/path/to/myconfig.toml --title myconfig
We can change the template for an existing artifact by ID by using the update command:
kb update --id 2 --template "lisp-cheatsheets"
We can apply the template "lisp-cheatsheets" to all artifacts belonging to the category "lispcode" by doing:
kb template apply "lisp-cheatsheets" --category "lispcode"
We can apply the template "dark" to all artifacts having in their title the string "zip" (e.g., bzip, 7zip, zipper) by doing:
kb template apply "dark" --title "zip" --extended-match
# or
kb template apply "dark" --title "zip" -m
We can always have our queries to "contain" the string by using the --extended-match
option when using kb template apply
.
We can apply the template "light" to all artifacts of the category "cheatsheet" who have as author "gnc" and as status "OK" by doing:
kb template apply "light" --category "cheatsheet" --author "gnc" --status "OK"
kb can be integrated with other tools.
We can integrate kb with rofi, a custom mode has been developed accessible in the "misc" directory within this repository.
We can launch rofi with this mode by doing:
rofi -show kb -modi kb:/path/to/rofi-kb-mode.sh
Synchronization with a remote git repository is experimental at the moment. Anyway we can initialize our knowledge base to a created empty github/gitlab (other git service) repository by doing:
kb sync init
We can then push our knowledge base to the remote git repository with:
kb sync push
We can pull (e.g., from another machine) our knowledge base from the remote git repository with:
kb sync pull
We can at any time view to what remote endpoint our knowledge is synchronizing to with:
kb sync info
If you want to upgrade kb to the most recent stable release do:
pip install -U kb-manager
If instead you want to update kb to the most recent release (that may be bugged), do:
git clone https://github.com/gnebbia/kb
cd kb
pip install --upgrade .
Q) How do I solve the AttributeError: module 'attr' has no attribute 's'
error?
A) Uninstall attr and use attrs:
pip uninstall attr
pip uninstall attrs
pip install attrs
pip install -U kb-manager
Date: 2022-09-21
Version: 0.1.7
Author: Gnebbia
Source Code: https://github.com/gnebbia/kb
License: GPL-3.0 license
1677071667
Empower the team with sharing your knowledge.
Crowi is a Markdown Wiki like:
Install dependencies and build CSS and JavaScript:
$ npm install
More info is here.
Don't use master
branch because it is unstable. Use released version except when you want to contribute to the project.
Crowi is designed to be set up on Heroku or some PaaS, but you can also start up Crowi with ENV parameter on your local.
$ PASSWORD_SEED=somesecretstring MONGO_URI=mongodb://username:password@localhost/crowi node app.js
or please write .env
.
PORT
: Server port. default: 3000
.BASE_URL
: Server base URL (e.g. https://demo.crowi.wiki/). If this env is not set, it is detected by accessing URL.NODE_ENV
: production
OR development
.MONGO_URI
: URI to connect to MongoDB. This parameter is also by MONGOHQ_URL
OR MONGOLAB_URI
.REDIS_URL
: URI to connect to Redis (used for session store and socket.io). This parameter is also by REDISTOGO_URL
.rediss://
scheme if you want to TLS connection to Redis.REDIS_REJECT_UNAUTHORIZED
: Set "0" if you want to skip the verification of certificate.ELASTICSEARCH_URI
: URI to connect to Elasticearch.PASSWORD_SEED
: A password seed used by password hash generator.SECRET_TOKEN
: A secret key for verifying the integrity of signed cookies.FILE_UPLOAD
: aws
(default), local
, none
Optional:
MATHJAX
: If set 1
, enable MathJax feature.PLANTUML_URI
: If set the url of PlantUML server, then enable PlantUML feature. e.g. http://localhost:18080
.ENABLE_DNSCACHE
: If set true
, Use internal DNS cache for crowi in Linux VMs. (See also: #407)see: .env.sample
We can use docker-compose for develop without complicated settings.
$ docker-compose -f docker-compose.development.yml up
Please try the following commands.
# Stop containers
$ docker-compose -f docker-compose.development.yml stop
# Remove containers
$ docker-compose -f docker-compose.development.yml rm
# Remove images
$ docker-compose -f docker-compose.development.yml images -q | xargs docker rmi -f
# Build images
$ docker-compose -f docker-compose.development.yml build
Author: Crowi
Source Code: https://github.com/crowi/crowi
License: MIT license
1676110620
Python library for Representation Learning on Knowledge Graphs
Open source library based on TensorFlow that predicts links between concepts in a knowledge graph.
AmpliGraph is a suite of neural machine learning models for relational Learning, a branch of machine learning that deals with supervised learning on knowledge graphs.
Use AmpliGraph if you need to:
AmpliGraph's machine learning models generate knowledge graph embeddings, vector representations of concepts in a metric space:
It then combines embeddings with model-specific scoring functions to predict unseen and novel links:
AmpliGraph includes the following submodules:
Create and activate a virtual environment (conda)
conda create --name ampligraph python=3.7
source activate ampligraph
AmpliGraph is built on TensorFlow 1.x. Install from pip or conda:
CPU-only
pip install "tensorflow>=1.15.2,<2.0"
or
conda install tensorflow'>=1.15.2,<2.0.0'
GPU support
pip install "tensorflow-gpu>=1.15.2,<2.0"
or
conda install tensorflow-gpu'>=1.15.2,<2.0.0'
Install the latest stable release from pip:
pip install ampligraph
If instead you want the most recent development version, you can clone the repository and install from source (your local working copy will be on the latest commit on the develop
branch). The code snippet below will install the library in editable mode (-e
):
git clone https://github.com/Accenture/AmpliGraph.git
cd AmpliGraph
pip install -e .
>> import ampligraph
>> ampligraph.__version__
'1.4.0'
AmpliGraph includes implementations of TransE, DistMult, ComplEx, HolE, ConvE, and ConvKB. Their predictive power is reported below and compared against the state-of-the-art results in literature. More details available here.
FB15K-237 | WN18RR | YAGO3-10 | FB15k | WN18 | |
---|---|---|---|---|---|
Literature Best | 0.35* | 0.48* | 0.49* | 0.84** | 0.95* |
TransE (AmpliGraph) | 0.31 | 0.22 | 0.51 | 0.63 | 0.66 |
DistMult (AmpliGraph) | 0.31 | 0.47 | 0.50 | 0.78 | 0.82 |
ComplEx (AmpliGraph) | 0.32 | 0.51 | 0.49 | 0.80 | 0.94 |
HolE (AmpliGraph) | 0.31 | 0.47 | 0.50 | 0.80 | 0.94 |
ConvE (AmpliGraph) | 0.26 | 0.45 | 0.30 | 0.50 | 0.93 |
ConvE (1-N, AmpliGraph) | 0.32 | 0.48 | 0.40 | 0.80 | 0.95 |
ConvKB (AmpliGraph) | 0.23 | 0.39 | 0.30 | 0.65 | 0.80 |
* Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, 2869–2878. 2018.
** Kadlec, Rudolf, Ondrej Bajgar, and Jan Kleindienst. "Knowledge base completion: Baselines strike back. " arXiv preprint arXiv:1705.10744 (2017). Results above are computed assigning the worst rank to a positive in case of ties. Although this is the most conservative approach, some published literature may adopt an evaluation protocol that assigns the best rank instead.
The project documentation can be built from your local working copy with:
cd docs
make clean autogen html
See guidelines from AmpliGraph documentation.
If you like AmpliGraph and you use it in your project, why not starring the project on GitHub!
If you instead use AmpliGraph in an academic publication, cite as:
@misc{ampligraph,
author= {Luca Costabello and
Sumit Pai and
Chan Le Van and
Rory McGrath and
Nicholas McCarthy and
Pedro Tabacof},
title = {{AmpliGraph: a Library for Representation Learning on Knowledge Graphs}},
month = mar,
year = 2019,
doi = {10.5281/zenodo.2595043},
url = {https://doi.org/10.5281/zenodo.2595043}
}
Join the conversation on Slack
Author: Accenture
Source Code: https://github.com/Accenture/AmpliGraph
License: Apache-2.0 license
1625136600
Don’t know what JavaScript is? Why it is used?
Then this article is cooked and just served for you.
JavaScript is a text-based programming language used both on the client-side and server-side that allows you to make web pages interactive. Where HTML and CSS are languages that give structure and style to web pages, JavaScript gives web pages interactive elements that engage a user. Common examples of JavaScript that you might use every day include the search box on Amazon, a news recap video embedded on The New York Times, or refreshing your Twitter feed.
Incorporating JavaScript improves the user experience of the web page by converting it from a static page into an interactive one. To recap, JavaScript adds behavior to web pages.
I mentioned above that JavaScript is a “scripting language.” Scripting languages are coding languages used to automate processes that users would otherwise need to execute on their own, step-by-step. Short of scripting, any changes on web pages you visit would require either manually reloading the page, or navigating a series of static menus to get to the content you’re after
A scripting language like JavaScript (JS, for those in the know) does the heavy lifting by telling computer programs like websites or web applications to “do something.” In the case of JavaScript, this means telling those dynamic features described earlier to do whatever it is they do — like telling images to animate themselves, photos to cycle through a slideshow, or autocomplete suggestions to respond to prompts. It’s the “script” in JavaScript that makes these things happen seemingly on their own.
#javascript #knowledge #sharing
1621929334
As software developers, we always want our software to work properly. We’ll do everything to improve the software quality. To find the best solution, we are ready to use parallelizing or applying any various optimization techniques. One of these optimization techniques is the so-called string interning. It allows users to reduce memory usage. It also makes string comparison faster. However, everything is good in moderation. Interning at every turn is not worth it. Further, I’ll show you how not to slip up with creating a hidden bottleneck in the form of the String.Intern method for your application.
In case you’ve forgotten, let me remind you that string is a reference type in C#. Therefore, the string variable itself is just a reference that lies on the stack and stores an address. The address points to an instance of the String class located on the heap.
There are several ways to calculate how many bytes a string object takes on the heap: the version by John Skeet and the version by Timur Guev (the last article is in Russian). In the picture above, I used the second option. Even if this formula is not 100 % true, we can still estimate the size of string objects. For example, about 4.7 million lines (each is 100 characters long) are enough to take up 1 GB of RAM. Let’s say there’s a large number of duplicates among the strings in a program. So, it’s just worth using the interning functionality built into the framework. Now, why don’t we briefly recap what is string interning?
The idea of string interning is to store only one instance of the String type in memory for identical strings. When running an app, the virtual machine creates an internal hash table, called the interning table (sometimes it is called String Pool). This table stores references to each unique string literal declared in the program. In addition, using the two methods described below, we can get and add references to string objects to this table by ourselves. If an application contains numerous strings (which are often identical), it makes no sense to create a new instance of the String class every time. Instead, you can simply refer to an instance of the String type that has already been created on the heap. To get a reference to it, access the interning table. The virtual machine itself interns all string literals in the code (to find more about interning tricks, check this article). We may choose one of two methods: String.Intern and String.IsInterned.
The first one takes a string as input. If there’s an identical string in the interning table, it returns a reference to an object of the String type that already exists on the heap. If there’s no such string in the table, the reference to this string object is added to the interning table. Then, it is returned from the method. The IsInterned method also accepts a string as input and returns a reference from the interning table to an existing object. If there’s no such object, null is returned (everyone knows about the non-intuitive return value of this method).
Using interning, we reduce the number of new string objects by working with existing ones through references obtained via the Intern method. Thus, we do not create a large number of new objects. So, we save memory and improve program performance. After all, many string objects, references to which quickly disappear from the stack, can lead to frequent garbage collection. It will negatively affect the overall program performance. Interned strings won’t disappear up to the end of the process, even if the references to these objects are no longer in the program. This thing is worth paying attention to. Using interning to reduce memory consumption can produce the opposite effect.
Interning strings can boost performance when comparing these very strings. Let’s take a look at the implementation of the String.Equals method:
Before calling the EqualsHelper method, where a character-by-character comparison of strings is performed, the Object.ReferenceEquals method checks for the equality of references. If the strings are interned, the Object.ReferenceEquals method returns true when the strings are equal (without comparing the strings themselves character-by-character). Of course, if the references are not equal, then the EqualsHelper method will be called, and the subsequent character-by-character comparison will occur. After all, the Equals method does not know that we are working with interned strings. Also, if the ReferenceEquals method returns false, we know that the compared strings are different.
If you are sure that the input strings are interned at a specific place in the program, then you can compare them using the Object.ReferenceEquals method. However, it’s not the greatest approach. There is always a chance that the code will change in the future. Also, it may be reused in another part of the program. So, non-interned lines can get into it. In this case, when comparing two identical non-interned strings via the ReferenceEquals method, we will assume that they are not identical.
Interning strings for later comparison seems justified only if you plan to compare interned strings quite often. Remember that interning an entire set of strings also takes some time. Therefore, you shouldn’t perform it to compare several instances of strings once.
Well, we revised what string interning is. Now, let’s move on to the problem I’ve faced.
In our bug tracker, there was a task created long ago. It required some research on how parallelizing the C++ code analysis can save analysis time. It would be great if the PVS-Studio analyzer worked in parallel on several machines when analyzing a single project. I chose IncrediBuild as the software that allows such parallelization. IncrediBuild allows you to run different processes in parallel on machines located on the same network. For example, you can parallelize source files compiling on different company machines (or in a cloud). Thus, we save time on the building process. Game developers often use this software.
Well, I started working on this task. At first, I selected a project and analyzed it with PVS-Studio on my machine. Then, I ran the analysis using IncrediBuild, parallelizing the analyzer processes on the company’s machines. At the end, I summed up the results of such parallelization. So, having positive results, we’ll offer our clients such solutions to speed up the analysis.
I chose the Unreal Tournament project. We managed to persuade the programmers to install IncrediBuild on their machines. As a result, we had the combined cluster with about 145 cores.
I analyzed the Unreal Tournament project using the compilation monitoring system in PVS-Studio. So, I worked as follows: I ran the CLMonitor.exe program in monitor mode and performed a full build of Unreal Tournament in Visual Studio. Then, after building process, I ran CLMonitor.exe again, but in the analysis launch mode. Depending on the value specified in the PVS-Studio settings for the ThreadCount parameter, CLMonitor.exe simultaneously runs the corresponding number of PVS-Studio.exe child processes at the same time. These processes are engaged in the analysis of each individual source C++ file. One PVS-Studio.exe child process analyzes one source file. After the analysis, it passes the results back to CLMonitor.exe.
Everything is easy: in the PVS-Studio settings, I set the ThreadCount parameter equal to the number of available cores (145). I run the analysis getting ready for 145 processes of PVS-Studio.exe executed in parallel on remote machines. IncrediBuild has Build Monitor, a user-friendly parallelization monitoring system. Using it, you can observe the processes running on remote machines. The same I observed in the process of analysis:
It seemed that nothing could be easier. Relax and watch the analysis process. Then simply record its duration with IncrediBuild and without. However, in practice, it turned out to be a little bit complicated…
During the analysis, I could switch to other tasks. I also could just meditate looking at PVS-Studio.exe running in the Build Monitor window. As the analysis with IncrediBuild ended, I compared its duration with the results of the one without IncrediBuild. The difference was significant. However, the overall result could have been better. It was 182 minutes on one machine with 8 threads and 50 minutes using IncrediBuild with 145 threads. It turned out that the number of threads increased by 18 times. Meanwhile, the analysis time decreased by only 3.5 times. Finally, I glimpsed the result in the Build Monitor window. Scrolling through the report, I noticed something weird. That’s what I saw on the chart:
I noticed that PVS-Studio.exe executed and completed successfully. But then for some reason, the process paused before starting the next one. It happened again and again. Pause after pause. These downtimes led to a noticeable delay and did their bit to prolong the analysis time. At first, I blamed IncrediBuild. Probably it performs some kind of internal synchronization and slows down the launch.
I shared the results with my senior colleague. He didn’t jump to conclusions. He suggested looking at what’s going on inside our CLMonitor.exe app right when downtime appears on the chart. I ran the analysis again. Then, I noticed the first obvious “failure” on the chart. I connected to the CLMonitor.exe process via Visual Studio debugger and paused it. Opening the Threads, my colleague and I saw about 145 suspended threads. Reviewing the places in the code where the execution paused, we saw code lines with similar content:
What do these lines have in common? Each of them uses the String.Intern method. And it seems justified. Because these are the places where CLMonitor.exe handles data from PVS-Studio.exe processes. Data is written to objects of the ErrorInfo type, which encapsulates information about a potential error found by the analyzer. Also, we internalize quite reasonable things, namely paths to source files. One source file may contain many errors, so it doesn’t make sense for ErrorInfo objects to contain different string objects with the same content. It’s fair enough to just refer to a single object from the heap.
Without a second thought, I realized that string interning had been applied at the wrong moment. So, here’s the situation we observed in the debugger. For some reason, 145 threads were hanging on executing the String.Intern method. Meanwhile, the custom task scheduler LimitedConcurrencyLevelTaskScheduler inside CLMonitor.exe couldn’t start a new thread that would later start a new PVS-Studio.exe process. Then, IncrediBuild would have already run this process on the remote machine. After all, from the scheduler’s point of view, the thread has not yet completed its execution. It performs the transformation of the received data from PVS-Studio.exe in ErrorInfo, followed by string interning. The completion of the PVS-Studio.exe process doesn’t mean anything to the thread. The remote machines are idle. The thread is still active. Also, we set the limit of 145 threads, which does not allow the scheduler to start a new one.
A larger value for the ThreadCount parameter would not solve the problem. It would only increase the queue of threads hanging on the execution of the String.Intern method.
We did not want to remove interning at all. It would increase the amount of RAM consumed by CLMonitor.exe. Eventually, we found a fairly simple and elegant solution. We decided to move interning from the thread that runs PVS-Studio.exe to a slightly later place of code execution (in the thread that directly generates the error report).
As my colleague said, we managed to make a very accurate edit of just two lines. Thus, we solved the problem with idle remote machines. So, we ran the analysis again. There were no significant time intervals between PVS-Studio.exe launches. The analysis’ time decreased from 50 minutes to 26, that is, almost twice. Now, let’s take a look at the overall result that we got using IncrediBuild and 145 available cores. The total analysis time decreased by 7 times. It’s far better than by 3.5 times.
#csharp #knowledge #c#
1620820740
Philosophers have long recognized the difference between two types of knowledge: knowing-how and knowing-that, where (roughly and very informally) the former is typically associated with skills and abilities, and the latter is associated with **propositions **(truths/established facts). In our everyday discourse we use the word ‘know’ for both types of knowledge, which creates some confusion. So, for example, we say things like:
_But there are several fundamental differences between __K_1 and _K_2, some of which are listed in the diagram below:
The difference between knowing-how and knowing-that
_Below we discuss in some details these opposing properties as they relate to __K_1 and _K_2 given above.
#ai & machine learning #artificial intelligence #knowledge #learning
1619713080
C## capabilities keep expanding from year to year. New features enrich software development. However, their advantages may not always be so obvious. For example, the good old yield. To some developers, especially beginners, it’s like magic - inexplicable, but intriguing. This article shows how yield works and what this peculiar word hides. Have fun reading!
The yield keyword is used to build generators of element sequences. These generators do not create collections. Instead, the sequence stores the current state - and moves on to the next state on command. Thus, memory requirements are minimal and do not depend on the number of elements. It’s not hard to guess that generated sequences can be infinite.
In the simplest scenario, the generator stores the current element and contains a set of commands that must be executed to get a new element. This is often much more convenient than creating a collection and storing all of its elements.
While there is nothing wrong with writing a class to implement the generator’s behavior, yield simplifies creating such generators significantly. You do not have to create new classes - everything works already.
I must point out here that yield is not a feature available exclusively in C#. However, while the concept is the same, in different languages yield may be implemented and used differently. Which is why here’s one more reminder that this article talks about yield only in the context of C#.
#csharp #knowledge
1618883220
The world around us is changing rapidly. With the arrival of industrial revolution 4.0, businesses of all sizes and types are increasingly capitalizing on advanced, intelligent technologies. They are taking advantage of automation to reduce time-consuming, tedious tasks, especially automating assembly line work. However, as such intelligent technologies are better performing than humans, it is necessary businesses must think of knowledge management for their employees. Knowledge is significantly a crucial aspect in achieving high-quality performance for employees. The field of knowledge management consists of psychology, epistemology, and cognitive science. Gaining information provided by artificial intelligence can play a crucial role in helping business employees make timely decisions.
Knowledge grows when used and deflates when kept under lock. That’s true! Artificial intelligence provides the mechanisms that enable machines to learn. It allows them to gain, process and utilize knowledge to perform tasks. AI also enables machines to unlock knowledge that can be delivered to humans to improve the decision-making process.
#artificial intelligence #latest news #knowledge
1616256720
By far, Jenkins is the most adopted tool for continuous integration, owning nearly 50% of the market share. As so many developers are using it, it has excellent community support, like no other Jenkins alternative. With that, it has more than 1,500 plugins available for continuous integration and delivery purposes.
We love and respect Jenkins. After all, it’s the first tool we encountered at the beginning of our automation careers. But as things are rapidly changing in the automation field, Jenkins is left behind with its old approach. Even though many developers and companies are using it, most of them aren’t happy with it. Having used it ourselves on previous projects, we quickly became frustrated by its lack of functionality,** numerous maintenance issues**,** dependencies,** and scaling problems.
We decided to investigate if other developers face the same problems and quickly saw the need to create a tool ourselves. We asked some developers at last year’s AWS Summit in Berlin about this. Most of them told us that they chose Jenkins because it’s free in the first place. However, many of them expressed interest in trying to use some other Jenkins alternative.
When using Jenkins, teams tend to make more mistakes with the delivery and end up with broken pipelines. As a result, they implement inefficient practices, can’t adopt agility well, and lose the flexibility to innovate. When problems come up, you instantly need an expert that will resolve the issues to unblock developers.
Microtica is a Jenkins alternative with modern DevOps in its foundations. Say goodbye to plugins and dependencies, late maintenance nights, backup, scaling, patching issues.
#knowledge #devops automation #microservices
1604164380
The past two decades ignited the technological revolution we are living today. However, even though companies embraced it in public, most of them have been slow in adapting their workplaces. We’ve all heard about the famous paperless office, but employees are still drowning in a sea of paper. Process re-engineering was another buzz word, but customers still face manual systems with time-consuming processes at the backend. Remote working was the cornerstone of our technological advancement, yet employers were afraid of losing control, and their doubts extinguished these excellent initiatives.
Then the pandemic happened. Everyone was in shock and panic seeped in. The only solution was to switch online, quickly digitise our paperwork and start rethinking our processes. And companies realised that things weren’t so bad after all. As a matter of fact, before the pandemic, only 10% of the workforce primarily worked from home. When it finally passes, this number will increase to 25%. Furthermore, employers are more open to this alternative form of work because it carries several benefits such as downsizing office space, decreasing meeting hours, reducing business travel, and making office hours more flexible.
But there’s another benefit which many employers do not realise. The talent pool of potential employees suddenly increased by several folds. Workers do not need to reside in the country where they are working (if they can work remotely). Some countries like Bermuda, Croatia and many others are even offering working visas. But this pandemic is also opening up new horizons for another sector which has a lot of untapped potential; people with disabilities.
#disability #knowledge #artificial-intelligence #technology #data-science
1603970460
We have officially entered the post-truth era.
With the rise of deep-fakes, lying politicians, and Surkovian disinformation campaigns, it’s hard to get a handle on what truth even is.
For a few months I was deep in a skeptical hole where I had truly lost grip on what I considered “real”, and I had to claw my way out by getting real silly and coming up with a formal definition that we might all agree with. Truth, I propose, is given by this expression:
That’s it. That’s truth.
I’ll define terms shortly and it’ll be clear that I’m abstracting away many messy details, but I will try to convince you that this basic structure agrees with many of our informal intuitions and is useful for decision-making — potentially even serving as an objective function for the automated scientists of the near future. Even better: perhaps it can be used as an objective function for search-systems or newsfeeds, only returning the top results as ranked by their truth-value.
If you buy the definition wholesale, I’ll show that it has interesting implications that aren’t immediately obvious. If you don’t buy the definition at all, I invite you to put all critiques, counterexamples, and challenges in the comments so I can update accordingly.
The first thing you’ll notice is that truth is a function.
This basic framing already aligns with our intuition in important ways: truth isn’t a value that floats out in the ether all by its lonesome, it’s a property of a statement — which we write here as s. Statements themselves don’t have a truth-value on their own, either: they are always grounded by some set of contexts, D.
Some basic validation by intuition: the statement “the boy runs” doesn’t have any truth-value assigned to it until you give it a context. If the context involves a boy running, then the statement is true. If the context does not include a boy running, then it is false. You need both a statement**s **and a set of contexts **D **to evaluate truth, even if **_D _**means “all possible contexts” (for example, tautologies are true in all contexts!).
It can be argued that we can do away with the notion of contexts if statements were fully qualified (i.e “The boy runs and the time and place is such where there is a boy running”), but since contexts are an easy-to-use shorthand, it’s what we’ll go with.
#science #philosophy #knowledge #data-science
1603886640
The COVID-19 pandemic we’re all facing this year is transforming the workforce throughout the whole world. In fact, according to Stanford University research, the US has become a work-from-home economy. 42% of the workforce now working from home full-time. While some businesses are finding it difficult to adjust to this modern lifestyle, others went on immediately. Looking ahead, for a large number of organizations, the inability to adapt could be fatal.
With the pandemic, numerous companies have had to become fully digital. Although the need to deliver software fast and efficiently has always been present, now it’s becoming bigger than ever. However, fast software delivery also brings a lot of error-prone processes.
DevOps is a collection of activities meant to minimize the lead time between a code commit and deployment to production, thus maintaining this process on a high level. This way of working, and even thinking, has been present in software development for a while. However, with the new situation we’ve encountered, its benefits are getting more obvious.
A large number of companies from multiple industries are already using DevOps through various tools and activities. They enjoy a number of benefits, like continuous delivery, agile teams, separated release, and deployment processes. They all lead to an overall improvement of the product and service quality.
_As it doesn’t seem like we’re going back to the office any time soon, it seems like DevOps will be more than necessary in the near future. _
European tech companies are also transforming due to the COVID-19 pandemic. Many developers’ teams went fully-remote, moving from a single location to many different ones. Also, a lot of developers went to their home countries.
Even though remote working in a normal situation is a very natural and useful thing, the quick shift to a remote setting was a shock for many. It turned out that going remote voluntarily and going remote by force are two completely different things.
Teams had to rapidly adapt to the new normal while maintaining their productivity at a high level. This opened the need for distributed DevOps, forcing teams to become agile even if they weren’t before.
For some teams, this switch was easier than it was for others. IDC UK’s blog post on this topic mentions GitHub and Microsoft as teams that have been gradually moving to a distributed model. On the other hand, Cycloid’s primary model is distributed, so the coronavirus situation didn’t change their way of working. Zapier and Trello are other companies that are remote by default. Moreover, Twitter and Atlassian adapted to the situation by allowing their employees to work from home permanently.
However, these shifts require other types of investments, requiring companies to provide a stronger tech stack for their employees. For example, GitHub and Microsoft are investing in cloud development systems to allow their dev teams easier access to their infrastructure.
Many companies are considering going down this path. In fact, Codefresh conducted research asking teams how the pandemic is affecting their DevOps state. 58% of respondents told Codefresh that due to the pandemic, they are moving parts of their infrastructure to the cloud. Moreover, 17% are migrating their entire stack to the cloud.
So, companies are increasing their DevOps budgets as it positively impacts the productivity of developers. _Here are our suggestions on what teams should focus on to become more efficient: _
Code deployment can be a complex process. Errors can disrupt the functioning of the entire software, and therefore, the entire organization. However, continuous delivery has come to the rescue.
DevOps is all about continuous delivery. This means that the team always has the source code in a deployable state, deploys software often, and is confident in the deployment pipeline. When a team deploys often, it means that modifications aren’t that big. It also means that they can be done by a few people instead of the entire team. This makes communication much easier.
**More trust in the implementation process **is often a positive consequence of continuous deployment. This is due to the increased deployment frequency. It doesn’t have to be stressful to deploy code to production, so teams can accelerate go-to-market. What is more, they can **provide customer value in a shorter period of time. **
#knowledge #news #coronavirus #covid19 #devops #pandemic #productivity
1602902280
DevOps improves software delivery speed and quality through a list of practices that pursue an agile mindset. The terms that first come to mind when you mention DevOps are continuous integration,continuous delivery and deployment , collaboration, automation, and monitoring.
DevOps means different things to different teams. Some teams are all about automation, while others do things manually and still consider that they are doing DevOps. Some consider it a culture and a mindset-shaper.
As DevOps revolves around continuous delivery and fast code shipping, it’s crucial to act quickly without any significant errors. That’s why it’s vital to track the DevOps metrics that can help you achieve this.
To succeed in DevOps, teams use many different tools. That’s why different DevOps metrics are essential for different dev teams.
So, before even beginning with DevOps, your team should determine what DevOps means for them. What is more, teams should also detect their biggest DevOps challenges. Then, it will be easier for them to decide which DevOps metrics they need to monitor more actively to improve and create a more quality software delivery process.
_Here are the critical DevOps metrics most teams find important: _
It is important to develop and sustain a competitive advantage to offer updates, new functionalities, and technical enhancements with greater quality and accuracy. The opportunity to increase delivery intensity contributes to increased flexibility and better adherence to evolving consumers’ requirements.
The aim should be to allow smaller deployments as frequently as possible. When deployments are smaller, software testing and deployment are much more comfortable.
Regularly measuring deployment frequency will offer greater visibility into which improvements were more successful and which segments require change. A rapid drop in frequency can indicate that other tasks or manual actions are disrupting the workflow. For sustainable growth and development, deployment frequency indicators that suggest minor yet constant changes are optimal.
Going one step further and making testing more manageable can measure both production and non-production deployments. This way, you’ll be able to determine the frequency of your deployments to QA and optimize for early and smaller deployments.
_Adding this metric is in Microtica’s roadmap. As Microtica provides a build and deployment timeline, we’re planning to add a feature to show you your build and deployment frequency. _
#knowledge #devops #devops automation #devops metrics #devops tools