1678676400
Stable: v1.2.1 / Roadmap | F.A.Q.
High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model:
Supported platforms:
The entire implementation of the model is contained in 2 source files:
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications. As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: whisper.objc
https://user-images.githubusercontent.com/1991296/197385372-962a6dea-bca1-4d50-bf96-1d8c27b98c81.mp4
You can also easily make your own offline voice assistant application: command
https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a76d-5735c25c49da.mp4
Or you can even run it straight in the browser: talk.wasm
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD instrisics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
First, download one of the Whisper models converted in ggml format. For example:
bash ./models/download-ggml-model.sh base.en
Now build the main example and transcribe an audio file like this:
# build the main example
make
# transcribe an audio file
./main -f samples/jfk.wav
For a quick demo, simply run make base.en
:
$ make base.en
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
./main -h
usage: ./main [options] file0.wav file1.wav ...
options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [-1 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-owts, --output-words [false ] output script for generating karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [true ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
bash ./models/download-ggml-model.sh base.en
Downloading ggml model base.en ...
ggml-base.en.bin 100%[========================>] 141.11M 6.34MB/s in 24s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
You can now use it like this:
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
===============================================
Running base.en on all samples in ./samples ...
===============================================
----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 113.81 ms
whisper_print_timings: mel time = 15.40 ms
whisper_print_timings: sample time = 11.58 ms / 27 runs ( 0.43 ms per run)
whisper_print_timings: encode time = 266.60 ms / 1 runs ( 266.60 ms per run)
whisper_print_timings: decode time = 66.11 ms / 27 runs ( 2.45 ms per run)
whisper_print_timings: total time = 476.31 ms
The command downloads the base.en
model converted to custom ggml
format and runs the inference on all .wav
samples in the folder samples
.
For detailed usage instructions, run: ./main -h
Note that the main example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool. For example, you can use ffmpeg
like this:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
If you want some extra audio samples to play with, simply run:
make samples
This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via ffmpeg
.
You can download and run the other models as follows:
make tiny.en
make tiny
make base.en
make base
make small.en
make small
make medium.en
make medium
make large-v1
make large
Model | Disk | Mem | SHA |
---|---|---|---|
tiny | 75 MB | ~125 MB | bd577a113a864445d4c299885e0cb97d4ba92b5f |
base | 142 MB | ~210 MB | 465707469ff3a37a2b9b8d8f89f2f99de7299dac |
small | 466 MB | ~600 MB | 55356645c2b361a969dfd0ef2c5a50d530afd8d5 |
medium | 1.5 GB | ~1.7 GB | fd9727b6e1217c2f614f9b698455c4ffd82463b4 |
large | 2.9 GB | ~3.3 GB | 0f4c8e34f21cf1a914c59d8b3ce882345ad349d6 |
Here is another example of transcribing a 3:24 min speech in about half a minute on a MacBook M1 Pro, using medium.en
model:
Expand to see the result
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:08.000 --> 00:00:17.000] At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
[00:00:17.000 --> 00:00:23.000] A short time later, debris was seen falling from the skies above Texas.
[00:00:23.000 --> 00:00:29.000] The Columbia's lost. There are no survivors.
[00:00:29.000 --> 00:00:32.000] On board was a crew of seven.
[00:00:32.000 --> 00:00:39.000] Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
[00:00:39.000 --> 00:00:48.000] Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
[00:00:48.000 --> 00:00:52.000] a colonel in the Israeli Air Force.
[00:00:52.000 --> 00:00:58.000] These men and women assumed great risk in the service to all humanity.
[00:00:58.000 --> 00:01:03.000] In an age when space flight has come to seem almost routine,
[00:01:03.000 --> 00:01:07.000] it is easy to overlook the dangers of travel by rocket
[00:01:07.000 --> 00:01:12.000] and the difficulties of navigating the fierce outer atmosphere of the Earth.
[00:01:12.000 --> 00:01:18.000] These astronauts knew the dangers, and they faced them willingly,
[00:01:18.000 --> 00:01:23.000] knowing they had a high and noble purpose in life.
[00:01:23.000 --> 00:01:31.000] Because of their courage and daring and idealism, we will miss them all the more.
[00:01:31.000 --> 00:01:36.000] All Americans today are thinking as well of the families of these men and women
[00:01:36.000 --> 00:01:40.000] who have been given this sudden shock and grief.
[00:01:40.000 --> 00:01:45.000] You're not alone. Our entire nation grieves with you,
[00:01:45.000 --> 00:01:52.000] and those you love will always have the respect and gratitude of this country.
[00:01:52.000 --> 00:01:56.000] The cause in which they died will continue.
[00:01:56.000 --> 00:02:04.000] Mankind is led into the darkness beyond our world by the inspiration of discovery
[00:02:04.000 --> 00:02:11.000] and the longing to understand. Our journey into space will go on.
[00:02:11.000 --> 00:02:16.000] In the skies today, we saw destruction and tragedy.
[00:02:16.000 --> 00:02:22.000] Yet farther than we can see, there is comfort and hope.
[00:02:22.000 --> 00:02:29.000] In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
[00:02:29.000 --> 00:02:35.000] who created all these. He who brings out the starry hosts one by one
[00:02:35.000 --> 00:02:39.000] and calls them each by name."
[00:02:39.000 --> 00:02:46.000] Because of His great power and mighty strength, not one of them is missing.
[00:02:46.000 --> 00:02:55.000] The same Creator who names the stars also knows the names of the seven souls we mourn today.
[00:02:55.000 --> 00:03:01.000] The crew of the shuttle Columbia did not return safely to earth,
[00:03:01.000 --> 00:03:05.000] yet we can pray that all are safely home.
[00:03:05.000 --> 00:03:13.000] May God bless the grieving families, and may God continue to bless America.
[00:03:13.000 --> 00:03:19.000] [Silence]
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: load time = 569.03 ms
whisper_print_timings: mel time = 146.85 ms
whisper_print_timings: sample time = 238.66 ms / 553 runs ( 0.43 ms per run)
whisper_print_timings: encode time = 18665.10 ms / 9 runs ( 2073.90 ms per run)
whisper_print_timings: decode time = 13090.93 ms / 549 runs ( 23.85 ms per run)
whisper_print_timings: total time = 32733.52 ms
This is a naive example of performing real-time inference on audio from your microphone. The stream tool samples the audio every half a second and runs the transcription continously. More info is available in issue #10.
make stream
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a80f-28ba83be7d09.mp4
Adding the --print-colors
argument will print the transcribed text using an experimental color coding strategy to highlight words with high or low confidence:
For example, to limit the line length to a maximum of 16 characters, simply add -ml 16
:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:00.850] And so my
[00:00:00.850 --> 00:00:01.590] fellow
[00:00:01.590 --> 00:00:04.140] Americans, ask
[00:00:04.140 --> 00:00:05.660] not what your
[00:00:05.660 --> 00:00:06.840] country can do
[00:00:06.840 --> 00:00:08.430] for you, ask
[00:00:08.430 --> 00:00:09.440] what you can do
[00:00:09.440 --> 00:00:10.020] for your
[00:00:10.020 --> 00:00:11.000] country.
The --max-len
argument can be used to obtain word-level timestamps. Simply use -ml 1
:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:00.320]
[00:00:00.320 --> 00:00:00.370] And
[00:00:00.370 --> 00:00:00.690] so
[00:00:00.690 --> 00:00:00.850] my
[00:00:00.850 --> 00:00:01.590] fellow
[00:00:01.590 --> 00:00:02.850] Americans
[00:00:02.850 --> 00:00:03.300] ,
[00:00:03.300 --> 00:00:04.140] ask
[00:00:04.140 --> 00:00:04.990] not
[00:00:04.990 --> 00:00:05.410] what
[00:00:05.410 --> 00:00:05.660] your
[00:00:05.660 --> 00:00:06.260] country
[00:00:06.260 --> 00:00:06.600] can
[00:00:06.600 --> 00:00:06.840] do
[00:00:06.840 --> 00:00:07.010] for
[00:00:07.010 --> 00:00:08.170] you
[00:00:08.170 --> 00:00:08.190] ,
[00:00:08.190 --> 00:00:08.430] ask
[00:00:08.430 --> 00:00:08.910] what
[00:00:08.910 --> 00:00:09.040] you
[00:00:09.040 --> 00:00:09.320] can
[00:00:09.320 --> 00:00:09.440] do
[00:00:09.440 --> 00:00:09.760] for
[00:00:09.760 --> 00:00:10.020] your
[00:00:10.020 --> 00:00:10.510] country
[00:00:10.510 --> 00:00:11.000] .
The main example provides support for output of karaoke-style movies, where the currently pronounced word is highlighted. Use the -wts
argument and run the generated bash script. This requires to have ffmpeg
installed.
Here are a few "typical" examples:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4
https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b1c6-323ac4db5b2c.mp4
./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
source ./samples/mm0.wav.wts
ffplay ./samples/mm0.wav.mp4
https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-95f9-4227de3570aa.mp4
./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
source ./samples/gb0.wav.wts
ffplay ./samples/gb0.wav.mp4
https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a0cd-f28a317987ba.mp4
Use the extra/bench-wts.sh script to generate a video in the following format:
./extra/bench-wts.sh samples/jfk.wav
ffplay ./samples/jfk.wav.all.mp4
https://user-images.githubusercontent.com/1991296/223206245-2d36d903-cf8e-4f09-8c3b-eb9f9c39d6fc.mp4
In order to have an objective comparison of the performance of the inference across different system configurations, use the bench tool. The tool simply runs the Encoder part of the model and prints how much time it took to execute it. The results are summarized in the following Github issue:
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
You can download the converted models using the models/download-ggml-model.sh script or manually from here:
For more details, see the conversion script models/convert-pt-to-ggml.py or the README in models.
There are various examples of using the library for different projects in the examples folder. Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
Example | Web | Description |
---|---|---|
main | whisper.wasm | Tool for translating and transcribing audio using Whisper |
bench | bench.wasm | Benchmark the performance of Whisper on your machine |
stream | stream.wasm | Real-time transcription of raw microphone capture |
command | command.wasm | Basic voice assistant example for receiving voice commands from the mic |
talk | talk.wasm | Talk with a GPT-2 bot |
whisper.objc | iOS mobile application using whisper.cpp | |
whisper.swiftui | SwiftUI iOS / macOS application using whisper.cpp | |
whisper.android | Android mobile application using whisper.cpp | |
whisper.nvim | Speech-to-text plugin for Neovim | |
generate-karaoke.sh | Helper script to easily generate a karaoke video of raw audio capture | |
livestream.sh | Livestream audio transcription | |
yt-wsp.sh | Download + transcribe and/or translate any VOD (original) |
If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic. You can use the Show and tell category to share your own projects that use whisper.cpp
. If you have a question, make sure to check the Frequently asked questions (#126) discussion.
Author: ggerganov
Source Code: https://github.com/ggerganov/whisper.cpp
License: MIT license
1676692860
We’ve recently reviewed the best books for learning SQL, PHP, JavaScript, Python, and Node.js. In this article, we’ll review the best books for learning HTML.
HTML is a fundamental building block of the Web, and familiarity with HTML a must-have skill for any aspiring web developer. And we’re talking a lot more than simply structuring documents, as there’s no limit to the amount of interactivity you can create with APIs in HTML5.
We’ve rounded up the best HTML books for beginners and advanced coders. Get ready to dive into the language of the Web! 🤓
HTML stands for “HyperText Markup Language”. It’s a language for describing web pages using plain text.
HTML is the primary language used to create web pages, and it’s a very easy language to learn. It’s easy to read and understand, and it’s also easy to write. HTML is the only language that web browsers can understand and use to create web pages.
HTML is used to create the structure of a web page, including forms, embedding of videos and images, and creating links to other web pages.
HTML is an essential language for web developers to learn, as it’s used to create the structure of web pages. If you’re looking to get started in web development, a great place to start is learning HTML.
Note: you can explore a wide variety of HTML topics through the HTML articles on SitePoint. And if you’re stuck on an HTML issue, our friendly forum experts will help you get it sorted in no time.
HTML might be easy to learn, but as the foundation of the Web, it’s also a vast and ever-evolving technology.
Here are some things to consider when picking an HTML book:
Also consider that, while HTML5 was released in 2008, companion technologies like JavaScript and Web APIs have progressed a lot since then. So if you’re looking to find a book covering multiple disciplines, keep in mind to look for ones that are up to date in all respects.
Let’s first look at books for HTML beginners.
With over 30 years of experience, best-selling author Jennifer Robbins is one of the first professional designers of the Web, and the co-founder of the Artifact Conference for web designers and developers.
In HTML5 Pocket Reference she presents an alphabetical listing of every HTML element and attribute, markup examples, notes indicating the differences between HTML5 and HTML 4.01, and an overview of HTML5 APIs. Not bad for one of the shortest books in this list!
The HTML5 Pocket Reference is part of O’Reilly’s Pocket Reference series of over 34 books, and this fifth edition includes updates regarding the HTML5.1 Working Draft, and the WHATWG standards.
Prolific author Mike McGrath covers all of the basics of HTML in this extremely fun, concise, and accessible book that’s on its ninth edition, and counting! 👏
Why do I say HTML in Easy Steps — the highest rated book in this list — is fun?
You can get a taste of the book’s layout in an online sample.
HTML in Easy Steps is part of the In Easy Steps series, which includes over 200 titles. If you find you like this introduction to HTML, you can take your skills further and see HTML in context with HTML, CSS & JavaScript in Easy Steps from the same series and by the same author (480 pages, 4.6/5, July 2020).
Mark Pilgrim is a developer advocate, a former Google employee, and an author who wrote the acclaimed Dive into HTML5, which is a free online book.
Part of O’Reilly’s Up and Running series, HTML5: Up and Running is essentially the printed version of Dive into HTML5, which also happens to be a relatively short read.
It must be noted that HTML5: Up and Running has some lower ratings, partly because it’s a bit outdated — the author having abandoned the project some time ago. That said, you can still check out the online version first before you decide whether or not it’s your cup of tea.
Let’s now look at some books that are particularly appropriate for children who want to learn HTML.
A unique offering from Union Square & Co., HTML for Babies is a board book that introduces the fundamentals of HTML to youngsters while making extensive use of colors and huge fonts to overemphasize HTML syntax.
And while 16 pages surely is too short for adults (many of the bad reviews are about that), for babies and toddlers it may be just right.
If this sounds like a book you might be interested in, check out the other offerings in this five-book Code Babies series.
Get Coding! is a series of two books for children between nine and 12 years old, written by the Young Rewired State, a non-profit that helps young people to become digital makers.
The first book, Get Coding!, is a full-color introduction to HTML, CSS and JavaScript. It includes a step-by-step guide to building a website, an app and a game. It currently ranks #10 best seller in CSS, and it has over 2,600 ratings, making it the second most popular book in this list!
Young Rewired State’s YouTube channel also has plenty of videos that serve a companion content to the book.
Best Intermediate and Advanced HTML Books
Let’s now look at some books for intermediate to advanced users of HTML.
J.D Gauchat is a writer, programmer and entrepreneur, and his books are popular among web developers and tech professionals.
In HTML5 for Masterminds, he covers HTML5 in depth and provides step-by-step instructions on how to create responsive websites and applications with HTML5.
Besides all the fundamentals, the book covers a number of modern Web APIs, such as:
… and many more
HTML5 for Masterminds is part of the four-book series For Masterminds, and it’s one of the longest books in this list.
Terry Felke-Morris is a college professor emerita of web design and development, the author of multiple web development books, and the author of this comprehensive book on web development and design.
In Web Development & Design Foundations with HTML5, she teaches the basics and more of HTML5 and its related technologies, such as CSS3 and JavaScript, to help readers create websites — all while having a somewhat academic yet approachable angle to it.
The book is also the largest HTML book in our list, and in its ninth edition (wow! 👏). It also includes updates on HTML5.1 and HTML5.2. It’s a best seller, standing at #2 in XHTML, and you can safely consider it an indispensable guide for web development and design newbies.
Some readers will prefer to learn HTML, CSS and/or JavaScript at once. There’s good reason for this, because most of the time these three technologies go hand in hand with one another. So next few books we’ll look at present two or more of these languages in tandem.
Web design instructor David DuRocher presents a comprehensive yet simplified guide in HTML and CSS QuickStart Guide.
As described on the back cover, you’ll learn the following:
- Modern web design fundamentals, how to use the powerful combination of HTML5 and CSS3
- Site structure and responsive design principles, how to format HTML and CSS for all devices
- How to incorporate forms, multimedia elements, and captivating animations into your projects
- How to effectively produce HTML documents using industry-standard tools such as GitHub
- HTML and CSS elements, formatting, padding, gradients, menus, testing, debugging, and more
HTML and CSS QuickStart Guide is a best seller #4 in XHTML and #9 in CSS. It’s labeled as “Great on Kindle” (a distinction very few technical books get), and it even has an audiobook version!
Jon Duckett is a well-known author of books about web design and programming. His book HTML & CSS: Design and Build Websites is the most rated HTML/CSS books in this list by quite a margin, and also one of the best rated.
This book is over ten years old, and yet its content is still relevant today. It’s also beautifully designed, with full-color illustrations and screen captures. (See a sample chapter.)
It’s a #1 best seller in CSS, #2 in Web Design, and #2 in Computer Programming, and has a companion website with code samples for every chapter, and plenty of extras.
If you like this book, there’s also JavaScript and jQuery and PHP & MySQL by the same author and in the same style, both with fantastic reviews.
The Murach’s series is well known for its lengthy and well written books for learning programming and software development, and Murach’s HTML5 and CSS3 is no exception.
With one of the highest ratings on this list, this fifth-edition book is a best seller #9 in CSS. It’s an update to the fourth edition, which has over 400 ratings.
Aside from reference aids, the main sections include:
Zak also has a few courses on Udemy that you might find useful.
In HTML, CSS, and JavaScript, heavyweight authors Julie Meloni and Jennifer Kyrnin integrate these languages with examples that you can use as a reference, or use as a starting point for your own projects.
Part of the Sams Teach Yourself series — which boasts over 200 books — this third edition includes recent updates to the HTML5 and CSS3 standards.
A relatively large book, HTML, CSS, and JavaScript is also one of the most comprehensive, providing plentiful illustrations.
In HTML5 and CSS3: All-in-One for Dummies, bestselling author Andy Harris covers a lot of ground in web development. As is typical of the For Dummies series, the subject is presented in a very approachable and down-to-earth manner.
Here’s a list of eight “books” contained in this #5 best seller in XHTML:
To be honest, this series is little too legacy for my taste, as it focuses strongly on technologies like PHP and MySQL, which have been losing ground for quite some time against NoSQL databases and pure JavaScript frameworks such as React, Angular, or Vue. But hey, to each their own!
By the way, Andy also has a handful of courses on Udemy.
Paul McFedries is an author, serial technical writer, and trainer who specializes in Windows, web development, and programming. His books have sold over four million copies, and he has written over 90 titles for Microsoft Press, Wiley, and other publishers.
In Web Design Playground, Paul McFedries takes the reader on a journey into HTML and CSS, and the book is full of interactive exercises and full-color illustrations to help the reader learn — covering basics like creating web pages, to more advanced topics like styling with CSS and deploying web pages.
In Responsive Web Design with HTML 5 & CSS, author Jessica Minnick (an IT instructor at Pasco-Hernando State College in New Port Richey, Florida) makes a thorough review of responsive web design best practices, as well as covering HTML5 and CSS3, thus providing a comprehensive introduction to web development.
This is one of the books in the Shelly Cashman series, and it’s on its ninth edition! It’s also the #6 best seller in CSS category on Amazon. So it’s a pretty safe bet.
Veteran designer and co-founder of the Artifact Conference, Jennifer Robbins hits us with a #1 best seller in XHTML, #2 in JavaScript, and #2 in CSS.
Learning Web Design is an amazingly comprehensive, full-color book that covers HTML5, CSS3, web graphics, and JavaScript.
New in the fifth edition:
The companion website includes exercise materials for working along with the book, supplemental articles for further reading, links with resources listed on the book, and even instructor support!
If you want to look for a quick reference guide that covers the very basics of HTML5 in the most concise way, the HTML5 Pamphlet, by BarCharts, is the most rated resource in this list (4.6/5, 375 ratings).
Part of the Quick Study Computer series (which offers over 60 charts), this fold-out guide provides an overview of HTML5, including the major elements, attributes, and the newest HTML5 features. It also provides brief yet accurate context for the HTML5 document structure.
SitePoint Premium gives you access to the SitePoint Library, with a whole section dedicated to HTML and CSS books and courses, including HTML5 & CSS3 for the Real World: second Edition, by Alexis Goldstein, Louis Lazaris, and Estelle Weyl.
There’s also the following series by Jens Oliver Meiert:
As the building block of the modern Web, it’s important to understand HTML’s syntax and how it works in order to create the best possible user experience. With the basics of HTML under your belt, you can start to create web pages, and then go on to build upon that foundation with other technologies.
Hopefully this list of HTML books will help you to get started, and help you along the road to your design and developing goals. 💻🕸
Original article source at: https://www.sitepoint.com/
1676464820
We understand the struggles you’re now through. For your top-selling product, you’ve spent hundreds of dollars making a professional video. Your product is expertly portrayed in the video. You already know it has the components you require to boost clicks, conversions, and purchases Here Amazon consulting Service provider will guide you properly.
But it is consistently turned down. Uncertain of the cause of your Sponsored Brand Video Ad failing Amazon’s verification procedure? We’ve outlined EVERY cause for rejection as well as best practices for making films that make viewers want to stop scrolling. , it is advised that you enlist the assistance of Amazon Consulting services Provider
#1: Incorporating External Links Or Personal Data
It is completely forbidden to use any web links, URLs, or CTAs that send customers to websites other than Amazon.com. This also includes promoting your website or social media account. There should be no place in the videos where private information like a phone number, email address, or home address is provided.
#2: Including CTAs With Restrictions
Include calls to action that Amazon has given the go-ahead for. Do not request that customers take acts that Amazon does not support. For example, you might follow, subscribe, comment, enable notifications, or look at video descriptions.
#3: Emphasizing Information About Prices And Promotions
You cannot include prices, promotions (such as buy one get one free), discount claims (such as 10% off if you use a coupon), or information that is time-sensitive in your videos, just like in A+ Content and Brand Story. This also applies to shipment details or guarantees like “free shipping,” “superfast shipping,” etc.
#4: Requesting Client Feedback
You are not allowed to request customer reviews in your video. Videos are meant to inform viewers about the qualities and advantages of your product; hence Amazon explicitly forbids requesting feedback from them. Similarly to this, even if they are on Amazon, your video cannot use customer reviews (including star ratings).
#5: Utilizing Amazon’s Original Creative Works
Unless you have our specific written consent, you should not use any Amazon-related logos or icons in your film. The arrow icon from Amazon is frequently used by vendors to provide information about their packaging (pack of 2, combo pack). It is completely forbidden to include these kinds of icons in movies, listing photos, or A+ Content.
#6: Excessive Competition
It is unacceptable to demonstrate how much better than the competition your product is. “Product X is considerably better than Product Y, which is a piece of crap,” as an illustration.
#7: Poor Copying Mistakes
According to Amazon consulting service provider Your movie may be rejected if the title, description, or copy contains spelling or grammar issues, too much or the wrong kind of punctuation, or inconsistent capitalization.
#8: Mutes The Audio
The bottom right corner of the screen, which Amazon regards as a “not safe” location, is where the mute button may be found. The mute button must not be surrounded by or contain any creative features, such as text blocks, icons, or brand logos.
#9: Unsupported Language
The predominant language of the market where this advertisement will be displayed must be used in your video. Your video cannot contain text in Chinese, Spanish, or any other language if you are selling in the USA. It is permitted to use terminology that is ingrained in the English language’s dictionary or that is widely used. You can get the greatest advice by working with an experienced. Amazon Marketing consultant
#10: Incorrect Medical Claims
Avoid making untrue or too boastful health or medical claims. For instance, “Take our vitamin and reduce weight in 2 weeks.” Do not assert or imply that the product can treat or prevent disease. For instance, “Can help reduce stress associated with ADHD,” or “Can help with the look of acne scarring.”
Conclusion
Follow Tech2Globe to learn more about this subject. Our Amazon consulting services provider experts are always ready to offer our clients the greatest support.
1676023160
A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.
Below is an example of using Scattertext to create visualize terms used in 2012 American political conventions. The 2,000 most party-associated uni grams are displayed as points in the scatter plot. Their x- and y- axes are the dense ranks of their usage by Republican and Democratic speakers respectively.
import scattertext as st
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))
html = st.produce_scattertext_explorer(
corpus,
category='democrat', category_name='Democratic', not_category_name='Republican',
minimum_term_frequency=0, pmi_threshold_coefficient=0,
width_in_pixels=1000, metadata=corpus.get_df()['speaker'],
transform=st.Scalers.dense_rank
)
open('./demo_compact.html', 'w').write(html)
The HTML file written would look like the image below. Click on it for the actual interactive visualization.
Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to paper: arxiv.org/abs/1703.00565
@article{kessler2017scattertext,
author = {Kessler, Jason S.},
title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},
booktitle = {Proceedings of ACL-2017 System Demonstrations},
year = {2017},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
}
Install Python 3.4 or higher and run:
$ pip install scattertext
If you cannot (or don't want to) install spaCy, substitute nlp = spacy.load('en')
lines with nlp = scattertext.WhitespaceNLP.whitespace_nlp
. Note, this is not compatible with word_similarity_explorer
, and the tokenization and sentence boundary detection capabilities will be low-performance regular expressions. See demo_without_spacy.py
for an example.
It is recommended you install jieba
, spacy
, empath
, astropy
, flashtext
, gensim
and umap-learn
in order to take full advantage of Scattertext.
Scattertext should mostly work with Python 2.7, but it may not.
The HTML outputs look best in Chrome and Safari.
The name of this project is Scattertext. "Scattertext" is written as a single word and should be capitalized. When used in Python, the package scattertext
should be defined to the name st
, i.e., import scattertext as st
.
This is a tool that's intended for visualizing what words and phrases are more characteristic of a category than others.
Consider the example at the top of the page.
Looking at this seem overwhelming. In fact, it's a relatively simple visualization of word use during the 2012 political convention. Each dot corresponds to a word or phrase mentioned by Republicans or Democrats during their conventions. The closer a dot is to the top of the plot, the more frequently it was used by Democrats. The further right a dot, the more that word or phrase was used by Republicans. Words frequently used by both parties, like "of" and "the" and even "Mitt" tend to occur in the upper-right-hand corner. Although very low frequency words have been hidden to preserve computing resources, a word that neither party used, like "giraffe" would be in the bottom-left-hand corner.
The interesting things happen close to the upper-left and lower-right corners. In the upper-left corner, words like "auto" (as in auto bailout) and "millionaires" are frequently used by Democrats but infrequently or never used by Republicans. Likewise, terms frequently used by Republicans and infrequently by Democrats occupy the bottom-right corner. These include "big government" and "olympics", referring to the Salt Lake City Olympics in which Gov. Romney was involved.
Terms are colored by their association. Those that are more associated with Democrats are blue, and those more associated with Republicans red.
Terms that are most characteristic of the both sets of documents are displayed on the far-right of the visualization.
The inspiration for this visualization came from Dataclysm (Rudder, 2014).
Scattertext is designed to help you build these graphs and efficiently label points on them.
The documentation (including this readme) is a work in progress. Please see the tutorial below as well as the PyData 2017 Tutorial.
Poking around the code and tests should give you a good idea of how things work.
The library covers some novel and effective term-importance formulas, including Scaled F-Score.
New in Scattertext 0.1.0, one can use a dataframe for term/metadata positions and other term-specific data. We can also use it to determine term-specific information which is shown after a term is clicked.
Note that it is possible to disable the use of document categories in Scattertext, as we shall see in this example.
This example covers plotting term dispersion against word frequency and identifying the terms which are most and least dispersed given their frequencies. Using the Rosengren's S dispersion measure (Gries 2021), terms tend to increase in their dispersion scores as they get more frequent. We'll see how we can both plot this effect and factor out the effect of frequency.
This, along with a number of other dispersion metrics presented in Gries (2021), are available and documented in the Dispersion
class, which we'll use later in the section.
Let's start by creating a Convention corpus, but we'll use the CorpusWithoutCategoriesFromParsedDocuments
factory to ensure that no categories are included in the corpus. If we try to find document categories, we'll see that all documents have the category '_'.
import scattertext as st
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences))
corpus = st.CorpusWithoutCategoriesFromParsedDocuments(
df, parsed_col='parse'
).build().get_unigram_corpus().remove_infrequent_words(minimum_term_count=6)
corpus.get_categories()
# Returns ['_']
Next, we'll create a dataframe for all terms we'll plot. We'll just start by creating a dataframe where we capture the frequency of each term and various dispersion metrics. These will be shown after a term is activated in the plot.
dispersion = st.Dispersion(corpus)
dispersion_df = dispersion.get_df()
dispersion_df.head(3)
Which returns
Frequency Range SD VC Juilland's D Rosengren's S DP DP norm KL-divergence
thank 363 134 3.108113 1.618274 0.707416 0.694898 0.391548 0.391560 0.748808
you 1630 177 12.383708 1.435902 0.888596 0.898805 0.233627 0.233635 0.263337
so 549 155 3.523380 1.212967 0.774299 0.822244 0.283151 0.283160 0.411750
These are discussed in detail in Gries 2021.
We'll use Rosengren's S to find the dispersion of each term. It's which a metric designed for corpus parts (convention speeches in our case) of varying length. Where n is the number of documents in the corpus, s_i is the percentage of tokens in the corpus found in document i, v_i is term count in document i, and f is the total number of tokens in the corpus of type term type.
Rosengren's S: ^2}{f})^2}{f})
In order to start plotting, we'll need to add coordinates for each term to the data frame.
To use the dataframe_scattertext
function, you need, at a minimum a dataframe with 'X' and 'Y' columns.
The Xpos
and Ypos
columns indicate the positions of the original X
and Y
values on the scatterplot, and need to be between 0 and 1. Functions in st.Scalers
perform this scaling. Absent Xpos
or Ypos
, st.Scalers.scale
would be used.
Here is a sample of values:
st.Scalers.scale(vec)
Rescales the vector to where the minimum value is 0 and the maximum is 1.st.Scalers.log_scale(vec)
Rescales the lgo of the vectorst.Scalers.dense_ranke(vec)
Rescales the dense rank of the vectorst.Scalers.scale_center_zero_abs(vec)
Rescales a vector with both positive and negative values such that the 0 value in the original vector is plotted at 0.5, negative values are projected from [-argmax(abs(vec)), 0] to [0, 0.5] and positive values projected from [0, argmax(abs(vec))] to [0.5, 1].dispersion_df = dispersion_df.assign(
X=lambda df: df.Frequency,
Xpos=lambda df: st.Scalers.log_scale(df.X),
Y=lambda df: df["Rosengren's S"],
Ypos=lambda df: st.Scalers.scale(df.Y),
)
Note that the Ypos
column here is not necessary since Y
would automatically be scaled.
Finally, since we are not distinguishing between categories, we can set ignore_categories=True
.
We can now plot this graph using the dataframe_scattertext
function:
html = st.dataframe_scattertext(
corpus,
plot_df=dispersion_df,
metadata=corpus.get_df()['speaker'] + ' (' + corpus.get_df()['party'].str.upper() + ')',
ignore_categories=True,
x_label='Log Frequency',
y_label="Rosengren's S",
y_axis_labels=['Less Dispersion', 'Medium', 'More Dispersion'],
)
Which yields (click for an interactive version):
Note that we can see various dispersion statistics under a term's name, in addition to the standard usage statistics. To customize the statistics which are displayed, set the term_description_column=[...]
parameter with a list of column names to be displayed.
One issue in this dispersion chart, which tends to be common to dispersion metrics in general, is that dispersion and frequency tend to have a high correlation, but with a complex, non-linear curve. Depending on the metric, this correlation curve could be power, linear, sigmoidal, or typically, something else.
In order to factor out this correlation, we can predict the dispersion from frequency using a non-parametric regressor, and see which terms have the highest and lowest residuals with respect to their expected dispersions based on their frequencies.
In this case, we'll use a KNN regressor with 10 neighbors to predict Rosengren'S from term frequencies (dispersion_df.X
and .Y
respectively), and compute the residual.
We'll the residual to color points, with a neutral color for residuals around 0 and other colors for positive and negative values. We'll add a column in the data frame for point colors, and call it ColorScore. It is populated with values between 0 and 1, with 0.5 as a netural color on the d3 interpolateWarm
color scale. We use st.Scalers.scale_center_zero_abs
, discussed above, to make this transformation.
from sklearn.neighbors import KNeighborsRegressor
dispersion_df = dispersion_df.assign(
Expected=lambda df: KNeighborsRegressor(n_neighbors=10).fit(
df.X.values.reshape(-1, 1), df.Y
).predict(df.X.values.reshape(-1, 1)),
Residual=lambda df: df.Y - df.Expected,
ColorScore=lambda df: st.Scalers.scale_center_zero_abs(df.Residual)
)
Now we are ready to plot our colored dispersion chart. We assign the ColorScore column name to the color_score_column
paramter in dataframe_scattertext
.
Additionally, We'd like to populate the two term lists on the left with terms that have high and low residual values, indicating terms which have the most dispersion relative to their frequency-expected level and the lowest. We can do this by the left_list_column
parameter. We can specify the upper and lower term list names using the header_names
parameter. Finally, we can spiff-up the plot by adding an appealing background color.
html = st.dataframe_scattertext(
corpus,
plot_df=dispersion_df,
metadata=corpus.get_df()['speaker'] + ' (' + corpus.get_df()['party'].str.upper() + ')',
ignore_categories=True,
x_label='Log Frequency',
y_label="Rosengren's S",
y_axis_labels=['Less Dispersion', 'Medium', 'More Dispersion'],
color_score_column='ColorScore',
header_names={'upper': 'Lower than Expected', 'lower': 'More than Expected'},
left_list_column='Residual',
background_color='#e5e5e3'
)
Which yields (click for an interactive version):
While you should learn Python fully use Scattertext, I've put some of the basic functionality in a commandline tool. The tool is installed when you follow the procedure laid out above.
Run $ scattertext --help
from the commandline to see the full usage information. Here's a quick example of how to use vanilla Scattertext on a CSV file. The file needs to have at least two columns, one containing the text to be analyzed, and another containing the category. In the example CSV below, the columns are text and party, respectively.
The example below processes the CSV file, and the resulting HTML visualization into cli_demo.html.
Note, the parameter --minimum_term_frequency=8
omit terms that occur less than 8 times, and --regex_parser
indicates a simple regular expression parser should be used in place of spaCy. The flag --one_use_per_doc
indicates that term frequency should be calculated by only counting no more than one occurrence of a term in a document.
If you'd like to parse non-English text, you can use the --spacy_language_model
argument to configure which spaCy language model the tool will use. The default is 'en' and you can see the others available at https://spacy.io/docs/api/language-models.
$ curl -s https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv | head -2
party,speaker,text
democrat,BARACK OBAMA,"Thank you. Thank you. Thank you. Thank you so much.Thank you.Thank you so much. Thank you. Thank you very much, everybody. Thank you.
$
$ scattertext --datafile=https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv \
> --text_column=text --category_column=party --metadata_column=speaker --positive_category=democrat \
> --category_display_name=Democratic --not_category_display_name=Republican --minimum_term_frequency=8 \
> --one_use_per_doc --regex_parser --outputfile=cli_demo.html
The following code creates a stand-alone HTML file that analyzes words used by Democrats and Republicans in the 2012 party conventions, and outputs some notable term associations.
First, import Scattertext and spaCy.
>>> import scattertext as st
>>> import spacy
>>> from pprint import pprint
Next, assemble the data you want to analyze into a Pandas data frame. It should have at least two columns, the text you'd like to analyze, and the category you'd like to study. Here, the text
column contains convention speeches while the party
column contains the party of the speaker. We'll eventually use the speaker
column to label snippets in the visualization.
>>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> convention_df.iloc[0]
party democrat
speaker BARACK OBAMA
text Thank you. Thank you. Thank you. Thank you so ...
Name: 0, dtype: object
Turn the data frame into a Scattertext Corpus to begin analyzing it. To look for differences in parties, set the category_col
parameter to 'party'
, and use the speeches, present in the text
column, as the texts to analyze by setting the text
col parameter. Finally, pass a spaCy model in to the nlp
argument and call build()
to construct the corpus.
# Turn it into a Scattertext Corpus
>>> nlp = spacy.load('en')
>>> corpus = st.CorpusFromPandas(convention_df,
... category_col='party',
... text_col='text',
... nlp=nlp).build()
Let's see characteristic terms in the corpus, and terms that are most associated Democrats and Republicans. See slides 52 to 59 of the Turning Unstructured Content ot Kernels of Ideas talk for more details on these approaches.
Here are the terms that differentiate the corpus from a general English corpus.
>>> print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))
['obama',
'romney',
'barack',
'mitt',
'obamacare',
'biden',
'romneys',
'hardworking',
'bailouts',
'autoworkers']
Here are the terms that are most associated with Democrats:
>>> term_freq_df = corpus.get_term_freq_df()
>>> term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
>>> pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))
['auto',
'america forward',
'auto industry',
'insurance companies',
'pell',
'last week',
'pell grants',
"women 's",
'platform',
'millionaires']
And Republicans:
>>> term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')
>>> pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))
['big government',
"n't build",
'mitt was',
'the constitution',
'he wanted',
'hands that',
'of mitt',
'16 trillion',
'turned around',
'in florida']
Now, let's write the scatter plot a stand-alone HTML file. We'll make the y-axis category "democrat", and name the category "Democrat" with a capital "D" for presentation purposes. We'll name the other category "Republican" with a capital "R". All documents in the corpus without the category "democrat" will be considered Republican. We set the width of the visualization in pixels, and label each excerpt with the speaker using the metadata
parameter. Finally, we write the visualization to an HTML file.
>>> html = st.produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'])
>>> open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))
Below is what the webpage looks like. Click it and wait a few minutes for the interactive version.
Scattertext can also be used to visualize the category association of a variety of different phrase types. The word "phrase" denotes any single or multi-word collocation.
PyTextRank, created by Paco Nathan, is an implementation of a modified version of the TextRank algorithm (Mihalcea and Tarau 2004). It involves graph centrality algorithm to extract a scored list of the most prominent phrases in a document. Here, named entities recognized by spaCy. As of spaCy version 2.2, these are from an NER system trained on Ontonotes 5.
Please install pytextrank $ pip3 install pytextrank
before continuing with this tutorial.
To use, build a corpus as normal, but make sure you use spaCy to parse each document as opposed a built-in whitespace_nlp
-type tokenizer. Note that adding PyTextRank to the spaCy pipeline is not needed, as it will be run separately by the PyTextRankPhrases
object. We'll reduce the number of phrases displayed in the chart to 2000 using the AssociationCompactor
. The phrases generated will be treated like non-textual features since their document scores will not correspond to word counts.
import pytextrank, spacy
import scattertext as st
nlp = spacy.load('en')
nlp.add_pipe("textrank", last=True)
convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(nlp),
party=lambda df: df.party.apply({'democrat': 'Democratic', 'republican': 'Republican'}.get)
)
corpus = st.CorpusFromParsedDocuments(
convention_df,
category_col='party',
parsed_col='parse',
feats_from_spacy_doc=st.PyTextRankPhrases()
).build(
).compact(
AssociationCompactor(2000, use_non_text_features=True)
)
Note that the terms present in the corpus are named entities, and, as opposed to frequency counts, their scores are the eigencentrality scores assigned to them by the TextRank algorithm. Running corpus.get_metadata_freq_df('')
will return, for each category, the sums of terms' TextRank scores. The dense ranks of these scores will be used to construct the scatter plot.
term_category_scores = corpus.get_metadata_freq_df('')
print(term_category_scores)
'''
Democratic Republican
term
our future 1.113434 0.699103
your country 0.314057 0.000000
their home 0.385925 0.000000
our government 0.185483 0.462122
our workers 0.199704 0.210989
her family 0.540887 0.405552
our time 0.510930 0.410058
...
'''
Before we construct the plot, let's some helper variables Since the aggregate TextRank scores aren't particularly interpretable, we'll display the per-category rank of each score in the metadata_description
field. These will be displayed after a term is clicked.
term_ranks = np.argsort(np.argsort(-term_category_scores, axis=0), axis=0) + 1
metadata_descriptions = {
term: '<br/>' + '<br/>'.join(
'<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
for cat in corpus.get_categories())
for term in corpus.get_metadata()
}
We can construct term scores in a couple ways. One is a standard dense-rank difference, a score which is used in most of the two-category contrastive plots here, which will give us the most category-associated phrases. Another is to use the maximum category-specific score, this will give us the most prominent phrases in each category, regardless of the prominence in the other category. We'll take both approaches in this tutorial, let's compute the second kind of score, the category-specific prominence below.
category_specific_prominence = term_category_scores.apply(
lambda r: r.Democratic if r.Democratic > r.Republican else -r.Republican,
axis=1
)
Now we're ready output this chart. Note that we use a dense_rank
transform, which places identically scalled phrases atop each other. We use category_specific_prominence
as scores, and set sort_by_dist
as False
to ensure the phrases displayed on the right-hand side of the chart are ranked by the scores and not distance to the upper-left or lower-right corners. Since matching phrases are treated as non-text features, we encode them as single-phrase topic models and set the topic_model_preview_size
to 0
to indicate the topic model list shouldn't be shown. Finally, we set ensure the full documents are displayed. Note the documents will be displayed in order of phrase-specific score.
html = produce_scattertext_explorer(
corpus,
category='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
transform=dense_rank,
metadata=corpus.get_df()['speaker'],
scores=category_specific_prominence,
sort_by_dist=False,
use_non_text_features=True,
topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
topic_model_preview_size=0,
metadata_descriptions=metadata_descriptions,
use_full_doc=True
)
The most associated terms in each category make some sense, at least on a post hoc analysis. When referring to (then) Governor Romney, Democrats used his surname "Romney" in their most central mentions of him, while Republicans used the more familiar and humanizing "Mitt". In terms of the President Obama, the phrase "Obama" didn't show up as a top term i n either, the but the first name "Barack" was one of the the most central phrases in Democratic speeches, mirroring "Mitt."
Alternatively, we can Dense Rank Difference in scores to color phrase-points and determine the top phrases to be displayed on the right-hand side of the chart. Instead of setting scores
as category-specific prominence scores, we set term_scorer=RankDifference()
to inject a way determining term scores into the scatter plot creation process.
html = produce_scattertext_explorer(
corpus,
category='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
transform=dense_rank,
use_non_text_features=True,
metadata=corpus.get_df()['speaker'],
term_scorer=RankDifference(),
sort_by_dist=False,
topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
topic_model_preview_size=0,
metadata_descriptions=metadata_descriptions,
use_full_doc=True
)
Phrasemachine from AbeHandler (Handler et al. 2016) uses regular expressions over sequences of part-of-speech tags to identify noun phrases. This has the advantage over using spaCy's NP-chunking in that it tends to isolote meaningful, large noun phases which are free of appositives.
A opposed to PyTextRank, we'll just use counts of these phrases, treating them like any other term.
import spacy
from scattertext import SampleCorpora, PhraseMachinePhrases, dense_rank, RankDifference, AssociationCompactor, produce_scattertext_explorer
from scattertext.CorpusFromPandas import CorpusFromPandas
corpus = (CorpusFromPandas(SampleCorpora.ConventionData2012.get_data(),
category_col='party',
text_col='text',
feats_from_spacy_doc=PhraseMachinePhrases(),
nlp=spacy.load('en', parser=False))
.build().compact(AssociationCompactor(4000)))
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
transform=dense_rank,
metadata=corpus.get_df()['speaker'],
term_scorer=RankDifference(),
width_in_pixels=1000)
In order to visualize Empath (Fast et al., 2016) topics and categories instead of terms, we'll need to create a Corpus
of extracted topics and categories rather than unigrams and bigrams. To do so, use the FeatsOnlyFromEmpath
feature extractor. See the source code for examples of how to make your own.
When creating the visualization, pass the use_non_text_features=True
argument into produce_scattertext_explorer
. This will instruct it to use the labeled Empath topics and categories instead of looking for terms. Since the documents returned when a topic or category label is clicked will be in order of the document-level category-association strength, setting use_full_doc=True
makes sense, unless you have enormous documents. Otherwise, the first 300 characters will be shown.
(New in 0.0.26). Ensure you include topic_model_term_lists=feat_builder.get_top_model_term_lists()
in produce_scattertext_explorer
to ensure it bolds passages of snippets that match the topic model.
>>> feat_builder = st.FeatsFromOnlyEmpath()
>>> empath_corpus = st.CorpusFromParsedDocuments(convention_df,
... category_col='party',
... feats_from_spacy_doc=feat_builder,
... parsed_col='text').build()
>>> html = st.produce_scattertext_explorer(empath_corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... use_non_text_features=True,
... use_full_doc=True,
... topic_model_term_lists=feat_builder.get_top_model_term_lists())
>>> open("Convention-Visualization-Empath.html", 'wb').write(html.encode('utf-8'))
Scattertext also includes a feature builder to explore the relationship between General Inquirer Tag Categoires and Document Categories. We'll use a slightly different approach, looking at relationship of GI Tag Categories to political parties by using the Z-Scores of the Log-Odds-Ratio with Uninformative Dirichlet Priors (Monroe 2008). We'll use the produce_frequency_explorer
plot variation to visualize this relationship, setting the x-axis as the number of times a word in the tag category occurs, and the y-axis as the z-score.
For more information on the General Inquirer, please see the General Inquirer Home Page.
We'll use the same data set as before, except we'll use the FeatsFromGeneralInquirer
feature builder.
>>> general_inquirer_feature_builder = st.FeatsFromGeneralInquirer()
>>> corpus = st.CorpusFromPandas(convention_df,
... category_col='party',
... text_col='text',
... nlp=st.whitespace_nlp_with_sentences,
... feats_from_spacy_doc=general_inquirer_feature_builder).build()
Next, we'll call produce_frequency_explorer
in a similar way we called produce_scattertext_explorer
in the previous section. There are a few differences, however. First, we specify the LogOddsRatioUninformativeDirichletPrior
term scorer, which scores the relationships between the categories. The grey_threshold
indicates the points scoring between [-1.96, 1.96] (i.e., p > 0.05) should be colored gray. The argument metadata_descriptions=general_inquirer_feature_builder.get_definitions()
indicates that a dictionary mapping the tag name to a string definition is passed. When a tag is clicked, the definition in the dictionary will be shown below the plot, as shown in the image following the snippet.
>>> html = st.produce_frequency_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... metadata=convention_df['speaker'],
... use_non_text_features=True,
... use_full_doc=True,
... term_scorer=st.LogOddsRatioUninformativeDirichletPrior(),
... grey_threshold=1.96,
... width_in_pixels=1000,
... topic_model_term_lists=general_inquirer_feature_builder.get_top_model_term_lists(),
... metadata_descriptions=general_inquirer_feature_builder.get_definitions())
The [Moral Foundations Theory] proposes six psychological constructs as building blocks of moral thinking, as described in Graham et al. (2013). These foundations are, as described on [moralfoundations.org]: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, sanctity/degradation, and liberty/oppression. Please see the site for a more in-depth discussion of these foundations.
Frimer et al. (2019) created the Moral Foundations Dictionary 2.0, or a lexicon of terms which invoke a moral foundation as a virtue (favorable toward the foundation) or a vice (in opposition to the foundation).
This dictionary can be used in the same way as the General Inquirer. In this example, we can plot the Cohen's d scores of foundation-word counts relative to the frequencies words involving those foundations were invoked.
We can first load the the corpus as normal, and use st.FeatsFromMoralFoundationsDictionary()
to extract features.
import scattertext as st
convention_df = st.SampleCorpora.ConventionData2012.get_data()
moral_foundations_feats = st.FeatsFromMoralFoundationsDictionary()
corpus = st.CorpusFromPandas(convention_df,
category_col='party',
text_col='text',
nlp=st.whitespace_nlp_with_sentences,
feats_from_spacy_doc=moral_foundations_feats).build()
Next, let's use Cohen's d term scorer to analyze the corpus, and describe a set of Cohen's d association scores.
cohens_d_scorer = st.CohensD(corpus).use_metadata()
term_scorer = cohens_d_scorer.set_categories('democrat', ['republican']).term_scorer.get_score_df()
Which yields the following data frame:
cohens_d | cohens_d_se | cohens_d_z | cohens_d_p | hedges_r | hedges_r_se | hedges_r_z | hedges_r_p | m1 | m2 | count1 | count2 | docs1 | docs2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
care.virtue | 0.662891 | 0.149425 | 4.43629 | 4.57621e-06 | 0.660257 | 0.159049 | 4.15129 | 1.65302e-05 | 0.195049 | 0.12164 | 760 | 379 | 115 | 54 |
care.vice | 0.24435 | 0.146025 | 1.67335 | 0.0471292 | 0.243379 | 0.152654 | 1.59432 | 0.0554325 | 0.0580005 | 0.0428358 | 244 | 121 | 80 | 41 |
fairness.virtue | 0.176794 | 0.145767 | 1.21286 | 0.112592 | 0.176092 | 0.152164 | 1.15725 | 0.123586 | 0.0502469 | 0.0403369 | 225 | 107 | 71 | 39 |
fairness.vice | 0.0707162 | 0.145528 | 0.485928 | 0.313509 | 0.0704352 | 0.151711 | 0.464273 | 0.321226 | 0.00718627 | 0.00573227 | 32 | 14 | 21 | 10 |
authority.virtue | -0.0187793 | 0.145486 | -0.12908 | 0.551353 | -0.0187047 | 0.15163 | -0.123357 | 0.549088 | 0.358192 | 0.361191 | 1281 | 788 | 122 | 66 |
authority.vice | -0.0354164 | 0.145494 | -0.243422 | 0.596161 | -0.0352757 | 0.151646 | -0.232619 | 0.591971 | 0.00353465 | 0.00390602 | 20 | 14 | 14 | 10 |
sanctity.virtue | -0.512145 | 0.147848 | -3.46399 | 0.999734 | -0.51011 | 0.156098 | -3.26788 | 0.999458 | 0.0587987 | 0.101677 | 265 | 309 | 74 | 48 |
sanctity.vice | -0.108011 | 0.145589 | -0.74189 | 0.770923 | -0.107582 | 0.151826 | -0.708585 | 0.760709 | 0.00845048 | 0.0109339 | 35 | 28 | 23 | 20 |
loyalty.virtue | -0.413696 | 0.147031 | -2.81367 | 0.997551 | -0.412052 | 0.154558 | -2.666 | 0.996162 | 0.259296 | 0.309776 | 1056 | 717 | 119 | 66 |
loyalty.vice | -0.0854683 | 0.145549 | -0.587213 | 0.72147 | -0.0851287 | 0.151751 | -0.560978 | 0.712594 | 0.00124518 | 0.00197022 | 5 | 5 | 5 | 4 |
This data frame gives us Cohen's d scores (and their standard errors and z-scores), Hedge's r scores (ditto), the mean document-length normalized topic usage per category (where the in-focus category is m1 [in this case Democrats] and the out-of-focus is m2), the raw number of words used in for each topic (count1 and count2), and the number of documents in each category with the topic (docs1 and docs2).
Note that Cohen's d is the difference of m1 and m2 divided by their pooled standard deviation.
Now, let's plot the d-scores of foundations vs. their frequencies.
html = st.produce_frequency_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
metadata=convention_df['speaker'],
use_non_text_features=True,
use_full_doc=True,
term_scorer=st.CohensD(corpus).use_metadata(),
grey_threshold=0,
width_in_pixels=1000,
topic_model_term_lists=moral_foundations_feats.get_top_model_term_lists(),
metadata_descriptions=moral_foundations_feats.get_definitions()
)
Often the terms of most interest are ones that are characteristic to the corpus as a whole. These are terms which occur frequently in all sets of documents being studied, but relatively infrequent compared to general term frequencies.
We can produce a plot with a characteristic score on the x-axis and class-association scores on the y-axis using the function produce_characteristic_explorer
.
Corpus characteristicness is the difference in dense term ranks between the words in all of the documents in the study and a general English-language frequency list. See this Talk on Term-Class Association Scores for a more thorough explanation.
import scattertext as st
corpus = (st.CorpusFromPandas(st.SampleCorpora.ConventionData2012.get_data(),
category_col='party',
text_col='text',
nlp=st.whitespace_nlp_with_sentences)
.build()
.get_unigram_corpus()
.compact(st.ClassPercentageCompactor(term_count=2,
term_ranker=st.OncePerDocFrequencyRanker)))
html = st.produce_characteristic_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
metadata=corpus.get_df()['speaker']
)
open('demo_characteristic_chart.html', 'wb').write(html.encode('utf-8'))
In addition to words, phases and topics, we can make each point correspond to a document. Let's first create a corpus object for the 2012 Conventions data set. This explanation follows demo_pca_documents.py
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
import scattertext as st
from scipy.sparse.linalg import svds
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
corpus = (st.CorpusFromParsedDocuments(convention_df,
category_col='party',
parsed_col='parse')
.build()
.get_stoplisted_unigram_corpus())
Next, let's add the document names as meta data in the corpus object. The add_doc_names_as_metadata
function takes an array of document names, and populates a new corpus' meta data with those names. If two documents have the same name, it appends a number (starting with 1) to the name.
corpus = corpus.add_doc_names_as_metadata(corpus.get_df()['speaker'])
Next, we find tf.idf scores for the corpus' term-document matrix, run sparse SVD, and add them to a projection data frame, making the x and y-axes the first two singular values, and indexing it on the corpus' meta data, which corresponds to the document names.
embeddings = TfidfTransformer().fit_transform(corpus.get_term_doc_mat())
u, s, vt = svds(embeddings, k=3, maxiter=20000, which='LM')
projection = pd.DataFrame({'term': corpus.get_metadata(), 'x': u.T[0], 'y': u.T[1]}).set_index('term')
Finally, set scores as 1 for Democrats and 0 for Republicans, rendering Republican documents as red points and Democratic documents as blue. For more on the produce_pca_explorer
function, see Using SVD to visualize any kind of word embeddings.
category = 'democrat'
scores = (corpus.get_category_ids() == corpus.get_categories().index(category)).astype(int)
html = st.produce_pca_explorer(corpus,
category=category,
category_name='Democratic',
not_category_name='Republican',
metadata=convention_df['speaker'],
width_in_pixels=1000,
show_axes=False,
use_non_text_features=True,
use_full_doc=True,
projection=projection,
scores=scores,
show_top_terms=False)
Click for an interactive version
Cohen's d is a popular metric used to measure effect size. The definitions of Cohen's d and Hedge's r from (Shinichi and Cuthill 2017) are implemented in Scattertext.
>>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> corpus = (st.CorpusFromPandas(convention_df,
... category_col='party',
... text_col='text',
... nlp=st.whitespace_nlp_with_sentences)
... .build()
... .get_unigram_corpus())
We can create a term scorer object to examine the effect sizes and other metrics.
>>> term_scorer = st.CohensD(corpus).set_categories('democrat', ['republican'])
>>> term_scorer.get_score_df().sort_values(by='cohens_d', ascending=False).head()
cohens_d cohens_d_se cohens_d_z cohens_d_p hedges_r hedges_r_se hedges_r_z hedges_r_p m1 m2
obama 1.187378 0.024588 48.290444 0.000000e+00 1.187322 0.018419 64.461363 0.0 0.007778 0.002795
class 0.855859 0.020848 41.052045 0.000000e+00 0.855818 0.017227 49.677688 0.0 0.002222 0.000375
middle 0.826895 0.020553 40.232746 0.000000e+00 0.826857 0.017138 48.245626 0.0 0.002316 0.000400
president 0.820825 0.020492 40.056541 0.000000e+00 0.820786 0.017120 47.942661 0.0 0.010231 0.005369
barack 0.730624 0.019616 37.245725 6.213052e-304 0.730589 0.016862 43.327800 0.0 0.002547 0.000725
Our calculation of Cohen's d is not directly based on term counts. Rather, we divide each document's term counts by the total number of terms in the document before calculating the statistics. m1
and m2
are, respectively the mean portions of words in speeches made by Democrats and Republicans that were the term in question. The effect size (cohens_d
) is the difference between these means divided by the pooled standard standard deviation. cohens_d_se
is the standard error of the statistic, while cohens_d_z
and cohens_d_p
are the Z-scores and p-values indicating the statistical significance of the effect. Corresponding columns are present for Hedge's r, and unbiased version of Cohen's d.
>>> st.produce_frequency_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
term_scorer=st.CohensD(corpus),
metadata=convention_df['speaker'],
grey_threshold=0
)
Click for an interactive version.
Bi-Normal Separation (BNS) (Forman, 2008) was added in version 0.1.8. A variation of (BNS) is used where $F^{-1}(tpr) - F^{-1}(fpr)$ is not used as an absolute value, but kept as a difference. This allows for terms strongly indicative of true positives and false positives to have a high or low score. Note that tpr and fpr are scaled to between $[\alpha, 1-\alpha]$ where alpha is $\in [0, 1]$. In Forman (2008) and earlier literature $\alpha=0.0005$. In personal correspondence with Forman, he kindly suggested using $\frac{1.}{\mbox{minimum(positives, negatives)}}$. I have implemented this as $\alpha=\frac{1.}{\mbox{minimum documents in least frequent category}}$
corpus = (st.CorpusFromPandas(convention_df,
category_col='party',
text_col='text',
nlp=st.whitespace_nlp_with_sentences)
.build()
.get_unigram_corpus()
.remove_infrequent_words(3, term_ranker=st.OncePerDocFrequencyRanker))
term_scorer = (st.BNSScorer(corpus).set_categories('democrat'))
print(term_scorer.get_score_df().sort_values(by='democrat BNS'))
html = st.produce_frequency_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
scores=term_scorer.get_score_df()['democrat BNS'].reindex(corpus.get_terms()).values,
metadata=lambda c: c.get_df()['speaker'],
minimum_term_frequency=0,
grey_threshold=0,
y_label=f'Bi-normal Separation (alpha={term_scorer.alpha})'
)
BNS Scored terms using an algorithmically found alpha.
We can train a classifier to produce a prediction score for each document. Often classifiers or regressors use features which take into account features beyond the ones represented by Scatterext, be they n-gram, topic, extra-linguistic, neural, etc.
We can use Scattertext to visualize the correlations between unigrams (or really any feature representation) and the document scores produced by a model.
In the following example, we train a linear SVM using unigram and bi-gram features on the entire convention data set, and use the model to make a prediction on each document, and finally using Pearson's $r$ to correlate unigram features to the distance from the SVM decision boundary.
from sklearn.svm import LinearSVC
import scattertext as st
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse'
).build()
X = corpus.get_term_doc_mat()
y = corpus.get_category_ids()
clf = LinearSVC()
clf.fit(X=X, y=y==corpus.get_categories().index('democrat'))
doc_scores = clf.decision_function(X=X)
compactcorpus = corpus.get_unigram_corpus().compact(st.AssociationCompactor(2000))
plot_df = st.Correlations().set_correlation_type(
'pearsonr'
).get_correlation_df(
corpus=compactcorpus,
document_scores=doc_scores
).reindex(compactcorpus.get_terms()).assign(
X=lambda df: df.Frequency,
Y=lambda df: df['r'],
Xpos=lambda df: st.Scalers.dense_rank(df.X),
Ypos=lambda df: st.Scalers.scale_center_zero_abs(df.Y),
SuppressDisplay=False,
ColorScore=lambda df: df.Ypos,
)
html = st.dataframe_scattertext(
compactcorpus,
plot_df=plot_df,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
metadata=lambda c: c.get_df()['speaker'],
unified_context=False,
ignore_categories=False,
color_score_column='ColorScore',
left_list_column='ColorScore',
y_label="Pearson r (correlation to SVM document score)",
x_label='Frequency Ranks',
header_names={'upper': 'Top Democratic',
'lower': 'Top Republican'},
)
Scattertext relies on a set of general-domain English word frequencies when computing unigram characteristic
scores. When using running Scattertext on non-English data or in a specific domain, the quality of the scores will degrade.
Ensure that you are on Scattertext 0.1.6 or higher.
To remedy this, one can add a custom set of background scores to a Corpus-like object, using the Corpus.set_background_corpus
function. The function takes a pd.Series
object, indexed on terms with numeric count values.
By default, [!understanding-scaled-f-score](Scaled F-Score) is used to rank how characteristic terms are.
The example below illustrates using Polish background word frequencies.
First, we produce a Series object mapping Polish words to their frequencies using a list from the https://github.com/oprogramador/most-common-words-by-language repo.
polish_word_frequencies = pd.read_csv(
'https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/pl/pl_50k.txt',
sep = ' ',
names = ['Word', 'Frequency']
).set_index('Word')['Frequency']
Note the composition of the Series
>>> polish_word_frequencies
Word
nie 5875385
to 4388099
się 3507076
w 2723767
na 2309765
Name: Frequency, dtype: int64
Next, we build a DataFrame, reviews_df
, consisting of document which appear (to a non-Polish speaker) to be positive and negative hotel reviews from the https://klejbenchmark.com/tasks/ corpus (Kocoń, et al. 2019). Note this data is under a CC BY-NC-SA 4.0 license. These are labeled as "__label__meta_plus_m" and "__label__meta_minus_m". We will use Scattertext to compare those reviews and determine
nlp = spacy.blank('pl')
nlp.add_pipe('sentencizer')
with ZipFile(io.BytesIO(urlopen(
'https://klejbenchmark.com/static/data/klej_polemo2.0-in.zip'
).read())) as zf:
review_df = pd.read_csv(zf.open('train.tsv'), sep='\t')[
lambda df: df.target.isin(['__label__meta_plus_m', '__label__meta_minus_m'])
].assign(
Parse = lambda df: df.sentence.apply(nlp)
)
Next, we wish to create a ParsedCorpus
object from review_df
. In preparation, we first assemble a list of Polish stopwords from the stopwords repository. We also create the not_a_word
regular expression to filter out terms which do not contain a letter.
polish_stopwords = {
stopword for stopword in
urlopen(
'https://raw.githubusercontent.com/bieli/stopwords/master/polish.stopwords.txt'
).read().decode('utf-8').split('\n')
if stopword.strip()
}
not_a_word = re.compile(r'^\W+$')
With these present, we can build a corpus from review_df
with the category being the binary "target" column. We reduce the term space to unigrams and then run the filter_out
which takes a function to determine if a term should be removed from the corpus. The function identifies terms which are in the Polish stoplist or do not contain a letter. Finally, terms occurring less than 20 times in the corpus are removed.
We set the background frequency Series we created early as the background corpus.
corpus = st.CorpusFromParsedDocuments(
review_df,
category_col='target',
parsed_col='Parse'
).build(
).get_unigram_corpus(
).filter_out(
lambda term: term in polish_stopwords or not_a_word.match(term) is not None
).remove_infrequent_words(
minimum_term_count=20
).set_background_corpus(
polish_word_frequencies
)
Note that a minimum word count of 20 was chosen to ensure that only around 2,000 terms would be displayed
>>> corpus.get_num_terms()
2023
Running get_term_and_background_counts
shows us total term counts in the corpus compare to background frequency counts. We limit this to terms which only occur in the corpus.
>>> corpus.get_term_and_background_counts()[
... lambda df: df.corpus > 0
... ].sort_values(by='corpus', ascending=False)
background corpus
m 341583838.0 4819.0
hotelu 33108.0 1812.0
hotel 297974790.0 1651.0
doktor 154840.0 1534.0
polecam 0.0 1438.0
... ... ...
szoku 0.0 21.0
badaniem 0.0 21.0
balkonu 0.0 21.0
stopnia 0.0 21.0
wobec 0.0 21.0
Interesting, the term "polecam" appears very frequently in the corpus, but does not appear at all in the background corpus, making it highly characteristic. Judging from Google Translate, it appears to mean something related to "recommend".
We are now ready to display the plot.
html = st.produce_scattertext_explorer(
corpus,
category='__label__meta_plus_m',
category_name='Plus-M',
not_category_name='Minus-M',
minimum_term_frequency=1,
width_in_pixels=1000,
transform=st.Scalers.dense_rank
)
We can change the formula which is used to produce the Characteristic scores using the characteristic_scorer
parameter to produce_scattertext_explorer
.
It takes a instance of a descendant of the CharacteristicScorer
class. See DenseRankCharacteristicness.py for an example of how to make your own.
Example of plotting with a modified characteristic scorer,
html = st.produce_scattertext_explorer(
corpus,
category='__label__meta_plus_m',
category_name='Plus-M',
not_category_name='Minus-M',
minimum_term_frequency=1,
transform=st.Scalers.dense_rank,
characteristic_scorer=st.DenseRankCharacteristicness(),
Note that numbers show up as more characteristic using the Dense Rank Difference. It may be they occur unusually frequently in this corpus, or perhaps the background word frequencies under counted mumbers.
Word productivity is one strategy for plotting word-based charts describing an uncategorized corpus.
Productivity is defined in Schumann (2016) (Jason: check this) as the entropy of ngrams which contain a term. For the entropy computation, the probability of an n-gram wrt the term whose productivity is being calculated is the frequency of the n-gram divided by the term's frequency.
Since productivity highly correlates with frequency, the recommended metric to plot is the dense rank difference between frequency and productivity.
The snippet below plots words in the convention corpus based on their log frequency and their productivity.
The function st.whole_corpus_productivity_scores
returns a DataFrame giving each word's productivity. For example, in the convention corpus,
Productivity scores should be calculated on a Corpus
-like object which contains a complete set of unigrams and at least bigrams. This corpus should not be compacted before the productivity score calculation.
The terms with lower productivity have more limited usage (e.g., "thank" for "thank you", "united" for "united steates") while the terms with higher productivity occurr in a wider varity of contexts ("getting", "actually", "political", etc.).
import spacy
import scattertext as st
corpus_no_cat = st.CorpusWithoutCategoriesFromParsedDocuments(
st.SampleCorpora.ConventionData2012.get_data().assign(
Parse=lambda df: [x for x in spacy.load('en_core_web_sm').pipe(df.text)]),
parsed_col='Parse'
).build()
compact_corpus_no_cat = corpus_no_cat.get_stoplisted_unigram_corpus().remove_infrequent_words(9)
plot_df = st.whole_corpus_productivity_scores(corpus_no_cat).assign(
RankDelta = lambda df: st.RankDifference().get_scores(
a=df.Productivity,
b=df.Frequency
)
).reindex(
compact_corpus_no_cat.get_terms()
).dropna().assign(
X=lambda df: df.Frequency,
Xpos=lambda df: st.Scalers.log_scale(df.Frequency),
Y=lambda df: df.RankDelta,
Ypos=lambda df: st.Scalers.scale(df.RankDelta),
)
html = st.dataframe_scattertext(
compact_corpus_no_cat.whitelist_terms(plot_df.index),
plot_df=plot_df,
metadata=lambda df: df.get_df()['speaker'],
ignore_categories=True,
x_label='Rank Frequency',
y_label="Productivity",
left_list_column='Ypos',
color_score_column='Ypos',
y_axis_labels=['Least Productive', 'Average Productivity', 'Most Productive'],
header_names={'upper': 'Most Productive', 'lower': 'Least Productive', 'right': 'Characteristic'},
horizontal_line_y_position=0
)
Let's now turn our attention to a novel term scoring metric, Scaled F-Score. We'll examine this on a unigram version of the Rotten Tomatoes corpus (Pang et al. 2002). It contains excerpts of positive and negative movie reviews.
Please see Scaled F Score Explanation for a notebook version of this analysis.
from scipy.stats import hmean
term_freq_df = corpus.get_unigram_corpus().get_term_freq_df()[['Positive freq', 'Negative freq']]
term_freq_df = term_freq_df[term_freq_df.sum(axis=1) > 0]
term_freq_df['pos_precision'] = (term_freq_df['Positive freq'] * 1./
(term_freq_df['Positive freq'] + term_freq_df['Negative freq']))
term_freq_df['pos_freq_pct'] = (term_freq_df['Positive freq'] * 1.
/term_freq_df['Positive freq'].sum())
term_freq_df['pos_hmean'] = (term_freq_df
.apply(lambda x: (hmean([x['pos_precision'], x['pos_freq_pct']])
if x['pos_precision'] > 0 and x['pos_freq_pct'] > 0
else 0), axis=1))
term_freq_df.sort_values(by='pos_hmean', ascending=False).iloc[:10]
If we plot term frequency on the x-axis and the percentage of a term's occurrences which are in positive documents (i.e., its precision) on the y-axis, we can see that low-frequency terms have a much higher variation in the precision. Given these terms have low frequencies, the harmonic means are low. Thus, the only terms which have a high harmonic mean are extremely frequent words which tend to all have near average precisions.
freq = term_freq_df.pos_freq_pct.values
prec = term_freq_df.pos_precision.values
html = st.produce_scattertext_explorer(
corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),
category='Positive',
not_category_name='Negative',
not_categories=['Negative'],
x_label = 'Portion of words used in positive reviews',
original_x = freq,
x_coords = (freq - freq.min())/freq.max(),
x_axis_values = [int(freq.min()*1000)/1000.,
int(freq.max() * 1000)/1000.],
y_label = 'Portion of documents containing word that are positive',
original_y = prec,
y_coords = (prec - prec.min())/prec.max(),
y_axis_values = [int(prec.min() * 1000)/1000.,
int((prec.max()/2.)*1000)/1000.,
int(prec.max() * 1000)/1000.],
scores = term_freq_df.pos_hmean.values,
sort_by_dist=False,
show_characteristic=False
)
file_name = 'not_normed_freq_prec.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1300, height=700)
from scipy.stats import norm
def normcdf(x):
return norm.cdf(x, x.mean(), x.std ())
term_freq_df['pos_precision_normcdf'] = normcdf(term_freq_df.pos_precision)
term_freq_df['pos_freq_pct_normcdf'] = normcdf(term_freq_df.pos_freq_pct.values)
term_freq_df['pos_scaled_f_score'] = hmean([term_freq_df['pos_precision_normcdf'], term_freq_df['pos_freq_pct_normcdf']])
term_freq_df.sort_values(by='pos_scaled_f_score', ascending=False).iloc[:10]
freq = term_freq_df.pos_freq_pct_normcdf.values
prec = term_freq_df.pos_precision_normcdf.values
html = st.produce_scattertext_explorer(
corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),
category='Positive',
not_category_name='Negative',
not_categories=['Negative'],
x_label = 'Portion of words used in positive reviews (norm-cdf)',
original_x = freq,
x_coords = (freq - freq.min())/freq.max(),
x_axis_values = [int(freq.min()*1000)/1000.,
int(freq.max() * 1000)/1000.],
y_label = 'documents containing word that are positive (norm-cdf)',
original_y = prec,
y_coords = (prec - prec.min())/prec.max(),
y_axis_values = [int(prec.min() * 1000)/1000.,
int((prec.max()/2.)*1000)/1000.,
int(prec.max() * 1000)/1000.],
scores = term_freq_df.pos_scaled_f_score.values,
sort_by_dist=False,
show_characteristic=False
)
term_freq_df['neg_precision_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1./
(term_freq_df['Negative freq'] + term_freq_df['Positive freq'])))
term_freq_df['neg_freq_pct_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1.
/term_freq_df['Negative freq'].sum()))
term_freq_df['neg_scaled_f_score'] = hmean([term_freq_df['neg_precision_normcdf'], term_freq_df['neg_freq_pct_normcdf']])
term_freq_df['scaled_f_score'] = 0
term_freq_df.loc[term_freq_df['pos_scaled_f_score'] > term_freq_df['neg_scaled_f_score'],
'scaled_f_score'] = term_freq_df['pos_scaled_f_score']
term_freq_df.loc[term_freq_df['pos_scaled_f_score'] < term_freq_df['neg_scaled_f_score'],
'scaled_f_score'] = 1-term_freq_df['neg_scaled_f_score']
term_freq_df['scaled_f_score'] = 2 * (term_freq_df['scaled_f_score'] - 0.5)
term_freq_df.sort_values(by='scaled_f_score', ascending=True).iloc[:10]
is_pos = term_freq_df.pos_scaled_f_score > term_freq_df.neg_scaled_f_score
freq = term_freq_df.pos_freq_pct_normcdf*is_pos - term_freq_df.neg_freq_pct_normcdf*~is_pos
prec = term_freq_df.pos_precision_normcdf*is_pos - term_freq_df.neg_precision_normcdf*~is_pos
def scale(ar):
return (ar - ar.min())/(ar.max() - ar.min())
def close_gap(ar):
ar[ar > 0] -= ar[ar > 0].min()
ar[ar < 0] -= ar[ar < 0].max()
return ar
html = st.produce_scattertext_explorer(
corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),
category='Positive',
not_category_name='Negative',
not_categories=['Negative'],
x_label = 'Frequency',
original_x = freq,
x_coords = scale(close_gap(freq)),
x_axis_labels = ['Frequent in Neg',
'Not Frequent',
'Frequent in Pos'],
y_label = 'Precision',
original_y = prec,
y_coords = scale(close_gap(prec)),
y_axis_labels = ['Neg Precise',
'Imprecise',
'Pos Precise'],
scores = (term_freq_df.scaled_f_score.values + 1)/2,
sort_by_dist=False,
show_characteristic=False
)
We can use st.ScaledFScorePresets
as a term scorer to display terms' Scaled F-Score on the y-axis and term frequencies on the x-axis.
html = st.produce_frequency_explorer(
corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),
category='Positive',
not_category_name='Negative',
not_categories=['Negative'],
term_scorer=st.ScaledFScorePresets(beta=1, one_to_neg_one=True),
metadata = rdf['movie_name'],
grey_threshold=0
)
Scaled F-Score is not the only scoring method included in Scattertext. Please click on one of the links below to view a notebook which describes how other class association scores work and can be visualized through Scattertext.
New in 0.0.2.73 is the delta JS-Divergence scorer DeltaJSDivergence
scorer (Gallagher et al. 2020), and its corresponding compactor (JSDCompactor.) See demo_deltajsd.py
for an example usage.
New in 0.0.2.72
Scattertext was originally set up to visualize corpora objects, which are connected sets of documents and terms to visualize. The "compaction" process allows users to eliminate terms which may not be associated with a category using a variety of feature selection methods. The issue with this is that the terms eliminated during the selection process are not taken into account when scaling term positions.
This issue can be mitigated by using the position-select-plot process, where term positions are pre-determined before the selection process is made.
Let's first use the 2012 conventions corpus, update the category names, and create a unigram corpus.
import scattertext as st
import numpy as np
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
).assign(party=lambda df: df['party'].apply({'democrat': 'Democratic', 'republican': 'Republican'}.get))
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus()
category_name = 'Democratic'
not_category_name = 'Republican'
Next, let's create a dataframe consisting of the original counts and their log-scale positions.
def get_log_scale_df(corpus, y_category, x_category):
term_coord_df = corpus.get_term_freq_df('')
# Log scale term counts (with a smoothing constant) as the initial coordinates
coord_columns = []
for category in [y_category, x_category]:
col_name = category + '_coord'
term_coord_df[col_name] = np.log(term_coord_df[category] + 1e-6) / np.log(2)
coord_columns.append(col_name)
# Scale these coordinates to between 0 and 1
min_offset = term_coord_df[coord_columns].min(axis=0).min()
for coord_column in coord_columns:
term_coord_df[coord_column] -= min_offset
max_offset = term_coord_df[coord_columns].max(axis=0).max()
for coord_column in coord_columns:
term_coord_df[coord_column] /= max_offset
return term_coord_df
# Get term coordinates from original corpus
term_coordinates = get_log_scale_df(corpus, category_name, not_category_name)
print(term_coordinates)
Here is a preview of the term_coordinates
dataframe. The Democrat
and Republican
columns contain the term counts, while the _coord
columns contain their logged coordinates. Visualizing 7,973 terms is difficult (but possible) for people running Scattertext on most computers.
Democratic Republican Democratic_coord Republican_coord
term
thank 158 205 0.860166 0.872032
you 836 794 0.936078 0.933729
so 337 212 0.894681 0.873562
much 84 76 0.831380 0.826820
very 62 75 0.817543 0.826216
... ... ... ... ...
precinct 0 2 0.000000 0.661076
godspeed 0 1 0.000000 0.629493
beauty 0 1 0.000000 0.629493
bumper 0 1 0.000000 0.629493
sticker 0 1 0.000000 0.629493
[7973 rows x 4 columns]
We can visualize this full data set by running the following code block. We'll create a custom Javascript function to populate the tooltip with the original term counts, and create a Scattertext Explorer where the x and y coordinates and original values are specified from the data frame. Additionally, we can use show_diagonal=True
to draw a dashed diagonal line across the plot area.
You can click the chart below to see the interactive version. Note that it will take a while to load.
# The tooltip JS function. Note that d is is the term data object, and ox and oy are the original x- and y-
# axis counts.
get_tooltip_content = ('(function(d) {return d.term + "<br/>' + not_category_name + ' Count: " ' +
'+ d.ox +"<br/>' + category_name + ' Count: " + d.oy})')
html_orig = st.produce_scattertext_explorer(
corpus,
category=category_name,
not_category_name=not_category_name,
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
metadata=corpus.get_df()['speaker'],
show_diagonal=True,
original_y=term_coordinates[category_name],
original_x=term_coordinates[not_category_name],
x_coords=term_coordinates[category_name + '_coord'],
y_coords=term_coordinates[not_category_name + '_coord'],
max_overlapping=3,
use_global_scale=True,
get_tooltip_content=get_tooltip_content,
)
Next, we can visualize the compacted version of the corpus. The compaction, using ClassPercentageCompactor
, selects terms which frequently in each category. The term_count
parameter, set to 2, is used to determine the percentage threshold for terms to keep in a particular category. This is done using by calculating the percentile of terms (types) in each category which appear more than two times. We find the smallest percentile, and only include terms which occur above that percentile in a given category.
Note that this compaction leaves only 2,828 terms. This number is much easier for Scattertext to display in a browser.
# Select terms which appear a minimum threshold in both corpora
compact_corpus = corpus.compact(st.ClassPercentageCompactor(term_count=2))
# Only take term coordinates of terms remaining in corpus
term_coordinates = term_coordinates.loc[compact_corpus.get_terms()]
html_compact = st.produce_scattertext_explorer(
compact_corpus,
category=category_name,
not_category_name=not_category_name,
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
metadata=corpus.get_df()['speaker'],
show_diagonal=True,
original_y=term_coordinates[category_name],
original_x=term_coordinates[not_category_name],
x_coords=term_coordinates[category_name + '_coord'],
y_coords=term_coordinates[not_category_name + '_coord'],
max_overlapping=3,
use_global_scale=True,
get_tooltip_content=get_tooltip_content,
)
Occasionally, only term frequency statistics are available. This may happen in the case of very large, lost, or proprietary data sets. TermCategoryFrequencies
is a corpus representation,that can accept this sort of data, along with any categorized documents that happen to be available.
Let use the Corpus of Contemporary American English as an example.
We'll construct a visualization to analyze the difference between spoken American English and English that occurs in fiction.
df = (pd.read_excel('https://www.wordfrequency.info/files/genres_sample.xls')
.dropna()
.set_index('lemma')[['SPOKEN', 'FICTION']]
.iloc[:1000])
df.head()
'''
SPOKEN FICTION
lemma
the 3859682.0 4092394.0
I 1346545.0 1382716.0
they 609735.0 352405.0
she 212920.0 798208.0
would 233766.0 229865.0
'''
Transforming this into a visualization is extremely easy. Just pass a dataframe indexed on terms with columns indicating category-counts into the the TermCategoryFrequencies
constructor.
term_cat_freq = st.TermCategoryFrequencies(df)
And call produce_scattertext_explorer
normally:
html = st.produce_scattertext_explorer(
term_cat_freq,
category='SPOKEN',
category_name='Spoken',
not_category_name='Fiction',
)
If you'd like to incorporate some documents into the visualization, you can add them into to the TermCategoyFrequencies
object.
First, let's extract some example Fiction and Spoken documents from the sample COCA corpus.
import requests, zipfile, io
coca_sample_url = 'http://corpus.byu.edu/cocatext/samples/text.zip'
zip_file = zipfile.ZipFile(io.BytesIO(requests.get(coca_sample_url).content))
document_df = pd.DataFrame(
[{'text': zip_file.open(fn).read().decode('utf-8'),
'category': 'SPOKEN'}
for fn in zip_file.filelist if fn.filename.startswith('w_spok')][:2]
+ [{'text': zip_file.open(fn).read().decode('utf-8'),
'category': 'FICTION'}
for fn in zip_file.filelist if fn.filename.startswith('w_fic')][:2])
And we'll pass the documents_df
dataframe into TermCategoryFrequencies
via the document_category_df
parameter. Ensure the dataframe has two columns, 'text' and 'category'. Afterward, we can call produce_scattertext_explorer
(or your visualization function of choice) normally.
doc_term_cat_freq = st.TermCategoryFrequencies(df, document_category_df=document_df)
html = st.produce_scattertext_explorer(
doc_term_cat_freq,
category='SPOKEN',
category_name='Spoken',
not_category_name='Fiction',
)
Word representations have recently become a hot topic in NLP. While lots of work has been done visualizing how terms relate to one another given their scores (e.g., http://projector.tensorflow.org/), none to my knowledge has been done visualizing how we can use these to examine how document categories differ.
In this example given a query term, "jobs", we can see how Republicans and Democrats talk about it differently.
In this configuration of Scattertext, words are colored by their similarity to a query phrase.
This is done using spaCy-provided GloVe word vectors (trained on the Common Crawl corpus). The cosine distance between vectors is used, with mean vectors used for phrases.
The calculation of the most similar terms associated with each category is a simple heuristic. First, sets of terms closely associated with a category are found. Second, these terms are ranked based on their similarity to the query, and the top rank terms are displayed to the right of the scatterplot.
A term is considered associated if its p-value is less than 0.05. P-values are determined using Monroe et al. (2008)'s difference in the weighted log-odds-ratios with an uninformative Dirichlet prior. This is the only model-based method discussed in Monroe et al. that does not rely on a large, in-domain background corpus. Since we are scoring bigrams in addition to the unigrams scored by Monroe, the size of the corpus would have to be larger to have high enough bigram counts for proper penalization. This function relies the Dirichlet distribution's parameter alpha, a vector, which is uniformly set to 0.01.
Here is the code to produce such a visualization.
>>> from scattertext import word_similarity_explorer
>>> html = word_similarity_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... target_term='jobs',
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... alpha=0.01,
... max_p_val=0.05,
... save_svg_button=True)
>>> open("Convention-Visualization-Jobs.html", 'wb').write(html.encode('utf-8'))
Scattertext can interface with Gensim Word2Vec models. For example, here's a snippet from demo_gensim_similarity.py
which illustrates how to train and use a word2vec model on a corpus. Note the similarities produced reflect quirks of the corpus, e.g., "8" tends to refer to the 8% unemployment rate at the time of the convention.
import spacy
from gensim.models import word2vec
from scattertext import SampleCorpora, word_similarity_explorer_gensim, Word2VecFromParsedCorpus
from scattertext.CorpusFromParsedDocuments import CorpusFromParsedDocuments
nlp = spacy.en.English()
convention_df = SampleCorpora.ConventionData2012.get_data()
convention_df['parsed'] = convention_df.text.apply(nlp)
corpus = CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()
model = word2vec.Word2Vec(size=300,
alpha=0.025,
window=5,
min_count=5,
max_vocab_size=None,
sample=0,
seed=1,
workers=1,
min_alpha=0.0001,
sg=1,
hs=1,
negative=0,
cbow_mean=0,
iter=1,
null_word=0,
trim_rule=None,
sorted_vocab=1)
html = word_similarity_explorer_gensim(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
target_term='jobs',
minimum_term_frequency=5,
pmi_threshold_coefficient=4,
width_in_pixels=1000,
metadata=convention_df['speaker'],
word2vec=Word2VecFromParsedCorpus(corpus, model).train(),
max_p_val=0.05,
save_svg_button=True)
open('./demo_gensim_similarity.html', 'wb').write(html.encode('utf-8'))
How Democrats and Republicans talked differently about "jobs" in their 2012 convention speeches.
We can use Scattertext to visualize alternative types of word scores, and ensure that 0 scores are greyed out. Use the sparse_explroer
function to acomplish this, and see its source code for more details.
>>> from sklearn.linear_model import Lasso
>>> from scattertext import sparse_explorer
>>> html = sparse_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... scores = corpus.get_regression_coefs('democrat', Lasso(max_iter=10000)),
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... metadata=convention_df['speaker'])
>>> open('./Convention-Visualization-Sparse.html', 'wb').write(html.encode('utf-8'))
You can also use custom term positions and axis labels. For example, you can base terms' y-axis positions on a regression coefficient and their x-axis on term frequency and label the axes accordingly. The one catch is that axis positions must be scaled between 0 and 1.
First, let's define two scaling functions: scale
to project positive values to [0,1], and zero_centered_scale
project real values to [0,1], with negative values always <0.5, and positive values always >0.5.
>>> def scale(ar):
... return (ar - ar.min()) / (ar.max() - ar.min())
...
>>> def zero_centered_scale(ar):
... ar[ar > 0] = scale(ar[ar > 0])
... ar[ar < 0] = -scale(-ar[ar < 0])
... return (ar + 1) / 2.
Next, let's compute and scale term frequencies and L2-penalized regression coefficients. We'll hang on to the original coefficients and allow users to view them by mousing over terms.
>>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>>
>>> frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))
>>> scores = corpus.get_logreg_coefs('democrat',
... LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
>>> scores_scaled = zero_centered_scale(scores)
Finally, we can write the visualization. Note the use of the x_coords
and y_coords
parameters to store the respective coordinates, the scores
and sort_by_dist
arguments to register the original coefficients and use them to rank the terms in the right-hand list, and the x_label
and y_label
arguments to label axes.
>>> html = produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... x_coords=frequencies_scaled,
... y_coords=scores_scaled,
... scores=scores,
... sort_by_dist=False,
... metadata=convention_df['speaker'],
... x_label='Log frequency',
... y_label='L2-penalized logistic regression coef')
>>> open('demo_custom_coordinates.html', 'wb').write(html.encode('utf-8'))
The Emoji analysis capability displays a chart of the category-specific distribution of Emoji. Let's look at a new corpus, a set of tweets. We'll build a visualization showing how men and women use emoji differently.
Note: the following example is implemented in demo_emoji.py
.
First, we'll load the dataset and parse it using NLTK's tweet tokenizer. Note, install NLTK before running this example. It will take some time for the dataset to download.
import nltk, urllib.request, io, agefromname, zipfile
import scattertext as st
import pandas as pd
with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(
'http://followthehashtag.com/content/uploads/USA-Geolocated-tweets-free-dataset-Followthehashtag.zip'
).read())) as zf:
df = pd.read_excel(zf.open('dashboard_x_usa_x_filter_nativeretweets.xlsx'))
nlp = st.tweet_tokenzier_factory(nltk.tokenize.TweetTokenizer())
df['parse'] = df['Tweet content'].apply(nlp)
df.iloc[0]
'''
Tweet Id 721318437075685382
Date 2016-04-16
Hour 12:44
User Name Bill Schulhoff
Nickname BillSchulhoff
Bio Husband,Dad,GrandDad,Ordained Minister, Umpire...
Tweet content Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Favs NaN
RTs NaN
Latitude 40.7603
Longitude -72.9547
Country US
Place (as appears on Bio) East Patchogue, NY
Profile picture http://pbs.twimg.com/profile_images/3788000007...
Followers 386
Following 705
Listed 24
Tweet language (ISO 639-1) en
Tweet Url http://www.twitter.com/BillSchulhoff/status/72...
parse Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Name: 0, dtype: object
'''
Next, we'll use the AgeFromName package to find the probabilities of the gender of each user given their first name. First, we'll find a dataframe indexed on first names that contains the probability that each someone with that first name is male (male_prob
).
male_prob = agefromname.AgeFromName().get_all_name_male_prob()
male_prob.iloc[0]
'''
hi 1.00000
lo 0.95741
prob 1.00000
Name: aaban, dtype: float64
'''
Next, we'll extract the first names of each user, and use the male_prob
data frame to find users whose names indicate there is at least a 90% chance they are either male or female, label those users, and create new data frame df_mf
with only those users.
df['first_name'] = df['User Name'].apply(lambda x: x.split()[0].lower() if type(x) == str and len(x.split()) > 0 else x)
df_aug = pd.merge(df, male_prob, left_on='first_name', right_index=True)
df_aug['gender'] = df_aug['prob'].apply(lambda x: 'm' if x > 0.9 else 'f' if x < 0.1 else '?')
df_mf = df_aug[df_aug['gender'].isin(['m', 'f'])]
The key to this analysis is to construct a corpus using only the emoji extractor st.FeatsFromSpacyDocOnlyEmoji
which builds a corpus only from emoji and not from anything else.
corpus = st.CorpusFromParsedDocuments(
df_mf,
parsed_col='parse',
category_col='gender',
feats_from_spacy_doc=st.FeatsFromSpacyDocOnlyEmoji()
).build()
Next, we'll run this through a standard produce_scattertext_explorer
visualization generation.
html = st.produce_scattertext_explorer(
corpus,
category='f',
category_name='Female',
not_category_name='Male',
use_full_doc=True,
term_ranker=OncePerDocFrequencyRanker,
sort_by_dist=False,
metadata=(df_mf['User Name']
+ ' (@' + df_mf['Nickname'] + ') '
+ df_mf['Date'].astype(str)),
width_in_pixels=1000
)
open("EmojiGender.html", 'wb').write(html.encode('utf-8'))
SentencePiece tokenization is a subword tokenization technique which relies on a language-model to produce optimized tokenization. It has been used in large, transformer-based contextual language models.
Ensure to run $ pip install sentencepiece
before running this example.
First, let's load the political convention data set as normal.
import tempfile
import re
import scattertext as st
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parse'] = convention_df.text.apply(st.whitespace_nlp_with_sentences)
Next, let's train a SentencePiece tokenizer based on this data. The train_sentence_piece_tokenizer
function trains a SentencePieceProcessor on the data set and returns it. You can of course use any SentencePieceProcessor.
def train_sentence_piece_tokenizer(documents, vocab_size):
'''
:param documents: list-like, a list of str documents
:vocab_size int: the size of the vocabulary to output
:return sentencepiece.SentencePieceProcessor
'''
import sentencepiece as spm
sp = None
with tempfile.NamedTemporaryFile(delete=True) as tempf:
with tempfile.NamedTemporaryFile(delete=True) as tempm:
tempf.write(('\n'.join(documents)).encode())
spm.SentencePieceTrainer.Train(
'--input=%s --model_prefix=%s --vocab_size=%s' % (tempf.name, tempm.name, vocab_size)
)
sp = spm.SentencePieceProcessor()
sp.load(tempm.name + '.model')
return sp
sp = train_sentence_piece_tokenizer(convention_df.text.values, vocab_size=2000)
Next, let's add the SentencePiece tokens as metadata when creating our corpus. In order to do this, pass a FeatsFromSentencePiece
instance into the feats_from_spacy_doc
parameter. Pass the SentencePieceProcessor into the constructor.
corpus = st.CorpusFromParsedDocuments(convention_df,
parsed_col='parse',
category_col='party',
feats_from_spacy_doc=st.FeatsFromSentencePiece(sp)).build()
Now we can create the SentencePiece token scatter plot.
html = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
sort_by_dist=False,
metadata=convention_df['party'] + ': ' + convention_df['speaker'],
term_scorer=st.RankDifference(),
transform=st.Scalers.dense_rank,
use_non_text_features=True,
use_full_doc=True,
)
Suppose you'd like to audit or better understand weights or importances given to bag-of-words features by a classifier.
It's easy to use Scattertext to do, if you use a Scikit-learn-style classifier.
For example the Lighting package makes available high-performance linear classifiers which are have Scikit-compatible interfaces.
First, let's import sklearn
's text feature extraction classes, the 20 Newsgroup corpus, Lightning's Primal Coordinate Descent classifier, and Scattertext. We'll also fetch the training portion of the Newsgroup corpus.
from lightning.classification import CDClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import scattertext as st
newsgroups_train = fetch_20newsgroups(
subset='train',
remove=('headers', 'footers', 'quotes')
)
Next, we'll tokenize our corpus twice. Once into tfidf features which will be used to train the classifier, an another time into ngram counts that will be used by Scattertext. It's important that both vectorizers share the same vocabulary, since we'll need to apply the weight vector from the model onto our Scattertext Corpus.
vectorizer = TfidfVectorizer()
tfidf_X = vectorizer.fit_transform(newsgroups_train.data)
count_vectorizer = CountVectorizer(vocabulary=vectorizer.vocabulary_)
Next, we use the CorpusFromScikit
factory to build a Scattertext Corpus object. Ensure the X
parameter is a document-by-feature matrix. The argument to the y
parameter is an array of class labels. Each label is an integer representing a different news group. We the feature_vocabulary
is the vocabulary used by the vectorizers. The category_names
are a list of the 20 newsgroup names which as a class-label list. The raw_texts
is a list of the text of newsgroup texts.
corpus = st.CorpusFromScikit(
X=count_vectorizer.fit_transform(newsgroups_train.data),
y=newsgroups_train.target,
feature_vocabulary=vectorizer.vocabulary_,
category_names=newsgroups_train.target_names,
raw_texts=newsgroups_train.data
).build()
Now, we can train the model on tfidf_X
and the categoricla response variable, and capture feature weights for category 0 ("alt.atheism").
clf = CDClassifier(penalty="l1/l2",
loss="squared_hinge",
multiclass=True,
max_iter=20,
alpha=1e-4,
C=1.0 / tfidf_X.shape[0],
tol=1e-3)
clf.fit(tfidf_X, newsgroups_train.target)
term_scores = clf.coef_[0]
Finally, we can create a Scattertext plot. We'll use the Monroe-style visualization, and automatically select around 4000 terms that encompass the set of frequent terms, terms with high absolute scores, and terms that are characteristic of the corpus.
html = st.produce_frequency_explorer(
corpus,
'alt.atheism',
scores=term_scores,
use_term_significance=False,
terms_to_include=st.AutoTermSelector.get_selected_terms(corpus, term_scores, 4000),
metadata = ['/'.join(fn.split('/')[-2:]) for fn in newsgroups_train.filenames]
)
Let's take a look at the performance of the classifier:
newsgroups_test = fetch_20newsgroups(subset='test',
remove=('headers', 'footers', 'quotes'))
X_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(X_test)
f1 = f1_score(pred, newsgroups_test.target, average='micro')
print("Microaveraged F1 score", f1)
Microaveraged F1 score 0.662108337759. Not bad over a ~0.05 baseline.
Please see Signo for an introduction to semiotic squares.
Some variants of the semiotic square-creator are can be seen in this notebook, which studies words and phrases in headlines that had low or high Facebook engagement and were published by either BuzzFeed or the New York Times: [http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb]
The idea behind the semiotic square is to express the relationship between two opposing concepts and concepts things within a larger domain of a discourse. Examples of opposed concepts life or death, male or female, or, in our example, positive or negative sentiment. Semiotics squares are comprised of four "corners": the upper two corners are the opposing concepts, while the bottom corners are the negation of the concepts.
Circumscribing the negation of a concept involves finding everything in the domain of discourse that isn't associated with the concept. For example, in the life-death opposition, one can consider the universe of discourse to be all animate beings, real and hypothetical. The not-alive category will cover dead things, but also hypothetical entities like fictional characters or sentient AIs.
In building lexicalized semiotic squares, we consider concepts to be documents labeled in a corpus. Documents, in this setting, can belong to one of three categories: two labels corresponding to the opposing concepts, a neutral category, indicating a document is in the same domain as the opposition, but cannot fall into one of opposing categories.
In the example below positive and negative movie reviews are treated as the opposing categories, while plot descriptions of the same movies are treated as the neutral category.
Terms associated with one of the two opposing categories (relative only to the other) are listed as being associated with that category. Terms associated with a netural category (e.g., not positive) are terms which are associated with the disjunction of the opposite category and the neutral category. For example, not-positive terms are those most associated with the set of negative reviews and plot descriptions vs. positive reviews.
Common terms among adjacent corners of the square are also listed.
An HTML-rendered square is accompanied by a scatter plot. Points on the plot are terms. The x-axis is the Z-score of the association to one of the opposed concepts. The y-axis is the Z-score how associated a term is with the neutral set of documents relative to the opposed set. A point's red-blue color indicate the term's opposed-association, while the more desaturated a term is, the more it is associated with the neutral set of documents.
import scattertext as st
movie_df = st.SampleCorpora.RottenTomatoes.get_data()
movie_df.category = movie_df.category.apply\
(lambda x: {'rotten': 'Negative', 'fresh': 'Positive', 'plot': 'Plot'}[x])
corpus = st.CorpusFromPandas(
movie_df,
category_col='category',
text_col='text',
nlp=st.whitespace_nlp_with_sentences
).build().get_unigram_corpus()
semiotic_square = st.SemioticSquare(
corpus,
category_a='Positive',
category_b='Negative',
neutral_categories=['Plot'],
scorer=st.RankDifference(),
labels={'not_a_and_not_b': 'Plot Descriptions', 'a_and_b': 'Reviews'}
)
html = st.produce_semiotic_square_explorer(semiotic_square,
category_name='Positive',
not_category_name='Negative',
x_label='Fresh-Rotten',
y_label='Plot-Review',
neutral_category_name='Plot Description',
metadata=movie_df['movie_name'])
There are a number of other types of semiotic square construction functions.
A frequently requested feature of Scattertext has been the ability to visualize topic models. While this capability has existed in some forms (e.g., the Empath visualization), I've finally gotten around to implementing a concise API for such a visualization. There are three main ways to visualize topic models using Scattertext. The first is the simplest: manually entering topic models and visualizing them. The second uses a Scikit-Learn pipeline to produce the topic models for visualization. The third is a novel topic modeling technique, based on finding terms similar to a custom set of seed terms.
If you have already created a topic model, simply structure it as a dictionary. This dictionary is keyed on string which serve as topic titles and are displayed in the main scatterplot. The values are lists of words that belong to that topic. The words that are in each topic list are bolded when they appear in a snippet.
Note that currently, there is no support for keyword scores.
For example, one might manually the following topic models to explore in the Convention corpus:
topic_model = {
'money': ['money','bank','banks','finances','financial','loan','dollars','income'],
'jobs':['jobs','workers','labor','employment','worker','employee','job'],
'patriotic':['america','country','flag','americans','patriotism','patriotic'],
'family':['mother','father','mom','dad','sister','brother','grandfather','grandmother','son','daughter']
}
We can use the FeatsFromTopicModel
class to transform this topic model into one which can be visualized using Scattertext. This is used just like any other feature builder, and we pass the topic model object into produce_scattertext_explorer
.
import scattertext as st
topic_feature_builder = st.FeatsFromTopicModel(topic_model)
topic_corpus = st.CorpusFromParsedDocuments(
convention_df,
category_col='party',
parsed_col='parse',
feats_from_spacy_doc=topic_feature_builder
).build()
html = st.produce_scattertext_explorer(
topic_corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
metadata=convention_df['speaker'],
use_non_text_features=True,
use_full_doc=True,
pmi_threshold_coefficient=0,
topic_model_term_lists=topic_feature_builder.get_top_model_term_lists()
)
Since topic modeling using document-level coocurence generally produces poor results, I've added a SentencesForTopicModeling
class which allows clusterting by coocurence at the sentence-level. It requires a ParsedCorpus
object to be passed to its constructor, and creates a term-sentence matrix internally.
Next, you can create a topic model dictionary like the one above by passing in a Scikit-Learn clustering or dimensionality reduction pipeline. The only constraint is the last transformer in the pipeline must populate a components_
attribute.
The num_topics_per_term
attribute specifies how many terms should be added to a list.
In the following example, we'll use NMF to cluster a stoplisted, unigram corpus of documents, and use the topic model dictionary to create a FeatsFromTopicModel
, just like before.
Note that in produce_scattertext_explorer
, we make the topic_model_preview_size
20 in order to show a preview of the first 20 terms in the topic in the snippet view as opposed to the default 10.
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
unigram_corpus = (st.CorpusFromParsedDocuments(convention_df,
category_col='party',
parsed_col='parse')
.build().get_stoplisted_unigram_corpus())
topic_model = st.SentencesForTopicModeling(unigram_corpus).get_topics_from_model(
Pipeline([
('tfidf', TfidfTransformer(sublinear_tf=True)),
('nmf', (NMF(n_components=100, alpha=.1, l1_ratio=.5, random_state=0)))
]),
num_terms_per_topic=20
)
topic_feature_builder = st.FeatsFromTopicModel(topic_model)
topic_corpus = st.CorpusFromParsedDocuments(
convention_df,
category_col='party',
parsed_col='parse',
feats_from_spacy_doc=topic_feature_builder
).build()
html = st.produce_scattertext_explorer(
topic_corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
metadata=convention_df['speaker'],
use_non_text_features=True,
use_full_doc=True,
pmi_threshold_coefficient=0,
topic_model_term_lists=topic_feature_builder.get_top_model_term_lists(),
topic_model_preview_size=20
)
A surprisingly easy way to generate good topic models is to use a term scoring formula to find words that are associated with sentences where a seed word occurs vs. where one doesn't occur.
Given a custom term list, the SentencesForTopicModeling.get_topics_from_terms
will generate a series of topics. Note that the dense rank difference (RankDifference
) works particularly well for this task, and is the default parameter.
term_list = ['obama', 'romney', 'democrats', 'republicans', 'health', 'military', 'taxes',
'education', 'olympics', 'auto', 'iraq', 'iran', 'israel']
unigram_corpus = (st.CorpusFromParsedDocuments(convention_df,
category_col='party',
parsed_col='parse')
.build().get_stoplisted_unigram_corpus())
topic_model = (st.SentencesForTopicModeling(unigram_corpus)
.get_topics_from_terms(term_list,
scorer=st.RankDifference(),
num_terms_per_topic=20))
topic_feature_builder = st.FeatsFromTopicModel(topic_model)
# The remaining code is identical to two examples above. See demo_word_list_topic_model.py
# for the complete example.
Scattertext makes it easy to create word-similarity plots using projections of word embeddings as the x and y-axes. In the example below, we create a stop-listed Corpus with only unigram terms. The produce_projection_explorer
function by uses Gensim to create word embeddings and then projects them to two dimentions using Uniform Manifold Approximation and Projection (UMAP).
UMAP is chosen over T-SNE because it can employ the cosine similarity between two word vectors instead of just the euclidean distance.
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
corpus = (st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse')
.build().get_stoplisted_unigram_corpus())
html = st.produce_projection_explorer(corpus, category='democrat', category_name='Democratic',
not_category_name='Republican', metadata=convention_df.speaker)
In order to use custom word embedding functions or projection functions, pass models into the word2vec_model
and projection_model
parameters. In order to use T-SNE, for example, use projection_model=sklearn.manifold.TSNE()
.
import umap
from gensim.models.word2vec import Word2Vec
html = st.produce_projection_explorer(corpus,
word2vec_model=Word2Vec(size=100, window=5, min_count=10, workers=4),
projection_model=umap.UMAP(min_dist=0.5, metric='cosine'),
category='democrat',
category_name='Democratic',
not_category_name='Republican',
metadata=convention_df.speaker)
Term positions can also be determined by the positions of terms according to the output of principal component analysis, and produce_projection_explorer
also supports this functionality. We'll look at how axes transformations ("scalers" in Scattertext terminology) can make it easier to inspect the output of PCA.
We'll use the 2012 Conventions corpus for these visualizations. Only unigrams occurring in at least three documents will be considered.
>>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
>>> corpus = (st.CorpusFromParsedDocuments(convention_df,
... category_col='party',
... parsed_col='parse')
... .build()
... .get_stoplisted_unigram_corpus()
... .remove_infrequent_words(minimum_term_count=3, term_ranker=st.OncePerDocFrequencyRanker))
Next, we use scikit-learn's tf-idf transformer to find very simple, sparse embeddings for all of these words. Since, we input a #docs x #terms matrix to the transformer, we can transpose it to get a proper term-embeddings matrix, where each row corresponds to a term, and the columns correspond to document-specific tf-idf scores.
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> embeddings = TfidfTransformer().fit_transform(corpus.get_term_doc_mat())
>>> embeddings.shape
(189, 2159)
>>> corpus.get_num_docs(), corpus.get_num_terms()
(189, 2159)
>>> embeddings = embeddings.T
>>> embeddings.shape
(2159, 189)
Given these spare embeddings, we can apply sparse singular value decomposition to extract three factors. SVD outputs factorizes the term embeddings matrix into three matrices, U, Σ, and VT. Importantly, the matrix U provides the singular values for each term, and VT provides them for each document, and Σ is a vector of the singular values.
>>> from scipy.sparse.linalg import svds
>>> U, S, VT = svds(embeddings, k = 3, maxiter=20000, which='LM')
>>> U.shape
(2159, 3)
>>> S.shape
(3,)
>>> VT.shape
(3, 189)
We'll look at the first two singular values, plotting each term such that the x-axis position is the first singular value, and the y-axis term is the second. To do this, we make a "projection" data frame, where the x
and y
columns store the first two singular values, and key the data frame on each term. This controls the term positions on the chart.
>>> x_dim = 0; y_dim = 1;
>>> projection = pd.DataFrame({'term':corpus.get_terms(),
... 'x':U.T[x_dim],
... 'y':U.T[y_dim]}).set_index('term')
We'll use the produce_pca_explorer
function to visualize these. Note we include the projection object, and specify which singular values were used for x and y (x_dim
and y_dim
) so we they can be labeled in the interactive visualization.
html = st.produce_pca_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
projection=projection,
metadata=convention_df['speaker'],
width_in_pixels=1000,
x_dim=x_dim,
y_dim=y_dim)
Click for an interactive visualization.
We can easily re-scale the plot in order to make more efficient use of space. For example, passing in scaler=scale_neg_1_to_1_with_zero_mean
will make all four quadrants take equal area.
html = st.produce_pca_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
projection=projection,
metadata=convention_df['speaker'],
width_in_pixels=1000,
scaler=st.scale_neg_1_to_1_with_zero_mean,
x_dim=x_dim,
y_dim=y_dim)
Click for an interactive visualization.
To export the content of a scattertext explorer object (ScattertextStructure) to matplotlib you can use produce_scattertext_pyplot
. The function returns a matplotlib.figure.Figure
object which can be visualized using plt.show
or plt.savefig
as in the example below.
Note that installation of textalloc==0.0.3 and matplotlib>=3.6.0 is required before running this.
convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()
scattertext_structure = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
pmi_threshold_coefficient=8,
width_in_pixels=1000,
return_scatterplot_structure=True,
)
fig = st.produce_scattertext_pyplot(scattertext_structure)
fig.savefig('pyplot_export.png', format='png')
[]
Please see the examples in the PyData 2017 Tutorial on Scattertext.
Cozy: The Collection Synthesizer (Loncaric 2016) was used to help determine which terms could be labeled without overlapping a circle or another label. It automatically built a data structure to efficiently store and query the locations of each circle and labeled term.
The script to build rectangle-holder.js
was
fields ax1 : long, ay1 : long, ax2 : long, ay2 : long
assume ax1 < ax2 and ay1 < ay2
query findMatchingRectangles(bx1 : long, by1 : long, bx2 : long, by2 : long)
assume bx1 < bx2 and by1 < by2
ax1 < bx2 and ax2 > bx1 and ay1 < by2 and ay2 > by1
And it was called using
$ python2.7 src/main.py <script file name> --enable-volume-trees \
--js-class RectangleHolder --enable-hamt --enable-arrays --js rectangle_holder.js
Adding in code to ensure that term statistics will show up even if no documents are present in visualization.
Better axis labeling (see demo_axis_crossbars_and_labels.py).
Pytextrank compatibility
Ensuring Pandas 1.0 compatibility fixing Issue #51 and scikit-learn stopwords import issue in #49.
AssociationCompactorByRank
, TermCategoryRanker
.terms_to_show
parameteruse_categories_as_metadata_and_replace_terms
to TermDocMatrix
.get_metadata_doc_count_df
and get_metadata_count_mat
to TermDocMatrixproduce_pairplot
ScatterChart.hide_terms(terms: iter[str])
which enables selected terms to be hidden from the chart.ScatterChartData.score_transform
to specify the function which can change an original score into a value between 0 and 1 used for term coloring.alternative_term_func
to produce_scattertext_explorer
which allows you to inject a function that activates when a term is clicked.HedgesR
, and unbiased version of Cohen's d which is a subclass of CohensD
.frequency_transform
parameter to produce_frequency_explorer
. This defaults to a log transform, but allows you to use any way your heart desires to order terms along the x-axis.show_category_headings=True
to produce_scattertext_explorer
. Setting this to False suppresses the list of categories which will be displayed in the term context area.div_name
argument to produce_scattertext_explorer
and name-spaced important divs and classes by div_name
in HTML templates and Javascript.show_cross_axes=True
to produce_scattertext_explorer
. Setting this to False
prevents the cross axes from being displayed if show_axes
is True
.TermDocMatrix.get_metadata_freq_df
now accepts the label_append
argument which by default adds ' freq'
to the end of each column.TermDocMatrix.get_num_cateogires
returns the number of categories in a term-document matrix.Added the following methods:
TermDocMatrixWithoutCategories.get_num_metadata
TermDocMatrix.use_metadata_as_categories
unified_context
argument in produce_scattertext_explorer
lists all contexts in a single column. This let's you see snippets organized by multiple categories in a single column. See demo_unified_context.py
for an example.Added a series of objects to handle uncategorized corpora. Added section on Document-Based Scatterplots, and the add_doc_names_as_metadata function. CategoryColorAssigner
was also added to assign colors to a qualitative categories.
A number of new term scoring approaches including RelativeEntropy
(a direct implementation of Frankhauser et al. (2014)), and ZScores
and implementation of the Z-Score model used in Frankhauser et al.
TermDocMatrix.get_metadata_freq_df()
returns a metadata-doc corpus.
CorpusBasedTermScorer.set_ranker
allows you to use a different term ranker when finding corpus-based scores. This not only lets these scorers with metadata, but also allows you to integrate once-per-document counts.
Fixed produce_projection_explorer
such that it can work with a predefined set of term embeddings. This can allow, for example, the easy exploration of one hot-encoded term embeddings in addition to arbitrary lower-dimensional embeddings.
Added add_metadata
to TermDocMatrix
in order to inject meta data after a TermDocMatrix object has been created.
Made sure tooltip never started above the top of the web page.
Added DomainCompactor
.
Fixed bug #31, enabling context to show when metadata value is clicked.
Enabled display of terms in topic models in explorer, along with the the display of customized topic models. Please see Visualizing topic models for an overview of the additions.
Removed pkg_resources from Phrasemachine, corrected demo_phrase_machine.py
Now compatible with Gensim 3.4.0.
Added characteristic explorer, produce_characteristic_explorer
, to plot terms with their characteristic scores on the x-axis and their class-association scores on the y-axis. See Ordering Terms by Corpus Characteristicness for more details.
Added TermCategoryFrequencies
in response to Issue 23. Please see Visualizing differences based on only term frequencies for more details.
Added x_axis_labels
and y_axis_labels
parameters to produce_scattertext_explorer
. These let you include evenly-spaced string axis labels on the chart, as opposed to just "Low", "Medium" and "High". These rely on d3's ticks function, which can behave unpredictable. Caveat usor.
Semiotic Squares now look better, and have customizable labels.
Incorporated the General Inquirer lexicon. For non-commercial use only. The lexicon is downloaded from their homepage at the start of each use. See demo_general_inquierer.py
.
Incorporated Phrasemachine from AbeHandler (Handler et al. 2016). For the license, please see PhraseMachineLicense.txt
. For an example, please see demo_phrase_machine.py
.
Added CompactTerms
for removing redundant and infrequent terms from term document matrices. These occur if a word or phrase is always part of a larger phrase; the shorter phrase is considered redundant and removed from the corpus. See demo_phrase_machine.py
for an example.
Added FourSquare
, a pattern that allows for the creation of a semiotic square with separate categories for each corner. Please see demo_four_square.py
for an early example.
Finally, added a way to easily perform T-SNE-style visualizations on a categorized corpus. This uses, by default, the umap-learn package. Please see demo_tsne_style.py.
Fixed to ScaledFScorePresets(one_to_neg_one=True)
, added UnigramsFromSpacyDoc
.
Now, when using CorpusFromPandas
, a CorpusDF
object is returned, instead of a Corpus
object. This new type of object keeps a reference to the source data frame, and returns it via the CorpusDF.get_df()
method.
The factory CorpusFromFeatureDict
was added. It allows you to directly specify term counts and metadata item counts within the dataframe. Please see test_corpusFromFeatureDict.py
for an example.
Added a very semiotic square creator.
The idea to build a semiotic square that contrasts two categories in a Term Document Matrix while using other categories as neutral categories.
See Creating semiotic squares for an overview on how to use this functionality and semiotic squares.
Added a parameter to disable the display of the top-terms sidebar, e.g., produce_scattertext_explorer(..., show_top_terms=False, ...)
.
An interface to part of the subjectivity/sentiment dataset from Bo Pang and Lillian Lee. ``A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts''. ACL. 2004. See SampleCorpora.RottenTomatoes
.
Fixed bug that caused tooltip placement to be off after scrolling.
Made category_name
and not_category_name
optional in produce_scattertext_explorer
etc.
Created the ability to customize tooltips via the get_tooltip_content
argument to produce_scattertext_explorer
etc., control axes labels via x_axis_values
and y_axis_values
. The color_func
parameter is a Javascript function to control color of a point. Function takes a parameter which is a dictionary entry produced by ScatterChartExplorer.to_dict
and returns a string.
Integration with Scikit-Learn's text-analysis pipeline led the creation of the CorpusFromScikit
and TermDocMatrixFromScikit
classes.
The AutoTermSelector
class to automatically suggest terms to appear in the visualization.
This can make it easier to show large data sets, and remove fiddling with the various minimum term frequency parameters.
For an example of how to use CorpusFromScikit
and AutoTermSelector
, please see demo_sklearn.py
Also, I updated the library and examples to be compatible with spaCy 2.
Fixed bug when processing single-word documents, and set the default beta to 2.
Added produce_frequency_explorer
function, and adding the PEP 369-compliant __version__
attribute as mentioned in #19. Fixed bug when creating visualizations with more than two possible categories. Now, by default, category names will not be title-cased in the visualization, but will retain their original case.
If you'd still like to do this this, use ScatterChart (or a descendant).to_dict(..., title_case_names=True)
. Fixed DocsAndLabelsFromCorpus
for Py 2 compatibility.
Fixed bugs in chinese_nlp
when jieba has already been imported and in p-value computation when performing log-odds-ratio w/ prior scoring.
Added demo for performing a Monroe et. al (2008) style visualization of log-odds-ratio scores in demo_log_odds_ratio_prior.py
.
Breaking change: pmi_filter_thresold
has been replaced with pmi_threshold_coefficient
.
Added Emoji and Tweet analysis. See Emoji analysis.
Characteristic terms falls back ot "Most frequent" if no terms used in the chart are present in the background corpus.
Fixed top-term calculation for custom scores.
Set scaled f-score's default beta to 0.5.
Added --spacy_language_model
argument to the CLI.
Added the alternative_text_field
option in produce_scattertext_explorer
to show an alternative text field when showing contexts in the interactive HTML visualization.
Updated ParsedCorpus.get_unigram_corpus
to allow for continued alternative_text_field
functionality.
Added ability to for Scattertext to use noun chunks instead of unigrams and bigrams through the FeatsFromSpacyDocOnlyNounChunks
class. In order to use it, run your favorite Corpus
or TermDocMatrix
factory, and pass in an instance of the class as a parameter:
st.CorpusFromParsedDocuments(..., feats_from_spacy_doc=st.FeatsFromSpacyDocOnlyNounChunks())
Fixed a bug in corpus construction that occurs when the last document has no features.
Now you don't have to install tinysegmenter to use Scattertext. But you need to install it if you want to parse Japanese. This caused a problem when Scattertext was being installed on Windows.
Added TermDocMatrix.get_corner_score
, giving an improved version of the Rudder Score. Exposing whitespace_nlp_with_sentences
. It's a lightweight bad regex sentence splitter built a top a bad regex tokenizer that somewhat apes spaCy's API. Use it if you don't have spaCy and the English model downloaded or if you care more about memory footprint and speed than accuracy.
It's not compatible with word_similarity_explorer
but is compatible with `word_similarity_explorer_gensim'.
Tweaked scaled f-score normalization.
Fixed Javascript bug when clicking on '$'.
Fixed bug in Scaled F-Score computations, and changed computation to better score words that are inversely correlated to category.
Added Word2VecFromParsedCorpus
to automate training Gensim word vectors from a corpus, andword_similarity_explorer_gensim
to produce the visualization.
See demo_gensim_similarity.py
for an example.
Added the d3_url
and d3_scale_chromatic_url
parameters to produce_scattertext_explorer
. This provides a way to manually specify the paths to "d3.js" (i.e., the file from "https://cdnjs.cloudflare.com/ajax/libs/d3/4.6.0/d3.min.js") and "d3-scale-chromatic.v1.js" (i.e., the file from "https://d3js.org/d3-scale-chromatic.v1.min.js").
This is important if you're getting the error:
Javascript error adding output!
TypeError: d3.scaleLinear is not a function
See your browser Javascript console for more details.
It also lets you use Scattertext if you're serving in an environment with no (or a restricted) external Internet connection.
For example, if "d3.min.js" and "d3-scale-chromatic.v1.min.js" were present in the current working directory, calling the following code would reference them locally instead of the remote Javascript files. See Visualizing term associations for code context.
>>> html = st.produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... d3_url='d3.min.js',
... d3_scale_chromatic_url='d3-scale-chromatic.v1.min.js')
Fixed a bug in 0.0.2.6.0 that transposed default axis labels.
Added a Japanese mode to Scattertext. See demo_japanese.py
for an example of how to use Japanese. Please run pip install tinysegmenter
to parse Japanese.
Also, the chiense_mode
boolean parameter in produce_scattertext_explorer
has been renamed to asian_mode
.
For example, the output of demo_japanese.py
is:
Custom term positions and axis labels. Although not recommended, you can visualize different metrics on each axis in visualizations similar to Monroe et al. (2008). Please see Custom term positions for more info.
Enhanced the visualization of query-based categorical differences, a.k.a the word_similarity_explorer
function. When run, a plot is produced that contains category associated terms colored in either red or blue hues, and terms not associated with either class colored in greyscale and slightly smaller. The intensity of each color indicates association with the query term. For example:
Some minor bug fixes, and added a minimum_not_category_term_frequency
parameter. This fixes a problem with visualizing imbalanced datasets. It sets a minimum number of times a word that does not appear in the target category must appear before it is displayed.
Added TermDocMatrix.remove_entity_tags
method to remove entity type tags from the analysis.
Fixed matched snippet not displaying issue #9, and fixed a Python 2 issue in created a visualization using a ParsedCorpus
prepared via CorpusFromParsedDocuments
, mentioned in the latter part of the issue #8 discussion.
Again, Python 2 is supported in experimental mode only.
Corrected example links on this Readme.
Fixed a bug in Issue 8 where the HTML visualization produced by produce_scattertext_html
would fail.
Fixed a couple issues that rendered Scattertext broken in Python 2. Chinese processing still does not work.
Note: Use Python 3.4+ if you can.
Fixed links in Readme, and made regex NLP available in CLI.
Added the command line tool, and fixed a bug related to Empath visualizations.
Ability to see how a particular term is discussed differently between categories through the word_similarity_explorer
function.
Specialized mode to view sparse term scores.
Fixed a bug that was caused by repeated values in background unigram counts.
Added true alphabetical term sorting in visualizations.
Added an optional save-as-SVG button.
Addition option of showing characteristic terms (from the full set of documents) being considered. The option (show_characteristic
in produce_scattertext_explorer
) is on by default, but currently unavailable for Chinese. If you know of a good Chinese wordcount list, please let me know. The algorithm used to produce these is F-Score.
See this and the following slide for more details
Added document and word count statistics to main visualization.
Added preliminary support for visualizing Empath (Fast 2016) topics categories instead of emotions. See the tutorial for more information.
Improved term-labeling.
Addition of strip_final_period
param to FeatsFromSpacyDoc
to deal with spaCy tokenization of all-caps documents that can leave periods at the end of terms.
I've added support for Chinese, including the ChineseNLP class, which uses a RegExp-based sentence splitter and Jieba for word segmentation. To use it, see the demo_chinese.py
file. Note that CorpusFromPandas
currently does not support ChineseNLP.
In order for the visualization to work, set the asian_mode
flag to True
in produce_scattertext_explorer
.
Author: JasonKessler
Source Code: https://github.com/JasonKessler/scattertext
License: Apache-2.0 license
1675500600
Are you a social media marketer who wants to better focus your time, effort, and budget? It's time for some new social media analytics tools!
Wondering which of your social media tactics are working? Want to better focus your time, effort, and budget? You need a social media analytics tool.
In this article, I’ll cover some of the best free social media analytics tools available, along with some paid options (for the true nerds who want to dive deep on the data and see real returns).
Social media analytics tools help you create performance reports to share with your team, stakeholders, and boss — to figure out what’s working and what’s not. They should also provide the historical data you need to assess your social media marketing strategy on both macro and micro levels.
They can help you answer questions like:
Key benefits: Performance data from every social network in one place with easy-to-understand reports
Paid or free? Paid tool
Skill level: Beginner to intermediate
Best for: Business owners who run their own social media, social media managers at small-to-medium sized businesses, marketing teams
Most social media management platforms have built-in analytics tools. I hope you’ll forgive me for saying Hootsuite’s reporting capabilities are my favorite. But it’s the tool I know and love best.
Imagine Twitter analytics, Instagram analytics, Facebook analytics, Pinterest analytics, and LinkedIn analytics all in one place. Hootsuite Analytics offers a complete picture of all your social media efforts, so you don’t have to check each platform individually.
It saves time by making it easy to compare results across networks.
Ever spend a bunch of time writing and designing a social post only to have it fall completely flat? There could be a lot of reasons for that. But one of the most common reasons this happens is posting at the wrong time. A.k.a. Posting when your target audiences are not online or not interested in engaging with you.
This is why our Best Time to Publish tool is one of the most popular features of Hootsuite Analytics. It looks at your unique historical social media data and recommends the most optimal times to post based on three different goals:
Most social media analytics tools will only recommend posting times based on engagement. Or they’ll use data from universal benchmarks, instead of your unique performance history.
Other cool things you can do with Hootsuite Analytics:
On top of all of that, Hootsuite won the 2022 MarTech Breakthrough Award for Best Overall Social Media Management Platform!
And, according to reviews at least, the social media analytics tools were a big part of that win:
“Makes social media so much easier!
The ease of scheduling posts is amazing. The analytics for reporting are incredible. You can create you own personalized reports.”
– Melissa R. Social Media Manager
Need help setting realistic goals? Or maybe you’re not a fan of manually collecting data for audits and SWOT analyses?
With Hootsuite’s social media benchmarking, you can find out how others in your industry are doing on social and compare your results with just a few clicks.
To get industry benchmarks, follow these steps:
That’s it! Now you can see how your results compare to average performance stats within your industry. You can set up custom timeframes, switch between networks — Instagram, Facebook, Twitter, LinkedIn, and TikTok — and look up benchmarks for the following metrics:
… and more.
You will also find resources to improve your content performance right in the summary section:
And, if you need to present your results to your team, boss, or other stakeholders, you can easily download your comparison report as a PDF file.
Hootsuite Analytics is included in the Hootsuite Professional plan, which you can try for free for 30 days.
Key benefit: See how much traffic and leads flow to your website from your social media channels
Paid or free: Free tool
Skill level: all skill levels
Best for: all social media professionals should be familiar with Google Analytics, but especially those who work for a web-based business
You’ve probably heard of Google Analytics already. That’s because it’s one of the best free tools to use to learn about your website visitors. And if you’re a social marketer who likes to drive traffic to your website, then it’s an invaluable resource to have in your back pocket.
While it’s not a social media reporting tool per se, you can use it to set up reports that will help you:
With these data points, you’ll be able to get the most out of your social media campaigns and effectively strategize for the future. No social media strategy is complete without Google analytics.
Learn more: How to use Google Analytics to track social media success
Key benefit: Fully customizable reporting that can draw data from all major social media networks.
Paid or free: Paid tool
Skill level: intermediate
Best for: social media managers
RivalIQ was designed to let social media managers be data scientists, without the pesky certification. RivalIQ delivers on-demand analytical data, alerts, and custom reports from major social media platforms.
Easily conduct a competitive analysis or a complete social media audit with RivalIQ’s in-depth reporting. Better still, you can actually present your findings directly to your director, stakeholders, and marketing team with fully-customizable charts, graphics, and dashboards.
But RivalIQ isn’t just for finding the big picture! Comprehensive social post analytics lets you see exactly which posts work for each platform and identify why they work. Know exactly whether it was the hashtags, time of day, post type, or which network’s audience led to success. Then take that knowledge and double down for more success!
Pro tip: Getting owned by the competition? With RivalIQ you can find all the same info above, but from their social media accounts. If you can’t beat ’em, join ’em (then beat ’em at their own game)!
Learn more: Try a demo or start your free trial with RivalIQ
Key benefits: Analyze brand sentiment and customer demographics in real time, alongside all your other social media performance data
Free or paid: Paid tool
Skill level: Intermediate to advanced
Best for: Social media professionals, PR and communications teams, small to large social media teams
Hootsuite Insights is a powerful enterprise-level social listening tool that doubles as an analytics tool.
It goes beyond Hootsuite Analytics, tracking your earned social mentions so you can measure social sentiment and improve customer experience.
It also analyzes data about your audience demographics like gender, location, and language. You can compare demographics across networks, or look at the aggregate picture of your audience for all networks combined.
This is a tool that really tells you a lot about your audience — and how they feel about you. It can tell you whether a spike in mentions is a victory or a disaster. And it can help you capitalize or avoid either one, respectively.
Key benefits: Track and analyze data from more than 95 million sources, including blogs, forums, and review sites, as well as social networks
Free or paid: Paid tool
Skill level: Beginner to intermediate
Best for: PR and communications teams, social media marketers who focus on engagement and brand monitoring
Brandwatch is a powerful tool with five easy-to-use social media analytics report templates:
Learn more: You can add Brandwatch to your Hootsuite dashboard
Key benefits: Monitor conversations from more than 150 million sources to analyze engagement, potential reach, comments, sentiment, and emotions
Free or paid: Paid tool
Skill level: intermediate to advanced
Best for: social media managers, PR and communications teams, brand monitors, product marketers, researchers
Talkwalker offers analytics related to social conversations beyond your owned social properties, including:
You can filter by region, demographics, device, type of content, and more.
Talkwalker is especially useful to spot activity peaks in conversations about your brand. This can help you determine the best times for your brand to post on social media.
Learn more: You can add Talkwalker to your Hootsuite dashboard
Key benefits: In-depth automated social media reports and dashboards for all platforms
Free or paid: Paid tool
Skill level: intermediate to advanced
Best for: Enterprise-level businesses and organizations
Keyhole lets you report on everything: social media campaigns, brand mentions and interactions, hashtag impact, and even influencer campaign results. But that’s not all!
You can drill down into your impressions, reach, share of voice, and even analyze your competitor’s social media strategies.
If you’re utilizing influencer marketing as part of your strategy, Keyhole has reporting capabilities that will let you identify the ideal influencers to work with.
Best of all? Keyhole allows you to effectively never work in a spreadsheet again. Nice!
Key benefits: Analyze the YouTube performance of multiple channels
Free or paid: Paid tool (free for Hootsuite Enterprise users)
Skill level: all skill levels
Best for: YouTube marketers and creators, social media managers who run a YouTube channel alongside other social channels
The Channelview Insights App adds YouTube analytics to the Hootsuite dashboard.
With this integration, you can analyze your YouTube video and channel performance alongside all your other social media channels. You can also schedule automatic, regular reports.
Easily see the following metrics in one place:
Key benefit: Track mentions, keywords, and sentiment across multiple languages on social channels and elsewhere on the web.
Free or paid: Paid tool
Skill level: Beginner to intermediate
Best for: PR and communications teams, brand monitoring teams, product marketers, researchers at small to medium-sized businesses.
Want to get a big picture view of what’s being said about your brand on the internet? Mentionlytics is a great entry into the world of social media monitoring — especially if you run a global business in more than one language.
Other things you can do with Mentionlytics:
Key benefit: tracks Instagram analytics, including Instagram Story analytics
Free or paid: Paid (or free for Hootsuite Enterprise users)
Skill level: All skill levels
Best for: Instagram marketers
Alert all the Instagram marketers. Panoramiq Insights is perfect for Hootsuite free users or pro users who want to get deeper insights on their Stories in particular. (Just download the app from our App Library).
Among other things, Panoramiq Insights lets you:
We’ve created a free social media analytics template you can use to collect data about your performance on the various social networks. It’s a great place to start if you aren’t ready to invest in a tool that will automatically collect data for you. Simply download it, make a copy, and start customizing it with your own data.
For more information on how to share your analytics data effectively, check out our post on how to create a smart and simple social media report.
Social media analytics is the collection and analysis of performance data that helps you measure the success of your social media strategy. It includes tracking metrics like engagement, reach, likes, and many more across all your social channels.
Hootsuite Analytics is one of the best social media analytics tools on the market. It helps social media managers, marketers, and business owners track metrics from all major networks (Instagram, Facebook, Twitter, LinkedIn, TikTok, Pinterest, YouTube) in one easy-to-use dashboard. Hootsuite users can also generate beautiful custom results with just a few clicks.
Social media analytics tells you if your tactics and social strategies are working, and helps you better focus your time, effort, and budgets to reach the best results.
Most social media networks have free built-in analytics tools — but the easiest way to track social media analytics from multiple accounts and networks in one place is to use a social media management tool like Hootsuite.
Track your social media performance and maximize your budget with Hootsuite. Publish your posts and analyze the results in the same, easy-to-use dashboard. Try it free today.
Original article source at: https://blog.hootsuite.com/
1675147452
What trends will dominate in the next decade? Will they be AI, VR, .NET trends or Blockchain development trends?
We live in a time where technological advancements are happening at an unprecedented rate. The pace of change has never been faster. In the last fifteen years alone, we’ve seen the rise of smartphones, social media platforms, cloud computing, virtual reality, artificial intelligence, and much more.
As these technologies continue to evolve, new ones will emerge. And more and more web development companies that offer software powered by these technologies appeared.
In this article, we would like to be more down-to-earth and speak about the technologies right in front of us. And the topic is .NET technology and .NET trends. It features multiple usage options: web enterprise and eCommerce platforms, the Internet of Things development, high-quality graphic android games, and more! This technology is in the top 10 most popular and loved technologies, according to the stack overflow survey 2022.
The.NET Framework is an application platform developed by Microsoft Corporation (MS) that allows developers to create applications that run across multiple operating systems, including Windows, Mac OS X, Linux, iOS, Android, etc. The.NET Framework includes several components such as ASP.NET, ADO.NET, WPF, WCF, MVC, Entity Framework, LINQ, etc. The .Net Framework provides a set of tools and libraries designed to simplify application programming. These include common language runtime (CLR), Common Language Infrastructure (CLI), type safety, managed code, garbage collection, and security features.
The .Net technology has been widely adopted in enterprise software development. It is also used in many other areas, including web services, mobile devices, desktop applications, games, and embedded systems. More than 300 million lines of source code are written using this technology worldwide. Furthermore, there are more than 7 million .NET developers worldwide.
The .NET framework is a modern and fast platform for building applications because it includes several technical features and characteristics that make it efficient and easy to use. Key features include:
In general, .NET Core, a subset of .NET Framework, has been designed to be lightweight, modular, and high-performance. It allows building software quicker and makes the software development lifecycle shorter. It’s optimized for server-side scenarios and can be used to build high-performance, scalable and responsive applications.
.NET 6 has enhanced security features such as Write Exclusive Execution (W^X) and Control Flow Technology (CET). CET technology is available with the latest versions of Intel processors and AMD. It has the ability to protect your device from control-flow attacks. Although .NET 6 supports CET on Windows x64 applications; however, you should enable this feature explicitly.
In the same way, the W^X function protects your attack path by ensuring that memory pages are not writable. However, the W^X feature is available on Apple systems by default, but it is not available on Windows systems. Moreover, W^X feature still requires investing in application security testing.
Friendly
It works with a variety of frameworks and libraries. With ASP.NET Core, it is possible to build a sizable online application incorporating Angular and JavaScript.
.NET has two parts, the .NET Framework, which is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows, and .NET Core. .NET Core is open-source and can be used to build open-source applications.
.NET Core is an open-source, cross-platform and can be used to build applications for Windows, Linux, and macOS. Microsoft and the .NET community on GitHub maintain it.
Software development utilizing the Microsoft.Net Framework is known as “.Net Development.” The Microsoft Windows (PC) and Microsoft Windows CE platforms support the usage of a collection of software components known as the.NET Framework to create and execute applications. The .NET Framework offers a common programming interface for creating software for servers, desktops, mobile devices, portable devices, and other platforms.
The Framework’s ability to provide an API (application programming interface) to access common services in other applications, like databases or web servers, is a crucial component. Before giving those traditional systems access to the data contained in their database, a normal application can use the API to interact with traditional systems like Access or SQL Server.
It is difficult to predict which technological trend will grow more in the future. But it is obvious that .NET will be here as it is the essential technology revolution and will become more popular in the coming years.
Nonetheless, when it comes to the wages of .NET developer, the industry has found 2.1% rise in average salary by 2020 and 2.3% in 2021. Yet, presently, an increase of 3.3% in .NET salaries has also been seen in 2022.
A .NET developer can pull down roughly $105,000 per year. There is an increasing trend of working offsite for .NET developers. This means you can earn a handsome package from the comfort of your home. According to UpWork, by 2025 41.8% of American developers will prefer to work in remote settings.
The need for .Net developers will rise as more businesses adopt this technology for their operations and goods due to ongoing .NET trends. Over time, the .Net has grown significantly and is now a widely used programming language among programmers and software developers. Additionally, numerous businesses and governments from throughout the world use it. As more companies utilize this technology in their goods and services, the number of web developers will likewise rise.
.NET is a popular, ever-growing development platform that lets developers create cloud-based, mobile, and desktop applications for Windows, iOS, or Android. You’ll find .NET libraries for just about every programming language and platform.
Now we would like to go to top 9.NET trends. It’s not our first article about trends, so check this article out:
Now we would like to go to top 9.NET trends. It’s not our first article about trends, so check this article out:
But what will it be like in 2023? What will .NET trends dominate at that time? Here are some predictions:
An advanced component of the.NET web development framework called Blazor enables.NET development companies to produce web apps in C#. In order to use HTML and CSS, it offers reusable web components. Blazor allows you to develop server-side and client-side code in the same language.
Furthermore, by running client events directly on the server, Blazor may be linked to SignalR to run real-time apps and provide users with a unified experience. This is of the most useful .NET trends that will help developers overcome the most common challenges and learn best coding practices.
Blazor allows the creation of rich interactive UIs using C#. The most popular frameworks, like Angular, React, and Vue uses JavaScript for building interactive interfaces. Using Blazor means leaving them behind, which is a tough transition for developers.
The Internet of Things (IoT) is a broad term that refers to any device connected to the internet. IoT has become more popular over time, and it’s no wonder why: It’s a powerful way for businesses to collect data about their customers and respond in real-time when necessary. As a result, IoT is becoming an important part of many companies’ strategies for growth.
The .NET framework is one of the most popular platforms for developing applications for the IoT because it can handle multiple languages like C# or VBScript easily while still allowing developers access to all .NET APIs (Application Programming Interfaces). This is one of the latest .NET trends in 2023.
This makes it easier than ever before for developers who want access across multiple platforms, such as Windows 10 IoT Core plus Windows Server 2016 Core Essentials edition, which includes Visual Studio Code as well as Microsoft Azure Mobile Services SDKs, which provide support for mobile apps development using iOS & Android devices in addition.
Additionally, you may build, test, distribute, and manage applications by directly accessing auxiliary components using the.NET Nano and Meadow IoT frameworks.
IoT applications that present the current state of the hardware and software of each smart device through a single interface can be made with .NET. All popular boards, including Raspberry Pi, Arduino, Pine A64, and others, are compatible with its internal architecture without a hitch.
This framework is a new .NET trend designed to develop applications powered by machine learning, using the latest advances in deep learning and artificial intelligence technology from Microsoft Research.
For medium-sized and large organizations, automating operations and processing multidimensional data are fast becoming necessities. And for these reasons, utilizing machine learning skills is a great option. ML.NET can be used to build unique algorithms, parts, and modules for embedding detection, identification, and self-optimization characteristics.
ML.NET is a cross-platform, open-source machine learning framework for .NET developers and a unique .NET trend. Making it easy to create ML models, use them to make predictions, and integrate them into your apps.
ML.NET helps and supports multiple data types (XML files) and formats (JSON files). Moreover, a variety of prebuilt algorithms in the core library can be used without writing your own code
Native applications can be made using the.NET Multi-platform App UI. The main programming languages for developing apps that can run on iOS, Windows, Android, and macOS via a shareable source code are C# and XAML.
.NET MAUI is a new framework for building AI applications. It is built on top of .NET Core and is compatible with .NET Framework.
With MAUI, you may exchange both single and multiple files between devices, protect data using the key-value combination technique, use text-to-speech modules, and perform real-time network, bandwidth, and latency analysis.
Software developers can safeguard the availability, confidentiality, and integrity of data using .NET. Benefits include an integrated HTTPS module, the ability to configure role-based security, and the prevention of scripting attacks by inspecting each packet and user request.
2023 is going to be a great year for .Net Development. With the emergence of new technologies like IoT, ML.NET, and many more, this year will be very exciting for developers using .NET. High Tech security systems are in high demand, especially in the IT industry. .NET is considered the most secure platform for developing apps.
Almost all developers have risks of cuberattack during the process of software development. .NET framework allows these deveopers to improve the security system for boosted performance of theri software.
Microsoft is currently releasing new .NET Core updates. Its capabilities can be used to design REST APIs, full-stack applications, and client-server architectures with bi-directional data transmission. You may also create autonomous Docker containers using a microservices framework to operate your application.
The number of searches for Microsoft .NET framework Core-related topics has increased over time, according to Google Trends, which shows more people are interested in these topics than those linked to classic ASP.NET.
Several aspects such as versaility, security, and execution as the reasons to the renownce of the core ASP.NET development. Further, it helps and supports an extensive range of programming languages such C#, Python, and much more.
All users must upgrade to.NET 6 because Microsoft no longer supports.NET version 5.0. It is an LTS release that ensures hot reload, intelligent code auditing, and diagnostics with the Visual Studio 2022 version, in addition to stable support for up to three years. Additionally, Crossgen2 with.NET 6 can speed up application loading.
A unified platform for browser, cloud, desktop, Internet of Things, and mobile applications is also provided. The underlying infrastructure has also been modified to accommodate the requirements of all app kinds and facilitate code reuse across all of your apps.
According to Microsoft’s official announcement on October 11, 2022,.NET 6 is a Long-term Support (LTS) release and will be supported for three years; with its announcement, it has become one of the top .NET trends in 2023. Additionally, it works with a variety of operating systems, including Windows Arm64, Apple Silicon, and macOS.
Although Mono is currently available for iOS and Android, its appeal is still restricted to a small group of users because of particular problems with performance, compatibility, and inadequate documentation.
But by providing the most dynamic and effective cross-platform framework for creating Android, iOS, and Windows mobile apps, Xamarin has addressed this void and has become the latest .NET trend.
The.NET Core technology created and maintained by Microsoft is required for creating cross-platform applications utilizing the Xamarin framework. Based on .NET Trends data, we can detect that 11% of the worldwide developers search for the Xamarin business for the app development project.
.NET Command-Line Interface (CLI) is a command-line tool for working with .NET applications and one of the latest .NET trends. It can be used from the command line or through simple scripting. It supports creating, executing, and debugging C# and VB projects.
The .NET Command-Line Interface (CLI) is a tool for script-based automation of Microsoft tools and technologies. It provides cross-platform support for PowerShell, SQL Server Management Studio (SSMS), Database Management Studio (DBMS), and Integration Services.
The .NET Command-Line Interface (CLI) is a user-friendly command-line interface (CLI) for installing and using the .NET Framework. Think of it as your own personal copy of Visual Studio.
The traditional Visual Studio “installing” process has been replaced with an executable file that can be run on any system to install a localized set of commands that allow you to perform the most common development tasks.
Additional reading you might fight interesting:
SumatoSoft delivers exemplary services in 10+ business domains and works with all types of technologies like big data, blockchain, artificial intelligence, web, mobile, and progressive app development. We also can make a great contribution to any project that requires .NET custom software development. Every project we undertake starts with nuanced business analysis. We have more than 150 successful projects in various industries like eCommerce software, Elearning development, Finance, Real Estate, Logistics software, Travel, and more.
We are ready to build your next .NET project. Contact us to get a free quote!
As a developer, you know that .NET is the fastest-growing language in the world. And now there’s no better time to be learning it than now! .NET is an open-source framework for building applications and services for Windows, Linux, and macOS platforms.
Microsoft created .NET in 2000 as a successor to their earlier COM (Component Object Model) technology which allowed software components to be developed separately from each other but still work together seamlessly.
Over time it has become one of the most popular development frameworks for enterprise applications because of its high-level support for reuse across multiple languages (C#/.Net Framework), powerful features like LINQ, which makes data querying easier than ever before, and seamless integration with SQL Server databases when using Entity Framework Core Edition libraries or Entity Framework Data Services endpoints respectively.
All this makes developing apps quicker than ever before! The future of .NET looks bright, and we can’t wait to see what it brings.
1675127640
Conferences.digital is the best way to watch the latest and greatest videos from your favourite developer conferences for free on your Mac. Either search specifically for conferences, talks, speakers or topics or simply browse through the catalog - you can add talks to your watchlist to save for later, favourite or continue watching where you left off.
Some of you will see the warning in the screenshot below about the app "can't be opened because it is from an unidentified developer" when you first try to run it.
This is because I am not registered with Apple as part of their developer program. This is the first MacOS app that I've ever built, and I don't plan to release it in the Mac App Store so I don't want to pay the fees to register with Apple.
To bypass Apple's restriction, right-click the app and select "Open" the first time you use it or open your System Preferences application, click on the Security & Privacy selection, and there will be an "Open Anyway" button you can click.
If you have any other issues check out #9.
If you have any ideas how to improve the app or you wish yourself a new feature you can submit an pull request or create an issue here.
If you want to submit an pull request make sure you read the contribution guidelines first.
Please be advised that this project is still in an early stage and work in progress.
If you encounter any bugs please create an issue here and be as descriptive as possible.
Conferences.digital is written in Swift 5. Compatible with macOS 10.12.2+
As soon as new conferences/talks have been added it will be announced on twitter.
Download the latest release here.
Author: Zagahr
Source Code: https://github.com/zagahr/Conferences.digital
License: BSD-2-Clause license
1673431254
THỨ TƯ, 04/01/2023 - 10:16 (GMT+0700)
Bất kể là nhà đầu tư vừa bước chân vào thị trường hay nhà đầu tư giàu kinh nghiệm muốn thử các đồng tiền điện tử non trẻ, hãy chú ý và cân nhắc đến danh sách 9 đồng tiền điện tử hàng đầu này để đầu tư dài hạn, theo Analytics Insight.
Danh sách các loại tiền điện tử tốt nhất để mua bắt đầu bằng bitcoin là sự thật không có gì đáng ngạc nhiên. Mặc dù tương đối bất ổn và phải đối phó với các nguy cơ bị áp đặt quy định khó khăn từ một số chính phủ trên toàn cầu, nhưng điều khiến bitcoin đáng để đầu tư lâu dài là sự chấp nhận rộng rãi và ngày càng phổ biến của nó trên toàn cầu. Vì vậy, nếu bạn đang tự hỏi liệu đầu tư vào bitcoin có đáng hay không thì câu trả lời là bitcoin tương đối thích hợp cho đầu tư dài hạn.
Tiền điện tử lớn thứ 2 theo vốn hóa thị trường đã chứng kiến sự gia tăng lớn về giá trị của nó theo thời gian – cao tới khoảng 800% trong vòng một năm qua.
Vai trò của loại tiền điện tử này trong việc mở rộng tài chính phi tập trung (DeFi) đáng được đề cập vì đây là lý do chính giải thích tại sao ethereum được chấp nhận rộng rãi và được đầu tư nhiều.
Polygon là một nền tảng mở rộng và khả năng tương tác blockchain tương thích với ethereum, đóng vai trò là framework để tạo các mạng blockchain được liên kết với nhau. Polygon được nhiều nhà đầu tư đánh giá cao vì thực tế là nó đã phát triển đến mức khắc phục được một số thiếu sót lớn của ethereum, chẳng hạn như thông lượng, giao dịch bị trì hoãn và thiếu kiểm soát cộng đồng.
Ngoài ra, Polygon hiện đang có mức giá khá “kinh tế”, dễ mua trong khi có tiềm năng tăng giá mạnh mẽ trong tương lai.
Dogecoin là tiền điện tử meme được tạo ra từ năm 2013 đã được sử dụng như một phương pháp thanh toán chính thức cho một số mặt hàng của Tesla, thậm chí là SpaceX đã đưa dogecoin thành tiền điện tử tiềm năng. Đây cũng là cái tên xuất hiện trong top các loại tiền điện tử tiết kiệm nhất để mua vào năm 2021.
Loại lợi nhuận mà đồng tiền meme này đã mang lại cho các nhà đầu tư chỉ trong vài tháng từng là một hiện tượng – lợi nhuận lên đến 8000% trong vài tháng. Tỷ phú Elon Musk là một trong những doanh nhân nổi tiếng nhất bày tỏ sự ủng hộ cho dogecoin nhiều năm qua.
Binance coin thu hút sự chú ý của nhà đầu tư thông qua sàn giao dịch tiền điện tử Binance. Binance coin đã và đang được sử dụng để giao dịch cũng như thanh toán phí.
Cho đến nay, sàn Binance giữ kỷ lục là sàn giao dịch tiền điện tử lớn nhất thế giới với hơn 1,4 triệu giao dịch mỗi giây vào tháng 4/2021, vì vậy, binance coin có tiềm năng phát triển hơn nữa trong thời gian tới, đầu tư vào nó có thể sẽ hiệu quả.
Polkadot là một loại tiền điện tử phù hợp nếu bạn đang tìm kiếm lợi nhuận lớn và đầu tư thấp. Chính khả năng kết nối liền mạch tất cả các mạng blockchain không đồng nhất của nó là lý do tại sao có hàng trăm dự án đang được xây dựng trên hệ sinh thái Polkadot.
Một cái tên khác trong danh sách tiền điện tử tiềm năng nhất cho đầu tư dài hạn là Cardano. Cardano chính xác là một loại tiền kỹ thuật số đã ghi nhận sự tăng trưởng ấn tượng về giá trị trong một khoảng thời gian khá dài.
Solana, hoạt động với sự kết hợp của cơ chế bằng chứng cổ phần và bằng chứng lịch sử, đã từng mang lại lợi nhuận khổng lồ cho các nhà đầu tư. Dù là tài chính phi tập trung (DeFi), ứng dụng phi tập trung (DApp) hay hợp đồng thông minh, Solana đều có thể đáp ứng được nhu cầu của bạn.
Stabecoin tether đã quá nổi tiếng vì tính ổn định và ít rủi ro, tuy nhiên sau những diễn biến thị trường vào năm 2022 thì nhiều nhà đầu tư vẫn ít nhiều lo ngại, không rõ liệu tether có còn thực sự ít rủi ro như trong quá khứ hay không.
Nhìn chung, đầu tư bitcoin, tiền điện tử hay bất cứ đồng tiền số nào trên danh sách đều mang tính rủi ro nhất định. Thị trường tiền ảo đầy biến động và ngay cả các chuyên gia kinh tế, tài chính cũng không thể chắc chắn hoàn toàn về cơ hội kiếm tiền. Do vậy, ngay cả khi đã chọn các tiền điện tử tiềm năng nhất, bạn chỉ nên đầu tư một phần nhỏ số vốn mình có.
1673242279
Crypto and cryptocurrency exchange development have gained widespread popularity recently, with more and more people looking to invest in and trade digital assets. With the increasing demand for cryptocurrency exchange development services, many apps and exchanges are now available. This article will look at the ten best crypto exchanges of 2022 based on factors such as ease of use and overall benefits.
Description
Coinbase is a popular cryptocurrency exchange and wallet platform that allows users to buy, sell and store a wide range of digital assets, including Bitcoin, Ethereum, Litecoin, and more. It is well-known for its ease of use and security features, making it an excellent choice for both beginner and experienced cryptocurrency investors.
Benefits
Coinbase is known for its ease of use and security features, making it an excellent choice for both beginner and experienced cryptocurrency exchange software investors. It offers a user-friendly interface and supports various payment methods, including bank transfers and credit/debit cards. Also, it has insurance for digital assets held on the platform, providing users with an extra layer of protection.
Description
Binance is a leading cryptocurrency exchange software offering a wide range of trading pairs and features, including margin, futures, and spot trading. It is known for its low fees, fast transaction speeds, and user-friendly interface, making it a popular choice for traders of all experience levels.
Benefits
Binance is known for its low fees, fast transaction speeds, and user-friendly interface, making it a popular choice for traders of all experience levels. It also strongly focuses on security, with advanced measures in place to protect user assets. Also, it has a wide range of trading pairs and supports a variety of payment methods.
Description
Kraken is a trusted cryptocurrency exchange that offers a wide range of trading pairs and features, including margin trading and futures trading. It is known for its security measures, including the use of cold storage for the majority of its digital assets.
Benefits
Kraken is known for its security measures, including the use of cold storage for the majority of its digital assets. It also has a user-friendly interface and low fees, making it a popular choice for traders of all experience levels. In addition, it supports a wide range of payment methods and strongly focuses on regulatory compliance.
Description
Bitfinex is a popular cryptocurrency exchange development solution offering a wide range of trading pairs and features, including margin and futures trading. It is known for its advanced trading tools and low fees, making it a popular choice for experienced traders.
Benefits
Bitfinex is known for its advanced trading tools and low fees, making it a popular choice for experienced traders. It also has a user-friendly interface and supports a variety of payment methods. In addition, it strongly focuses on security and uses advanced measures to protect user assets.
Description
Gemini is a trusted cryptocurrency exchange known for its security measures and regulatory compliance. It offers various trading pairs and features, including margin and futures trading.
Benefits
Gemini is known for its strong focus on security and compliance, making it a great choice for users concerned about their assets' safety. It also has a user-friendly interface and low fees, making it a popular choice for traders of all experience levels. In addition, it supports a wide range of payment methods.
Description
Bitstamp is a reputable cryptocurrency exchange software that offers a wide range of trading pairs and features, including margin trading and futures trading. It is known for its low fees and user-friendly interface, making it a popular choice for both beginner and experienced traders.
Benefits
Bitstamp is known for its low fees and user-friendly interface, making it a popular choice for both beginner and experienced traders. It also strongly focuses on security and uses advanced measures to protect user assets. The exchange supports a variety of payment methods and has a reputation for reliability.
Description
Robinhood is a popular trading app that recently added support for cryptocurrency trading. It is known for its user-friendly interface and zero-fee trading, making it an excellent choice for beginner investors.
Benefits
Robinhood is known for its user-friendly interface and zero-fee trading, making it a great choice for beginner investors. It also supports various payment methods and strongly focuses on security. The exchange offers a range of educational resources to help users learn about investing.
Description
eToro is a social trading platform that provides various financial assets, including cryptocurrency. It is known for its copy trading feature, allowing users to copy more experienced traders' trades automatically.
Benefits
eToro is known for its copy trading feature, which automatically allows users to copy the trades of more experienced traders. This can be an excellent way for beginner investors to learn from and potentially profit from the strategies of more experienced traders. eToro also has a user-friendly interface and supports a variety of payment methods. In addition, it has a strong focus on security and regulatory compliance.
Description
BlockFi is a financial services company that offers a range of products, including a cryptocurrency interest account that allows users to earn interest on their digital assets. It is known for its high-interest rates and secure storage of assets.
Benefits
BlockFi is known for its high-interest rates on its cryptocurrency interest account, making it a great way for users to earn passive income on their digital assets. It has secure storage of assets and a user-friendly interface. Also, it offers a range of financial products, including loans and trading services, making it a one-stop shop for all your cryptocurrency financial needs.
Description
Ledger is a hardware wallet manufacturer that offers a range of products for securely storing and managing cryptocurrency assets. It is known for its security measures and easy-to-use interface, making it a popular choice for both beginner and experienced cryptocurrency investors.
Benefits
Ledger is known for its security measures and easy-to-use interface, making it a popular choice for both beginners and experienced cryptocurrency investors. Its hardware wallets offer an extra layer of protection for users' digital assets, as they are stored offline and are not connected to the internet. Ledger also offers a range of products to suit different needs and budgets, including budget-friendly options for beginners and more advanced options for experienced users.
There are various factors to consider when choosing the best crypto exchange for you:
Different exchanges have different fees for buying, selling, and trading cryptocurrencies. Compare other exchanges' fees to see which offers the best deal.
Security is an important consideration when choosing a crypto exchange. Look for exchanges with stringent security measures, such as two-factor authentication and cold storage for digital assets.
Consider the user experience of the exchange. Look for exchanges with a user-friendly interface that is easy to navigate.
Different exchanges support payment methods, such as bank transfers, credit/debit cards, and online payment processors. Make sure to choose an exchange that supports your preferred payment method.
Not all exchanges support all cryptocurrencies. Ensure that the exchange you are considering supports the specific cryptocurrency you are interested in buying or selling.
Look for an exchange with a good reputation in the industry. Read reviews and do your research to get a sense of the exchange's track record.
If you encounter any issues with your transaction or account, it is vital to have access to reliable customer support. Look for an exchange with a good track record of customer support.
Considering these factors, you can settle on the best crypto exchange for your needs. You can also research and compare the features and fees of multiple exchanges before making a decision.
There are many great crypto exchanges to choose from 2022 in 2022. Whether you are an experienced trader or a beginner, there is an option that will suit your needs. Do your research and consider the features and fees of each platform before deciding. Using a reputable and user-friendly platform makes you feel confident in your cryptocurrency investments and trades.
Or, if you are planning to partner with a cryptocurrency exchange development company to develop your exchange, cool! Look for the best one suiting your business and crypto exchange needs!
1672989018
Here's our top picks explaining about Top 10 Web 3.0 Business Ideas for 2023.
#1 Decentralized Games
The revolution of decentralized games plays a major role in the world of Web 3.0. It adds fuel in delivering innovative games that completely offer control to the gamers in multiple aspects. A player can entirely take control of their game not just taking over their moves in the game but also having a provision to create their customized 3D digital venues. It’s obvious if you think Web 3.0 for your business you can give priority to“Decentralized Games”, as this ranks in the 1st place.
#2 Decentralized Apps
The urge to build decentralized apps on Web 3.0 is always constant among the investor’s circle. The reason behind it is quite simple, decentralized applications empower the core infrastructure of Web 3.0. Many experts consider innovative Dapps on Web 3.0 to expand the profit margin of a business by making their application more interactive, engaging, transparent, trustworthy, independent, and secure.
#3 Non-Fungible Tokens
NFTs in Web 3.0 is a buzzword today. Businesses are moving towards creating their own brand name in the form of “NFTs”. Here’s some solid proof that proves the above statement. As we knew, a well-known popular brand, the so-called “Nike”, entered into NFTs.
By the way, it’s their foremost step for turning billion dollars out of their NFT wearable collection. Probably you can roll out your NFT on Web 3.0 and become the next billionaire.
#4 Decentralized Exchanges
The influence of Web 3.0 in interoperable exchange is also vital as it promises to offer 100% accuracy & transparency to users on the internet. Lack of trust and security in centralized exchanges lies as the underlying cause of users shifting drastically towards decentralized exchanges.
Web 3.0 becomes a solution provider to this kind of issue and welcomes internet users by inviting them to launch a successful trading business backed with the support of blockchain technology.
#5 Decentralized Wallets
Crypto’s is a hot topic that’s probably on the mind of every business freak. If so, a decentralized wallet development is mandatory. A reliable wallet acts as a storage to safely store your user’s assets and they can access it whenever they are in need off.
Ensure that you research the types of wallets while choosing your business model. Or else you can simply enquire about it with an expert.
#6 Metaverses
Web 3.0 and metaverse complement each other as it collaborates to offer a better user experience to the users. Metaverse provides a better user experience to every user on the internet. As of now, netizens use the internet for browsing but Metaverse lets the users live in a virtual 3D environment. It creates a new lifestyle for internet users by developing a virtual environment, so a user can socially interact with the other user.
#7 Decentralized Social Networks
Social networking platforms are vibrant to rule the next generation of internet users. The probability of converting click rates to leads is high in social media when compared to other mediums. While from a creator's point of view, they have a provision to monetize their work in a secured and transparent manner.
#8 DAOs
The world is preparing for the revolution of DAOs on the web3 community. Decentralized autonomous organizations let users take complete control while making decisions on a business. On a blockchain-backed DAO, a community member plays a major role in making crucial decisions. Vote’s in the web3 community can alter the way businesses are operating.
#9 Decentralized Streaming Platforms
Exhibiting talented works of artists and creators is going to be a challenge in 2023. Decentralized streaming platform acts as a big alternative that supports creators in all aspects by allowing them to directly connect with their audience. Extremely talented works of creators have been in the public ever since decentralized streaming platforms set it’s footprints on Web 3.0.
#10 Decentralized Finance - DeFi
DeFi & Web 3.0 on a whole shape the future of financial-backed operations online. DeFi solution’s solid goal is to offer a better experience to internet users while exchanging their assets on a P2P network. People too affirm it’s growth by steadily collaborating with Defi platforms rather than relying on 3D parties like banks & government.
Everybody is interested in Web3 and wishes to seize this big opportunity ASAP. We Assetfinx - A industry-leading Web3 Development Company is privileged to support those individuals who wish to start a business in Web 3.0. Our team anticipates that the sector will further continue to expand in 2023.
Get Consultation!
Phone/ Whatsapp: 638 430 1100
Mail: contact@assetfinx.net
1670502174
Build Rest Api Project With Express & MongoDB | CRUD API | Node.js Tutorial for Beginners #9
In this video we will continue to build our contact management Rest API project using Express & MongoDb. And we will implement project wide error handling, MongoDB setup and CRUD operations of our contacts resource.
⭐️ Support my channel⭐️ https://www.buymeacoffee.com/dipeshmalvia
⭐️ GitHub link for Reference ⭐️ https://github.com/dmalvia/Express_MongoDB_Rest_API_Tutorial
⭐️ Node.js for beginners Playlist ⭐️ https://youtube.com/playlist?list=PLTP3E5bPW796_icZanMqhdg7i0Cl7Y51F
🔥 Video contents... ENJOY 👇
⭐️ JavaScript ⭐️
🔗 Social Medias 🔗
⭐️ Tags ⭐️ - Node.js, Express & MongoDB Project - Build Rest API Project Express & MongoDB - Express CRUD API Tutorial - Node.Js & Express Crash Course
⭐️ Hashtags ⭐️ #nodejs #express #beginners #tutorial
Disclaimer: It doesn't feel good to have a disclaimer in every video but this is how the world is right now. All videos are for educational purpose and use them wisely. Any video may have a slight mistake, please take decisions based on your research. This video is not forcing anything on you.
https://youtu.be/niw5KSO94YI
1670076960
Today, JavaScript is at the core of virtually all modern web applications. That’s why JavaScript issues, and finding the mistakes that cause them, are at the forefront for web developers.
Powerful JavaScript-based libraries and frameworks for single page application (SPA) development, graphics and animation, and server-side JavaScript platforms are nothing new. JavaScript has truly become ubiquitous in the world of web app development and is therefore an increasingly important skill to master.
At first, JavaScript may seem quite simple. And indeed, to build basic JavaScript functionality into a web page is a fairly straightforward task for any experienced software developer, even if they’re new to JavaScript. Yet the language is significantly more nuanced, powerful, and complex than one would initially be led to believe. Indeed, many of JavaScript’s subtleties lead to a number of common problems that keep it from working—10 of which we discuss here—that are important to be aware of and avoid in one’s quest to become a master JavaScript developer.
this
There’s no shortage of confusion among JavaScript developers regarding JavaScript’s this
keyword.
As JavaScript coding techniques and design patterns have become increasingly sophisticated over the years, there’s been a corresponding increase in the proliferation of self-referencing scopes within callbacks and closures, which are a fairly common source of “this
/that confusion” causing JavaScript issues.
Consider this example code snippet:
Game.prototype.restart = function () {
this.clearLocalStorage();
this.timer = setTimeout(function() {
this.clearBoard(); // What is "this"?
}, 0);
};
Executing the above code results in the following error:
Uncaught TypeError: undefined is not a function
Why? It’s all about context. The reason you get the above error is because, when you invoke setTimeout()
, you are actually invoking window.setTimeout()
. As a result, the anonymous function being passed to setTimeout()
is being defined in the context of the window
object, which has no clearBoard()
method.
A traditional, old-browser-compliant solution is to simply save your reference to this
in a variable that can then be inherited by the closure; e.g.:
Game.prototype.restart = function () {
this.clearLocalStorage();
var self = this; // Save reference to 'this', while it's still this!
this.timer = setTimeout(function(){
self.clearBoard(); // Oh OK, I do know who 'self' is!
}, 0);
};
Alternatively, in newer browsers, you can use the bind()
method to pass in the proper reference:
Game.prototype.restart = function () {
this.clearLocalStorage();
this.timer = setTimeout(this.reset.bind(this), 0); // Bind to 'this'
};
Game.prototype.reset = function(){
this.clearBoard(); // Ahhh, back in the context of the right 'this'!
};
As discussed in our JavaScript Hiring Guide, a common source of confusion among JavaScript developers (and therefore a common source of bugs) is assuming that JavaScript creates a new scope for each code block. Although this is true in many other languages, it is not true in JavaScript. Consider, for example, the following code:
for (var i = 0; i < 10; i++) {
/* ... */
}
console.log(i); // What will this output?
If you guess that the console.log()
call would either output undefined
or throw an error, you guessed incorrectly. Believe it or not, it will output 10
. Why?
In most other languages, the code above would lead to an error because the “life” (i.e., scope) of the variable i
would be restricted to the for
block. In JavaScript, though, this is not the case and the variable i
remains in scope even after the for
loop has completed, retaining its last value after exiting the loop. (This behavior is known, incidentally, as variable hoisting.)
Support for block-level scopes in JavaScript is available via the let
keyword. The let
keyword has been widely supported by browsers and back-end JavaScript engines like Node.js for years now..
If that’s news to you, it’s worth taking the time to read up on scopes, prototypes, and more.
Memory leaks are almost inevitable JavaScript issues if you’re not consciously coding to avoid them. There are numerous ways for them to occur, so we’ll just highlight a couple of their more common occurrences.
Consider the following code:
var theThing = null;
var replaceThing = function () {
var priorThing = theThing; // Hold on to the prior thing
var unused = function () {
// 'unused' is the only place where 'priorThing' is referenced,
// but 'unused' never gets invoked
if (priorThing) {
console.log("hi");
}
};
theThing = {
longStr: new Array(1000000).join('*'), // Create a 1MB object
someMethod: function () {
console.log(someMessage);
}
};
};
setInterval(replaceThing, 1000); // Invoke `replaceThing' once every second
If you run the above code and monitor memory usage, you’ll find that you’ve got a significant memory leak, leaking a full megabyte per second! And even a manual Garbage Collector (GC) doesn’t help. So it looks like we are leaking longStr
every time replaceThing
is called. But why?
Memory leaks are almost inevitable JavaScript issues if you’re not consciously coding to avoid them.
Let’s examine things in more detail:
Each theThing
object contains its own 1MB longStr
object. Every second, when we call replaceThing
, it holds on to a reference to the prior theThing
object in priorThing
. But we still wouldn’t think this would be a problem, since each time through, the previously referenced priorThing
would be dereferenced (when priorThing
is reset via priorThing = theThing;
). And moreover, is only referenced in the main body of replaceThing
and in the function unused
which is, in fact, never used.
So again we’re left wondering why there is a memory leak here.
To understand what’s going on, we need to better understand the inner workings of JavaScript. The typical way that closures are implemented is that every function object has a link to a dictionary-style object representing its lexical scope. If both functions defined inside replaceThing
actually used priorThing
, it would be important that they both get the same object, even if priorThing
gets assigned to over and over, so both functions share the same lexical environment. But as soon as a variable is used by any closure, it ends up in the lexical environment shared by all closures in that scope. And that little nuance is what leads to this gnarly memory leak.
Consider this code fragment:
function addClickHandler(element) {
element.click = function onClick(e) {
alert("Clicked the " + element.nodeName)
}
}
Here, onClick
has a closure that keeps a reference to element
(via element.nodeName
). By also assigning onClick
to element.click
, the circular reference is created; i.e.: element
→ onClick
→ element
→ onClick
→ element
…
Interestingly, even if element
is removed from the DOM, the circular self-reference above would prevent element
and onClick
from being collected, and hence, a memory leak.
JavaScript’s memory management (and, in particular, garbage collection) is largely based on the notion of object reachability.
The following objects are assumed to be reachable and are known as “roots”:
Objects are kept in memory at least as long as they are accessible from any of the roots through a reference, or a chain of references.
There is a Garbage Collector in the browser that cleans memory occupied by unreachable objects; in other words, objects will be removed from memory if and only if the GC believes that they are unreachable. Unfortunately, it’s fairly easy to end up with defunct “zombie” objects that are no longer in use but that the GC still thinks are “reachable.”
One of the conveniences in JavaScript is that it will automatically coerce any value being referenced in a boolean context to a boolean value. But there are cases where this can be as confusing as it is convenient. Some of the following, for example, have been known to be troublesome for many a JavaScript developer:
// All of these evaluate to 'true'!
console.log(false == '0');
console.log(null == undefined);
console.log(" \t\r\n" == 0);
console.log('' == 0);
// And these do too!
if ({}) // ...
if ([]) // ...
With regard to the last two, despite being empty (which might lead one to believe that they would evaluate to false
), both {}
and []
are in fact objects and any object will be coerced to a boolean value of true
in JavaScript, consistent with the ECMA-262 specification.
As these examples demonstrate, the rules of type coercion can sometimes be clear as mud. Accordingly, unless type coercion is explicitly desired, it’s typically best to use ===
and !==
(rather than ==
and !=
), so as to avoid any unintended side effects of type coercion. (==
and !=
automatically perform type conversion when comparing two things, whereas ===
and !==
do the same comparison without type conversion.)
And completely as a sidepoint—but since we’re talking about type coercion and comparisons—it’s worth mentioning that comparing NaN
with anything (even NaN
!) will always return false
. You therefore cannot use the equality operators (==
, ===
, !=
, !==
) to determine whether a value is NaN
or not. Instead, use the built-in global isNaN()
function:
console.log(NaN == NaN); // False
console.log(NaN === NaN); // False
console.log(isNaN(NaN)); // True
JavaScript makes it relatively easy to manipulate the DOM (i.e., add, modify, and remove elements), but does nothing to promote doing so efficiently.
A common example is code that adds a series of DOM Elements one at a time. Adding a DOM element is an expensive operation. Code that adds multiple DOM elements consecutively is inefficient and likely not to work well.
One effective alternative when multiple DOM elements need to be added is to use document fragments instead, thereby improving efficiency and performance.
For example:
var div = document.getElementsByTagName("my_div");
var fragment = document.createDocumentFragment();
for (var e = 0; e < elems.length; e++) { // elems previously set to list of elements
fragment.appendChild(elems[e]);
}
div.appendChild(fragment.cloneNode(true));
In addition to the inherently improved efficiency of this approach, creating attached DOM elements is expensive, whereas creating and modifying them while detached and then attaching them yields much better performance.
for
LoopsConsider this code:
var elements = document.getElementsByTagName('input');
var n = elements.length; // Assume we have 10 elements for this example
for (var i = 0; i < n; i++) {
elements[i].onclick = function() {
console.log("This is element #" + i);
};
}
Based on the above code, if there were 10 input elements, clicking any of them would display “This is element #10”! This is because, by the time onclick
is invoked for any of the elements, the above for
loop will have completed and the value of i
will already be 10 (for all of them).
Here’s how we can correct the aforementioned problems with JavaScript to achieve the desired behavior:
var elements = document.getElementsByTagName('input');
var n = elements.length; // Assume we have 10 elements for this example
var makeHandler = function(num) { // Outer function
return function() { // Inner function
console.log("This is element #" + num);
};
};
for (var i = 0; i < n; i++) {
elements[i].onclick = makeHandler(i+1);
}
In this revised version of the code, makeHandler
is immediately executed each time we pass through the loop, each time receiving the then-current value of i+1
and binding it to a scoped num
variable. The outer function returns the inner function (which also uses this scoped num
variable) and the element’s onclick
is set to that inner function. This ensures that each onclick
receives and uses the proper i
value (via the scoped num
variable).
A surprisingly high percentage of JavaScript developers fail to fully understand, and therefore to fully leverage, the features of prototypal inheritance.
Here’s a simple example. Consider this code:
BaseObject = function(name) {
if (typeof name !== "undefined") {
this.name = name;
} else {
this.name = 'default'
}
};
Seems fairly straightforward. If you provide a name, use it, otherwise set the name to ‘default’. For instance:
var firstObj = new BaseObject();
var secondObj = new BaseObject('unique');
console.log(firstObj.name); // -> Results in 'default'
console.log(secondObj.name); // -> Results in 'unique'
But what if we were to do this:
delete secondObj.name;
We’d then get:
console.log(secondObj.name); // -> Results in 'undefined'
But wouldn’t it be nicer for this to revert to ‘default’? This can easily be done, if we modify the original code to leverage prototypal inheritance, as follows:
BaseObject = function (name) {
if(typeof name !== "undefined") {
this.name = name;
}
};
BaseObject.prototype.name = 'default';
With this version, BaseObject
inherits the name
property from its prototype
object, where it is set (by default) to 'default'
. Thus, if the constructor is called without a name, the name will default to default
. And similarly, if the name
property is removed from an instance of BaseObject
, the prototype chain will then be searched and the name
property will be retrieved from the prototype
object where its value is still 'default'
. So now we get:
var thirdObj = new BaseObject('unique');
console.log(thirdObj.name); // -> Results in 'unique'
delete thirdObj.name;
console.log(thirdObj.name); // -> Results in 'default'
Let’s define a simple object, and create an instance of it, as follows:
var MyObject = function() {}
MyObject.prototype.whoAmI = function() {
console.log(this === window ? "window" : "MyObj");
};
var obj = new MyObject();
Now, for convenience, let’s create a reference to the whoAmI
method, presumably so we can access it merely by whoAmI()
rather than the longer obj.whoAmI()
:
var whoAmI = obj.whoAmI;
And just to be sure everything looks copacetic, let’s print out the value of our new whoAmI
variable:
console.log(whoAmI);
Outputs:
function () {
console.log(this === window ? "window" : "MyObj");
}
Okay, cool. Looks fine.
But now, look at the difference when we invoke obj.whoAmI()
vs. our convenience reference whoAmI()
:
obj.whoAmI(); // Outputs "MyObj" (as expected)
whoAmI(); // Outputs "window" (uh-oh!)
What went wrong? When we did the assignment var whoAmI = obj.whoAmI;
, the new variable whoAmI
was being defined in the global namespace. As a result, its value of this
is window
, not the obj
instance of MyObject
!
Thus, if we really need to create a reference to an existing method of an object, we need to be sure to do it within that object’s namespace, to preserve the value of this
. One way of doing this would be as follows:
var MyObject = function() {}
MyObject.prototype.whoAmI = function() {
console.log(this === window ? "window" : "MyObj");
};
var obj = new MyObject();
obj.w = obj.whoAmI; // Still in the obj namespace
obj.whoAmI(); // Outputs "MyObj" (as expected)
obj.w(); // Outputs "MyObj" (as expected)
setTimeout
or setInterval
For starters, let’s be clear on something here: Providing a string as the first argument to setTimeout
or setInterval
is not itself a mistake per se. It is perfectly legitimate JavaScript code. The issue here is more one of performance and efficiency. What is rarely explained is that if you pass in a string as the first argument to setTimeout
or setInterval
, it will be passed to the function constructor to be converted into a new function. This process can be slow and inefficient, and is rarely necessary.
The alternative to passing a string as the first argument to these methods is to instead pass in a function. Let’s take a look at an example.
Here, then, would be a fairly typical use of setInterval
and setTimeout
, passing a string as the first parameter:
setInterval("logTime()", 1000);
setTimeout("logMessage('" + msgValue + "')", 1000);
The better choice would be to pass in a function as the initial argument; e.g.:
setInterval(logTime, 1000); // Passing the logTime function to setInterval
setTimeout(function() { // Passing an anonymous function to setTimeout
logMessage(msgValue); // (msgValue is still accessible in this scope)
}, 1000);
As explained in our JavaScript Hiring Guide, “strict mode” (i.e., including 'use strict';
at the beginning of your JavaScript source files) is a way to voluntarily enforce stricter parsing and error handling on your JavaScript code at runtime, as well as making it more secure.
While, admittedly, failing to use strict mode is not a “mistake” per se, its use is increasingly being encouraged and its omission is increasingly becoming considered bad form.
Here are some key benefits of strict mode:
this
coercion. Without strict mode, a reference to a this
value of null or undefined is automatically coerced to the global. This can cause many frustrating bugs. In strict mode, referencing a this
value of null or undefined throws an error.var object = {foo: "bar", foo: "baz"};
) or a duplicate named argument for a function (e.g., function foo(val1, val2, val1){}
), thereby catching what is almost certainly a bug in your code that you might otherwise have wasted lots of time tracking down.eval()
behaves in strict mode and in nonstrict mode. Most significantly, in strict mode, variables and functions declared inside an eval()
statement are not created in the containing scope. (They are created in the containing scope in nonstrict mode, which can also be a common source of problems with JavaScript.)delete
. The delete
operator (used to remove properties from objects) cannot be used on nonconfigurable properties of the object. Nonstrict code will fail silently when an attempt is made to delete a nonconfigurable property, whereas strict mode will throw an error in such a case.As is true with any technology, the better you understand why and how JavaScript works and doesn’t work, the more solid your code will be and the more you’ll be able to effectively harness the true power of the language. Conversely, lack of proper understanding of JavaScript paradigms and concepts is indeed where many JavaScript problems lie.
Thoroughly familiarizing yourself with the language’s nuances and subtleties is the most effective strategy for improving your proficiency and increasing your productivity. Avoiding many common JavaScript mistakes will help when your JavaScript is not working.
Original article source at: https://www.toptal.com/
1669876946
Today, the Internet of Things or IoT is a flourishing and rapidly developing industry with IoT technology penetrating nearly every aspect of our society. More and more companies are getting interested in IoT benefits with new innovations being introduced every year. And while experts have predicted the industry’s expansion years ago, IoT as a field still has a lot of space for growth and numerous opportunities ahead of it.
According to a 2015 report from the McKinsey Global Institute, the potential worldwide IoT impact on economy could reach up to $11.1 trillion annually by 2025 if the right conditions occurred (policies, implementations, etc.) According to the research conducted by McKinsey & Company in 2019, IoT growth varies by field, but the progress is undeniable and there are positive predictions for the nearest future.
There are already quite a few established IoT companies from all over the world offering services to those interested in such technologies. In this article, we will look at the top IoT companies right now. First, we will explain how we chose the top IoT companies on our list. Then, we will present these top 40 companies. At the end of the article, you will find answers to the most common questions regarding IoT companies.
To compile this list of top IoT companies, we considered a variety of factors. There is not one company that can be considered a single leader in the field, so you need to think of your goals and priorities when choosing which company to work with. When compiling our list of top IoT companies, we took into account:
Without further ado, here are the top IoT companies worldwide:
Source: https://sumatosoft.com/
SumatoSoft is one of the top IoT companies that offers a range of IoT software development services for different needs and situations. The company builds custom software solutions for businesses of different size: it builds MVP for startups, but it targets complex enterprise applications that require well-conceived business analysis.
SumatoSoft offers industry-focused IoT solutions for healthcare, retail, manufacturing, smart homes & cities, and automotive domains. These IoT solutions include remote patient monitoring, warehouse automation, fleet management, robotics, smart traffic lights, and more. Every solution SumatoSoft builds comes with great security and scalability for future changes in terms of new features, fleet expansion, new users, and increased workload.
SumatoSoft IoT services include:
The SumatoSoft team has built 150 custom software solutions for 27 countries for 11 industries. After more than 10 years on the market, the company became a reliable technical partner to its clients, demonstrating a 98% client satisfaction rate with the quality of services they provide.
Key company characteristics:
Founding year: 2012
Size of company: 50-99
Location of company: Boston, United States
Source: https://rightinformation.com/
Right Information is one of the top IoT companies with over two decades of experience. Unlike many other IoT companies that provide resources only, Right Information offers services that span the full cycle of IoT product and service development. The company focuses on the end result and utilizes the latest engineering and data science practices to create some of the best solutions in the field. This is precisely why it is so valued by its past customers and is still rapidly expanding into new markets.
Right Information is a company that partners with brands and startups globally, so it doesn’t matter where your business is located. The company’s experts can provide consultations at the beginning stages of development for clients who don’t have prior experience with IoT. This allows the company’s team to find the best approach to a specific problem the client wants to solve by implementing an IoT solution. Over the years, Right Information has partnered with companies from industries ranging from automotive to biotechnology.
Key company characteristics:
Founding year: 2001
Size of company: 10-49
Location of company: Wroclaw, Poland
Source: https://www.apptomate.co/
apptomate is a highly experienced and established company based in three countries at the same time (USA, Germany, and India). In addition to offering IoT services, the company also works on a wide variety of projects that involve other technologies. You may even have a project that requires a combination of these – AI, machine learning, RPA, DevOps, hybrid mobile apps, enterprise apps, Magento 2 solutions, and more. In other words, it is the ideal company for those who want to develop multiple products or services.
apptomate has a history of working on different e-commerce projects as well as projects that involve cloud applications, digital marketing, and quality assurance and testing among other things. The company also offers consulting services for those who are uncertain about what they need to get developed in their specific situation. First and foremost, apptomate focuses on user experience which enables the company to create IoT products and services that serve the needs of their client’s target audience.
Key company characteristics:
Founding year: 2009
Size of company: 100-249
Location of company: San Pablo, United States
Source: https://www.verypossible.com/
Very is an AI and IoT design company that provides services for building smart manufacturing, consumer electronics, and connected health and wellness systems. The company is known for delivering the finished product in a timely fashion and has a reputation for providing a particularly good customer experience. Their past clients say that the company is easy to work with and the team is always ready to change things on the go at the client’s request, handling projects with both efficiency and agility.
Very offers IoT services that include product design, mobile app development, hardware engineering, software development, machine learning, and support and maintenance. The company appeared on Inc. magazine’s annual 5000 list of the fastest-growing US private companies five years in a row and was certified by Great Places to Work in 2021. It rightfully takes its spot as one of the best IoT companies with a proven track record of excellence.
Key company characteristics:
Founding year: 2011
Size of company: 50-99 employees
Location of company: Bozeman, United States
Source: https://intellias.com/
Intellias is an IoT company with many years of experience that has worked with some of the biggest companies in the world such as KIA, LG, Siemens, and HelloFresh. The company partners with startups and businesses across the globe to deliver high-quality end products and services based on the specific needs of each client. The services they offer include digital consulting, engineering, emerging technology, and operations. Similarly to some other IoT companies, they focus on helping businesses make the digital transition.
Because the company has been around for a long time, it has been continuously adopting new practices and adding more services to the ones they already offer. This is why they even offer services for developing blockchain-based solutions. You can also involve other specialists for your project when working with them by hiring professional writers from the writing services reviews site Trust My Paper. Other technologies they work with include cybersecurity, cloud services, DevOps, RPA, experience design, and more. And the industries they focus on are automotive, transportation, finances, agriculture, and media.
Key company characteristics:
Founding year: 2002
Size of company: 1000+
Location of company: Lviv, Ukraine
Source: https://www.iomico.com/
iomico is one of the top IoT companies that works with technologies such as machine learning, edge AI, and computer vision. The company has experience with advanced networks that include Wi-Fi, Bluetooth, Ethernet, GSM, and others. Though the company’s team is not big, they house some of the best professionals that work at every stage of IoT development. They work with clients to create IoT solutions from scratch and guide their customers through the entire process.
The three main areas of focus that iomico operates in are electronics, programming, and industrial design. Within these areas, they develop all kinds of solutions and utilize a wide variety of technologies, including thermal analysis, signal integrity analysis, ASIC engineering, PCB engineering, embedded Android development, media streaming development, image/video processing, and many more. The company’s track record is quite good with many happy clients that keep coming back for more.
Key company characteristics:
Founding year: 2012
Size of company: 10-49
Location of company: Sunnyvale, United States
Source: https://10pearls.com/
10Pearls is an IoT company focused on innovation and creativity. The company’s goal is to anticipate IoT development trends in the tech world and enable businesses to adopt the latest technologies to accommodate their needs. 10Pearls develops solutions that include software and applications for mobile and web use as well as offers services in development, user experience, quality assurance, security, and DevOps and SecOps among others. Over the years, the company has firmly established itself as a leader in its field.
10Pearls has been recognized and awarded by the likes of Gartner, Forrester, Inc., and others. In fact, Inc. magazine included it on its 5000 list of fastest-growing US private companies four years in a row. 10Pearls is also notable for working with some of the latest and emerging technologies, including AI, voice and language processing, AR and VR, and chatbots. The company also works with a variety of platforms, including Salesforce, Sitecore, SharePoint, ServiceNow, Micro Focus, and more.
Key company characteristics:
Founding year: 2004
Size of company: 250-999
Location of company: Vienna, United States
Source: https://oril.co/
Oril is one of the top IoT companies that closely follows tech and upcoming trends of IoT to deliver high-quality products and services that are in line with the most recent developments in the industry. While the company is younger than most leaders in the field, it has already worked on quite a few big projects and is rapidly expanding every year. Right now, it specializes in industries such as fintech, health and fitness, and automotive, but it plans to expand to other fields too.
Oril’s main goal is to create impactful solutions for its clients. The company’s team is very diverse and includes professionals from different backgrounds and with different experiences. This enables their specialists to solve client problems with proficiency and efficiency. You can also hire your own writers from the custom writing reviews site Best Essays Education to work with you and Oril’s team on your project. Oril’s services include user experience, product development, and digital transformation along with IoT development. Additionally, they also work on development projects for companies from the PropTech field.
Key company characteristics:
Founding year: 2015
Size of company: 10-49
Location of company: Brooklyn, United States
Source: https://www.convrtx.com/
ConvrtX is one of the leading IoT companies working with enterprises and businesses from different industries. The company prioritizes its clients’ brands and helps them build a stronger image through digital solutions. At the same time, ConvrtX is focused on providing high-quality customer service which is why the company is known for its outstanding reputation. Their communication is truly unparalleled and adds to the fact that the company is regularly delivering high-quality solutions that meet its clients’ needs and demands.
In other words, ConvrtX is focused on providing a personalized experience to each business it works with making it ideal for organizations that have unusual or very specific project requirements. After delivering the final product or service, ConvrtX also continues providing support to its clients and even ensuring that they get the latest updates for their solutions. The company’s holistic approach has earned it a number of awards and a good name in the IoT industry.
Key company characteristics:
Founding year: 2015
Size of company: 100-249
Location of company: Toronto, Canada
Source: https://pharaohsoft.com/
Pharaoh Soft is another leader in the IoT field that helps companies of all sizes to transfer their processes to the digital environment and continue operating even more efficiently. The company is known for working on projects that require agility and scalability while also providing high-quality end solutions. Pharaoh Soft is one of the youngest companies on this list, but it has grown rapidly over the years and will likely continue doing so in the nearest future.
Pharaoh Soft offers a wide variety of services related to e-commerce, staff augmentation, business intelligence, product design, mobile app development, and many more. Their past clients range from global companies to small businesses from all kinds of industries. Pharaoh Soft has also received a number of awards and certifications over the years with experts praising the company’s approach to IoT and other technologies they work with.
Key company characteristics:
Founding year: 2014
Size of company: 10-49
Location of company: Cairo, Egypt
#11 Cogniteq
#12 AJProTech
#13 Integra Sources
#14 rhipe
#15 Gritstone Technologies
#16 NIX United
#17 Proxi.cloud
#18 Neoito
#19 tecblic
#20 Ficode Technologies
#21 SoluLab
#22 Quality Wolves
#23 HQSoftware
#24 Prompt Softech
#25 Eastern Peak
#26 vCloud Tech
#27 EurecaApps
#28 TechAvidus
#29 Softqube Technologies
#30 Djangostars
#31 BEETSOFT
#32 DO OK
#33 Zehntech Technologies
#34 EMed HealthTech Pvt Ltd
#35 INEXTURE Solutions LLP
#36 Kinetica Systems
#37 XME.digital solutions
#38 ThinkPalm Technologies Pvt Ltd
#39 Oodles AI – AI Development Company
#40 KOMPANIONS
There are many successful IoT companies working all over the world right now. Some of these are fairly young but have already established themselves as leaders in their field. This list of top IoT companies will help you find the right one for your own needs.
We at SumatoSoft provide IoT services of our own and have been helping businesses in different industries to realize their ambitions for 10 years now. Don’t hesitate to check out our portfolio and get in touch with us!
There is no one company that is currently leading in IoT. In this article, we presented the top IoT companies from all over the world that are offering a variety of services. Depending on your needs and where you are based, you may prefer one company over the other.
There are different types of IoT devices (e.g. consumer IoT, military IoT, etc.), so the most popular device will vary depending on the category of IoT you are interested in. Some of the most widely used consumer IoT devices are smart home tech (e.g. voice controllers), home security systems, and others.
We listed the top IoT companies worldwide, but when it comes to Europe, the top companies are SumatoSoft, Right Information, Intellias, Cogniteq, Proxi.cloud, Ficode Technologies, HQSoftware, Eastern Peak, EurecaApps, Djangostars, DO OK, and others.
1667998168
Global retail eCommerce sales is growing consistently with no low tides in the last decade. In 2021, online sales which is $5.2 trillion worldwide is expected to increase to $8.1 trillion by 2026 at a CAGR of 56%. Similar positive growth is experienced in revenue that online stores are generating.
The eye-catchy figures indicate why is eCommerce important and compelling retailers to accelerate the shift to eCommerce store development and avail of the financial gains that online venture brings by multiple folds. The gains are attractive but starting an eCommerce store is not an apple pie. There is a lot more that goes into a successful e-commerce business from getting fundamentals right to paving the road to success. You will get extensive details related to eCommerce store development with an eCommerce app development guide.
This guide suggests everything is done seamlessly if you embrace a robust eCommerce platform that brings winning results to the table. With multiple choices available for an eCommerce platform, it’s tricky to choose the best fit as the excellent features of the eCommerce platform can either make or break your e-commerce dream.
Still, not persuaded by how eCommerce platform leverage brings marginal differences in investment and profits it brings in return. Here’s the answer:
An online store is the backbone of successful businesses today. If you do not have an e-commerce platform, you are losing a big chunk of business to your competitors. Back in the day, businesses had to develop their platforms. The affair was time-consuming, and costly and require development teams. Scalability and integration with other software used by the business were an issue.
The onset of e-commerce platforms has revolutionized the digital landscape and has made running an e-commerce business accessible to all for a fraction of the cost it was before. It eliminates the need to hire a team of eCommerce pros to get build an eCommerce store from scratch that’s a time-expensive process. Also, taking care of maintenance services, building infrastructure, and others are add-ons, which do not allow retailers to focus on core business activities.
eCommerce platforms allow retailers to live with peace of mind as everything is handled by the eCommerce platform vendor. It enables retailers to get build an eCommerce marketplace and focus on how to fuel up business growth. If you want to dig deeper into how to build a multi-million marketplace business model, the guide will help. Again, the importance of the best e-commerce platform selection to successfully deliver solutions suited to your business requirements cannot be ignored.
At this point, you know how to start a retail online store but are unaware of the features that make the store live on the top of the customers’ minds. With multiple choices for eCommerce platforms, it’s confusing to choose the best e-commerce platform out of the thousands that are out there. What are the metrics you use to make intelligent comparisons between different eCommerce platforms? What are the features to consider? Is the pricing, right? Here’s the biggest question on every e-tailer’s mind. Which is the best eCommerce platform available today?
For example, a value-added IT reseller will need features that a fashion eCommerce website may not. However, some features are an absolute must-have to run a successful eCommerce business. Ensure that the e-commerce software you choose does not skimp on these 11 features. Compare these features before choosing an eCommerce platform.
The best catalog, price, multiple shipping options, and multitude of payment gateways have made checkout a breeze. Once, the order is placed, the customers wait eagerly for their product ordered. When there’s a delay in the order delivery, the customers get impatient for the same. The solution to all these problems is- enabling an order tracking feature that allows customers to track the order right from the moment order is confirmed by the seller.
The link to track the order at every step of the way is sent to the customers in the order shipment email from where they can keep tabs on the order. Also, the reason for unexpected delays is mentioned that keeps the customers hold their breath for the most awaited order. eCommerce development companies never forget to integrate the feature.
Push notification is a powerful weapon that gains an upper hand over a general message sent to customers which is most often ignored or left unread. The push notifications are tailored to every individual customer’s demographic, browsing pattern, and buying history. It increases the chances of reading the push notifications and taking an action.
When they are sent at the right time after tracking the users’ specific eCommerce activity, the conversion rate and ROI get increased by a large margin. For instance, sending discount-related or about-to-last-in-stock push notification to the users for the product they are browsing, stimulates them to shop and enable impulse buying.
Select the eCommerce platform that enables push notification API integration to increase sales and profits of the store.
Chatbots are AI-powered assistants that are trained in tracking user activities, connecting the dots, and then providing a tailored response. Mostly, eCommerce stores are implementing them to enhance customer support services that improve response rates and customer satisfaction.
Taking a step ahead, they resolve the queries 24/7, 365 days beyond office hours. They work wonderfully in resolving 80% of customer queries related to order tracking, refund, payment issues, and more. Chatbots deployed in messaging platforms involve the customers at scale as their open rate is 98%. The bots send offers or reminders for abandoned shopping cart which enhances sales.
Check the eCommerce platform feature’s checklist to know if the chatbot integration facility which has become a need of the hour is provided or not.
More than 40% of traffic coming to the eCommerce store is from mobile devices, which makes perfect sense to optimize the store for mobile. It helps the store to gain user traction that prefers browsing and buying from mobile devices. Also, making the eCommerce store mobile-responsive eliminates the need to manage the marketing campaign, product information, catalog, listing, and others for different screen sizes. It reduces time, cost, and resource investment.
The responsive design ensures store looks everywhere consistently the same and delivers a great experience to the users regardless of the device they are using. Move ahead with the platform that supports responsive themes and templates which are customized according to eCommerce design needs.
Customers already show reluctance while buying from online stores due to perceived cases of financial data stolen or lost. Making the customers assures that sensitive data is safe when they transact with the store is important. The platform must enable end-to-end encryption, integrate SSL certificates, take data backups, ensure PCI DSS compliance, and implement all the fraud prevention techniques followed by security audits after regular intervals that ensure eCommerce transactions and stored data are safe.
Consider the platform that provides two-factor authentication and security plugin integration facility so that no one can put a dent in the website’s security.
No business in a specific amount of time can guarantee a degree of growth. When it grows, the eCommerce store requires additional support from the platform based on the evolving users’ preferences, market trends, and latest technologies. It requires the eCommerce platform support expansion that includes customizing APIs, extending storage needs, and automating things as per the changing business needs.
The platform that facilitates customization and scalability ensures that the store performs in as-is condition with no downtime. Also, be clear about the fees that the vendor will charge upon increasing storage and integrating more features.
The product catalog is the Holy Grail of your e-commerce business. Today’s consumer demand a consistent and rich online buying experience and a robust catalog will offer them that and ensure they return to your store. Your product catalog should complement the requirement of your sector. For example, if you sell IT products, your catalog should enable you to bundle products, up-sell and cross-sell. For fashion products, your catalog should offer customers different sizes and colors.
Another important feature of a powerful product catalog is its rich content. The product images, description and attributes should be consistent. This allows you to easily offer an omnichannel experience to your customers.
In today’s dynamic and competitive e-commerce market, prices are never set in stone. Customers are ALWAYS looking for deals and bargains. Running promotions is a great way of attracting customers. You will definitely run promotions, sales and promo codes, so it’s better to choose an e-commerce platform that allows you to do so easily and automates the calculations.
Customers usually have shipping carrier preferences depending on delivery time, packaging and handling etc. A lot of cart abandonments occur because customers don’t find their preferred carrier.
While it may not be possible to offer every shipping carrier under the sun, it is advisable to offer at least a couple of major carriers, so customers feel in control of their orders.
Another reason for cart abandonment at checkout is the unavailability of familiar payment gateways. Online shoppers usually prefer one specific payment type. Some have loyalty rewards programs; some may trust only locally popular payment options. If you are a global entity, you MUST offer local payment options that consumers trust and know. Do not lose out a confirmed sale because of a lack of payment options. E-commerce software that offers you multiple payment gateways according to your geographical location will be a big asset to your e-commerce business.
Not paying taxes is a legal offense and if you are a global entity it is best if your e-commerce platform can automate tax calculations for you. Failure to do so will lead to numerous legal repercussions in different countries, and we’re guessing you do not want that! The best e-commerce platforms out there offer automated tax and account calculations.
Millennials trust user-generated content 50% more than any other. We live in the millennial world and well, we need to adapt. Consumers today check for online reviews before making any decision, be it buying a product, choosing a restaurant or planning a holiday. You will heavily lose out if your e-commerce platform does not allow users to rate you, leave testimonials, reviews or comments.
Facebook and Instagram are where the world’s biggest audience is. A good e-commerce platform will allow seamless social media integrations to give your customers a truly omnichannel buying experience. This is again where a solid and consistent product catalog is of immense importance.
Google Merchant allows your products to list in the Google Shopping results. Google Shopping generates high traffic and as an e-commerce business, you want that! More and more consumers use Google to search for products, and you want your products to show up in those results. Choose an e-commerce platform that integrates well with Google Merchant.
Conclusion
eCommerce is all about how it appeals to your visitors and how it interacts. It applies to your wise decision by evaluating all your business needs and the customers’ expectations. It is better to choose the right platform out of a committed analysis and the right decisions will always guide businesses to achieve and succeed.
In case, you are a little unsure with eCommerce platform selection, get connected with eCommerce development companies that have a team of business consultants and technical analysts that brainstorm your eCommerce project needs and then suggest the best eCommerce platform that meets Ecommerce needs.
You can also consider the aforementioned tips to pave the way for e-commerce success. Choose wisely and compare all the e-commerce platforms before deciding.
1667904337
As demand grows, the number of machine learning libraries available in Python continues to rise, with new libraries getting added each month. With this in mind, we’ve compiled a list of the 10 best Python libraries for machine learning that you can use today, with more on the horizon for 2022. This collection of powerful tools covers everything from data visualization to natural language processing (NLP) to anomaly detection and prediction, making it easier than ever before to build your own machine learning model with Python.
NumPy is a library for scientific computing, and it's one of the most commonly used libraries among machine learning practitioners. It provides powerful features, such as multi-dimensional arrays and matrices, for data processing and statistical analysis. These features are essential to machine learning with large datasets. NumPy also provides a variety of functions that allow users to efficiently perform operations on these arrays, including linear algebra operations like matrix multiplication.
The scikit-learn library is one of the most popular machine learning libraries for Python. It has many algorithms, including support for classification and regression. If you're just getting started with machine learning and want a library that can do both classification and regression, scikit-learn is an excellent choice.
SciPy is a collection of libraries for scientific computing. It includes modules for data analysis, optimization, integration and visualization. SciPy is written in Python and makes use of NumPy arrays, matplotlib plots and other state-of-the-art numerical computing techniques.
TensorFlow is a machine learning library designed to make it easier to create neural networks. It was originally developed by researchers and engineers working on the Google Brain Team within Google's AI organization.
Pandas is one of the most popular Python libraries used by data scientists and analysts. It provides data structures and operations for manipulating numerical tables. This library also supports reading and writing to relational databases, working with missing values, as well as statistical functions like descriptive statistics, estimation, hypothesis testing, and more.
Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. This library provides simple abstractions for creating deep learning models, without sacrificing flexibility. For example, it allows users to mix training routines from different libraries, choose pre-made operations or create their own custom ops.
Theano is a Python library that allows developers to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can be used to perform various tasks such as linear algebra, calculus, probability theory, and more.
PyTorch is an open-source machine learning library. PyTorch provides a foundation in deep learning, enables fast prototyping, and has been used to build research systems.
Gensim is a Python library that can be used to perform natural language processing tasks. It is designed to work well with large text corpora like Wikipedia, but it also works well with other forms of textual data. Gensim makes use of Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to extract semantic topics from text.
NLTK is a machine learning library that specializes in natural language processing. NLTK includes a suite of text processing programs, libraries and modules that provide easy access to the most common natural language processing tasks. These include tokenizing texts into words, sentences and paragraphs; removing common English stopwords like the, and, or a; stemming words by reducing their inflections (for example, sport might become sports); and even annotating texts with part-of-speech tags.