1623964800
In this openSUSE Leap 15.3 review video, we take a look at this distroās features, specifications, and interface.
The GNOME 3.34.7 desktop is being used in this video.
Enjoy the video!
ā°Timestampsā°
#opensuse #developer
1653475560
msgpack.php
A pure PHP implementation of the MessagePack serialization format.
The recommended way to install the library is through Composer:
composer require rybakit/msgpack
To pack values you can either use an instance of a Packer
:
$packer = new Packer();
$packed = $packer->pack($value);
or call a static method on the MessagePack
class:
$packed = MessagePack::pack($value);
In the examples above, the method pack
automatically packs a value depending on its type. However, not all PHP types can be uniquely translated to MessagePack types. For example, the MessagePack format defines map
and array
types, which are represented by a single array
type in PHP. By default, the packer will pack a PHP array as a MessagePack array if it has sequential numeric keys, starting from 0
and as a MessagePack map otherwise:
$mpArr1 = $packer->pack([1, 2]); // MP array [1, 2]
$mpArr2 = $packer->pack([0 => 1, 1 => 2]); // MP array [1, 2]
$mpMap1 = $packer->pack([0 => 1, 2 => 3]); // MP map {0: 1, 2: 3}
$mpMap2 = $packer->pack([1 => 2, 2 => 3]); // MP map {1: 2, 2: 3}
$mpMap3 = $packer->pack(['a' => 1, 'b' => 2]); // MP map {a: 1, b: 2}
However, sometimes you need to pack a sequential array as a MessagePack map. To do this, use the packMap
method:
$mpMap = $packer->packMap([1, 2]); // {0: 1, 1: 2}
Here is a list of type-specific packing methods:
$packer->packNil(); // MP nil
$packer->packBool(true); // MP bool
$packer->packInt(42); // MP int
$packer->packFloat(M_PI); // MP float (32 or 64)
$packer->packFloat32(M_PI); // MP float 32
$packer->packFloat64(M_PI); // MP float 64
$packer->packStr('foo'); // MP str
$packer->packBin("\x80"); // MP bin
$packer->packArray([1, 2]); // MP array
$packer->packMap(['a' => 1]); // MP map
$packer->packExt(1, "\xaa"); // MP ext
Check the "Custom types" section below on how to pack custom types.
The Packer
object supports a number of bitmask-based options for fine-tuning the packing process (defaults are in bold):
Name | Description |
---|---|
FORCE_STR | Forces PHP strings to be packed as MessagePack UTF-8 strings |
FORCE_BIN | Forces PHP strings to be packed as MessagePack binary data |
DETECT_STR_BIN | Detects MessagePack str/bin type automatically |
FORCE_ARR | Forces PHP arrays to be packed as MessagePack arrays |
FORCE_MAP | Forces PHP arrays to be packed as MessagePack maps |
DETECT_ARR_MAP | Detects MessagePack array/map type automatically |
FORCE_FLOAT32 | Forces PHP floats to be packed as 32-bits MessagePack floats |
FORCE_FLOAT64 | Forces PHP floats to be packed as 64-bits MessagePack floats |
The type detection mode (
DETECT_STR_BIN
/DETECT_ARR_MAP
) adds some overhead which can be noticed when you pack large (16- and 32-bit) arrays or strings. However, if you know the value type in advance (for example, you only work with UTF-8 strings or/and associative arrays), you can eliminate this overhead by forcing the packer to use the appropriate type, which will save it from running the auto-detection routine. Another option is to explicitly specify the value type. The library provides 2 auxiliary classes for this,Map
andBin
. Check the "Custom types" section below for details.
Examples:
// detect str/bin type and pack PHP 64-bit floats (doubles) to MP 32-bit floats
$packer = new Packer(PackOptions::DETECT_STR_BIN | PackOptions::FORCE_FLOAT32);
// these will throw MessagePack\Exception\InvalidOptionException
$packer = new Packer(PackOptions::FORCE_STR | PackOptions::FORCE_BIN);
$packer = new Packer(PackOptions::FORCE_FLOAT32 | PackOptions::FORCE_FLOAT64);
To unpack data you can either use an instance of a BufferUnpacker
:
$unpacker = new BufferUnpacker();
$unpacker->reset($packed);
$value = $unpacker->unpack();
or call a static method on the MessagePack
class:
$value = MessagePack::unpack($packed);
If the packed data is received in chunks (e.g. when reading from a stream), use the tryUnpack
method, which attempts to unpack data and returns an array of unpacked messages (if any) instead of throwing an InsufficientDataException
:
while ($chunk = ...) {
$unpacker->append($chunk);
if ($messages = $unpacker->tryUnpack()) {
return $messages;
}
}
If you want to unpack from a specific position in a buffer, use seek
:
$unpacker->seek(42); // set position equal to 42 bytes
$unpacker->seek(-8); // set position to 8 bytes before the end of the buffer
To skip bytes from the current position, use skip
:
$unpacker->skip(10); // set position to 10 bytes ahead of the current position
To get the number of remaining (unread) bytes in the buffer:
$unreadBytesCount = $unpacker->getRemainingCount();
To check whether the buffer has unread data:
$hasUnreadBytes = $unpacker->hasRemaining();
If needed, you can remove already read data from the buffer by calling:
$releasedBytesCount = $unpacker->release();
With the read
method you can read raw (packed) data:
$packedData = $unpacker->read(2); // read 2 bytes
Besides the above methods BufferUnpacker
provides type-specific unpacking methods, namely:
$unpacker->unpackNil(); // PHP null
$unpacker->unpackBool(); // PHP bool
$unpacker->unpackInt(); // PHP int
$unpacker->unpackFloat(); // PHP float
$unpacker->unpackStr(); // PHP UTF-8 string
$unpacker->unpackBin(); // PHP binary string
$unpacker->unpackArray(); // PHP sequential array
$unpacker->unpackMap(); // PHP associative array
$unpacker->unpackExt(); // PHP MessagePack\Type\Ext object
The BufferUnpacker
object supports a number of bitmask-based options for fine-tuning the unpacking process (defaults are in bold):
Name | Description |
---|---|
BIGINT_AS_STR | Converts overflowed integers to strings [1] |
BIGINT_AS_GMP | Converts overflowed integers to GMP objects [2] |
BIGINT_AS_DEC | Converts overflowed integers to Decimal\Decimal objects [3] |
1. The binary MessagePack format has unsigned 64-bit as its largest integer data type, but PHP does not support such integers, which means that an overflow can occur during unpacking.
2. Make sure the GMP extension is enabled.
3. Make sure the Decimal extension is enabled.
Examples:
$packedUint64 = "\xcf"."\xff\xff\xff\xff"."\xff\xff\xff\xff";
$unpacker = new BufferUnpacker($packedUint64);
var_dump($unpacker->unpack()); // string(20) "18446744073709551615"
$unpacker = new BufferUnpacker($packedUint64, UnpackOptions::BIGINT_AS_GMP);
var_dump($unpacker->unpack()); // object(GMP) {...}
$unpacker = new BufferUnpacker($packedUint64, UnpackOptions::BIGINT_AS_DEC);
var_dump($unpacker->unpack()); // object(Decimal\Decimal) {...}
In addition to the basic types, the library provides functionality to serialize and deserialize arbitrary types. This can be done in several ways, depending on your use case. Let's take a look at them.
If you need to serialize an instance of one of your classes into one of the basic MessagePack types, the best way to do this is to implement the CanBePacked interface in the class. A good example of such a class is the Map
type class that comes with the library. This type is useful when you want to explicitly specify that a given PHP array should be packed as a MessagePack map without triggering an automatic type detection routine:
$packer = new Packer();
$packedMap = $packer->pack(new Map([1, 2, 3]));
$packedArray = $packer->pack([1, 2, 3]);
More type examples can be found in the src/Type directory.
As with type objects, type transformers are only responsible for serializing values. They should be used when you need to serialize a value that does not implement the CanBePacked interface. Examples of such values could be instances of built-in or third-party classes that you don't own, or non-objects such as resources.
A transformer class must implement the CanPack interface. To use a transformer, it must first be registered in the packer. Here is an example of how to serialize PHP streams into the MessagePack bin
format type using one of the supplied transformers, StreamTransformer
:
$packer = new Packer(null, [new StreamTransformer()]);
$packedBin = $packer->pack(fopen('/path/to/file', 'r+'));
More type transformer examples can be found in the src/TypeTransformer directory.
In contrast to the cases described above, extensions are intended to handle extension types and are responsible for both serialization and deserialization of values (types).
An extension class must implement the Extension interface. To use an extension, it must first be registered in the packer and the unpacker.
The MessagePack specification divides extension types into two groups: predefined and application-specific. Currently, there is only one predefined type in the specification, Timestamp.
Timestamp
The Timestamp extension type is a predefined type. Support for this type in the library is done through the TimestampExtension
class. This class is responsible for handling Timestamp
objects, which represent the number of seconds and optional adjustment in nanoseconds:
$timestampExtension = new TimestampExtension();
$packer = new Packer();
$packer = $packer->extendWith($timestampExtension);
$unpacker = new BufferUnpacker();
$unpacker = $unpacker->extendWith($timestampExtension);
$packedTimestamp = $packer->pack(Timestamp::now());
$timestamp = $unpacker->reset($packedTimestamp)->unpack();
$seconds = $timestamp->getSeconds();
$nanoseconds = $timestamp->getNanoseconds();
When using the MessagePack
class, the Timestamp extension is already registered:
$packedTimestamp = MessagePack::pack(Timestamp::now());
$timestamp = MessagePack::unpack($packedTimestamp);
Application-specific extensions
In addition, the format can be extended with your own types. For example, to make the built-in PHP DateTime
objects first-class citizens in your code, you can create a corresponding extension, as shown in the example. Please note, that custom extensions have to be registered with a unique extension ID (an integer from 0
to 127
).
More extension examples can be found in the examples/MessagePack directory.
To learn more about how extension types can be useful, check out this article.
If an error occurs during packing/unpacking, a PackingFailedException
or an UnpackingFailedException
will be thrown, respectively. In addition, an InsufficientDataException
can be thrown during unpacking.
An InvalidOptionException
will be thrown in case an invalid option (or a combination of mutually exclusive options) is used.
Run tests as follows:
vendor/bin/phpunit
Also, if you already have Docker installed, you can run the tests in a docker container. First, create a container:
./dockerfile.sh | docker build -t msgpack -
The command above will create a container named msgpack
with PHP 8.1 runtime. You may change the default runtime by defining the PHP_IMAGE
environment variable:
PHP_IMAGE='php:8.0-cli' ./dockerfile.sh | docker build -t msgpack -
See a list of various images here.
Then run the unit tests:
docker run --rm -v $PWD:/msgpack -w /msgpack msgpack
To ensure that the unpacking works correctly with malformed/semi-malformed data, you can use a testing technique called Fuzzing. The library ships with a help file (target) for PHP-Fuzzer and can be used as follows:
php-fuzzer fuzz tests/fuzz_buffer_unpacker.php
To check performance, run:
php -n -dzend_extension=opcache.so \
-dpcre.jit=1 -dopcache.enable=1 -dopcache.enable_cli=1 \
tests/bench.php
Example output
Filter: MessagePack\Tests\Perf\Filter\ListFilter
Rounds: 3
Iterations: 100000
=============================================
Test/Target Packer BufferUnpacker
---------------------------------------------
nil .................. 0.0030 ........ 0.0139
false ................ 0.0037 ........ 0.0144
true ................. 0.0040 ........ 0.0137
7-bit uint #1 ........ 0.0052 ........ 0.0120
7-bit uint #2 ........ 0.0059 ........ 0.0114
7-bit uint #3 ........ 0.0061 ........ 0.0119
5-bit sint #1 ........ 0.0067 ........ 0.0126
5-bit sint #2 ........ 0.0064 ........ 0.0132
5-bit sint #3 ........ 0.0066 ........ 0.0135
8-bit uint #1 ........ 0.0078 ........ 0.0200
8-bit uint #2 ........ 0.0077 ........ 0.0212
8-bit uint #3 ........ 0.0086 ........ 0.0203
16-bit uint #1 ....... 0.0111 ........ 0.0271
16-bit uint #2 ....... 0.0115 ........ 0.0260
16-bit uint #3 ....... 0.0103 ........ 0.0273
32-bit uint #1 ....... 0.0116 ........ 0.0326
32-bit uint #2 ....... 0.0118 ........ 0.0332
32-bit uint #3 ....... 0.0127 ........ 0.0325
64-bit uint #1 ....... 0.0140 ........ 0.0277
64-bit uint #2 ....... 0.0134 ........ 0.0294
64-bit uint #3 ....... 0.0134 ........ 0.0281
8-bit int #1 ......... 0.0086 ........ 0.0241
8-bit int #2 ......... 0.0089 ........ 0.0225
8-bit int #3 ......... 0.0085 ........ 0.0229
16-bit int #1 ........ 0.0118 ........ 0.0280
16-bit int #2 ........ 0.0121 ........ 0.0270
16-bit int #3 ........ 0.0109 ........ 0.0274
32-bit int #1 ........ 0.0128 ........ 0.0346
32-bit int #2 ........ 0.0118 ........ 0.0339
32-bit int #3 ........ 0.0135 ........ 0.0368
64-bit int #1 ........ 0.0138 ........ 0.0276
64-bit int #2 ........ 0.0132 ........ 0.0286
64-bit int #3 ........ 0.0137 ........ 0.0274
64-bit int #4 ........ 0.0180 ........ 0.0285
64-bit float #1 ...... 0.0134 ........ 0.0284
64-bit float #2 ...... 0.0125 ........ 0.0275
64-bit float #3 ...... 0.0126 ........ 0.0283
fix string #1 ........ 0.0035 ........ 0.0133
fix string #2 ........ 0.0094 ........ 0.0216
fix string #3 ........ 0.0094 ........ 0.0222
fix string #4 ........ 0.0091 ........ 0.0241
8-bit string #1 ...... 0.0122 ........ 0.0301
8-bit string #2 ...... 0.0118 ........ 0.0304
8-bit string #3 ...... 0.0119 ........ 0.0315
16-bit string #1 ..... 0.0150 ........ 0.0388
16-bit string #2 ..... 0.1545 ........ 0.1665
32-bit string ........ 0.1570 ........ 0.1756
wide char string #1 .. 0.0091 ........ 0.0236
wide char string #2 .. 0.0122 ........ 0.0313
8-bit binary #1 ...... 0.0100 ........ 0.0302
8-bit binary #2 ...... 0.0123 ........ 0.0324
8-bit binary #3 ...... 0.0126 ........ 0.0327
16-bit binary ........ 0.0168 ........ 0.0372
32-bit binary ........ 0.1588 ........ 0.1754
fix array #1 ......... 0.0042 ........ 0.0131
fix array #2 ......... 0.0294 ........ 0.0367
fix array #3 ......... 0.0412 ........ 0.0472
16-bit array #1 ...... 0.1378 ........ 0.1596
16-bit array #2 ........... S ............. S
32-bit array .............. S ............. S
complex array ........ 0.1865 ........ 0.2283
fix map #1 ........... 0.0725 ........ 0.1048
fix map #2 ........... 0.0319 ........ 0.0405
fix map #3 ........... 0.0356 ........ 0.0665
fix map #4 ........... 0.0465 ........ 0.0497
16-bit map #1 ........ 0.2540 ........ 0.3028
16-bit map #2 ............. S ............. S
32-bit map ................ S ............. S
complex map .......... 0.2372 ........ 0.2710
fixext 1 ............. 0.0283 ........ 0.0358
fixext 2 ............. 0.0291 ........ 0.0371
fixext 4 ............. 0.0302 ........ 0.0355
fixext 8 ............. 0.0288 ........ 0.0384
fixext 16 ............ 0.0293 ........ 0.0359
8-bit ext ............ 0.0302 ........ 0.0439
16-bit ext ........... 0.0334 ........ 0.0499
32-bit ext ........... 0.1845 ........ 0.1888
32-bit timestamp #1 .. 0.0337 ........ 0.0547
32-bit timestamp #2 .. 0.0335 ........ 0.0560
64-bit timestamp #1 .. 0.0371 ........ 0.0575
64-bit timestamp #2 .. 0.0374 ........ 0.0542
64-bit timestamp #3 .. 0.0356 ........ 0.0533
96-bit timestamp #1 .. 0.0362 ........ 0.0699
96-bit timestamp #2 .. 0.0381 ........ 0.0701
96-bit timestamp #3 .. 0.0367 ........ 0.0687
=============================================
Total 2.7618 4.0820
Skipped 4 4
Failed 0 0
Ignored 0 0
With JIT:
php -n -dzend_extension=opcache.so \
-dpcre.jit=1 -dopcache.jit_buffer_size=64M -dopcache.jit=tracing -dopcache.enable=1 -dopcache.enable_cli=1 \
tests/bench.php
Example output
Filter: MessagePack\Tests\Perf\Filter\ListFilter
Rounds: 3
Iterations: 100000
=============================================
Test/Target Packer BufferUnpacker
---------------------------------------------
nil .................. 0.0005 ........ 0.0054
false ................ 0.0004 ........ 0.0059
true ................. 0.0004 ........ 0.0059
7-bit uint #1 ........ 0.0010 ........ 0.0047
7-bit uint #2 ........ 0.0010 ........ 0.0046
7-bit uint #3 ........ 0.0010 ........ 0.0046
5-bit sint #1 ........ 0.0025 ........ 0.0046
5-bit sint #2 ........ 0.0023 ........ 0.0046
5-bit sint #3 ........ 0.0024 ........ 0.0045
8-bit uint #1 ........ 0.0043 ........ 0.0081
8-bit uint #2 ........ 0.0043 ........ 0.0079
8-bit uint #3 ........ 0.0041 ........ 0.0080
16-bit uint #1 ....... 0.0064 ........ 0.0095
16-bit uint #2 ....... 0.0064 ........ 0.0091
16-bit uint #3 ....... 0.0064 ........ 0.0094
32-bit uint #1 ....... 0.0085 ........ 0.0114
32-bit uint #2 ....... 0.0077 ........ 0.0122
32-bit uint #3 ....... 0.0077 ........ 0.0120
64-bit uint #1 ....... 0.0085 ........ 0.0159
64-bit uint #2 ....... 0.0086 ........ 0.0157
64-bit uint #3 ....... 0.0086 ........ 0.0158
8-bit int #1 ......... 0.0042 ........ 0.0080
8-bit int #2 ......... 0.0042 ........ 0.0080
8-bit int #3 ......... 0.0042 ........ 0.0081
16-bit int #1 ........ 0.0065 ........ 0.0095
16-bit int #2 ........ 0.0065 ........ 0.0090
16-bit int #3 ........ 0.0056 ........ 0.0085
32-bit int #1 ........ 0.0067 ........ 0.0107
32-bit int #2 ........ 0.0066 ........ 0.0106
32-bit int #3 ........ 0.0063 ........ 0.0104
64-bit int #1 ........ 0.0072 ........ 0.0162
64-bit int #2 ........ 0.0073 ........ 0.0174
64-bit int #3 ........ 0.0072 ........ 0.0164
64-bit int #4 ........ 0.0077 ........ 0.0161
64-bit float #1 ...... 0.0053 ........ 0.0135
64-bit float #2 ...... 0.0053 ........ 0.0135
64-bit float #3 ...... 0.0052 ........ 0.0135
fix string #1 ....... -0.0002 ........ 0.0044
fix string #2 ........ 0.0035 ........ 0.0067
fix string #3 ........ 0.0035 ........ 0.0077
fix string #4 ........ 0.0033 ........ 0.0078
8-bit string #1 ...... 0.0059 ........ 0.0110
8-bit string #2 ...... 0.0063 ........ 0.0121
8-bit string #3 ...... 0.0064 ........ 0.0124
16-bit string #1 ..... 0.0099 ........ 0.0146
16-bit string #2 ..... 0.1522 ........ 0.1474
32-bit string ........ 0.1511 ........ 0.1483
wide char string #1 .. 0.0039 ........ 0.0084
wide char string #2 .. 0.0073 ........ 0.0123
8-bit binary #1 ...... 0.0040 ........ 0.0112
8-bit binary #2 ...... 0.0075 ........ 0.0123
8-bit binary #3 ...... 0.0077 ........ 0.0129
16-bit binary ........ 0.0096 ........ 0.0145
32-bit binary ........ 0.1535 ........ 0.1479
fix array #1 ......... 0.0008 ........ 0.0061
fix array #2 ......... 0.0121 ........ 0.0165
fix array #3 ......... 0.0193 ........ 0.0222
16-bit array #1 ...... 0.0607 ........ 0.0479
16-bit array #2 ........... S ............. S
32-bit array .............. S ............. S
complex array ........ 0.0749 ........ 0.0824
fix map #1 ........... 0.0329 ........ 0.0431
fix map #2 ........... 0.0161 ........ 0.0189
fix map #3 ........... 0.0205 ........ 0.0262
fix map #4 ........... 0.0252 ........ 0.0205
16-bit map #1 ........ 0.1016 ........ 0.0927
16-bit map #2 ............. S ............. S
32-bit map ................ S ............. S
complex map .......... 0.1096 ........ 0.1030
fixext 1 ............. 0.0157 ........ 0.0161
fixext 2 ............. 0.0175 ........ 0.0183
fixext 4 ............. 0.0156 ........ 0.0185
fixext 8 ............. 0.0163 ........ 0.0184
fixext 16 ............ 0.0164 ........ 0.0182
8-bit ext ............ 0.0158 ........ 0.0207
16-bit ext ........... 0.0203 ........ 0.0219
32-bit ext ........... 0.1614 ........ 0.1539
32-bit timestamp #1 .. 0.0195 ........ 0.0249
32-bit timestamp #2 .. 0.0188 ........ 0.0260
64-bit timestamp #1 .. 0.0207 ........ 0.0281
64-bit timestamp #2 .. 0.0212 ........ 0.0291
64-bit timestamp #3 .. 0.0207 ........ 0.0295
96-bit timestamp #1 .. 0.0222 ........ 0.0358
96-bit timestamp #2 .. 0.0228 ........ 0.0353
96-bit timestamp #3 .. 0.0210 ........ 0.0319
=============================================
Total 1.6432 1.9674
Skipped 4 4
Failed 0 0
Ignored 0 0
You may change default benchmark settings by defining the following environment variables:
Name | Default |
---|---|
MP_BENCH_TARGETS | pure_p,pure_u , see a list of available targets |
MP_BENCH_ITERATIONS | 100_000 |
MP_BENCH_DURATION | not set |
MP_BENCH_ROUNDS | 3 |
MP_BENCH_TESTS | -@slow , see a list of available tests |
For example:
export MP_BENCH_TARGETS=pure_p
export MP_BENCH_ITERATIONS=1000000
export MP_BENCH_ROUNDS=5
# a comma separated list of test names
export MP_BENCH_TESTS='complex array, complex map'
# or a group name
# export MP_BENCH_TESTS='-@slow' // @pecl_comp
# or a regexp
# export MP_BENCH_TESTS='/complex (array|map)/'
Another example, benchmarking both the library and the PECL extension:
MP_BENCH_TARGETS=pure_p,pure_u,pecl_p,pecl_u \
php -n -dextension=msgpack.so -dzend_extension=opcache.so \
-dpcre.jit=1 -dopcache.enable=1 -dopcache.enable_cli=1 \
tests/bench.php
Example output
Filter: MessagePack\Tests\Perf\Filter\ListFilter
Rounds: 3
Iterations: 100000
===========================================================================
Test/Target Packer BufferUnpacker msgpack_pack msgpack_unpack
---------------------------------------------------------------------------
nil .................. 0.0031 ........ 0.0141 ...... 0.0055 ........ 0.0064
false ................ 0.0039 ........ 0.0154 ...... 0.0056 ........ 0.0053
true ................. 0.0038 ........ 0.0139 ...... 0.0056 ........ 0.0044
7-bit uint #1 ........ 0.0061 ........ 0.0110 ...... 0.0059 ........ 0.0046
7-bit uint #2 ........ 0.0065 ........ 0.0119 ...... 0.0042 ........ 0.0029
7-bit uint #3 ........ 0.0054 ........ 0.0117 ...... 0.0045 ........ 0.0025
5-bit sint #1 ........ 0.0047 ........ 0.0103 ...... 0.0038 ........ 0.0022
5-bit sint #2 ........ 0.0048 ........ 0.0117 ...... 0.0038 ........ 0.0022
5-bit sint #3 ........ 0.0046 ........ 0.0102 ...... 0.0038 ........ 0.0023
8-bit uint #1 ........ 0.0063 ........ 0.0174 ...... 0.0039 ........ 0.0031
8-bit uint #2 ........ 0.0063 ........ 0.0167 ...... 0.0040 ........ 0.0029
8-bit uint #3 ........ 0.0063 ........ 0.0168 ...... 0.0039 ........ 0.0030
16-bit uint #1 ....... 0.0092 ........ 0.0222 ...... 0.0049 ........ 0.0030
16-bit uint #2 ....... 0.0096 ........ 0.0227 ...... 0.0042 ........ 0.0046
16-bit uint #3 ....... 0.0123 ........ 0.0274 ...... 0.0059 ........ 0.0051
32-bit uint #1 ....... 0.0136 ........ 0.0331 ...... 0.0060 ........ 0.0048
32-bit uint #2 ....... 0.0130 ........ 0.0336 ...... 0.0070 ........ 0.0048
32-bit uint #3 ....... 0.0127 ........ 0.0329 ...... 0.0051 ........ 0.0048
64-bit uint #1 ....... 0.0126 ........ 0.0268 ...... 0.0055 ........ 0.0049
64-bit uint #2 ....... 0.0135 ........ 0.0281 ...... 0.0052 ........ 0.0046
64-bit uint #3 ....... 0.0131 ........ 0.0274 ...... 0.0069 ........ 0.0044
8-bit int #1 ......... 0.0077 ........ 0.0236 ...... 0.0058 ........ 0.0044
8-bit int #2 ......... 0.0087 ........ 0.0244 ...... 0.0058 ........ 0.0048
8-bit int #3 ......... 0.0084 ........ 0.0241 ...... 0.0055 ........ 0.0049
16-bit int #1 ........ 0.0112 ........ 0.0271 ...... 0.0048 ........ 0.0045
16-bit int #2 ........ 0.0124 ........ 0.0292 ...... 0.0057 ........ 0.0049
16-bit int #3 ........ 0.0118 ........ 0.0270 ...... 0.0058 ........ 0.0050
32-bit int #1 ........ 0.0137 ........ 0.0366 ...... 0.0058 ........ 0.0051
32-bit int #2 ........ 0.0133 ........ 0.0366 ...... 0.0056 ........ 0.0049
32-bit int #3 ........ 0.0129 ........ 0.0350 ...... 0.0052 ........ 0.0048
64-bit int #1 ........ 0.0145 ........ 0.0254 ...... 0.0034 ........ 0.0025
64-bit int #2 ........ 0.0097 ........ 0.0214 ...... 0.0034 ........ 0.0025
64-bit int #3 ........ 0.0096 ........ 0.0287 ...... 0.0059 ........ 0.0050
64-bit int #4 ........ 0.0143 ........ 0.0277 ...... 0.0059 ........ 0.0046
64-bit float #1 ...... 0.0134 ........ 0.0281 ...... 0.0057 ........ 0.0052
64-bit float #2 ...... 0.0141 ........ 0.0281 ...... 0.0057 ........ 0.0050
64-bit float #3 ...... 0.0144 ........ 0.0282 ...... 0.0057 ........ 0.0050
fix string #1 ........ 0.0036 ........ 0.0143 ...... 0.0066 ........ 0.0053
fix string #2 ........ 0.0107 ........ 0.0222 ...... 0.0065 ........ 0.0068
fix string #3 ........ 0.0116 ........ 0.0245 ...... 0.0063 ........ 0.0069
fix string #4 ........ 0.0105 ........ 0.0253 ...... 0.0083 ........ 0.0077
8-bit string #1 ...... 0.0126 ........ 0.0318 ...... 0.0075 ........ 0.0088
8-bit string #2 ...... 0.0121 ........ 0.0295 ...... 0.0076 ........ 0.0086
8-bit string #3 ...... 0.0125 ........ 0.0293 ...... 0.0130 ........ 0.0093
16-bit string #1 ..... 0.0159 ........ 0.0368 ...... 0.0117 ........ 0.0086
16-bit string #2 ..... 0.1547 ........ 0.1686 ...... 0.1516 ........ 0.1373
32-bit string ........ 0.1558 ........ 0.1729 ...... 0.1511 ........ 0.1396
wide char string #1 .. 0.0098 ........ 0.0237 ...... 0.0066 ........ 0.0065
wide char string #2 .. 0.0128 ........ 0.0291 ...... 0.0061 ........ 0.0082
8-bit binary #1 ........... I ............. I ........... F ............. I
8-bit binary #2 ........... I ............. I ........... F ............. I
8-bit binary #3 ........... I ............. I ........... F ............. I
16-bit binary ............. I ............. I ........... F ............. I
32-bit binary ............. I ............. I ........... F ............. I
fix array #1 ......... 0.0040 ........ 0.0129 ...... 0.0120 ........ 0.0058
fix array #2 ......... 0.0279 ........ 0.0390 ...... 0.0143 ........ 0.0165
fix array #3 ......... 0.0415 ........ 0.0463 ...... 0.0162 ........ 0.0187
16-bit array #1 ...... 0.1349 ........ 0.1628 ...... 0.0334 ........ 0.0341
16-bit array #2 ........... S ............. S ........... S ............. S
32-bit array .............. S ............. S ........... S ............. S
complex array ............. I ............. I ........... F ............. F
fix map #1 ................ I ............. I ........... F ............. I
fix map #2 ........... 0.0345 ........ 0.0391 ...... 0.0143 ........ 0.0168
fix map #3 ................ I ............. I ........... F ............. I
fix map #4 ........... 0.0459 ........ 0.0473 ...... 0.0151 ........ 0.0163
16-bit map #1 ........ 0.2518 ........ 0.2962 ...... 0.0400 ........ 0.0490
16-bit map #2 ............. S ............. S ........... S ............. S
32-bit map ................ S ............. S ........... S ............. S
complex map .......... 0.2380 ........ 0.2682 ...... 0.0545 ........ 0.0579
fixext 1 .................. I ............. I ........... F ............. F
fixext 2 .................. I ............. I ........... F ............. F
fixext 4 .................. I ............. I ........... F ............. F
fixext 8 .................. I ............. I ........... F ............. F
fixext 16 ................. I ............. I ........... F ............. F
8-bit ext ................. I ............. I ........... F ............. F
16-bit ext ................ I ............. I ........... F ............. F
32-bit ext ................ I ............. I ........... F ............. F
32-bit timestamp #1 ....... I ............. I ........... F ............. F
32-bit timestamp #2 ....... I ............. I ........... F ............. F
64-bit timestamp #1 ....... I ............. I ........... F ............. F
64-bit timestamp #2 ....... I ............. I ........... F ............. F
64-bit timestamp #3 ....... I ............. I ........... F ............. F
96-bit timestamp #1 ....... I ............. I ........... F ............. F
96-bit timestamp #2 ....... I ............. I ........... F ............. F
96-bit timestamp #3 ....... I ............. I ........... F ............. F
===========================================================================
Total 1.5625 2.3866 0.7735 0.7243
Skipped 4 4 4 4
Failed 0 0 24 17
Ignored 24 24 0 7
With JIT:
MP_BENCH_TARGETS=pure_p,pure_u,pecl_p,pecl_u \
php -n -dextension=msgpack.so -dzend_extension=opcache.so \
-dpcre.jit=1 -dopcache.jit_buffer_size=64M -dopcache.jit=tracing -dopcache.enable=1 -dopcache.enable_cli=1 \
tests/bench.php
Example output
Filter: MessagePack\Tests\Perf\Filter\ListFilter
Rounds: 3
Iterations: 100000
===========================================================================
Test/Target Packer BufferUnpacker msgpack_pack msgpack_unpack
---------------------------------------------------------------------------
nil .................. 0.0001 ........ 0.0052 ...... 0.0053 ........ 0.0042
false ................ 0.0007 ........ 0.0060 ...... 0.0057 ........ 0.0043
true ................. 0.0008 ........ 0.0060 ...... 0.0056 ........ 0.0041
7-bit uint #1 ........ 0.0031 ........ 0.0046 ...... 0.0062 ........ 0.0041
7-bit uint #2 ........ 0.0021 ........ 0.0043 ...... 0.0062 ........ 0.0041
7-bit uint #3 ........ 0.0022 ........ 0.0044 ...... 0.0061 ........ 0.0040
5-bit sint #1 ........ 0.0030 ........ 0.0048 ...... 0.0062 ........ 0.0040
5-bit sint #2 ........ 0.0032 ........ 0.0046 ...... 0.0062 ........ 0.0040
5-bit sint #3 ........ 0.0031 ........ 0.0046 ...... 0.0062 ........ 0.0040
8-bit uint #1 ........ 0.0054 ........ 0.0079 ...... 0.0062 ........ 0.0050
8-bit uint #2 ........ 0.0051 ........ 0.0079 ...... 0.0064 ........ 0.0044
8-bit uint #3 ........ 0.0051 ........ 0.0082 ...... 0.0062 ........ 0.0044
16-bit uint #1 ....... 0.0077 ........ 0.0094 ...... 0.0065 ........ 0.0045
16-bit uint #2 ....... 0.0077 ........ 0.0094 ...... 0.0063 ........ 0.0045
16-bit uint #3 ....... 0.0077 ........ 0.0095 ...... 0.0064 ........ 0.0047
32-bit uint #1 ....... 0.0088 ........ 0.0119 ...... 0.0063 ........ 0.0043
32-bit uint #2 ....... 0.0089 ........ 0.0117 ...... 0.0062 ........ 0.0039
32-bit uint #3 ....... 0.0089 ........ 0.0118 ...... 0.0063 ........ 0.0044
64-bit uint #1 ....... 0.0097 ........ 0.0155 ...... 0.0063 ........ 0.0045
64-bit uint #2 ....... 0.0095 ........ 0.0153 ...... 0.0061 ........ 0.0045
64-bit uint #3 ....... 0.0096 ........ 0.0156 ...... 0.0063 ........ 0.0047
8-bit int #1 ......... 0.0053 ........ 0.0083 ...... 0.0062 ........ 0.0044
8-bit int #2 ......... 0.0052 ........ 0.0080 ...... 0.0062 ........ 0.0044
8-bit int #3 ......... 0.0052 ........ 0.0080 ...... 0.0062 ........ 0.0043
16-bit int #1 ........ 0.0089 ........ 0.0097 ...... 0.0069 ........ 0.0046
16-bit int #2 ........ 0.0075 ........ 0.0093 ...... 0.0063 ........ 0.0043
16-bit int #3 ........ 0.0075 ........ 0.0094 ...... 0.0062 ........ 0.0046
32-bit int #1 ........ 0.0086 ........ 0.0122 ...... 0.0063 ........ 0.0044
32-bit int #2 ........ 0.0087 ........ 0.0120 ...... 0.0066 ........ 0.0046
32-bit int #3 ........ 0.0086 ........ 0.0121 ...... 0.0060 ........ 0.0044
64-bit int #1 ........ 0.0096 ........ 0.0149 ...... 0.0060 ........ 0.0045
64-bit int #2 ........ 0.0096 ........ 0.0157 ...... 0.0062 ........ 0.0044
64-bit int #3 ........ 0.0096 ........ 0.0160 ...... 0.0063 ........ 0.0046
64-bit int #4 ........ 0.0097 ........ 0.0157 ...... 0.0061 ........ 0.0044
64-bit float #1 ...... 0.0079 ........ 0.0153 ...... 0.0056 ........ 0.0044
64-bit float #2 ...... 0.0079 ........ 0.0152 ...... 0.0057 ........ 0.0045
64-bit float #3 ...... 0.0079 ........ 0.0155 ...... 0.0057 ........ 0.0044
fix string #1 ........ 0.0010 ........ 0.0045 ...... 0.0071 ........ 0.0044
fix string #2 ........ 0.0048 ........ 0.0075 ...... 0.0070 ........ 0.0060
fix string #3 ........ 0.0048 ........ 0.0086 ...... 0.0068 ........ 0.0060
fix string #4 ........ 0.0050 ........ 0.0088 ...... 0.0070 ........ 0.0059
8-bit string #1 ...... 0.0081 ........ 0.0129 ...... 0.0069 ........ 0.0062
8-bit string #2 ...... 0.0086 ........ 0.0128 ...... 0.0069 ........ 0.0065
8-bit string #3 ...... 0.0086 ........ 0.0126 ...... 0.0115 ........ 0.0065
16-bit string #1 ..... 0.0105 ........ 0.0137 ...... 0.0128 ........ 0.0068
16-bit string #2 ..... 0.1510 ........ 0.1486 ...... 0.1526 ........ 0.1391
32-bit string ........ 0.1517 ........ 0.1475 ...... 0.1504 ........ 0.1370
wide char string #1 .. 0.0044 ........ 0.0085 ...... 0.0067 ........ 0.0057
wide char string #2 .. 0.0081 ........ 0.0125 ...... 0.0069 ........ 0.0063
8-bit binary #1 ........... I ............. I ........... F ............. I
8-bit binary #2 ........... I ............. I ........... F ............. I
8-bit binary #3 ........... I ............. I ........... F ............. I
16-bit binary ............. I ............. I ........... F ............. I
32-bit binary ............. I ............. I ........... F ............. I
fix array #1 ......... 0.0014 ........ 0.0059 ...... 0.0132 ........ 0.0055
fix array #2 ......... 0.0146 ........ 0.0156 ...... 0.0155 ........ 0.0148
fix array #3 ......... 0.0211 ........ 0.0229 ...... 0.0179 ........ 0.0180
16-bit array #1 ...... 0.0673 ........ 0.0498 ...... 0.0343 ........ 0.0388
16-bit array #2 ........... S ............. S ........... S ............. S
32-bit array .............. S ............. S ........... S ............. S
complex array ............. I ............. I ........... F ............. F
fix map #1 ................ I ............. I ........... F ............. I
fix map #2 ........... 0.0148 ........ 0.0180 ...... 0.0156 ........ 0.0179
fix map #3 ................ I ............. I ........... F ............. I
fix map #4 ........... 0.0252 ........ 0.0201 ...... 0.0214 ........ 0.0167
16-bit map #1 ........ 0.1027 ........ 0.0836 ...... 0.0388 ........ 0.0510
16-bit map #2 ............. S ............. S ........... S ............. S
32-bit map ................ S ............. S ........... S ............. S
complex map .......... 0.1104 ........ 0.1010 ...... 0.0556 ........ 0.0602
fixext 1 .................. I ............. I ........... F ............. F
fixext 2 .................. I ............. I ........... F ............. F
fixext 4 .................. I ............. I ........... F ............. F
fixext 8 .................. I ............. I ........... F ............. F
fixext 16 ................. I ............. I ........... F ............. F
8-bit ext ................. I ............. I ........... F ............. F
16-bit ext ................ I ............. I ........... F ............. F
32-bit ext ................ I ............. I ........... F ............. F
32-bit timestamp #1 ....... I ............. I ........... F ............. F
32-bit timestamp #2 ....... I ............. I ........... F ............. F
64-bit timestamp #1 ....... I ............. I ........... F ............. F
64-bit timestamp #2 ....... I ............. I ........... F ............. F
64-bit timestamp #3 ....... I ............. I ........... F ............. F
96-bit timestamp #1 ....... I ............. I ........... F ............. F
96-bit timestamp #2 ....... I ............. I ........... F ............. F
96-bit timestamp #3 ....... I ............. I ........... F ............. F
===========================================================================
Total 0.9642 1.0909 0.8224 0.7213
Skipped 4 4 4 4
Failed 0 0 24 17
Ignored 24 24 0 7
Note that the msgpack extension (v2.1.2) doesn't support ext, bin and UTF-8 str types.
The library is released under the MIT License. See the bundled LICENSE file for details.
Author: rybakit
Source Code: https://github.com/rybakit/msgpack.php
License: MIT License
1677907260
Node.js client for the official ChatGPT API.
This package is a Node.js wrapper around ChatGPT by OpenAI. TS batteries included. āØ
March 1, 2023
The official OpenAI chat completions API has been released, and it is now the default for this package! š„
Method | Free? | Robust? | Quality? |
---|---|---|---|
ChatGPTAPI | ā No | ā Yes | ā ļø Real ChatGPT models |
ChatGPTUnofficialProxyAPI | ā Yes | āļø Maybe | ā Real ChatGPT |
Note: We strongly recommend using ChatGPTAPI
since it uses the officially supported API from OpenAI. We may remove support for ChatGPTUnofficialProxyAPI
in a future release.
ChatGPTAPI
- Uses the gpt-3.5-turbo-0301
model with the official OpenAI chat completions API (official, robust approach, but it's not free)ChatGPTUnofficialProxyAPI
- Uses an unofficial proxy server to access ChatGPT's backend API in a way that circumvents Cloudflare (uses the real ChatGPT and is pretty lightweight, but relies on a third-party server and is rate-limited)To run the CLI, you'll need an OpenAI API key:
export OPENAI_API_KEY="sk-TODO"
npx chatgpt "your prompt here"
By default, the response is streamed to stdout, the results are stored in a local config file, and every invocation starts a new conversation. You can use -c
to continue the previous conversation and --no-stream
to disable streaming.
Under the hood, the CLI uses ChatGPTAPI
with text-davinci-003
to mimic ChatGPT.
Usage:
$ chatgpt <prompt>
Commands:
<prompt> Ask ChatGPT a question
rm-cache Clears the local message cache
ls-cache Prints the local message cache path
For more info, run any command with the `--help` flag:
$ chatgpt --help
$ chatgpt rm-cache --help
$ chatgpt ls-cache --help
Options:
-c, --continue Continue last conversation (default: false)
-d, --debug Enables debug logging (default: false)
-s, --stream Streams the response (default: true)
-s, --store Enables the local message cache (default: true)
-t, --timeout Timeout in milliseconds
-k, --apiKey OpenAI API key
-n, --conversationName Unique name for the conversation
-h, --help Display this message
-v, --version Display version number
npm install chatgpt
Make sure you're using node >= 18
so fetch
is available (or node >= 14
if you install a fetch polyfill).
To use this module from Node.js, you need to pick between two methods:
Method | Free? | Robust? | Quality? |
---|---|---|---|
ChatGPTAPI | ā No | ā Yes | ā ļø Real ChatGPT models |
ChatGPTUnofficialProxyAPI | ā Yes | āļø Maybe | ā Real ChatGPT |
ChatGPTAPI
- Uses the gpt-3.5-turbo-0301
model with the official OpenAI chat completions API (official, robust approach, but it's not free). You can override the model, completion params, and system message to fully customize your assistant.
ChatGPTUnofficialProxyAPI
- Uses an unofficial proxy server to access ChatGPT's backend API in a way that circumvents Cloudflare (uses the real ChatGPT and is pretty lightweight, but relies on a third-party server and is rate-limited)
Both approaches have very similar APIs, so it should be simple to swap between them.
Note: We strongly recommend using ChatGPTAPI
since it uses the officially supported API from OpenAI. We may remove support for ChatGPTUnofficialProxyAPI
in a future release.
Sign up for an OpenAI API key and store it in your environment.
import { ChatGPTAPI } from 'chatgpt'
async function example() {
const api = new ChatGPTAPI({
apiKey: process.env.OPENAI_API_KEY
})
const res = await api.sendMessage('Hello World!')
console.log(res.text)
}
You can override the default model
(gpt-3.5-turbo-0301
) and any OpenAI chat completion params using completionParams
:
const api = new ChatGPTAPI({
apiKey: process.env.OPENAI_API_KEY,
completionParams: {
temperature: 0.5,
top_p: 0.8
}
})
If you want to track the conversation, you'll need to pass the parentMessageId
like this:
const api = new ChatGPTAPI({ apiKey: process.env.OPENAI_API_KEY })
// send a message and wait for the response
let res = await api.sendMessage('What is OpenAI?')
console.log(res.text)
// send a follow-up
res = await api.sendMessage('Can you expand on that?', {
parentMessageId: res.id
})
console.log(res.text)
// send another follow-up
res = await api.sendMessage('What were we talking about?', {
parentMessageId: res.id
})
console.log(res.text)
You can add streaming via the onProgress
handler:
const res = await api.sendMessage('Write a 500 word essay on frogs.', {
// print the partial response as the AI is "typing"
onProgress: (partialResponse) => console.log(partialResponse.text)
})
// print the full text at the end
console.log(res.text)
You can add a timeout using the timeoutMs
option:
// timeout after 2 minutes (which will also abort the underlying HTTP request)
const response = await api.sendMessage(
'write me a really really long essay on frogs',
{
timeoutMs: 2 * 60 * 1000
}
)
If you want to see more info about what's actually being sent to OpenAI's chat completions API, set the debug: true
option in the ChatGPTAPI
constructor:
const api = new ChatGPTAPI({
apiKey: process.env.OPENAI_API_KEY,
debug: true
})
We default to a basic systemMessage
. You can override this in either the ChatGPTAPI
constructor or sendMessage
:
const res = await api.sendMessage('what is the answer to the universe?', {
systemMessage: `You are ChatGPT, a large language model trained by OpenAI. You answer as concisely as possible for each responseIf you are generating a list, do not have too many items.
Current date: ${new Date().toISOString()}\n\n`
})
Note that we automatically handle appending the previous messages to the prompt and attempt to optimize for the available tokens (which defaults to 4096
).
Usage in CommonJS (Dynamic import)
async function example() {
// To use ESM in CommonJS, you can use a dynamic import
const { ChatGPTAPI } = await import('chatgpt')
const api = new ChatGPTAPI({ apiKey: process.env.OPENAI_API_KEY })
const res = await api.sendMessage('Hello World!')
console.log(res.text)
}
The API for ChatGPTUnofficialProxyAPI
is almost exactly the same. You just need to provide a ChatGPT accessToken
instead of an OpenAI API key.
import { ChatGPTUnofficialProxyAPI } from 'chatgpt'
async function example() {
const api = new ChatGPTUnofficialProxyAPI({
accessToken: process.env.OPENAI_ACCESS_TOKEN
})
const res = await api.sendMessage('Hello World!')
console.log(res.text)
}
See demos/demo-reverse-proxy for a full example:
npx tsx demos/demo-reverse-proxy.ts
ChatGPTUnofficialProxyAPI
messages also contain a conversationid
in addition to parentMessageId
, since the ChatGPT webapp can't reference messages across
You can override the reverse proxy by passing apiReverseProxyUrl
:
const api = new ChatGPTUnofficialProxyAPI({
accessToken: process.env.OPENAI_ACCESS_TOKEN,
apiReverseProxyUrl: 'https://your-example-server.com/api/conversation'
})
Known reverse proxies run by community members include:
Reverse Proxy URL | Author | Rate Limits | Last Checked |
---|---|---|---|
https://chat.duti.tech/api/conversation | @acheong08 | 120 req/min by IP | 2/19/2023 |
https://gpt.pawan.krd/backend-api/conversation | @PawanOsman | ? | 2/19/2023 |
Note: info on how the reverse proxies work is not being published at this time in order to prevent OpenAI from disabling access.
To use ChatGPTUnofficialProxyAPI
, you'll need an OpenAI access token from the ChatGPT webapp. To do this, you can use any of the following methods which take an email
and password
and return an access token:
These libraries work with email + password accounts (e.g., they do not support accounts where you auth via Microsoft / Google).
Alternatively, you can manually get an accessToken
by logging in to the ChatGPT webapp and then opening https://chat.openai.com/api/auth/session
, which will return a JSON object containing your accessToken
string.
Access tokens last for days.
Note: using a reverse proxy will expose your access token to a third-party. There shouldn't be any adverse effects possible from this, but please consider the risks before using this method.
See the auto-generated docs for more info on methods and parameters.
Most of the demos use ChatGPTAPI
. It should be pretty easy to convert them to use ChatGPTUnofficialProxyAPI
if you'd rather use that approach. The only thing that needs to change is how you initialize the api with an accessToken
instead of an apiKey
.
To run the included demos:
OPENAI_API_KEY
in .envA basic demo is included for testing purposes:
npx tsx demos/demo.ts
A demo showing on progress handler:
npx tsx demos/demo-on-progress.ts
The on progress demo uses the optional onProgress
parameter to sendMessage
to receive intermediary results as ChatGPT is "typing".
npx tsx demos/demo-conversation.ts
A persistence demo shows how to store messages in Redis for persistence:
npx tsx demos/demo-persistence.ts
Any keyv adaptor is supported for persistence, and there are overrides if you'd like to use a different way of storing / retrieving messages.
Note that persisting message is required for remembering the context of previous conversations beyond the scope of the current Node.js process, since by default, we only store messages in memory. Here's an external demo of using a completely custom database solution to persist messages.
Note: Persistence is handled automatically when using ChatGPTUnofficialProxyAPI
because it is connecting indirectly to ChatGPT.
All of these awesome projects are built using the chatgpt
package. š¤Æ
If you create a cool integration, feel free to open a PR and add it to the list.
node >= 14
.fetch
is installed.chatgpt
, we recommend using it only from your backend APIPrevious Updates
Feb 19, 2023
We now provide three ways of accessing the unofficial ChatGPT API, all of which have tradeoffs:
Method | Free? | Robust? | Quality? |
---|---|---|---|
ChatGPTAPI | ā No | ā Yes | āļø Mimics ChatGPT |
ChatGPTUnofficialProxyAPI | ā Yes | āļø Maybe | ā Real ChatGPT |
ChatGPTAPIBrowser (v3) | ā Yes | ā No | ā Real ChatGPT |
Note: I recommend that you use either ChatGPTAPI
or ChatGPTUnofficialProxyAPI
.
ChatGPTAPI
- Uses text-davinci-003
to mimic ChatGPT via the official OpenAI completions API (most robust approach, but it's not free and doesn't use a model fine-tuned for chat)ChatGPTUnofficialProxyAPI
- Uses an unofficial proxy server to access ChatGPT's backend API in a way that circumvents Cloudflare (uses the real ChatGPT and is pretty lightweight, but relies on a third-party server and is rate-limited)ChatGPTAPIBrowser
- (deprecated; v3.5.1 of this package) Uses Puppeteer to access the official ChatGPT webapp (uses the real ChatGPT, but very flaky, heavyweight, and error prone)Feb 5, 2023
OpenAI has disabled the leaked chat model we were previously using, so we're now defaulting to text-davinci-003
, which is not free.
We've found several other hidden, fine-tuned chat models, but OpenAI keeps disabling them, so we're searching for alternative workarounds.
Feb 1, 2023
This package no longer requires any browser hacks ā it is now using the official OpenAI completions API with a leaked model that ChatGPT uses under the hood. š„
import { ChatGPTAPI } from 'chatgpt'
const api = new ChatGPTAPI({
apiKey: process.env.OPENAI_API_KEY
})
const res = await api.sendMessage('Hello World!')
console.log(res.text)
Please upgrade to chatgpt@latest
(at least v4.0.0). The updated version is significantly more lightweight and robust compared with previous versions. You also don't have to worry about IP issues or rate limiting.
Huge shoutout to @waylaidwanderer for discovering the leaked chat model!
If you run into any issues, we do have a pretty active Discord with a bunch of ChatGPT hackers from the Node.js & Python communities.
Lastly, please consider starring this repo and following me on twitter to help support the project.
Thanks && cheers, Travis
Author: Transitive-bullshit
Source Code: https://github.com/transitive-bullshit/chatgpt-api
License: MIT license
1630743562
FHIR_DB
This is really just a wrapper around Sembast_SQFLite - so all of the heavy lifting was done by Alex Tekartik. I highly recommend that if you have any questions about working with this package that you take a look at Sembast. He's also just a super nice guy, and even answered a question for me when I was deciding which sembast version to use. As usual, ResoCoder also has a good tutorial.
I have an interest in low-resource settings and thus a specific reason to be able to store data offline. To encourage this use, there are a number of other packages I have created based around the data format FHIR. FHIRĀ® is the registered trademark of HL7 and is used with the permission of HL7. Use of the FHIR trademark does not constitute endorsement of this product by HL7.
So, while not absolutely necessary, I highly recommend that you use some sort of interface class. This adds the benefit of more easily handling errors, plus if you change to a different database in the future, you don't have to change the rest of your app, just the interface.
I've used something like this in my projects:
class IFhirDb {
IFhirDb();
final ResourceDao resourceDao = ResourceDao();
Future<Either<DbFailure, Resource>> save(Resource resource) async {
Resource resultResource;
try {
resultResource = await resourceDao.save(resource);
} catch (error) {
return left(DbFailure.unableToSave(error: error.toString()));
}
return right(resultResource);
}
Future<Either<DbFailure, List<Resource>>> returnListOfSingleResourceType(
String resourceType) async {
List<Resource> resultList;
try {
resultList =
await resourceDao.getAllSortedById(resourceType: resourceType);
} catch (error) {
return left(DbFailure.unableToObtainList(error: error.toString()));
}
return right(resultList);
}
Future<Either<DbFailure, List<Resource>>> searchFunction(
String resourceType, String searchString, String reference) async {
List<Resource> resultList;
try {
resultList =
await resourceDao.searchFor(resourceType, searchString, reference);
} catch (error) {
return left(DbFailure.unableToObtainList(error: error.toString()));
}
return right(resultList);
}
}
I like this because in case there's an i/o error or something, it won't crash your app. Then, you can call this interface in your app like the following:
final patient = Patient(
resourceType: 'Patient',
name: [HumanName(text: 'New Patient Name')],
birthDate: Date(DateTime.now()),
);
final saveResult = await IFhirDb().save(patient);
This will save your newly created patient to the locally embedded database.
IMPORTANT: this database will expect that all previously created resources have an id. When you save a resource, it will check to see if that resource type has already been stored. (Each resource type is saved in it's own store in the database). It will then check if there is an ID. If there's no ID, it will create a new one for that resource (along with metadata on version number and creation time). It will save it, and return the resource. If it already has an ID, it will copy the the old version of the resource into a _history store. It will then update the metadata of the new resource and save that version into the appropriate store for that resource. If, for instance, we have a previously created patient:
{
"resourceType": "Patient",
"id": "fhirfli-294057507-6811107",
"meta": {
"versionId": "1",
"lastUpdated": "2020-10-16T19:41:28.054369Z"
},
"name": [
{
"given": ["New"],
"family": "Patient"
}
],
"birthDate": "2020-10-16"
}
And we update the last name to 'Provider'. The above version of the patient will be kept in _history, while in the 'Patient' store in the db, we will have the updated version:
{
"resourceType": "Patient",
"id": "fhirfli-294057507-6811107",
"meta": {
"versionId": "2",
"lastUpdated": "2020-10-16T19:45:07.316698Z"
},
"name": [
{
"given": ["New"],
"family": "Provider"
}
],
"birthDate": "2020-10-16"
}
This way we can keep track of all previous version of all resources (which is obviously important in medicine).
For most of the interactions (saving, deleting, etc), they work the way you'd expect. The only difference is search. Because Sembast is NoSQL, we can search on any of the fields in a resource. If in our interface class, we have the following function:
Future<Either<DbFailure, List<Resource>>> searchFunction(
String resourceType, String searchString, String reference) async {
List<Resource> resultList;
try {
resultList =
await resourceDao.searchFor(resourceType, searchString, reference);
} catch (error) {
return left(DbFailure.unableToObtainList(error: error.toString()));
}
return right(resultList);
}
You can search for all immunizations of a certain patient:
searchFunction(
'Immunization', 'patient.reference', 'Patient/$patientId');
This function will search through all entries in the 'Immunization' store. It will look at all 'patient.reference' fields, and return any that match 'Patient/$patientId'.
The last thing I'll mention is that this is a password protected db, using AES-256 encryption (although it can also use Salsa20). Anytime you use the db, you have the option of using a password for encryption/decryption. Remember, if you setup the database using encryption, you will only be able to access it using that same password. When you're ready to change the password, you will need to call the update password function. If we again assume we created a change password method in our interface, it might look something like this:
class IFhirDb {
IFhirDb();
final ResourceDao resourceDao = ResourceDao();
...
Future<Either<DbFailure, Unit>> updatePassword(String oldPassword, String newPassword) async {
try {
await resourceDao.updatePw(oldPassword, newPassword);
} catch (error) {
return left(DbFailure.unableToUpdatePassword(error: error.toString()));
}
return right(Unit);
}
You don't have to use a password, and in that case, it will save the db file as plain text. If you want to add a password later, it will encrypt it at that time.
After using this for a while in an app, I've realized that it needs to be able to store data apart from just FHIR resources, at least on occasion. For this, I've added a second class for all versions of the database called GeneralDao. This is similar to the ResourceDao, but fewer options. So, in order to save something, it would look like this:
await GeneralDao().save('password', {'new':'map'});
await GeneralDao().save('password', {'new':'map'}, 'key');
The difference between these two options is that the first one will generate a key for the map being stored, while the second will store the map using the key provided. Both will return the key after successfully storing the map.
Other functions available include:
// deletes everything in the general store
await GeneralDao().deleteAllGeneral('password');
// delete specific entry
await GeneralDao().delete('password','key');
// returns map with that key
await GeneralDao().find('password', 'key');
FHIRĀ® is a registered trademark of Health Level Seven International (HL7) and its use does not constitute an endorsement of products by HL7Ā®
Run this command:
With Flutter:
$ flutter pub add fhir_db
This will add a line like this to your package's pubspec.yaml (and run an implicit flutter pub get):
dependencies:
fhir_db: ^0.4.3
Alternatively, your editor might support or flutter pub get. Check the docs for your editor to learn more.
Now in your Dart code, you can use:
import 'package:fhir_db/dstu2.dart';
import 'package:fhir_db/dstu2/fhir_db.dart';
import 'package:fhir_db/dstu2/general_dao.dart';
import 'package:fhir_db/dstu2/resource_dao.dart';
import 'package:fhir_db/encrypt/aes.dart';
import 'package:fhir_db/encrypt/salsa.dart';
import 'package:fhir_db/r4.dart';
import 'package:fhir_db/r4/fhir_db.dart';
import 'package:fhir_db/r4/general_dao.dart';
import 'package:fhir_db/r4/resource_dao.dart';
import 'package:fhir_db/r5.dart';
import 'package:fhir_db/r5/fhir_db.dart';
import 'package:fhir_db/r5/general_dao.dart';
import 'package:fhir_db/r5/resource_dao.dart';
import 'package:fhir_db/stu3.dart';
import 'package:fhir_db/stu3/fhir_db.dart';
import 'package:fhir_db/stu3/general_dao.dart';
import 'package:fhir_db/stu3/resource_dao.dart';
import 'package:fhir/r4.dart';
import 'package:fhir_db/r4.dart';
import 'package:flutter/material.dart';
import 'package:test/test.dart';
Future<void> main() async {
WidgetsFlutterBinding.ensureInitialized();
final resourceDao = ResourceDao();
// await resourceDao.updatePw('newPw', null);
await resourceDao.deleteAllResources(null);
group('Playing with passwords', () {
test('Playing with Passwords', () async {
final patient = Patient(id: Id('1'));
final saved = await resourceDao.save(null, patient);
await resourceDao.updatePw(null, 'newPw');
final search1 = await resourceDao.find('newPw',
resourceType: R4ResourceType.Patient, id: Id('1'));
expect(saved, search1[0]);
await resourceDao.updatePw('newPw', 'newerPw');
final search2 = await resourceDao.find('newerPw',
resourceType: R4ResourceType.Patient, id: Id('1'));
expect(saved, search2[0]);
await resourceDao.updatePw('newerPw', null);
final search3 = await resourceDao.find(null,
resourceType: R4ResourceType.Patient, id: Id('1'));
expect(saved, search3[0]);
await resourceDao.deleteAllResources(null);
});
});
final id = Id('12345');
group('Saving Things:', () {
test('Save Patient', () async {
final humanName = HumanName(family: 'Atreides', given: ['Duke']);
final patient = Patient(id: id, name: [humanName]);
final saved = await resourceDao.save(null, patient);
expect(saved.id, id);
expect((saved as Patient).name?[0], humanName);
});
test('Save Organization', () async {
final organization = Organization(id: id, name: 'FhirFli');
final saved = await resourceDao.save(null, organization);
expect(saved.id, id);
expect((saved as Organization).name, 'FhirFli');
});
test('Save Observation1', () async {
final observation1 = Observation(
id: Id('obs1'),
code: CodeableConcept(text: 'Observation #1'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save(null, observation1);
expect(saved.id, Id('obs1'));
expect((saved as Observation).code.text, 'Observation #1');
});
test('Save Observation1 Again', () async {
final observation1 = Observation(
id: Id('obs1'),
code: CodeableConcept(text: 'Observation #1 - Updated'));
final saved = await resourceDao.save(null, observation1);
expect(saved.id, Id('obs1'));
expect((saved as Observation).code.text, 'Observation #1 - Updated');
expect(saved.meta?.versionId, Id('2'));
});
test('Save Observation2', () async {
final observation2 = Observation(
id: Id('obs2'),
code: CodeableConcept(text: 'Observation #2'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save(null, observation2);
expect(saved.id, Id('obs2'));
expect((saved as Observation).code.text, 'Observation #2');
});
test('Save Observation3', () async {
final observation3 = Observation(
id: Id('obs3'),
code: CodeableConcept(text: 'Observation #3'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save(null, observation3);
expect(saved.id, Id('obs3'));
expect((saved as Observation).code.text, 'Observation #3');
});
});
group('Finding Things:', () {
test('Find 1st Patient', () async {
final search = await resourceDao.find(null,
resourceType: R4ResourceType.Patient, id: id);
final humanName = HumanName(family: 'Atreides', given: ['Duke']);
expect(search.length, 1);
expect((search[0] as Patient).name?[0], humanName);
});
test('Find 3rd Observation', () async {
final search = await resourceDao.find(null,
resourceType: R4ResourceType.Observation, id: Id('obs3'));
expect(search.length, 1);
expect(search[0].id, Id('obs3'));
expect((search[0] as Observation).code.text, 'Observation #3');
});
test('Find All Observations', () async {
final search = await resourceDao.getResourceType(
null,
resourceTypes: [R4ResourceType.Observation],
);
expect(search.length, 3);
final idList = [];
for (final obs in search) {
idList.add(obs.id.toString());
}
expect(idList.contains('obs1'), true);
expect(idList.contains('obs2'), true);
expect(idList.contains('obs3'), true);
});
test('Find All (non-historical) Resources', () async {
final search = await resourceDao.getAll(null);
expect(search.length, 5);
final patList = search.toList();
final orgList = search.toList();
final obsList = search.toList();
patList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Patient);
orgList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Organization);
obsList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Observation);
expect(patList.length, 1);
expect(orgList.length, 1);
expect(obsList.length, 3);
});
});
group('Deleting Things:', () {
test('Delete 2nd Observation', () async {
await resourceDao.delete(
null, null, R4ResourceType.Observation, Id('obs2'), null, null);
final search = await resourceDao.getResourceType(
null,
resourceTypes: [R4ResourceType.Observation],
);
expect(search.length, 2);
final idList = [];
for (final obs in search) {
idList.add(obs.id.toString());
}
expect(idList.contains('obs1'), true);
expect(idList.contains('obs2'), false);
expect(idList.contains('obs3'), true);
});
test('Delete All Observations', () async {
await resourceDao.deleteSingleType(null,
resourceType: R4ResourceType.Observation);
final search = await resourceDao.getAll(null);
expect(search.length, 2);
final patList = search.toList();
final orgList = search.toList();
patList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Patient);
orgList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Organization);
expect(patList.length, 1);
expect(patList.length, 1);
});
test('Delete All Resources', () async {
await resourceDao.deleteAllResources(null);
final search = await resourceDao.getAll(null);
expect(search.length, 0);
});
});
group('Password - Saving Things:', () {
test('Save Patient', () async {
await resourceDao.updatePw(null, 'newPw');
final humanName = HumanName(family: 'Atreides', given: ['Duke']);
final patient = Patient(id: id, name: [humanName]);
final saved = await resourceDao.save('newPw', patient);
expect(saved.id, id);
expect((saved as Patient).name?[0], humanName);
});
test('Save Organization', () async {
final organization = Organization(id: id, name: 'FhirFli');
final saved = await resourceDao.save('newPw', organization);
expect(saved.id, id);
expect((saved as Organization).name, 'FhirFli');
});
test('Save Observation1', () async {
final observation1 = Observation(
id: Id('obs1'),
code: CodeableConcept(text: 'Observation #1'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save('newPw', observation1);
expect(saved.id, Id('obs1'));
expect((saved as Observation).code.text, 'Observation #1');
});
test('Save Observation1 Again', () async {
final observation1 = Observation(
id: Id('obs1'),
code: CodeableConcept(text: 'Observation #1 - Updated'));
final saved = await resourceDao.save('newPw', observation1);
expect(saved.id, Id('obs1'));
expect((saved as Observation).code.text, 'Observation #1 - Updated');
expect(saved.meta?.versionId, Id('2'));
});
test('Save Observation2', () async {
final observation2 = Observation(
id: Id('obs2'),
code: CodeableConcept(text: 'Observation #2'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save('newPw', observation2);
expect(saved.id, Id('obs2'));
expect((saved as Observation).code.text, 'Observation #2');
});
test('Save Observation3', () async {
final observation3 = Observation(
id: Id('obs3'),
code: CodeableConcept(text: 'Observation #3'),
effectiveDateTime: FhirDateTime(DateTime(1981, 09, 18)),
);
final saved = await resourceDao.save('newPw', observation3);
expect(saved.id, Id('obs3'));
expect((saved as Observation).code.text, 'Observation #3');
});
});
group('Password - Finding Things:', () {
test('Find 1st Patient', () async {
final search = await resourceDao.find('newPw',
resourceType: R4ResourceType.Patient, id: id);
final humanName = HumanName(family: 'Atreides', given: ['Duke']);
expect(search.length, 1);
expect((search[0] as Patient).name?[0], humanName);
});
test('Find 3rd Observation', () async {
final search = await resourceDao.find('newPw',
resourceType: R4ResourceType.Observation, id: Id('obs3'));
expect(search.length, 1);
expect(search[0].id, Id('obs3'));
expect((search[0] as Observation).code.text, 'Observation #3');
});
test('Find All Observations', () async {
final search = await resourceDao.getResourceType(
'newPw',
resourceTypes: [R4ResourceType.Observation],
);
expect(search.length, 3);
final idList = [];
for (final obs in search) {
idList.add(obs.id.toString());
}
expect(idList.contains('obs1'), true);
expect(idList.contains('obs2'), true);
expect(idList.contains('obs3'), true);
});
test('Find All (non-historical) Resources', () async {
final search = await resourceDao.getAll('newPw');
expect(search.length, 5);
final patList = search.toList();
final orgList = search.toList();
final obsList = search.toList();
patList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Patient);
orgList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Organization);
obsList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Observation);
expect(patList.length, 1);
expect(orgList.length, 1);
expect(obsList.length, 3);
});
});
group('Password - Deleting Things:', () {
test('Delete 2nd Observation', () async {
await resourceDao.delete(
'newPw', null, R4ResourceType.Observation, Id('obs2'), null, null);
final search = await resourceDao.getResourceType(
'newPw',
resourceTypes: [R4ResourceType.Observation],
);
expect(search.length, 2);
final idList = [];
for (final obs in search) {
idList.add(obs.id.toString());
}
expect(idList.contains('obs1'), true);
expect(idList.contains('obs2'), false);
expect(idList.contains('obs3'), true);
});
test('Delete All Observations', () async {
await resourceDao.deleteSingleType('newPw',
resourceType: R4ResourceType.Observation);
final search = await resourceDao.getAll('newPw');
expect(search.length, 2);
final patList = search.toList();
final orgList = search.toList();
patList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Patient);
orgList.retainWhere(
(resource) => resource.resourceType == R4ResourceType.Organization);
expect(patList.length, 1);
expect(patList.length, 1);
});
test('Delete All Resources', () async {
await resourceDao.deleteAllResources('newPw');
final search = await resourceDao.getAll('newPw');
expect(search.length, 0);
await resourceDao.updatePw('newPw', null);
});
});
}
Download Details:
Author: MayJuun
Source Code: https://github.com/MayJuun/fhir/tree/main/fhir_db
1669952228
In this tutorial, you'll learn: What is Dijkstra's Algorithm and how Dijkstra's algorithm works with the help of visual guides.
You can use algorithms in programming to solve specific problems through a set of precise instructions or procedures.
Dijkstra's algorithm is one of many graph algorithms you'll come across. It is used to find the shortest path from a fixed node to all other nodes in a graph.
There are different representations of Dijkstra's algorithm. You can either find the shortest path between two nodes, or the shortest path from a fixed node to the rest of the nodes in a graph.
In this article, you'll learn how Dijkstra's algorithm works with the help of visual guides.
Before we dive into more detailed visual examples, you need to understand how Dijkstra's algorithm works.
Although the theoretical explanation may seem a bit abstract, it'll help you understand the practical aspect better.
In a given graph containing different nodes, we are required to get the shortest path from a given node to the rest of the nodes.
These nodes can represent any object like the names of cities, letters, and so on.
Between each node is a number denoting the distance between two nodes, as you can see in the image below:
We usually work with two arrays ā one for visited nodes, and another for unvisited nodes. You'll learn more about the arrays in the next section.
When a node is visited, the algorithm calculates how long it took to get to the node and stores the distance. If a shorter path to a node is found, the initial value assigned for the distance is updated.
Note that a node cannot be visited twice.
The algorithm runs recursively until all the nodes have been visited.
In this section, we'll take a look at a practical example that shows how Dijkstra's algorithm works.
Here's the graph we'll be working with:
We'll use the table below to put down the visited nodes and their distance from the fixed node:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | ā |
B | ā |
C | ā |
D | ā |
E | ā |
Visited nodes = []
Unvisited nodes = [A,B,C,D,E]
Above, we have a table showing each node and the shortest distance from the that node to the fixed node. We are yet to choose the fixed node.
Note that the distance for each node in the table is currently denoted as infinity (ā). This is because we don't know the shortest distance yet.
We also have two arrays ā visited and unvisited. Whenever a node is visited, it is added to the visited nodes array.
Let's get started!
To simplify things, I'll break the process down into iterations. You'll see what happens in each step with the aid of diagrams.
The first iteration might seem confusing, but that's totally fine. Once we start repeating the process in each iteration, you'll have a clearer picture of how the algorithm works.
Step #1 - Pick an unvisited node
We'll choose A as the fixed node. So we'll find the shortest distance from A to every other node in the graph.
We're going to give A a distance of 0 because it is the initial node. So the table would look like this:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | ā |
C | ā |
D | ā |
E | ā |
Step #2 - Find the distance from current node
The next thing to do after choosing a node is to find the distance from it to the unvisited nodes around it.
The two unvisited nodes directly linked to A are B and C.
To get the distance from A to B:
0 + 4 = 4
0 being the value of the current node (A), and 4 being the distance between A and B in the graph.
To get the distance from A to C:
0 + 2 = 2
Step #3 - Update table with known distances
In the last step, we got 4 and 2 as the values of B and C respectively. So we'll update the table with those values:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 4 |
C | 2 |
D | ā |
E | ā |
Step #4 - Update arrays
At this point, the first iteration is complete. We'll move node A to the visited nodes array:
Visited nodes = [A]
Unvisited nodes = [B,C,D,E]
Before we proceed to the next iteration, you should know the following:
Step #1 - Pick an unvisited node
We have four unvisited nodes ā [B,C,D,E]. So how do you know which node to pick for the next iteration?
Well, we pick the node with the smallest known distance recorded in the table. Here's the table:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 4 |
C | 2 |
D | ā |
E | ā |
So we're going with node C.
Step #2 - Find the distance from current node
To find the distance from the current node to the fixed node, we have to consider the nodes linked to the current node.
The nodes linked to the current node are A and B.
But A has been visited in the previous iteration so it will not be linked to the current node. That is:
From the diagram above,
To find the distance from C to B:
2 + 1 = 3
2 above is recorded distance for node C while 1 is the distance between C and B in the graph.
Step #3 - Update table with known distances
In the last step, we got the value of B to be 3. In the first iteration, it was 4.
We're going to update the distance in the table to 3.
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 3 |
C | 2 |
D | ā |
E | ā |
So, A --> B = 4 (First iteration).
A --> C --> B = 3 (Second iteration).
The algorithm has helped us find the shortest path to B from A.
Step #4 - Update arrays
We're done with the last visited node. Let's add it to the visited nodes array:
Visited nodes = [A,C]
Unvisited nodes = [B,D,E]
Step #1 - Pick an unvisited node
We're down to three unvisited nodes ā [B,D,E]. From the array, B has the shortest known distance.
To restate what is going on in the diagram above:
Step #2 - Find the distance from current node
The nodes linked to the current node are D and E.
B (the current node) has a value of 3. Therefore,
For node D, 3 + 3 = 6.
For node E, 3 + 2 = 5.
Step #3 - Update table with known distances
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 3 |
C | 2 |
D | 6 |
E | 5 |
Step #4 - Update arrays
Visited nodes = [A,C,B]
Unvisited nodes = [D,E]
Step #1 - Pick an unvisited node
Like other iterations, we'll go with the unvisited node with the shortest known distance. That is E.
Step #2 - Find the distance from current node
According to our table, E has a value of 5.
For D in the current iteration,
5 + 5 = 10.
The value gotten for D here is 10, which is greater than the recorded value of 6 in the previous iteration. For this reason, we'll not update the table.
Step #3 - Update table with known distances
Our table remains the same:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 3 |
C | 2 |
D | 6 |
E | 5 |
Step #4 - Update arrays
Visited nodes = [A,C,B,E]
Unvisited nodes = [D]
Step #1 - Pick an unvisited node
We're currently left with one node in the unvisited array ā D.
Step #2 - Find the distance from current node
The algorithm has gotten to the last iteration. This is because all nodes linked to the current node have been visited already so we can't link to them.
Step #3 - Update table with known distances
Our table remains the same:
NODE | SHORTEST DISTANCE FROM FIXED NODE |
---|---|
A | 0 |
B | 3 |
C | 2 |
D | 6 |
E | 5 |
At this point, we have updated the table with the shortest distance from the fixed node to every other node in the graph.
Step #4 - Update arrays
Visited nodes = [A,C,B,E,D]
Unvisited nodes = []
As can be seen above, we have no nodes left to visit. Using Dijkstra's algorithm, we've found the shortest distance from the fixed node to others nodes in the graph.
The pseudocode example in this section was gotten from Wikipedia. Here it is:
1 function Dijkstra(Graph, source):
2
3 for each vertex v in Graph.Vertices:
4 dist[v] ā INFINITY
5 prev[v] ā UNDEFINED
6 add v to Q
7 dist[source] ā 0
8
9 while Q is not empty:
10 u ā vertex in Q with min dist[u]
11 remove u from Q
12
13 for each neighbor v of u still in Q:
14 alt ā dist[u] + Graph.Edges(u, v)
15 if alt < dist[v]:
16 dist[v] ā alt
17 prev[v] ā u
18
19 return dist[], prev[]
Here are some of the common applications of Dijkstra's algorithm:
In this article, we talked about Dijkstra's algorithm. It is used to find the shortest distance from a fixed node to all other nodes in a graph.
We started by giving a brief summary of how the algorithm works.
We then had a look at an example that further explained Dijkstra's algorithm in steps using visual guides.
We concluded with a pseudocode example and some of the applications of Dijkstra's algorithm.
Happy coding!
Original article source at https://www.freecodecamp.org
#algorithm #datastructures
1667421600
1 Introduction
What follows is an account of my experiences from about one year of roughly daily R usage. It began as a list of things that I liked and disliked about the language, but grew to be something huge. Once the list exceeded ten thousand words, I knew that it must be published. By the time it was ready, it had nearly tripled in length. It took five months of weekends just to get it all in R Markdown.
This isnāt an attack on R or a pitch for anything else. It is only an account of what Iāve found to be right and wrong with the language. Although the length of my list of what is wrong far exceeds that of what is right, that may be my failing rather than Rās. I suspect that my list of what R does right will grow as I learn other languages and begin to miss some of Rās benefits. I welcome any attempts to correct this or any other errors that you find. Some major errors will have slipped in somewhere or other.
To start, I must issue a warning: This document is huge. I have tried to keep everything contained in small sections, such that the reader has plenty of points where they can pause and return to the document later, but the word count is still far higher than Iām happy with. I have tried to not be too petty, but every negative point in here comes from an honest position of frustration. There are some things that I really love about R. Iāve even devoted an entire section to them. However, if there is one point that I really want this document to get across, itās that R is filled to the brim with small madnesses. Although I can name a few major issues with R, its ultimate problem is the sum of its little problems. This document couldnāt be short.
Also, on the topic of the sections in this document, watch out for all of the internal links. Nothing in R Markdown makes them look distinct from external ones, so you might lose your place if you donāt take care to open all of your links in a new tab/window.
Before I say anything nasty about R, a show of good faith is in order. In my year with R, I have done the following:
At minimum, I can say with confidence that unless I happen to pick up an R-focused statistics textbook ā the R FAQ has some tempting items ā Iāve already done all of the R-related reading that I ever plan to do. All that is left for me is to use the language more and more. I hope that this section shows that Iāve given it a good chance before writing this review of it.
I am not an R expert. I freely admit that I am lacking in the following regards:
foo ~ log(bar) * bar^2
), the plot()
function, and factor variables than I ought to be. I saw a lot of them during my degree, but have long since forgotten them and have never needed to really pick them back up. For similar reasons, I have nothing to say on how hard it can sometimes be to read data in to R.data.table
. From what little Iāve seen, itās a real pleasure. More practice with ggplot2
, the wider Tidyverse, and R Markdown is also in order. If I continue to use R, I will gradually master these. For now, it suffices to say that my experience with base R far exceeds my knowledge of both the Tidyverse and many other well-loved packages. If Iāve missed any gems, let me know.dplyr
loaded.roxygen2
). I have no plans for this.The above list is unlikely to be exhaustive. Iām not against reading another book about R as a programming language, but Advanced R seems to be the only one that anyone ever mentions. For the foreseeable future, the main thing that I plan to do to improve my evaluation of R is to learn Python. Iāll probably read a book on it.
Youād be a fool to read this without some experience of R. I donāt think that Iāve written anything that requires an expert level of understanding, but youāre unlikely to get much out of this document without at least a basic idea of R. Iāve also mentioned the Tidyverse a few times without giving it much introduction, particularly its tibble
package. If you care enough about R to consider reading this document, then you really ought to be familiar with the most popular parts of the Tidyverse. Itās rare for any discussion of R to go long without some mention of purrr
, dplyr
or magrittr
.
This document started out as personal notes that I had no intention of publishing. Thereās a good chance that I might have copy and pasted someoneās example from somewhere and totally forgot that it wasnāt my own. If you spot any plagiarism, let me know.
My overall feelings about R are tough to quantify. As I mentioned near the start, its ultimate problem is the sum of its little problems. However, if I must speak generally, then I think that the problem with R is that itās always some mix of the following:
When itās anything but #3, R is great. Statisticians and mathematicians love it for #1 and programmers love it for #2 and #4. If it werenāt for #3, R would be an amazing ā albeit, domain-specific ā language, but #3 is such a big factor that it makes the language unpredictable, inconsistent, and infuriating. Mixed with #4, it makes being an R novice hellish. It gives me little doubt that R is not the ideal tool for many of the jobs that it wants to do, but #1 and #2 leave me with equally little doubt that R can be a very good tool.
3 What R Does Right
As a final show of good faith, here is what I think R does right. In summary, along with having some great functional programming toys, R has some domain-specific tools that can work excellently when theyāre in their element. Whatever the faults of R, itās always going to be my first choice for some problems.
R wants to be a mathematics and statistics tool. Many of its fundamental design choices support this. For example, vectors are primitive types and R isnāt at all shy about giving you a table or matrix as output. Similarly, the base libraries are packed with maths and stats functions that are usually a good combination of relevant, generic, and helpful. Some examples:
Lots of stats is made easy. Commands like boxplot(data)
or quantile(data)
just work and there are lots of handy functions like colSums()
, table()
, cor()
, or summary()
.
R is the language of research-level statistics. If itās stats, R either has it built-in or has a library for it. Itās impossible to visit a statistics Q&A website and not see R code. For this reason alone, R will never truly die.
The generic functions in the base stats library work magic. Whenever you try to print or summarise a model from there, youāre going to get all of the details that you could ever realistically ask for and youāre going to get them presented in a very helpful way. For example
model <- lm(mpg ~ wt, data = mtcars)
print(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
shows us plenty of useful information and works just as well even if we change to another type of model. Your mileage may vary with packages, but it usually works as expected. Other examples are easy to come by, e.g. plot(model)
.
The rules for subsetting data, although requiring mastery, are extremely expressive. Coupled with sub-assignment tricks like result[result < 0.5] <- 0
, which often do exactly what you think they will, you can really save yourself a lot of work. Being able to demand precisely what parts of your data that you want to see or change is a really great feature.
The factor and ordered data types are definitely the sort of tools that I want to have in a stats language. Theyāre a bit unpredictable, but theyāre great when they work.
Itās no surprise that an R terminal has fully replaced my OSās built-in calculator. Itās my first choice for any arithmetical task. When checking a gaming problem, I once opened R and used (0.2 * seq(1000, 1300, 50) + 999) / seq(1000, 1300, 50)
. That wouldāve been several lines in many other languages. Furthermore, a general-purpose language that was capable of the same wouldāve had a call to something long-winded like math.vec.seq()
rather than just seq()
. I find the cumulative functions, e.g. cumsum()
and cummax()
, similarly enjoyable.
How many other language have matrix algebra fully built-in? Solving systems of linear equations is just solve()
.
The rep()
function is outstandingly versatile. Iād give examples, but those found in its documentation are more than sufficient. Open up R and run example(rep)
if you want to see them. If tricks like cbind(rep(1:6, each = 6), rep(1:6, times = 6))
have yet to become second nature, then youāre really missing out.
On top of replacing your computerās calculator, R can replace your graphing calculator as well. Unless you need to tinker with the axes or stop the asymptotes causing you problems ā problems that your graphing calculator would give you anyway ā functions like curve(x / (x^3 + 9), -10, 10)
(output below) do exactly what you would expect and exactly how.
These seem like trivial features, but the languageās deep integration of them is extremely beneficial for manipulating and presenting your data. They assist subsetting, variable creation, plotting, printing, and even metaprogramming.
The ability to name the components of vectors, e.g. c(Fizz=3, Buzz=5)
, is a nice trick for toy programs. The same syntax is used to much greater effect with lists, data frames, and S4 objects. However, itās good to show how far you can get with even the most basic case. Hereās my submission for a General FizzBuzz task:
namedGenFizzBuzz <- function(n, namedNums)
{
factors <- sort(namedNums)#Required by the task: We must go from least factor to greatest.
for(i in 1:n)
{
isFactor <- i %% factors == 0
print(if(any(isFactor)) paste0(names(factors)[isFactor], collapse = "") else i)
}
}
namedNums <- c(Fizz=3, Buzz=5, Baxx=7)#Notice that we can name our inputs without a function call.
namedGenFizzBuzz(105, namedNums)
Iāve little doubt that an R guru could improve this, but the amount of expressiveness in each line is already impressive. A lot of that is owed to Rās love for names.
Having a tabular data type in your base library ā the data frame ā is very handy for when you want a nice way to present your results without having to bother importing anything. Due to this and the aforementioned ability to name vectors, my output in coding challenges often looks nicer than most other peopleās.
I like how data frames are constructed. Even if you donāt know any R at all, itās pretty obvious what data.frame(who = c("Alice", "Bob"), height = c(1.2, 2.3))
produces and what adding the row.names = c("1st subject", "2nd subject")
argument would do.
As a non-trivial example of how far these features can get you, Iāve had some good fun making alists out of syntactically valid expressions and using only those alists to build a data frame where both the expressions and their evaluated values are shown:
expressions <- alist(-x ^ p, -(x) ^ p, (-x) ^ p, -(x ^ p))
x <- c(-5, -5, 5, 5)
p <- c(2, 3, 2, 3)
output <- data.frame(x,
p,
setNames(lapply(expressions, eval), sapply(expressions, deparse)),
check.names = FALSE)
print(output, row.names = FALSE)
## x p -x^p -(x)^p (-x)^p -(x^p)
## -5 2 -25 -25 25 -25
## -5 3 125 125 125 125
## 5 2 -25 -25 25 -25
## 5 3 -125 -125 -125 -125
(stolen from my submission here). Did you notice that the output knew the names of x
and p
without being told them? Did you also notice that a similar thing happened in after our call to curve()
earlier on? Finally, did you notice how easy it was to get such neat output?
Iāve already admitted a great deal of ignorance of this topic, but there are some parts of Rās ecosystem that Iām happy to call outstanding. The below are all things that Iām sure to miss in other languages.
corrplot
: It has less than ten functions, but it only needed one to blow my mind. Once youāve even as much as read the introduction, you will never try to read a correlation matrix again.ggplot2
: Iām not experienced enough to know what faults it has, but itās fun to use. That single fact makes it blow any other graphing software that Iāve used out of the water: Itās fun.magrittr
: It sold me on pipes. Iād say that any package that makes you consider changing your programming style is automatically outstanding. However, the real reason why I love it is because whenever Iāve run bigLongExpression()
in my console and decided that I really wanted foo()
of it, itās so much easier to press the up arrow and type CTRL+SHIFT+M+āfooā than it is to do anything that results in foo(bigLongExpression())
appearing. Maybe thereās a keyboard shortcut that I never learned, but this isnāt the only reason why I love magrittr
. Iāll say more about it much later.R Markdown
has served me well in writing this document. Itās buggier than Iād like, rarely has helpful error messages, and does things that I canāt explain or fix even after setting a bounty on Stack Overflow, but itās still a great way to make a document from R. Itās the closest thing that I know of to an R userās LaTeX. I had to wait on this bug fix before I could start numbering my sections. Hopefully it didnāt break anything.When itās not causing you problems, the vectorization can be the best thing about the language:
The vector recycling rules are powerful when mastered. Expressions like c("x", "y")[rep(c(1, 2), times = 4)]
let you do a lot with only a little work. My favourite ever FizzBuzz could well be
x <- paste0(rep("", 100), c("", "", "Fizz"), c("", "", "", "", "Buzz"))
cat(ifelse(x == "", 1:100, x), sep = "\n")
I wish that I could claim credit for that, but I stole it from an old version of this page and improved it a little.
Basically everything is a vector, so R comes with some great vector-manipulation tools like ifelse()
(seen above) and makes it very easy to use a function on a collection. Can you believe that mtcars / 20
actually works?
Tricks like array / seq_along(array)
save a lot of loop writing.
Even simple things like being able to subtract a vector from a constant (e.g. 10 - 1:5
) and get a sensible result are a gift when doing mathematics.
Vectorization of functions is sometimes very useful, particularly when it lets you do what shouldāve been two loops worth of work in one line. Youād be amazed by how often you can get away with calling foo(1:100)
without needing to vectorize foo()
yourself.
Rās done a good job of harnessing the power of functional languages while maintaining a C-like syntax. It makes no secret of being inspired by Scheme and has reaped many of its benefits.
Itās impossible to not notice that functions are first-class in R. Youāre almost forced to learn functional programming idioms like mapping functions, higher-order functions, and anonymous functions. This is a good thing. Where else do you find a language with enough useful higher-order functions for the community to be able to discourage new users from writing loops? Some examples:
Map()
, Filter()
, and Reduce()
. Once youāre used to them, you can write some very expressive code.apply()
and tapply()
can produce some very concise code, as can related functions like by()
.lapply(listOfFuns, function(f) f(1:10))
is entirely valid. It calls each function in listOfFuns
with the entire vector 1:10
as their first argument.Vectorize(foo)(1:100)
is not particularly hard to understand, but Iād struggle to name another language that lets me do the same thing with so much ease.Not only are functions first-class in R, environments are too. You therefore have lots of control over what environment an expression is evaluated in. This is an amazing source of power that tends to scare off beginners, but I cannot overstate how much of an asset it can be. If youāre not familiar with the below, look it up. You will not regret it.
with()
and within()
can generate them on the fly. Iāve seen this called ādata maskingā. Advanced R has a whole chapter on it. It lets you do things like ātreat this list of functions as if it were a namespace, so I can write code that uses function names that I wouldnāt dare use elsewhereā. This can also be used with data. For example, tapply(mtcars$mpg, list(mtcars$cyl, mtcars$gear), mean)
uses mtcars
far too many times, but with(mtcars, tapply(mpg, list(cyl, gear), mean))
gives us an easy fix. Ad-hoc namespaces are an amazing thing to have, particularly when using functions that donāt have a data
argument (e.g. plot()
).lm()
use the data-masking facilities that Iāve just described, as do handy functions like subset()
. This saves incredible amounts of typing and massively increases the readability of your stats code. For example, aggregate(mpg ~ cyl + gear, mtcars, mean)
returns very similar output to my above calls to tapply()
without needing the complexity of using with()
. It also allows for ridiculously concise code like aggregate(. ~ cyl + gear, mtcars, mean)
.transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)
does? Being able to do all of that in one line is outstanding. Without R allowing developers to add new functions like this, the Tidyverse wouldāve been impossible.You might have spotted a pattern by now. R often lets you do very much with very little.
Generic function OO is pretty nice to have, even if I wouldnāt use anything more complicated than S3. Being able to call foo(whatever)
and be confident that itās going to do what I mean is always nice. Some positives of Rās approach are:
As mentioned earlier on, S3 is used excellently in the base R stats library. Functions like print()
, plot()
, and summary()
almost always tell me everything that I wanted to know and tell me them with great clarity.
When youāre not trapped by the technicalities, S3 is an outstandingly simple tool that does exactly what R needs it to do. Have a look at all of the methods that the pre-loaded libraries define for plot()
methods(plot)
## [1] plot.acf* plot.data.frame* plot.decomposed.ts*
## [4] plot.default plot.dendrogram* plot.density*
## [7] plot.ecdf plot.factor* plot.formula*
## [10] plot.function plot.hclust* plot.histogram*
## [13] plot.HoltWinters* plot.isoreg* plot.lm*
## [16] plot.medpolish* plot.mlm* plot.ppr*
## [19] plot.prcomp* plot.princomp* plot.profile.nls*
## [22] plot.raster* plot.spec* plot.stepfun
## [25] plot.stl* plot.table* plot.ts
## [28] plot.tskernel* plot.TukeyHSD*
## see '?methods' for accessing help and source code
because a statistician often only need to dispatch on the type of model being used, S3 is the perfect tool to make functions like plot()
easy to extend, meaning that itās easy to make it give your users exactly what they want. This isnāt just theoretical either. The output for methods(plot)
gets a lot longer if I go through my list of packages and start loading some random number of them. Go try it yourself!
S3 generics and objects are very easy to write. The trade-off is that they donāt do anything to protect you from yourself. However, being able to tell R to shut up and do what I want it to a nice part of S3.
I like the idea of S3ās group generics, but I donāt like not being able to make my own. However, I think that you can do it for S4.
I have it on good authority that biology people often need to dispatch on more than one type of model at a time. This means that they shower the S4 object system with greater praise than what Iāve just given S3. Apparently, the bioconductor
package is the outstanding example of their love of it.
S4 has multiple inheritance and multiple dispatch. Iām not going to say that multiple inheritance is a good thing, but itās not always found in other OOP systems.
RC and the R6
package are about as close as youāre ever going to get to having Java-like OOP in a mostly function language.
Some of the syntax is nice:
:
operator is handy for things like for(i in 1:20){...}
.for
loop syntax is always the same: for(element in vector){...}
. This means that there is no difference between the typical ādo n timesā case like for(i in 1:n)
and the āfor every member of this collectionā case like for(v in sample(20))
. I appreciate the consistency....
notation has a very nice ādo what I meanā feel, particularly when youāre playing around with anonymous functions.repeat
loops, you never need to write while(TRUE)
.array[c(i, j)] <- array[c(j, i)]
swaps elements i
and j
in a very clean way.Alice <- Bob <- "Married"
. The best examples are when you do something like lastElement <- output[lastIndex <- lastIndex + 1] <- foo
, letting you avoid having to do anything twice.<-
and <<-
, but having environments use a subset of the list syntax was a very good idea. It was a similarly good idea to have a lot of Rās internals (e.g. quoted function calls) be pairlists. This lets them be manipulated in exactly the same way as lists. The similarities between lists, pairlists, environments, and data frames go deeper than you may expect. For example, the eval()
function lets you evaluate an expression in the specified environment, but itās happy to take any of the data types that Iāve just listed in place of an environment. At times, R almost lets you forget that lists and environments arenāt the same thing.setClass()
and setGeneric()
, you can probably guess what the corresponding function for methods is called.letters
and LETTERS
come in handy surprisingly often. Youāll see me use them a lot.findInterval()
and cut()
.na.print
argument to print()
, trivial as it is, can be a thing of beauty.finally
blocks in tryCatch()
. The only real oddity of the system is that its conditionals are treated as functions of the error, meaning that you will have to write strange code like tryCatch(code, error = function(unused) "error", warning = function(unused) "warning")
. However, this is the price that you pay for being able to use code like tryCatch(code, myError = function(e) paste0(e$message, " was triggered by ",e$call,". Try ",e$suggestion)
. As a final point of interest, Iāve heard that Rās condition handling system is one of the best copies of Common Lispās, which Iāve heard awesome things about.4 What R Does Wrong
This is where this documents starts to get long. Brace yourself. I really donāt want to give off the impression that I hate R, but there are just too many things wrong with it. Again, Rās ultimate problem is the sum of its small madnesses. No language is perfectly consistent or without compromises, but Rās choices of compromises and inconsistencies are utterly unpredictable. I could deal with a handful of problems like the many that will follow, but this is far more than a handful.
Weāll start gentle. Rās list type is an unavoidable part of the language, but itās very strange. As the following examples show, itās frequently a special case that you can rarely avoid.
https://stackoverflow.com/questions/2050790/ does a good job of demonstrating that the list type is not like anything that another language would prepare you for. It and its many answers are very much worth a read.
Lists are the parent class of data frames. Data frames are mandatory for anyone who wants to do stats in R and most of the problems with lists are inherited by data frames. This makes the oddities of lists unavoidable.
Particularly when extracting single elements of lists, you need to be vigilant for whether R is going to give what you wanted or the list containing what you wanted. Most of this comes down to learning the distinction between [
and [[
and sapply()
and lapply()
. Itās not too bad, but itās a complication.
Because they wonāt attempt to coerce your inputs to a common type and because, unless you count matrices, you cannot nest vectors (e.g. c(c(1, 2), c(3, 4))
is 1:4
), lists are what youāre most likely to use when you want to put two or more vectors in one place. A lot of your lists will therefore be nested structures. This is not inherently a problem, but extracting elements from nested structures is hard, both in a general sense and specifically for Rās lists. R does little to help you with this. Give https://stackoverflow.com/q/9624169/ and some of its answers a read. Why does this simple question get seven different answers? Do we really need libraries, anonymous functions, or knowing that [
is a function, just for what ought to be a commonplace operation?
Some common R functions do not work properly with lists. Some functions like sort()
and order()
will not work at all, even if you list only contains numbers, and other will work but misbehave. For example, what do you expect to get from c(someList, someVectorWithManyElements)
? You might expect a list that is now one item longer. Instead, you get your original list with every element of the vector appended to it in new slots, i.e. a list that is length(someVectorWithManyElements)
longer.
c(list(1, 2, 3), LETTERS[1:5])
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] "A"
##
## [[5]]
## [1] "B"
##
## [[6]]
## [1] "C"
##
## [[7]]
## [1] "D"
##
## [[8]]
## [1] "E"
The same output is given by append()
. To get list(1, 2, 3, LETTERS[1:5])
, you must do something like x <- list(1, 2, 3); x[[4]] <- LETTERS[1:5]
.
Note the use of [[4]]
and not [4]
above. Using [4]
gets you a warning and the output list(1, 2, 3, "A")
. The [
version is intended for cases like x[4:8] <- LETTERS[1:5]
, which gives the same output as c()
did above. The [
/[[
distinction is a beginnerās nightmare, as is Rās tendency to give you many ways to do the same thing.
Primarily due to the commonality of data frames, R has a handful of functions that are essentially āfoo, but the list versionā. lapply()
is the most obvious example.
A few functions, such as strsplit()
, can catch you off guard by returning a list when thereās no obvious reason why a vector or matrix wouldnāt have done. For strsplit()
in particular, I think that the idea is that itās designed to be used on character vectors of lengths greater than one. However, in my experience, I almost always want a length-one version. Iād far rather have that function and lapply()
/sapply()
/whatever it as need be than have to constantly use strsplit("foo")[[1]]
. Similarly, some functions, e.g. merge()
, insist on returning data frames even when the inputs were matrices. Coercing these unwanted outputs in to what you actually wanted is often harder than it has any right to be.
I think that the ultimate problem with lists is that the right way to use them is not easy to guess from your knowledge of the languageās other constructs. If everything in R worked like lists do, or if lists werenāt so common, then you wouldnāt really mind. As it is, youāll often make mistakes with lists and have to guess your way through correcting them. This isnāt terrible. Itās just annoying.
Rās strings suck. The overarching problem is that because there is no language-level distinction between characters vectors and their individual elements, Rās vectorization means that almost everything that you want to do with a string needs to be done by knowing the right function to use (rather than by using Rās ordinary syntax). I find that the correct functions can be hard to find and use. Although it doesnāt fix many of these issues, the common sentiment of ājust use stringr
/stringi
ā is difficult to dismiss.
Technically, R doesnāt even have a type for strings. You would want a string to be a vector of characters, but Rās characters are already vectors, so R canāt have a normal string type. Despite this, the documentation often uses the word āstringā. The language definition will tell you how to make sense of that, but I donāt think that information is found anywhere in the sorts of documentation that youāll find in your IDE.
Itās a pain to have to account for how R has two types of empty string: character(0)
and ""
.
Character vectors arenāt trivially split in to the characters that make each element. For example, "dog"[1]
is "dog"
because "dog"
is a vector of length one. The idiomatic way to split up a string in to its characters ā strsplit("dog", "")
ā returns a list, so rather than just getting the "d"
from "dog"
by doing "dog"[1]
, you have to do something like unlist(strsplit("dog", ""))[1]
or strsplit("dog", "")[[1]][1]
. The substr()
function can serve you better for trivial jobs, but you often need strsplit()
.
Hereās a challenge: Find the function that checks if "es"
is in "test"
. Youāll be on for a while.
Many of Rās best functions for handling strings expect you to know regex and are all documented in the same place (grep {base}, titled āPattern Matching and Replacementā). If you donāt know regex ā exactly what Iād expect of Rās target audience ā then youāre thrice damned:
?grep
are probably what you need. A glance at their documentation suggests that theyāre difficult materials and therefore presumably not required for your task.regexpr()
mean and how does it relate to regexec()
?ā ā leaving you with no straws to clutch at.The right function for the job can still be tough to use. Compare the stringr
answer to this question to the base R answers. Or better yet, use gregexpr()
or gregexec()
for any task and then tell me with a straight face that you both understand their outputs and find them easy to work with.
gregexpr("a", c("greatgreat", "magic", "not"))
## [[1]]
## [1] 4 9
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 2
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
The most useful function for printing strings seem to counter-intuitively be cat()
rather than print()
or format()
. For example, print()
ignores your \n
characters. The only time where print()
comes in really handy for string-related stuff is when your inputs are either quoted or lists. In both cases, print()
accepts these but cat()
does not. Without significant coercion (mostly deparse()
), Iāve yet to find a way to mix \n
with quoted input. Most of my attempts to do anything clever that mixes printing and lists end with me giving up and using a data frame.
Without defining a new operator, you canāt add strings in the way that other languages have taught you to, i.e. "a"+"b"
. John Chambers is against fixing this. Iām not convinced that heās wrong, but it is annoying.
If youāre converting numbers to characters, or using a function like nchar()
thatās meant for characters but accepts numbers, a shocking number of things will break when your numbers get big enough for R to automatically start using scientific notation.
nchar(10000)
## [1] 5
nchar(100000)
## [1] 5
a <- 10000
nchar(a) == nchar(a * 100)
## [1] TRUE
Youāre supposed to use format()
to coerce these sorts of numbers in to characters, but you wonāt know about that until something breaks and nchar()
ās documentation doesnāt seem to mention it (try ?as.character
). The format()
function also has a habit of requiring you to set the right flags to get what you want. trim = TRUE
comes up a lot. If youāre using a package or unfamiliar function, youāre forced to check to see if the author dealt with these issues before you use their work. Iād rather just have a generic nchar()
-like function that does what I mean. Would you believe that nchar()
ās documentation says itās a generic? Itās not lying and it later tells you that nchar()
coerces non-characters to characters, but R sure does know how to mess with your expectations.
R has some problems with its general facilities for manipulating variables. Some of the following will be seen every time that you use R.
It lacks both i++
and x += i
. It also lacks anything that would make these unnecessary, such as Pythonās enumerate
.
One day, youāll be tripped up by Rās hierarchy of how it likes to simplify mixed types outside of lists. The basics are documented with the c()
function. For example, c(2, "2")
returns c("2", "2")
. An exercise from Advanced R presents a few troubling cases:
1 == "1"
true?ā-1 < FALSE
true?ā"one" < 2
false?ā.To get complete information about the typing and structure of something, you will almost certainly need to call several functions. For example, do any of the following tell you everything about x
?
x <- diag(3)
x
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
typeof(x)
## [1] "double"
class(x)
## [1] "matrix" "array"
attributes(x)
## $dim
## [1] 3 3
str(x)
## num [1:3, 1:3] 1 0 0 0 1 0 0 0 1
dput(x) #Dirty trick, don't use in practice.
## structure(c(1, 0, 0, 0, 1, 0, 0, 0, 1), dim = c(3L, 3L))
Among these, str()
is the closest. However, you can see that it doesnāt give you all of the class information. This doesnāt improve if you add non-implicit classes to x
, but Iām avoiding that topic for as long as I can.
R likes to use ādoubleā and ānumericā almost interchangeably. Youāve just seen one such example (str(x)
vs typeof(x)
).
Integers are almost second class. ?integer
suggests that theyāre mostly for talking to other languages, but the problem seems to go deeper than that. Itās as if R tries to avoid integers unless you tell it not to. For example, 4
is a double, not an integer. Why? Unless youāre very careful, any integer that you give to R will eventually be coerced to a double.
Thereās no trivial way to express a condition like 1 < x < 5. In a maths language, Iād expect that exact syntax to work. Thereās probably a good reason why it doesnāt, and itās not at all hard to build an equivalent condition, but it still annoys me from time to time. I suspect that the <-
syntax is to blame.
The distinction between <-
and =
is something that youād have to look up. Iād try to explain the difference, but from what Iāve gathered, the difference only matters when using =
rather than <-
causes bugs. Like most R users, Iāve picked up the habit of āuse =
only for the arguments of functions and use <-
everywhere elseā.
<-
was designed for keyboards that donāt exist any more. Itās a pain to type on a modern system. IDEs can fix this.
The day that you accidentally have <
rather than <-
without it throwing an error will be an awful one. The reverse can also happen. For example, there are two things that you could have meant by if(x<-2)
.
Y<--2
is a terrible way to have to say āset Y
to be equal to negative twoā. Y<<--2
is even worse.
<<-
is probably the only good thing about the convention of using <-
, but itās only useful if you either know what a closure is and have reason to use one or if youāre some sort of guru with Rās first-class environments. You can sometimes use <<-
to great effect without deliberately writing a closure, but it always feels like a hack because youāre implicitly using one. For example, replicate(5, x <- x+1)
and replicate(5, x <<- x+1)
are very different, with the <<-
case being a very cool trick,
x <- 1
replicate(5, x <- x+1)
## [1] 2 2 2 2 2
x
## [1] 1
replicate(5, x <<- x+1)
## [1] 2 3 4 5 6
x
## [1] 6
but it only works because replicate()
quietly wraps its second argument in an anonymous function.
The idiomatic way to add an item to the end of a collection is a[length(a) + 1] <- "foo"
. This is rather verbose and a bit unpredictable when adding a collection to a list.
A quote from the language definition: āsupplied arguments and default arguments are treated differentlyā. This usually doesnāt trip you up, but youāre certain to discover it on the first day that you use eval()
. It has parent.frame()
as one of its default arguments, but calling eval()
with that argument supplied manually will produce different results than letting it be supplied by default.
x <- 1
(function(x) eval(quote(x + 1)))(100)
## [1] 101
(function(x) eval(quote(x + 1), envir = parent.frame()))(100)
## [1] 2
An easier-to-discover example can be found in section 8.3.19 of The R Inferno.
Argument names can be partially matched. See this link for some examples. I canāt tell if itās disgusting or awesome, but itās definitely dangerous. If I called f(n = 1)
, I probably didnāt mean f(nukeEarth = 1)
! At least it throws an error if it fails to partially match (e.g. if there were multiple valid partial matches). More on that when I cover the $
operator.
The ...
argument doesnāt make its users throw errors when theyāve been called with arguments that they donāt have or, even worse, those you misspelled. Advanced R has a great example in its Functions chapter. Would you have guessed that sum(1, 2, 3, na.omit = TRUE)
returns 7
, not 6
? Similarly, the Functionals chapter shows how this can lead to baffling errors and what strange things you have to do help your users avoid them.
NaN
, NULL
, and NA
have been accused of inconsistencies and illogical outputs, making it impossible to form a consistent mental model of them.
For many other examples, see section 8 of The R Inferno.
R has some strange ideas about switch statements:
Itās not a special form of any kind; Its syntax is that of a normal function call. If Iām being consistent in my formatting, then I should be calling it āRās switch()
functionā.
Itās only about 70 lines of C code, suggesting that it canāt be all that optimised.
R doesnāt have one switch statement, it has two. There is one where it switches on a numeric input and another for characters. The numeric version makes the strange assumption that the first argument (i.e. the argument being switched on) can be safely translated to a set of cases that must follow an ordering like āif input is 1, do the first option, if 2, do the secondā¦ā. There is no flexibility like letting you start at 2, having jumps higher than 1, or letting you supply a default case. Reread that last one: R has a switch without defaults! Itās frankly the worst switch that Iāve ever seen. The other version, the one that switches on characters, is more sensible. Iād give examples, but I donāt know how to demonstrate a non-feature.
As is a trend in R, both versions of switch are capable of silently doing nothing. For example, these do nothing:
switch(3, "foo", "bar")
switch("c", a = "foo", b = "bar")
print(switch("c", a = "foo", b = "bar")) #Showing off the return value.
## NULL
and they do it silently, returning NULL
. Iād expect a warning message informing me of this, but there is no such thing. If you want that behaviour, then you have to write it yourself e.g. switch("c", a = "foo", b = "bar", stop("Invalid input"))
or switch("c", a = "foo", b = "bar", warning("Invalid input"))
. You canāt do that with the numeric version, because R has a switch without defaults.
Now for the nasty stuff. Rās rules for selecting elements, deleting elements, and any other sort of subsetting require mastery. Theyāre extremely powerful once mastered, but until that point, they make using R a nightmare. For a stats language, this is unforgivable. Before mentioning any points that are best put in their own subsections, Iāll cover some more general points:
You never quite know whether you want to use x
, the name of x
, or a function like subset()
, which()
, Find()
, Position()
, or match()
. Rās operators make this even more of mess. You either want $
, [
, @
or even [[
. Making the wrong choice of [x]
, [x,]
, [,x]
or [[x]]
is another frequent source of error. You will get used to it eventually, but your hair will not survive the journey. Similar stories can be found about the apply family.
The [[
operator has been accused of inconsistent behaviour. Advanced R covers this better than I could. The short version is that it sometimes returns NULL
and other times throws an error. Personally, Iāve never noticed these because Iāve rarely tried to subset NULL
and I donāt see any reason why you would use [[
on an atomic vector. As far as I know, [
does the same job. The only exception that I can think of is if your atomic vector was named. For example:
a <- c(Alice = 1, Bob = 2)
a["Alice"]
## Alice
## 1
a[["Alice"]]
## [1] 1
When doing variable assignment on anything more than one-dimensional, object
and object[]
behave differently when you try to assign variables to them. Compare:
(c <- b <- diag(3))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
b[] <- 5
c <- 5
b
## [,1] [,2] [,3]
## [1,] 5 5 5
## [2,] 5 5 5
## [3,] 5 5 5
c
## [1] 5
This kind of makes sense, but it will trip you up.
The syntax for deleting elements of collections by index can be rather verbose. You canāt just pop out an element, you have to write vect <- vect[-index]
or vect <- vect[-c(index, nextIndex, ...)]
.
R is 1-indexed, but accessing element 0 of a vector gives an empty vector rather than an error. This probably makes sense considering that index -1 deletes element 1, but itās a clear source of major errors.
With the sole exception of environments, every named object in R is allowed to have duplicate names. I guarantee that will one day break your subsetting (e.g. see section 8.1.19 of The R Inferno). Fortunately, the constructor for data frames has a check.names
argument that corrects duplicates by default. Unfortunately, it does this silently, so you may not notice that some of your column names have been changed. Another oddity is that many functions that work on data frames, most notably [
, will silently correct duplicated names even if you told the original data frame to not do so. Why even let me have duplicated names if youāre going to make it so hard to keep them?
data.frame(x = 1:3, x = 11:13)
## x x.1
## 1 1 11
## 2 2 12
## 3 3 13
#Notice the x.1? You didn't ask for that. To get x twice, you need this:
correctNames <- data.frame(x = 1:3, x = 11:13, check.names = FALSE)
correctNames
## x x
## 1 1 11
## 2 2 12
## 3 3 13
correctNames[1:3, ]#As expected.
## x x
## 1 1 11
## 2 2 12
## 3 3 13
correctNames[1:2]#What?
## x x.1
## 1 1 11
## 2 2 12
## 3 3 13
Not only is this behaviour inconsistent, it is silent; No warnings or errors are thrown by the above code. Tibbles are much better about this:
library(tibble)
#We can't repeat our original first line, because tibble(x = 1:5, x = 11:15) throws an error:
## > tibble(x = 1:5, x = 11:15)
## Error: Column name `x` must not be duplicated.
## Use .name_repair to specify repair.
#We follow the error's advice.
#The .name_repair argument provides a few useful options, so we must pick one.
correctNames <- tibble(x = 1:5, x = 11:15, .name_repair = "minimal")
correctNames
## # A tibble: 5 Ć 2
## x x
## <int> <int>
## 1 1 11
## 2 2 12
## 3 3 13
## 4 4 14
## 5 5 15
correctNames[1:3,]#Good
## # A tibble: 3 Ć 2
## x x
## <int> <int>
## 1 1 11
## 2 2 12
## 3 3 13
correctNames[1:2]#Still good!
## # A tibble: 5 Ć 2
## x x
## <int> <int>
## 1 1 11
## 2 2 12
## 3 3 13
## 4 4 14
## 5 5 15
This may seem like an isolated example. It isnāt. A related example is that the check.names
argument in data.frame()
is very insistent on silently doing things, even to the point of overruling arguments that you explicitly set. For example, these column names arenāt what I asked for.
as.data.frame(list(1, 2, 3, 4, 5), col.names = paste("foo=bar", 6:10))
## foo.bar.6 foo.bar.7 foo.bar.8 foo.bar.9 foo.bar.10
## 1 1 2 3 4 5
as.data.frame(list(1, 2, 3, 4, 5), col.names = paste("foo=bar", 6:10), check.names = FALSE)#The fix.
## foo=bar 6 foo=bar 7 foo=bar 8 foo=bar 9 foo=bar 10
## 1 1 2 3 4 5
I think R does this to ensure your names are suitable for subsetting. Subsetting with non-unique or non-syntactic column names could be a pain, but the decision to not inform the user of this correction is baffling. Even if youāre fortunate enough to notice the silent changes, the lack of a warning message will leave you with no idea how to correct them. You could perhaps argue that duplicated names are the userās fault and they deserve what they get, but that argument falls apart for non-syntactic names. Who hasnāt put a space or an equals sign in their column names before? Mind, tibbles arenāt much better when it comes to non-syntactic names. Neither tibble("Example col" = 4)
nor data.frame("Example col" = 4)
warn you of the name change.
For what I believe to be memory reasons, objects of greater than one dimension are stored in column order rather than row order. Quick, what output do you expect from matrix(1:9, 3, 3)
?
matrix(1:9, 3, 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
This gives us a matrix with first row c(1, 4, 7)
. This goes against the usual English convention of reading from left to right. It is also inconsistent with functions like apply()
, where MARGIN = 1
corresponds to their by-row version and MARGIN = 2
is for by-column (if R privileges columns, shouldnāt they be the = 1
case?). This means that you can never really be sure if R is working in column order or row order. This is bad enough on its own, but it can also be a source of subtle bugs when working with matrices. Many mathematical functions donāt see any difference between a matrix and its transpose.
There is no nice way to access the last element of a vector. The idiomatic way is x[length(x)]
. The only good part of this is that x[length(x) - 0:n]
is a very nice way to get the last n + 1
elements. You could use tail()
, but Stack Overflow tells me itās very slow.
The sort()
and order()
functions are the main ways to sort stuff in R. If youāre trying to sort some data by a particular variable, then R counter-intuitively wants you to use order()
rather than sort()
. The syntax for order()
doesnāt help matters. It returns a permutation, so rather than order(x, params)
, you will want x[order(params),]
. My only explanation for this is that it makes order()
much easier to use with the with()
function. For example, data[with(data, order(col1, col2, col3)),]
is perhaps more pleasant to write than the hypothetical order(data, data$col1, data$col2, data$col3)
. The Tidyverseās dplyr
solves these problems: dplyr::arrange(data, col1, col2, col3)
does what you think. Iād much rather use arrange(mtcars, cyl, disp)
than mtcars[with(mtcars, order(cyl, disp)),]
.
The order()
case above illustrates another frequent annoyance with subsetting. Rather than asking for what you want, you often need to generate a vector that matches up to it. A collection of booleans (R calls these ālogical vectorsā) is one of the most common ways to do this, with duplicated()
being a typical example.
head(Nile)
## [1] 1120 1160 963 1210 1160 1160
duplicated(head(Nile))
## [1] FALSE FALSE FALSE FALSE TRUE TRUE
head(Nile)[duplicated(head(Nile))]
## [1] 1160 1160
This means that you will usually be asking for items[bools]
(and maybe [,bools]
or [bools,]
ā¦) in order to get the items that you want. There is great power in being able to do this, but having to do it is annoying and can catch you off guard. For example, what do you expect lower.tri()
to return when called on a matrix? What you wanted from lower.tri(mat)
is probably what you get from mat[lower.tri(mat)]
. Also, donāt expect a helpful error message if your construction of bools
is wrong. As Iāll discuss later on, the vector recycling rules will often make an incorrect construction give sensible-looking output.
For reasons that I cannot explain, aperm(x, params)
is the correct syntax, not x[aperm(params)]
or anything like it. I think that itās trying to be consistent with Rās ideas of how to manipulate matrices, but itās yet another source of confusion. I donāt want to have to think about if Iām treating my data like a matrix or like a data frame.
Good luck trying to figure out how to find a particular sequence of elements within a vector. For example, try finding if/where the unbroken vector 1:3
has occurred in sample(6, 100, replace = TRUE)
. Youāre best off just writing the for
loop.
This one isnāt too bad, but itās worth a mention. Combining operations can lead to some counter-intuitive results:
If a <- 1:5
, what do you expect to get from a[-1] <- 12:15
? Do you expect a[1]
to be removed or not? This is great once you know how it works, but itās confusing to a beginner.
a <- 1:5
a[-1] <- 12:15
a
## [1] 1 12 13 14 15
Because data[-index]
can be used to remove elements and data["colName"]
can be used to select elements, you might expect data[-"colName"]
or data[-c("colName1", "colName2")]
to work. You would be wrong. Both throw errors.
## > mtcars[-"wt"]
## Error in -"wt" : invalid argument to unary operator
Attempting to remove both by index and by name at the same time will never work. For example, mtcars[-c(1, "cyl")]
is an error and mtcars[c(1, "cyl")] <- NULL
will only remove the cyl
variable. Weirdly enough, I canāt actually show this mtcars[c(1, "cyl")] <- NULL
example. R is perfectly happy to show it, but R Markdown isnāt. What happens is that c(1, "cyl")
is coerced to c("1", "cyl")
. After this, R does not inform you that there is no 1
column to remove.
Now for the serious stuffā¦
This issue is notorious: R likes to remove unnecessary dimensions from your data in ways that are not easily predicted, forcing you to waste time preventing them. Rumour has it that this can be blamed on S being designed as a calculator rather than as a programming language. I canāt cite that, but itās easy to believe. No programmer would include any of the below in a programming language.
Unless you add , drop=FALSE
to all of your data selection/deletion lines, you run the risk of having all of your code that expects your data to have a particular structure unexpectedly break. This gives no errors or warnings. Compare:
(mat <- cbind(1:4, 4:1))
## [,1] [,2]
## [1,] 1 4
## [2,] 2 3
## [3,] 3 2
## [4,] 4 1
mat[, -1]
## [1] 4 3 2 1
mat[, -1, drop=FALSE]
## [,1]
## [1,] 4
## [2,] 3
## [3,] 2
## [4,] 1
and you will see that one of these is not a matrix. Data frames have the same issue unless you do all of your subsetting in a 1D form.
mat <- cbind(1:4, 4:1)
(frame <- as.data.frame(mat))
## V1 V2
## 1 1 4
## 2 2 3
## 3 3 2
## 4 4 1
frame[, -1]
## [1] 4 3 2 1
frame[, -1, drop=FALSE]
## V2
## 1 4
## 2 3
## 3 2
## 4 1
frame[-1]#1D subsetting
## V2
## 1 4
## 2 3
## 3 2
## 4 1
The Tidyverse, specifically tibble
, does its best to remove this.
library(tibble)
mat <- cbind(1:4, 4:1)
(tib <- as_tibble(mat))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 4 Ć 2
## V1 V2
## <int> <int>
## 1 1 4
## 2 2 3
## 3 3 2
## 4 4 1
tib[, -1]
## # A tibble: 4 Ć 1
## V2
## <int>
## 1 4
## 2 3
## 3 2
## 4 1
tib[, -1, drop=FALSE]
## # A tibble: 4 Ć 1
## V2
## <int>
## 1 4
## 2 3
## 3 2
## 4 1
tib[-1]
## # A tibble: 4 Ć 1
## V2
## <int>
## 1 4
## 2 3
## 3 2
## 4 1
tib[, -1, drop=TRUE]
## [1] 4 3 2 1
You can think of tibbles as having drop=FALSE
as their default. I canāt explain why base R doesnāt do the same. Itās got to either be some sort of compromise for matrix algebra or for making working in your console nicer.
drop=TRUE
; probably an unwise decision on our part long ago, but now one of those back-compatibility burdens that are unlikely to be changed.āThe drop
argument is even stranger than Iām letting on. Its defaults differ depending on whether there may only be one column remaining or if there may only be one row. To quote the documentation (?"[.data.frame"
): āThe default is to drop if only one column is left, but not to drop if only one row is leftā. Unlike the previous point, I can sort of make sense of this. For example, a single column can only ever be one type (even if that may be a container for mixed types, such as a list) but a single row could easily be a mix of types. Dropping on a row of mixed types will just give you a really ugly list, so youād much rather have a data frame. With a column, itās only with years of experience that the community has realised that they probably still want the data frame; Itās nowhere near as obvious that the vector is not preferable.
As you can tell by taking a close look at the documentation for [
and that of [.data.frame
, the drop
argument does not do the same thing for arrays and matrices as it does for data frames. This means that my earlier example could be dishonest. However, the confusion that you would need to overcome in order to check for if Iāve been dishonest is so great that it proves that thereās definitely something wrong with the drop
argument.
You may think that object
and object[,]
are the same thing. They are not. You would expect and get an error if object
is one-dimensional. However, if itās a data frame or matrix with one of its dimensions having size 1, then you do not get an error and both object
and object[,]
are very different.
library(tibble)
colMatrix <- matrix(1:3)
colMatrix
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
colMatrix[,]
## [1] 1 2 3
rowMatrix <- matrix(1:3, ncol = 3)
rowMatrix
## [,1] [,2] [,3]
## [1,] 1 2 3
rowMatrix[,]
## [1] 1 2 3
colFrame <- as.data.frame(colMatrix)
colFrame
## V1
## 1 1
## 2 2
## 3 3
colFrame[,]
## [1] 1 2 3
rowFrame <- as.data.frame(rowMatrix)
rowFrame
## V1 V2 V3
## 1 1 2 3
rowFrame[,]
## V1 V2 V3
## 1 1 2 3
colTib <- as_tibble(colMatrix)
colTib
## # A tibble: 3 Ć 1
## V1
## <int>
## 1 1
## 2 2
## 3 3
colTib[,]
## # A tibble: 3 Ć 1
## V1
## <int>
## 1 1
## 2 2
## 3 3
rowTib <- as_tibble(rowMatrix)
rowTib
## # A tibble: 1 Ć 3
## V1 V2 V3
## <int> <int> <int>
## 1 1 2 3
rowTib[,]
## # A tibble: 1 Ć 3
## V1 V2 V3
## <int> <int> <int>
## 1 1 2 3
Can you guess why? Itās because the use of [
makes R check if it should be dropping dimensions. This makes object
and object[,,drop=FALSE]
equivalent, whereas object[,]
is a vector rather than whatever it was originally. Tibbles, of course, donāt have this issue.
If youāve struggled to read this section, then youāre probably now aware of another point: Itās really easy to get the commas for drop=FALSE
mixed up. What do you think data[4, drop=FALSE]
is? If data
is a data frame, you get column 4 and a warning that the drop
argument was ignored. Did you expect row 4? Whether you did or not, you should be able to see why somebody may come to the opposite answer. Although I see no sensible alternative, the drop
argument needing its own comma is terrible syntax for a language where a stray comma is the difference between your dataās life and death. This is made even worse by the syntax for [
occasionally needing stray commas. Expressions like data[4,]
are commonplace in R, so itās far too easy to forget that you needed the extra comma for the drop
argument.
The $
operator is both silently hazardous and redundant:
As an S3 generic, you can never be certain that $
does what you want it to when you use it on a class from a package. For example, itās common knowledge that base Rās $
and the Tidyverseās $
are not the same thing. In fact, $
does not even behave consistently in base R. Compare the following partial matching behaviour:
library(tibble)
list(Bob = 5, Dobby = 7)$B
## [1] 5
env <- list2env(list(Bob = 5, Dobby = 7))
env$B
## NULL
data.frame(Bob = 5, Dobby = 7)$B
## [1] 5
tibble(Bob = 5, Dobby = 7)$B
## Warning: Unknown or uninitialised column: `B`.
## NULL
For what itās worth, replacing Dobby
with Bobby
gives more consistent results.
library(tibble)
list(Bob = 5, Bobby = 7)$B
## NULL
env <- list2env(list(Bob = 5, Bobby = 7))
env$B
## NULL
data.frame(Bob = 5, Bobby = 7)$B
## NULL
tibble(Bob = 5, Bobby = 7)$B
## Warning: Unknown or uninitialised column: `B`.
## NULL
In theory, I should note that [
and [[
are also S3 generics and therefore should share this issue. Aside from the drop
issues above, I rarely notice such misbehaviour in practice.
Consistency aside, partial matching is inherently dangerous. data$Pen
might give the Penetration
column if you forgot that you removed the Pen
column. By default, R does not give you any warnings when partial matches happen, so you wonāt have any idea that you got the wrong column.
The documentation for $
points out its redundancy in base R: āx$name
is equivalent to x[["name", exact = FALSE]]
ā. In other words, even if I want the behaviour of $
, I can get it with [[
. Another benefit of [[
is that it will only partially match if you tell it to (use exact = FALSE
). That matters becauseā¦
The partial matching of $
can be even worse than Iāve just described. If there are multiple valid partial matches, rather than get any of them, you get NULL
. This is what happened with the Bob/Bobby example above. To give another example, mtcars$di
and mtcars$dr
both give sensible output because there is only one valid partial match, but mtcars$d
is just NULL
. Iām largely okay with this behaviour, but you donāt even get a warning!
mtcars$di
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
## [13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
## [25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
mtcars$dr
## [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
## [16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
## [31] 3.54 4.11
mtcars$d
## NULL
Tibbles try to fix the partial-matching issues of $
by completely disallowing partial matching. They will not partially match even if you tell them to with [[, exact=FALSE]]
. If you try to partially match anyway, it will give you a warning and return NULL
. I sometimes wonder if it should be an error.
library(tibble)
mtTib <- as_tibble(mtcars)
mtTib$di
## Warning: Unknown or uninitialised column: `di`.
## NULL
mtTib$dr
## Warning: Unknown or uninitialised column: `dr`.
## NULL
mtTib$d
## Warning: Unknown or uninitialised column: `d`.
## NULL
mtcars[["d", exact = FALSE]]
## NULL
mtTib[["d", exact = FALSE]]
## Warning: `exact` ignored.
## NULL
On the base R side, there is a global option that makes $
give you warnings whenever partial matching happens. Itās disabled by default. Common sense suggests it should be otherwise.
The $
operator is another case of R quietly changing your data structures. For example, I would call mtcars$mpg
unreadable.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
typeof(mtcars$mpg)
## [1] "double"
You probably wanted mtcars["mpg"]
mtcars["mpg"]
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
typeof(mtcars["mpg"])
## [1] "list"
and you definitely did not want mtcars[, "mpg"]
or mtcars[["mpg"]]
, which both give the same output as using $
.
mtcars[, "mpg"]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Would you have guessed that? Tibbles share the above behaviour with $
and [[
, but keep ["name"]
and [, "name"]
identical due to their promise to not drop dimensions with [
.
The $
operator does not have any uses beyond selection. For example, there is no way to combine $
with operators like -
and thereās no way to pass arguments like drop=FALSE
to it.
$
is not allowed for atomic vectors like c(fizz=3, buzz=5)
, unlike [
and [[
. This is particularly annoying when dealing with named matrices because you end up having to use mat[, "x"]
where mat$x
should have done.
Section 8.1.21 of The R Inferno: There exists a $<-
operator. You hardly ever see it used. The R Inferno points out that it does not do partial matching, even for lists, unlike $
. This is actually documented behaviour ā in fact, ?Extract
mentions it twice ā but I challenge you to find it. I can see why it would be difficult to make a $<-
with partial matching, but making $<-
inconsistent with $
is just laughable.
In conclusion, once you know the difference between ["colname"]
and [, "colname"]
, $
is only useful if itās making your code cleaner, saving you typing, or if you actually want the partial matching. Personally, Iām uncomfortable with the inherent risks of partial matching, so $
is only really useful for interactive use and my IDEās auto-completion. That might even be its intended job. But if that is the case, nobody warns you of it.
When dealing with any sort of collection, any of the following mistakes can give indistinguishable results. This can make your debugging so messy that by the time that youāre done, you donāt know what was broken.
Trying to select an incorrect sequence of elements. This can be caused by :
or seq()
misbehaving or by simple user error. A tiny bit more on that later
The vector recycling rules silently causing the vector that you used to select elements to be recycled in an undesired way. More on that later.
Selecting an out-of-bounds value. You almost always donāt get any error or warning when you do this. For example, both out-of-bounds positive numbers and logical vectors that are longer than the vector that youāre subsetting silently return NA
for the inappropriate values.
length(LETTERS)
## [1] 26
LETTERS[c(1, 5, 20, 100)]
## [1] "A" "E" "T" NA
LETTERS[rep(TRUE, 100)]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
## [19] "S" "T" "U" "V" "W" "X" "Y" "Z" NA NA NA NA NA NA NA NA NA NA
## [37] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [55] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [73] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [91] NA NA NA NA NA NA NA NA NA NA
Again, as with many of the issues that weāve mentioned recently, this happens silently.
Accessing/subsetting a collection in the wrong way. For example, wrongly using any of [c(x, y)]
, [x, y]
, or [cbind(x, y)]
, selecting [x]
rather than [x, ]
, [[x]]
, or [, x]
, using the wrong rbind()/cbind()
, or an error in your call to anything like subset()
or within()
.
Selecting element 0.
Any sort of off-by-one errors, e.g. a modulo mistake of any sort, genuine off-by-one errors, or Rās 1-indexing causing you to trip up.
Misuse of searching functions like which()
, duplicated()
, or match()
.
This list also reveals another issue with subsetting: Thereās too many ways to do itā¦
ā¦and they donāt all work everywhere. For example, thereās a wide range of tools for using names to work with lists and data frames, but very few of them work for named atomic vectors (which includes named matrices).
The $
operator simply does not work.
Although namedVector["name"]
can be used for subsetting and subassignment, namedVector["name"] <- NULL
throws an error. For a list or data frame, this would have deleted the selected data points.
typeof(letters)
## [1] "character"
named <- setNames(letters, LETTERS)
tail(named)
## U V W X Y Z
## "u" "v" "w" "x" "y" "z"
named["Z"]
## Z
## "z"
named["Z"] <- "Super!"
tail(named)
## U V W X Y Z
## "u" "v" "w" "x" "y" "Super!"
#So subsetting and subassignment work just fine. However, for NULL...
## > named["Z"] <- NULL
## Error in named["Z"] <- NULL : replacement has length zero
#But for a data frame, this is just fine.
(data <- data.frame(A = 1, B = 2, Z = 3))
## A B Z
## 1 1 2 3
data["Z"] <- NULL
data
## A B
## 1 1 2
Incidentally, anyAtomicVector[index] <- NULL
is also an error. e.g. LETTERS[22] <- NULL
.
Sorry, did I say that namedVector["name"]
works for subsetting?
a <- diag(3)
colnames(a) <- LETTERS[1:3]
a
## A B C
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
a["A"]
## [1] NA
a["Z"]
## [1] NA
Long story short, named atomic vectors make a distinction between names and colnames that data frames do not.
a <- diag(3)
colnames(a) <- LETTERS[1:3]
colnames(a)
## [1] "A" "B" "C"
names(a)
## NULL
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
identical(names(mtcars), colnames(mtcars))
## [1] TRUE
So what happens when you give an atomic vector plain old names rather than colnames? For a non-matrix, it works fine (see the named <- setNames(letters, LETTERS)
example above). For a matrix - and presumably for any array, but letās not get in to that distinction - itās a little bit more complicated. Look closely at this output before reading further.
a <- diag(3)
(a <- setNames(a, LETTERS[1:3]))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## attr(,"names")
## [1] "A" "B" "C" NA NA NA NA NA NA
a["A"]
## A
## 1
a["Z"]#For a data frame, this would be an error...
## <NA>
## NA
When you try to give an atomic vector ordinary names, R will only try to name it element-by-element (even if said vector has dimensions). Data frames, on the other hand, treat names as colnames. R ultimately sees named matrices as named atomic vectors that happen to have a second dimension. This means that you can subset them with both ["name"]
and [, "name"]
and get different results.
a <- setNames(diag(3), LETTERS[1:3])
colnames(a) <- LETTERS[1:3]
a
## A B C
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## attr(,"names")
## [1] "A" "B" "C" NA NA NA NA NA NA
a["A"]
## A
## 1
a["Z"]
## <NA>
## NA
a[, "A"]
## [1] 1 0 0
#I'd love to show a[, "Z"], but it throws the error "Error in a[, "Z"] : subscript out of bounds".
#This is clearly consistent with a["Z"] and my earlier bits on out-of-bounds stuff not throwing errors.
Of course, ["name"]
and [, "name"]
arenāt identical for data frames either, but letās not get back in to talking about the drop
argument. Starting to see what I mean about R being inconsistent?
You cannot use named atomic vectors to generate environments. This means that awesome tricks like within(data, remove(columnIDoNotWant, anotherColumn))
work for lists and data frames but not for named atomic vectors.
#Data frames are fine.
head(within(mtcars, remove("mpg")))
## cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
#Named atmomic vectord are not.
## > within(setNames(letters, LETTERS), remove("Z"))
## Error in UseMethod("within") :
## no applicable method for 'within' applied to an object of class "character"
When you want to work with the names of named atomic vectors, you probably want to access their names directly and use expressions like namedVect[!names(namedVect) %in% c("remove", "us")]
.
namedVect <- setNames(letters, LETTERS)
namedVect[!names(namedVect) %in% c("A", "Z")]
## B C D E F G H I J K L M N O P Q R S T U
## "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
## V W X Y
## "v" "w" "x" "y"
However, this is a bad habit for non-atomic vectors because, unless you take the precautions mentioned earlier, [
likes to remove duplicated names and unnecessary dimensions from your data.
Donāt think that functional programming will save you from my previous point. The base libraryās higher-order functions donāt play nice with the names()
function. I think itās got something to do with lapply()
using X[[i]]
under the hood (see its documentation).
namedVect <- setNames(letters, LETTERS)
Filter(function(x) names(x) == "A", namedVect)
## named character(0)
head(lapply(namedVect, function(x) names(x) == "A"))
## $A
## logical(0)
##
## $B
## logical(0)
##
## $C
## logical(0)
##
## $D
## logical(0)
##
## $E
## logical(0)
##
## $F
## logical(0)
head(sapply(namedVect, function(x) names(x) == "A"))
## $A
## logical(0)
##
## $B
## logical(0)
##
## $C
## logical(0)
##
## $D
## logical(0)
##
## $E
## logical(0)
##
## $F
## logical(0)
Did you notice that Filter
and lapply
ās arguments are in inconsistent orders? A little bit more on that much later.
From the above few points, you can see that itās hard to find a way to manipulate named atomic vectors by their names that is both safe for them and for other named objects. The only one that comes to mind is to use [
with the aforementioned precautions. Thatās bad enough on its own ā it makes R feel unsafe and inconsistent ā but it also makes named atomic vectors feel like an afterthought. I find that most of my code that makes extended use of named atomic vectors comes out looking disturbingly unidiomatic. A little bit more on that when I talk about matrices.
Iāve already given a few examples of R either silently doing nothing or silently doing what you donāt want. Letās have a few more:
Again, much of what Iāve listed in the Indistinguishable Errors and Removing Dimensions sections occur silently.
As documented here, negative out-of-bounds values are silently disregarded when deleting elements. For example, if you have x <- 1:10
, then x[-20]
returns an unmodified version of x
without warning or error.
x <- 1:10
x[20]
## [1] NA
x
## [1] 1 2 3 4 5 6 7 8 9 10
x[-20]
## [1] 1 2 3 4 5 6 7 8 9 10
identical(x, x[-20])
## [1] TRUE
Given that x[20]
is NA
ā a questionable decision in of itself ā is this the behaviour that you expected?
Subassigning NULL
to a column that your data does not have does not give a warning or error. For example, trying to access mtcars["weight"]
is an error, but mtcars["weight"] <- NULL
silently does nothing. $
and $<-
have the same issue.
Using within()
to remove unwanted columns from your data, e.g. within(data, rm(colName1, colName2))
, does nothing to any columns with duplicated names. Again, no warning or errorā¦
dupe <- cbind(mtcars, foo = 3, foo = 4)
head(dupe)
## mpg cyl disp hp drat wt qsec vs am gear carb foo foo
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3 4
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3 4
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3 4
head(within(dupe, rm(carb, foo)))
## mpg cyl disp hp drat wt qsec vs am gear foo foo
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 4 4
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 4 4
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 4 4
By the way, cbind()
doesnāt silently correct duplicated column names. By now, you probably expected otherwise. This is documented behaviour, but I donāt think that anyone ever bothered to read the docs for cbind()
.
Using subset()
rather than within()
is sometimes suggested for operations like what I was trying to do in the previous point. For example, you can remove columns with subset(data, select = -c(colName1, colName2))
. However, for duplicated names, Iād argue that subset()
is even weirder than within()
. With subset()
, attempting to remove a duplicated column by name will only remove the first such column and removing any non-duplicated column will change the names of your duplicated columns.
#First, I'll show subset() working as normal and save us some space.
mtcars2 <- subset(mtcars, mpg > 25, select = -c(cyl, disp, hp, wt))
mtcars2
## mpg drat qsec vs am gear carb
## Fiat 128 32.4 4.08 19.47 1 1 4 1
## Honda Civic 30.4 4.93 18.52 1 1 4 2
## Toyota Corolla 33.9 4.22 19.90 1 1 4 1
## Fiat X1-9 27.3 4.08 18.90 1 1 4 1
## Porsche 914-2 26.0 4.43 16.70 0 1 5 2
## Lotus Europa 30.4 3.77 16.90 1 1 5 2
dupe <- cbind(mtcars2, foo = 3, foo = 4, foo = 5)
dupe
## mpg drat qsec vs am gear carb foo foo foo
## Fiat 128 32.4 4.08 19.47 1 1 4 1 3 4 5
## Honda Civic 30.4 4.93 18.52 1 1 4 2 3 4 5
## Toyota Corolla 33.9 4.22 19.90 1 1 4 1 3 4 5
## Fiat X1-9 27.3 4.08 18.90 1 1 4 1 3 4 5
## Porsche 914-2 26.0 4.43 16.70 0 1 5 2 3 4 5
## Lotus Europa 30.4 3.77 16.90 1 1 5 2 3 4 5
subset(dupe, select = -foo)#Names have silently changed and only one foo was dropped.
## mpg drat qsec vs am gear carb foo foo.1
## Fiat 128 32.4 4.08 19.47 1 1 4 1 4 5
## Honda Civic 30.4 4.93 18.52 1 1 4 2 4 5
## Toyota Corolla 33.9 4.22 19.90 1 1 4 1 4 5
## Fiat X1-9 27.3 4.08 18.90 1 1 4 1 4 5
## Porsche 914-2 26.0 4.43 16.70 0 1 5 2 4 5
## Lotus Europa 30.4 3.77 16.90 1 1 5 2 4 5
subset(dupe, select = -c(foo, foo))#Identical to previous.
## mpg drat qsec vs am gear carb foo foo.1
## Fiat 128 32.4 4.08 19.47 1 1 4 1 4 5
## Honda Civic 30.4 4.93 18.52 1 1 4 2 4 5
## Toyota Corolla 33.9 4.22 19.90 1 1 4 1 4 5
## Fiat X1-9 27.3 4.08 18.90 1 1 4 1 4 5
## Porsche 914-2 26.0 4.43 16.70 0 1 5 2 4 5
## Lotus Europa 30.4 3.77 16.90 1 1 5 2 4 5
subset(dupe, select = -carb)#Foo's names have silently changed, despite us not touching foo!
## mpg drat qsec vs am gear foo foo.1 foo.2
## Fiat 128 32.4 4.08 19.47 1 1 4 3 4 5
## Honda Civic 30.4 4.93 18.52 1 1 4 3 4 5
## Toyota Corolla 33.9 4.22 19.90 1 1 4 3 4 5
## Fiat X1-9 27.3 4.08 18.90 1 1 4 3 4 5
## Porsche 914-2 26.0 4.43 16.70 0 1 5 3 4 5
## Lotus Europa 30.4 3.77 16.90 1 1 5 3 4 5
subset(dupe, select = -c(carb, foo))#Names have silently changed and only one foo was dropped.
## mpg drat qsec vs am gear foo foo.1
## Fiat 128 32.4 4.08 19.47 1 1 4 4 5
## Honda Civic 30.4 4.93 18.52 1 1 4 4 5
## Toyota Corolla 33.9 4.22 19.90 1 1 4 4 5
## Fiat X1-9 27.3 4.08 18.90 1 1 4 4 5
## Porsche 914-2 26.0 4.43 16.70 0 1 5 4 5
## Lotus Europa 30.4 3.77 16.90 1 1 5 4 5
I think that the worst example here is subset(dupe, select = -carb)
. I didnāt touch foo
, so why change it? Iād rather have within()
ās silent inaction than subset()
ās silent sabotage.
Needless to say, there will be more examples of R silently misbehaving later on in this document. This was just a good place to throw in a few that are specific to subsetting.
This should be easy, shouldnāt it? Go through the data and only give me the bits that have the property that Iām asking for. What could possibly go wrong? Turns out, itās quite a lot. Even predicates as simple as ādoes the element equal x
?ā are a minefield. I understand why these examples are the way that they are ā really, I do ā but how to delete unwanted elements is one of the first things that youāre going to want to learn in a stats language. For something that youāre going to want to be able to do on day one of using R, there are far too many pitfalls.
You might think that setdiff()
is sufficient for removing data ā itās certainly the first thing tool that a mathematician would reach for ā but it has the side-effect of removing duplicate entries from the original vector and destroying your data structures by applying as.vector()
to them.
Nile
## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020
## [16] 960 1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840
## [31] 874 694 940 833 701 916 692 1020 1050 969 831 726 456 824 702
## [46] 1120 1100 832 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846 812 742 801
## [76] 1040 860 874 848 890 744 749 838 1050 918 986 797 923 975 815
## [91] 1020 906 901 1170 912 746 919 718 714 740
setdiff(Nile, 1160)#Not a time series any more.
## [1] 1120 963 1210 813 1230 1370 1140 995 935 1110 994 1020 960 1180 799
## [16] 958 1100 1150 1250 1260 1220 1030 774 840 874 694 940 833 701 916
## [31] 692 1050 969 831 726 456 824 702 832 764 821 768 845 864 862
## [46] 698 744 796 1040 759 781 865 944 984 897 822 1010 771 676 649
## [61] 846 812 742 801 860 848 890 749 838 918 986 797 923 975 815
## [76] 906 901 1170 912 746 919 718 714 740
setdiff(Nile, 0)#Hey, where did the other 1160s go?
## [1] 1120 1160 963 1210 813 1230 1370 1140 995 935 1110 994 1020 960 1180
## [16] 799 958 1100 1150 1250 1260 1220 1030 774 840 874 694 940 833 701
## [31] 916 692 1050 969 831 726 456 824 702 832 764 821 768 845 864
## [46] 862 698 744 796 1040 759 781 865 944 984 897 822 1010 771 676
## [61] 649 846 812 742 801 860 848 890 749 838 918 986 797 923 975
## [76] 815 906 901 1170 912 746 919 718 714 740
Itās safer when youāre dealing with names, e.g. data[setdiff(names(data), "nameOfThingToDelete")]
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head(mtcars[setdiff(names(mtcars), "wt")])
## mpg cyl disp hp drat qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 20.22 1 0 3 1
but anything thatās only sometimes safe doesnāt fill me with confidence.
Because which()
is an extremely intuitive function for extracting/changing subsets of your data and for dealing with missing values (see The R Inferno, section 8.1.12), it is one of the first things that a beginner will learn about. However, although your intuition is screaming for you to do it, you almost never want to use data <- data[-which(data==thingToDelete)]
. When which()
finds no matches, it evaluates to something of length 0. This makes data[-which(data==thingToDelete)]
also returns something of length 0, deleting your data.
Nile
## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020
## [16] 960 1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840
## [31] 874 694 940 833 701 916 692 1020 1050 969 831 726 456 824 702
## [46] 1120 1100 832 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846 812 742 801
## [76] 1040 860 874 848 890 744 749 838 1050 918 986 797 923 975 815
## [91] 1020 906 901 1170 912 746 919 718 714 740
Nile[-which(Nile==1160)]#This is fine.
## [1] 1120 963 1210 813 1230 1370 1140 995 935 1110 994 1020 960 1180 799
## [16] 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840 874 694 940
## [31] 833 701 916 692 1020 1050 969 831 726 456 824 702 1120 1100 832
## [46] 764 821 768 845 864 862 698 845 744 796 1040 759 781 865 845
## [61] 944 984 897 822 1010 771 676 649 846 812 742 801 1040 860 874
## [76] 848 890 744 749 838 1050 918 986 797 923 975 815 1020 906 901
## [91] 1170 912 746 919 718 714 740
which(Nile==11600)
## integer(0)
Nile[-which(Nile==11600)]#This is not.
## numeric(0)
What you probably expected was which()
leaving your data unchanged when it has not found a match. You might also have expected a warning or error, but surely youāve learned your lesson by now? Anyway, section 8.1.13 of The R Inferno offers some ways to get this behaviour, but the only practical-looking suggestion is data[!(data %in% thingToDelete)]
. I think that you can get away with removing the curly brackets there.
Nile[!Nile %in% 1160]
## [1] 1120 963 1210 813 1230 1370 1140 995 935 1110 994 1020 960 1180 799
## [16] 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840 874 694 940
## [31] 833 701 916 692 1020 1050 969 831 726 456 824 702 1120 1100 832
## [46] 764 821 768 845 864 862 698 845 744 796 1040 759 781 865 845
## [61] 944 984 897 822 1010 771 676 649 846 812 742 801 1040 860 874
## [76] 848 890 744 749 838 1050 918 986 797 923 975 815 1020 906 901
## [91] 1170 912 746 919 718 714 740
Nile[!Nile %in% 11600]
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020
## [16] 960 1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840
## [31] 874 694 940 833 701 916 692 1020 1050 969 831 726 456 824 702
## [46] 1120 1100 832 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846 812 742 801
## [76] 1040 860 874 848 890 744 749 838 1050 918 986 797 923 975 815
## [91] 1020 906 901 1170 912 746 919 718 714 740
Thatās mostly okay. However, identical(Nile, Nile[!Nile %in% 11600])
is FALSE
. Can you guess why? Itās like R has no always safe ways to subset.
At least removing elements that are equal to a particular number is simple for vectors. Even for lists, itās just data[data!=x]
. Itās maybe not what a beginner would guess (āI have to write data
twice?ā), but itās simple enough.
For removing a vector from a list of vectors, youāre going to want to learn some functional programming idioms. Not hard if youāre a programmer, but shouldnāt this be easier in a stats and maths tool? Anyway, you probably want Filter(function(x) all(x!=vectorToDelete), data)
. You can also do it with the apply family, but I donāt see why you would.
Removing what you donāt want from a data frame largely comes down to mastering the subsetting rules, a nightmare that Iāve spent the previous few thousand words covering. I often end up with very ugly lines like outcomes[outcomes$playerChoice == playerChoice & outcomes$computerChoice == computerChoice, "outcome"]
Before you ask, subset()
, with()
, and within()
arenāt good enough either. Iāve already mentioned some of their issues, but more on them later.
Overall, itās like R has no safe ways to subset. What is safe for one job is often either unsafe, invalid, or inconsistent with another. Rās huge set of subsetting tools is useful ā maybe even good ā once mastered, but until then youāre forced to adopt a guess-and-check style of programming and pray that you get a useful error/warning message when you get something wrong. Worse still, these prayers are rarely answered and, in the cases where R silently does something that you didnāt want, theyāre outright mocked. Do you understand how damning that is for a stats language? I canāt stress this point enough. Subsetting in R should be easy and intuitive. Instead, itās something that Iāve managed to produce thousands of words of complaints about and it still trips me up with alarming regularity, despite my clear knowledge of the correct way to do things. If I want a vector of consonants, you can bet that Iām going to write letters[-c("a", "e", "i", "o", "u")]
, letters[-which(letters == c("a", "e", "i", "o", "u"))]
, and letters[c("a", "e", "i", "o", "u") %in% letters]
before remembering the right way to do it. If Iām still making those mistakes for something simple, then I can only imagine what itās like for a true beginner doing something complicated.
Youāve heard the good, now for the bad. Rās vectorization is probably the best thing about the language and it will work miracles when youāre doing mathematics. However, it will trip you up in other areas. A lot of these points are minor, but when they cause you problems their source can be tough to track down. This is because R is working as intended and therefore not giving you any warnings or errors (spotting a pattern?). Furthermore, if you have correctly identified that you have a vectorization problem, then pretty much any function in R could be to blame, because most of Rās functions are vectorized.
The commonality of vectors leads to some new syntax that must be memorised. For example, if(x|y)
and if(x||y)
are very different and using &&
rather than &
can be fatal. Compare the following:
mtcars[mtcars$mpg < 20 && mtcars$hp > 150,]
## Warning in mtcars$mpg < 20 && mtcars$hp > 150: 'length(x) = 32 > 1' in coercion
## to 'logical(1)'
## [1] mpg cyl disp hp drat wt qsec vs am gear carb
## <0 rows> (or 0-length row.names)
mtcars[mtcars$mpg < 20 & mtcars$hp > 150,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Personally, I find that itās easy to remember to use &
for if
but I often forget to use &
for subsetting. It looks like version 4.1.4 is going to make ||
and &&
throw warnings.
The if
statements accept vectors of length greater than 1 as their predicate, but will only pay attention to the very first element. This throws a warning and there is a global option to make it an error instead, but I canāt see why R accepts such predicates at all. Why would I ever use if(c(TRUE, FALSE))
to mean āif the first element of my vector is true, thenā¦ā? This is also what the &&
and ||
syntax is for (e.g. c(TRUE, FALSE) && c(TRUE, FALSE)
is TRUE
), but I still donāt see why anyone would use several logical vectors and only be interested in their first elements.
When dealing with anything 2D, you need to be very careful to not mix up any of length()
, lengths()
, nrow()
, or ncol()
. In particular, length()
is so inconsistent that Iām unsure why they let it work for 2D structures (probably something to do with it being an internal generic). For example, the length of a data frame is its number of columns and the length of a matrix is its number of elements.
(a <- diag(4))
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
(b <- as.data.frame(a))
## V1 V2 V3 V4
## 1 1 0 0 0
## 2 0 1 0 0
## 3 0 0 1 0
## 4 0 0 0 1
length(a)
## [1] 16
length(b)
## [1] 4
Vectors are collections and therefore inherit the previous sectionās issues about selecting elements.
Because virtually everything is already a vector, you never know what to use when you want a collection or anything nested. Lists? Arrays? c()
? Data frames? One of cbind()
/rbind()
? Matrices? You get used to it eventually, but it takes a while to understand the differences.
Some functions are vectorized in such a way that youāre forced to remember the difference between how they behave for n length-one vectors and and how they behave for the corresponding single vector of length n. For example, paste("Alice", "Bob", "Charlie")
is not the same as paste(c("Alice", "Bob", "Charlie"))
.
paste("Alice", "Bob", "Charlie")
## [1] "Alice Bob Charlie"
paste(c("Alice", "Bob", "Charlie"))
## [1] "Alice" "Bob" "Charlie"
paste("Alice", "Bob", "Charlie", collapse = "")
## [1] "Alice Bob Charlie"
paste(c("Alice", "Bob", "Charlie"), collapse = "")
## [1] "AliceBobCharlie"
Iām not saying that this doesnāt make sense, but it is a source of unpredictability.
Another unpredictable example: What does max(100:200, 250:350, 276)
return? You might be surprised to discover that the output is the single number 350
, rather than a vector of many outputs.
max(100:200, 250:350, 276)
## [1] 350
The fix for this isnāt some collapse
-like argument like it is for paste()
, itās an entirely different function: pmax()
. Why?
pmax(100:200, 250:350, 276)
## [1] 276 276 276 276 276 276 276 276 276 276 276 276 276 276 276 276 276 276
## [19] 276 276 276 276 276 276 276 276 276 277 278 279 280 281 282 283 284 285
## [37] 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303
## [55] 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321
## [73] 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339
## [91] 340 341 342 343 344 345 346 347 348 349 350
A further annoyance comes from how many things behave differently on vectors of length one. For example, sample(1:5)
is exactly the same as sample(5)
, which is bound to give you bugs when you use sample(5:n)
for changing n
.
R has rules for recycling vector elements when you try to get it to do something with several vectors that donāt all have the same length. You saw this abused when I gave the x <- paste0(rep("", 100), c("", "", "Fizz"), c("", "", "", "", "Buzz"))
FizzBuzz example. When recycling occurs, R only throws a warning if the longest vectorās length is not a multiple of the others. For example, neither Map(sum, 1:6, 1:3)
nor that FizzBuzz line warn you that recycling has occurred, but Map(sum, 1:6, 1:4)
will.
Map(sum, 1:6, 1:3)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 5
##
## [[5]]
## [1] 7
##
## [[6]]
## [1] 9
Map(sum, 1:6, 1:4)
## Warning in mapply(FUN = f, ..., SIMPLIFY = FALSE): longer argument not a
## multiple of length of shorter
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 8
The first case ā where no warnings are given ā can be an unexpected source of major error. The authors of the Tidyverse seem to agree with me. For example, youāre only allowed to recycle vectors of length 1 when constructing a tibble, so tibble(1:4, 1:2)
will throw a clear error message whereas data.frame(1:4, 1:2)
silently recycles the second argument. Similarly, map2(1:6, 1:3, sum)
is an error, but map2(1:6, 1, sum)
is not.
library(tibble)
## > tibble(1:4, 1:2)
## Error: Tibble columns must have compatible sizes.
## * Size 4: Existing data.
## * Size 2: Column at position 2.
## ā¹ Only values of size one are recycled.
## Run `rlang::last_error()` to see where the error occurred.
data.frame(1:4, 1:2)
## X1.4 X1.2
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
library(purrr)
## > map2(1:6, 1:3, sum)
## Error: Mapped vectors must have consistent lengths:
## * `.x` has length 6
## * `.y` has length 3
map2(1:6, 1, sum)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] 5
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 7
Section 8.1.6 of The R Inferno: The recycling of vectors lets you attempt to do things that look correct to a novice and make sense to a master, but are almost certainly not what was wanted. For example, c(4, 6) == 1:10
is TRUE
only in its sixth element. The recycling rules turn it in to c(4, 6, 4, 6, 4, 6, 4, 6, 4, 6) == 1:10
. Again, there is no warning given to the user unless the longest vectorās length is not a multiple of the otherās. In this case, what you wanted was probably c(4, 6) %in% 1:10
, maybe with a call to all()
.
c(4, 6) == 1:10
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
c(4, 6, 4, 6, 4, 6, 4, 6, 4, 6) == 1:10
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
c(4, 6) %in% 1:10
## [1] TRUE TRUE
all(c(4, 6) %in% 1:10)
## [1] TRUE
Some functions donāt recycle in the way that you would expect. For example, read the documentation for strsplit()
and ask yourself if you expect strsplit("Alice", c("l", "c"))
and strsplit("Alice", "l")
to give the same output. If you think that they donāt, youāre wrong. If you expected the first option to warn you about the "c"
part not being used, youāre sane, but wrong. If you want to see how the second argument is supposed to work, re-run the earlier code with c("Alice", "Boblice")
as your first argument.
strsplit("Alice", c("l", "c"))
## [[1]]
## [1] "A" "ice"
strsplit("Alice", "l")
## [[1]]
## [1] "A" "ice"
strsplit(c("Alice", "Boblice"), c("l", "c"))
## [[1]]
## [1] "A" "ice"
##
## [[2]]
## [1] "Bobli" "e"
Remember what I said about needing to generate the correct logical vector when you want to subset a collection? Logical vectors are also recycled when subsetting collections. Because this vector recycling does not always throw warnings or errors, itās a new Hell. Iām honestly not sure if the exact rules for when this does/doesnāt throw warnings/errors are documented anywhere. The language definition claims that using a logical vector to subset a longer vector follows the same rules as when youāre using two such vectors for arithmetic (i.e. you get a warning if the larger of the twoās length isnāt a multiple of the smallerās). However, I know this to be false.
a <- 1:10
a + rep(1, 9) #Arithmetic; Gives a warning.
## Warning in a + rep(1, 9): longer object length is not a multiple of shorter
## object length
## [1] 2 3 4 5 6 7 8 9 10 11
a[rep(TRUE, 9)] #Logical subsetting; 10 results without warning.
## [1] 1 2 3 4 5 6 7 8 9 10
a[c(TRUE, FALSE, TRUE)] #Again, 10 results. Shouldn't it be either 10 with a warning or just 3?
## [1] 1 3 4 6 7 9 10
Iāll take this chance to repeat my claim that this is extremely powerful if used correctly, but the potential for errors slipping through unnoticed is huge. This toy example isnāt so bad, but wait until these errors creep in to your dataset with 50 rows and columns, leaving you with no idea where it all went wrong. The first time where this really caught me out was when I used the same logical vector for two similar datasets of slightly different sizes. I had hoped that if anything went wrong, Iād get an error. Because I didnāt, I continued on without knowing that half of my data was now ruined.
Logical vectors also recycle NA
without warning. I canāt point to any documentation that contradicts this, but it will always catch you off guard. On the bright side, this is consistent with the addition and subsetting rules for numeric vectors with NA
s.
arithmetic <- c(2, NA)
arithmetic + c(11, 12, 13, 14) #Keeps NA and recycles.
## [1] 13 NA 15 NA
logic <- c(TRUE, FALSE, TRUE, NA)
LETTERS[logic]
## [1] "A" "C" NA "E" "G" NA "I" "K" NA "M" "O" NA "Q" "S" NA "U" "W" NA "Y"
LETTERS[arithmetic] #Keeps NA and recycling is not expected.
## [1] "B" NA
You sometimes have to tell R that you wanted to work on the entire vector rather than its elements. For example, rep(matrix(1:4, nrow = 2, ncol = 2), 5)
will not repeat the matrix 5 times, it will repeat its elements 5 times. The fix is to use rep(list(matrix(1:4, nrow = 2, ncol = 2)), 5)
instead.
m <- matrix(1:4, nrow = 2, ncol = 2)
rep(m, 5)
## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
rep(list(m), 5)
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[2]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[4]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[5]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Similarly, you might think that vect %in% listOfVectors
will work, but it will instead check if the elements of vect
are elements of listOfVectors
. Again, the solution is to wrap the vector in a list. For example, you want list(1:4) %in% list(5:10, 10:15, 1:4)
not 1:4 %in% list(5:10, 10:15, 1:4)
.
list(1:4) %in% list(5:10, 10:15, 1:4)
## [1] TRUE
1:4 %in% list(5:10, 10:15, 1:4)
## [1] FALSE FALSE FALSE FALSE
You might be surprised that the last result was entirely FALSE
. After all, some of 1:4
is in the last element of the list. Iāll leave that one to you.
Again, for the most part, these arenāt major issues. I donāt particularly like the inconsistency between functions like paste()
and max()
, but the only true minefield is the vector recycling rules. When they silently do things that you donāt want, youāre screwed.
R makes no secret of being essentially half a century of patches for S. Many things disagree, lack any clear conventions, or are just plain bad, but show no signs of changing. Because so many packages depend on these inconsistencies, I donāt think that they will ever be removed from base R. R could be salvaged if its means of helping you manage the inconsistency were up to scratch ā e.g. the documentation, the function/argument names, or the warning/error messages ā but theyāre not. Itās therefore hard to guess about anything or to help yourself when youāve guessed wrong. These sounds like minor complaints, but R can be so poor in these regards that it becomes a deal-breaker for the entire language. If thereās one thing that will make you quit R forever, itās this. It may sound like Iām being harsh, but Iām not alone in saying it. Both Advanced R and The R Inferno can barely go a section without pointing out an inconsistency in R.
Really, this is Rās biggest issue. You can get used to the arcane laws powering Rās subsetting and vectorization, the abnormalities of its variable manipulations, and itās tendency to do dangerous things without warning you. However, this is the one thing that you can never learn to live with. R is openly, dangerously, and eternally inconsistent and also does a poor job of helping you live with that. In the very worst cases, you canāt find the relevant documentation, the thing thatās conceptually close to what youāre after doesnāt link to it, the examples are as poor as they are few, the documentation is simultaneously incomplete and filled with irrelevant information while assuming familiarity with something alien, the error messages donāt tell you what line threw the errors that your inevitable misunderstandings caused, the dissimilarity between what youāre working with and the rest of the language makes it impossible to guess where youāve slipped up, thereās undocumented behaviour that you need to look at the C code to discover, and you know that none of this will ever be fixed!
These issues tend to overlap, but Iāve done my best to split this up in to sections that cover each aspect of this problem. All in all, this section came out to be shorter than I expected. However, I hope that I have made the magnitude of some of these points clear.
If R had outstanding documentation, then I could live with its inconsistencies. Sadly, it doesnāt. The documentation does almost nothing to help you in this regard and has more than its fair share of issues:
Some of the docs are ancient and therefore have examples that are either terrible, few in number, or non-existent. The references in these docs suggests that this is a disease inherited from S, but sometimes itās really unforgivable:
if
, while
, repeat
, break
, or next
. Theyāre explained in the actual text, but I expect the Examples section to give examples!quantile()
functionās docs are an extreme examples of this. A similar sin can be found in the docs for lm()
and glm()
. However, their See Also sections link to a lot of functions that use them in their own examples, so I can just barely forgive this.Some of the docs have no examples at all e.g. UseMethod()
, vcov()
, and xtfrm()
.
Some of the docs will document many seemingly identical things and not tell you how they differ. For example, can you tell from the documentation if thereās a difference between rm()
and remove()
? An even worse case is trying to figure out the difference between resid()
and residuals()
. The documentation correctly tells you that one is an alias for another, but then it tells you that resid()
is intended to encourage you to not do a certain thing. This implies that residuals()
does not have that same intention, incorrectly hinting that they might have different behaviour.
In some of the standard libraries, you can find functions without any documentation. For example, MASS::as.fraction()
is totally undocumented.
The R Language Definition is incomplete. I imagine that this will really bother some people on principle alone. Personally, I would be satisfied if it were incomplete in the sense of āeach section is complete and correct, but the document is missing many key sectionsā. However, itās really more like a rough draft. It has sentences that stop mid-word, prompts for where to write something latter, and lots of information that is either clearly incomplete or very out of date.
A lot of Rās base functions are not written in R, so if you really want to understand how an R function works, you need to learn an extra language. I find that a lot of the power users have gotten used to reading the C source code for a lot of R. That wouldnāt be so bad, butā¦
For a long time, I didnāt know why many of my technical questions on Stack Overflow were answered by direct reference to Rās code, without any mention of its documentation. I eventually learned that Rās functions occasionally have undocumented behaviour, meaning that you canāt trust anything other than the code. For example:
Where do the docs tell you that the expr
argument in replicate()
gets wrapped in an anonymous function, meaning that you canāt use it to do <-
variable assignment to its calling environment (e.g. code like n <- 0; replicate(5, n <- n + 1)
does not change n
)? You might just spot this if you check the R code, but even then itās not clear.
replicate
## function (n, expr, simplify = "array")
## sapply(integer(n), eval.parent(substitute(function(...) expr)),
## simplify = simplify)
## <bytecode: 0x55735f5f7800>
## <environment: namespace:base>
Where do rep()
ās docs tell you that itās a special kind of generic where your extensions to it wonāt dispatch properly? Even the R code ā function (x, ...) .Primitive("rep")
ā wonāt help you here.
Where do lapply()
and Filter()
ās docs tell you that they donāt play nice with the names()
function? Again, even the R code wonāt help here.
lapply
## function (X, FUN, ...)
## {
## FUN <- match.fun(FUN)
## if (!is.vector(X) || is.object(X))
## X <- as.list(X)
## .Internal(lapply(X, FUN))
## }
## <bytecode: 0x55735ddf97a8>
## <environment: namespace:base>
In the same vein to the choose()
example, functions in the base stats library do not always tell you which calculation method they used. This can make you falsely assume that a figure was calculated exactly. For example, prop.test()
computes an approximation, but the only mention of this in its documentation is the See Also section saying ābinom.test()
for an exact test of a binomial hypothesisā. Not only is this in a terrible place, it only suggests that an approximation has been used in prop.test()
. The details of the approximation are left for the reader to guess.
Some functions act very strangely because theyāre designed with S compatibility in mind. This issue goes on to damage the documentation for said functions. For example, have a look at the docs for the seq()
function. It wonāt tell you what seq_along()
does, but it will tell you what to use seq_along()
instead of! Iāll let Stack Overflow explain seq.int()
ās documentation issues. Said documentation is so poor that Iāve been scared out of using the function. I really donāt know why R pays this price: Who is still using S? Another example is the **
operator. Iāll let the Arithmetic Operators documentation (try ?'**'
) speak for itself on that one. Its three sentences on the topic are **
ās only documentation. Given that you shouldnāt use it, it would be harsh for me to say more. For further reading, I will only give this.
As the previous example shows, backwards compatibility is a priority for R. This means that its inconsistencies will almost certainly never be fixed. Things would be better if the docs did a better job of helping you, but this section demonstrates ad nauseam that they do not. One wonders if thereās ever been any real interest in fixing it.
Some docs assume stats knowledge even when there should be no need to. If you donāt know what āsweeping outā is, you will never understand the docs for sweep()
. I find rmultinom()
ās docs to be similarly lacking. It talks about āthe typical multinomial experimentā as if youāll know what that is. Its Details section tells you the mathematical technicalities, but if I wanted that then I wouldāve gone to Wikipedia. All that they had to do was give an example about biased die and that wouldāve told the reader all that they will need to know. A similar case can be made about rbinom()
, but I can forgive that on the grounds of āwho uses R without knowing at least that much stats?ā.
The docs often do a bad job of linking to other relevant functions. For example, match()
ās doesnāt tell you about Position()
, subset()
, which()
, or the various grep things, mapply()
ās doesnāt tell you about Map()
, and rbinom()
ās doesnāt tell you about rmultinom()
.
I sometimes canāt understand how to search for functions in the documentation. For example, Filter()
ās docs are in the āfunprog {base}ā category, but putting ?funprog
in to R wonāt return those docs. Another oddity is that itās sometimes case sensitive. For example, ?Extract
works but ?extract
doesnāt. In case you missed it, there is no Extract()
or extract()
function.
I find that the documentation tries to cover too many functions at once. For example, in order to understand any particular function in the funprog or grep documentation, youāre probably going to have to go as far as understanding all of them. The worst case is the Condition Handling and Recovery documentation (?tryCatch
), which lists about 30 functions, forever dooming me to never really understand any more of Rās exception system than stop()
and stopifnot()
. A much smaller example is that both abs()
and sqrt()
are documented in the same place, despite barely having anything in common and not sharing this documentation with anything else. This issue also compromises the quality of the examples that are given. For example, the funprog documentation gives no examples of how to use Map()
, Find()
, or Position()
, something that never would have happened if they were alone in their own documentation pages. Then again, which()
and arrayInd()
are the only functions in their documentation, and arrayInd()
has no examples, so maybe Iām giving R too much credit. After all, like I hinted at earlier, even totally fundamental stuff like lists have more functions in their documentation than examples.
The docs sometimes spend a distracting amount of time comparing their subjects to other languages that you might not know. The best example is the funprog docs, which are needlessly cluttered with mentions of Common Lisp. A close second to this is the documentation for pairlists, which even in the language definition have little more description than āPairlist objects are similar to Lispās dotted-pair listsā. My favourite example is probably āregexpr and gregexpr with perl = TRUE allow Python-style named capturesā, if only because it manages to mention two languages in a totally unexpected way. I should also mention that Iāve already complained about how some functions are so obsessed with S compatibility that both their documentation and functionality are compromised. As a final but forgiveable case, sprintf()
is deliberately about C-style stuff and therefore never shuts up about C, making the R documentation pretty difficult for anyone who doesnāt know C.
If pairlists are not really intended for use by normal users, why are they documented in the exact same place as normal lists, which are critical to normal R usage?
Guidelines for unusual operators, such as using [
as a function, are rather hard to find in the documentation. One example that I found particularly annoying is in the names()
documentation. It canāt make its mind up about whether it wants to talk about the names(x) <- value
version or the "names<-"(x, value)
version. The only place where itās apparent that thereās a meaningful difference between the two is in the second part of the Values section, which says:
names<-
, the updated object. (Note that the value of names(x) <- value
is that of the assignment, value
, not the return value from the left-hand side.)āDonāt get me wrong, Rās documentation isnāt terrible. Its primary issue is that it does a poor job of helping you navigate Rās inconsistencies. If the examples were plentiful and the docs for each function linked to plenty of other related functions without themselves being cluttered with mentions of other functions and languages, then it would go a long way towards stopping R from tripping people up.
There are several inconsistencies in Rās functions and how you use them. This means that you either have to adopt a guess-and-check style of coding or constantly double-check the documentation before using a lot of Rās functions. Neither are satisfactory.
There are a few too many functions that have names synonymous with ādo more than onceā. Thereās replicate()
, repeat
loops, and rep()
. Good luck remembering which does what.
Why do we have both structure()
and str()
or seq()
and sequence()
, all of which are different, while having rm()
/remove()
and residuals()
/resid()
, which are not? The potential for confusion is obvious: If I were to write a new function, Pos()
, should you or should you not assume that itās an alias for Position()
?
There is no consistent convention for function names in the base libraries, even for related functions. I struggle to think of a function-naming scheme that isnāt found somewhere in R. For example, the documentation for mean()
links to both colMeans()
and weighted.mean()
. Similarly, the seq()
documentation contains both seq.int()
and seq_len()
. I also donāt like how thereās both readline()
and readLines()
or nrow()
and NROW()
. Or how about all.equal()
and anyDuplicated()
? Thereās even all of those functions with leading capitals like Vectorize()
or the funprog stuff. I could go onā¦
The above issue gets even worse if we discuss functions that youād expect to exist but donāt. For example, we have write()
but not read()
(the equivalent is probably scan()
).
Argument names are also inconsistent. Most of the apply family calls its function argument FUN
, but rapply()
and the funprog stuff use f
.
Related functions sometimes expect their arguments to be given in a different order. For example, except for mapply()
, the entire apply family wants the data to come before the function, whereas all of the funprog functions (e.g. Map()
, Filter()
, etc), want the reverse. When you realise that you picked the wrong function for a job, this makes rewriting your code infuriating.
Functions that should be related in theory are not always related in practice. For example, subset()
is not documented with the Set Operations (union()
, setdiff()
, etc) and works on completely different principles. The Set Operations are the extremely dangerous functions that remove duplicates from their inputs and apply as.vector()
to them. The subset()
function is a non-standard evaluation tool like within()
, making it completely different and dangerous in a different way. Finally, despite it being documented with the Set Operations, none of these warnings apply for is.element()
. I regret every time that I wrote off someoneās advice to use subset()
because of my (entirely reasonable!) assumption that it would be a (dangerous) Set Operation.
Functions with related names sometimes have different effects. For example, here is a damning quote from section 3.2.4 of Advanced R:
is.*()
function, but these functions need to be used with care. is.logical()
, is.integer()
, is.double()
, and is.character()
do what you might expect: they test if a vector is a character, double, integer, or logical. Avoid is.vector()
, is.atomic()
, and is.numeric()
: they donāt test if you have a vector, atomic vector, or numeric vector; youāll need to carefully read the documentation to figure out what they actually do.āSimilar to the above, from the solutions to Advanced R:
as.vector()
and is.vector()
use different definitions of āvector!āā.The language canāt really decide if it wants you to be using lambdas. The apply family has arguments like ...
and MoreArgs
to make it so you donāt always have to do so, but the funprog stuff gives you no such choice. I almost always find that I want the lambdas, so the apply familyās tools to help you avoid them only serve to complicate the documentation.
As an enjoyable example of how these inconsistencies can ruin your time with R, read the documentation for Vectorize()
. Itās packed with tips for avoiding these pitfalls.
Letās talk about matrices. Iāve already discussed some oddities like how functions like [
, $
and length()
treat them in ways that seem inconsistent with either the rest of the language or your expectations, but letās go deeper:
As covered earlier, matrices want to have rownames and colnames rather than names. This gives us a few more inconsistencies to deal with that I didnāt mention at the time. The rest of the language has trained you to use setNames(data, names)
. When you do this, data
is returned with its column names changed without any changes to data
. However, matrices want colnames(data) <- names
and the obvious equivalent for rownames()
. This modifies data
and does not return it.
a <- b <- diag(3)
(colnames(a) <- c("I", "Return", "Me"))
## [1] "I" "Return" "Me"
a#Changed
## I Return Me
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
setNames(b, c("I", "Return", "b"))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## attr(,"names")
## [1] "I" "Return" "b" NA NA NA NA NA
## [9] NA
b#Not changed
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
Not only are the function names inconsistent (why not colNames()
?), the syntax is wildly so. Also, take a look at the incomprehensible error message that colnames()
gives if you use diag(3)
directly rather than assigning it to a variable beforehand.
a <- diag(3)
colnames(a) <- c("Not", "A", "Problem")
## > colnames(diag(3)) <- c("Big", "Bad", "Bug")
## Error in colnames(diag(3)) <- c("Big", "Bad", "Bug") :
## target of assignment expands to non-language object
## > colnames(a <- diag(3)) <- c("Has", "Similar", "Problem")
## Error in colnames(a <- diag(3)) <- c("Has", "Similar", "Problem") :
## object 'a' not found
setNames()
has no such issue.
setNames(diag(3), c("Works", "Just", "Fine"))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## attr(,"names")
## [1] "Works" "Just" "Fine" NA NA NA NA NA NA
setNames(a <- diag(3), c("Works", "Just", "Fine"))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## attr(,"names")
## [1] "Works" "Just" "Fine" NA NA NA NA NA NA
In truth, I donāt mind either colnames()
or setNames()
. I just wish that R would pick one way of handling names and stick to it.
Unlike anything else in R that I can think of, matrices are happy to let you work by row and even have dedicated functions for it, with rowSums()
and apply(..., MARGIN = 1)
being the obvious examples. There is a good reasons for this difference ā matrices are always one type, unlike things like data frames ā but itās still an inconsistency. This inconsistency leads to code that is tough to justify. For instance, I frequently find that I want to treat the output of expand.grid()
as a matrix. unique(t(apply(expand.grid(1:4, 1:4, 1:4, 1:4), 1, sort)))
is one of my recent examples. This isnāt too bad, but I honestly have no idea why I needed the t()
. Experience has taught me not to question it, which is pretty bad in of itself. Rās inconsistency eventually makes you either fall in to the habit of not questioning sudden transformations of your data or forces you to become completely paralysed when trying to understand what ought to be trivial operations in your code. Doubts like āis there really no better way? R is supposed to be good with this sort of stuffā become frequent when wanting to work by row.
So what happens if, when manipulating a matrix, you write the sapply()
that the rest of the language has taught you to expect? At best, it gets treated like a vector in column-order.
(mat <- matrix(1:9, nrow = 3, byrow = TRUE))
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
sapply(mat, max)
## [1] 1 4 7 2 5 8 3 6 9
At worst, it doesnāt do anything like what you wanted.
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
sapply(mat, sum)
## [1] 1 4 7 2 5 8 3 6 9
The trick for avoiding this is to use numbers as your data argument and let subsetting be the function.
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
sapply(1:3, function(x) sum(mat[x, ]))
## [1] 6 15 24
sapply(1:3, function(x) max(mat[x, ]))
## [1] 3 6 9
Better yet, just use apply()
.
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
apply(mat, MARGIN = 1, sum)
## [1] 6 15 24
apply(mat, MARGIN = 1, max)
## [1] 3 6 9
But why did we have the learn any of this in the first place?
Your turn: What does seq_along(diag(3))
return? 1:3
or 1:9
? What if you added a row? What if you added a column? Or is the name of that function seq.along()
? Are you sure? Tempted to check the docs? Which docs? Feeling helpless? You should!
Many functions that are designed for matrices should be forgotten about everywhere else. Several guides warn against using apply()
on non-matrices and I wouldnāt dare use t()
on a non-matrix. Try t(iris)
.
I always expect c()
of a matrix to work in row-order. It doesnāt. However, thatās probably more the fault of c()
and I than it is of matrices. There are times when I canāt explain c(mtcars)
to myself.
Named matrices are named atomic vectors, so they break in the ways discussed earlier. This puts you in a dilemma when youāre using data thatās essentially only one type: Do you keep it as a matrix and lose the awesome subsetting powers of a data frame? Or do you make it in to a data frame and lose the power to work by row that matrices give you? At times, Iām tempted to forget that I named the matrix in the first place and just manipulate it like a mathematician. None of these solutions are good.
Overall, matrices are so inconsistent with the rest of the language that your matrix-manipulation code never looks right. It leaves you with an awful sense of unease.
Something to mention while weāve still got some bad error messages fresh in our minds: People often say that Rās error messages arenāt very good and Iām starting to agree. Errors like ādim(X)
must have a positive lengthā are useless when youāre not told which function in the line that threw the error had that error, what X
is, or in the very worst cases, what line the error was even in. This means that almost any error that R throws is going to require you looking through both the result of traceback()
(to find where the error happened) and the documentation (to identify the problematic argument). It seems that this issue gets even worse when you try to do statistics. Warnings like āWarning message: In ks.test(foo, bar)
: ties should not be present for the Kolmogorov-Smirnov testā donāt even tell you where the tie was. Was it in one of my arguments? Is it some technical detail of the test? Somewhere safe to ignore? You donāt know and R wonāt tell you unless you study the documentation. Worst come to worst, you have to read the code or learn the secret for getting traceback()
to work on warning messages. And yes, that last bit is something that you have to learn. It makes warnings messages a lot harder to debug than errors.
Of course, the more worrying (and frequent?) issue is when R gives you no warnings/errors at all. Iād much rather have a bad error message than none at all, but a bad error message is still annoying.
Maybe you think Iām clutching at straws? I admit, I sometimes wonder if my outrage is unjustified. Letās settle this with a challenge. If you win, then by all means close this document and write me off as a madman. If you lose, then maybe Iāve got a point.
CHALLENGE
Taking in to account Rās vector recycling rules, figure out how mapply()
ās MoreArgs
and ...
arguments differ and when you would want to pass something as a MoreArgs
argument rather than in the ...
argument. No cheating by going online (trust me, it wonāt help). Solve this without leaving your R IDE. Youāre encouraged to check the documentation.
If my criticisms are true, you will find that mapply()
ās documentation is of little help and that your confidence in your R knowledge is too small to make an educated guess.
HINT 1
Donāt try to cheat by looking at mapply()
ās code; Most of it is in C and therefore will be of no help to you.
HINT 2
You might think that the documentation for sapply()
will help you, but itāll actually mislead you because mapply()
ās ...
is essentially sapply()
ās X
and sapply()
ās ...
is most like mapply()
ās MoreArgs
.
Solution below. Time to stop scrolling.
SOLUTION
.
.
.
.
.
.
.
How do MoreArgs
and ...
differ?
Itās tough to explain. mapply()
uses the default vector recycling rules for the ...
arguments but reuses every element of MoreArgs
for each call. Because the MoreArgs
argument must be a list and R recycles the elements of lists (e.g. using a length one list as a ...
argument will have the element of that list reused for each call), the difference is subtle to the point of near invisibility. Ultimately, MoreArgs = list(a, b, c)
is equivalent to using list(a)
, list(b)
, and list(c)
as three separate ...
arguments. The answer is therefore that MoreArgs
only exists as syntactic sugar for this ...
case.
When use MoreArgs
rather than ...
?
Beyond what Iāve already said, I barely have any idea. If you want to keep some function arguments fixed for each call, then just use an anonymous function. I struggle to invent a useful example of where Iād even consider using MoreArgs
, never mind one that doesnāt look tailor-made to make the anonymous function option look better. The one and only example that the documentation gives for using MoreArgs
does not help here. Their example of mapply(rep, times = 1:4, MoreArgs = list(x = 42))
is identical to mapply(rep, times = 1:4, list(x = 42))
. Read that again: You can get identical functionality by deleting the thing that theyāre trying to demonstrate!
Bonus
Did you notice that the documentation for mapply()
has a notable omission? It doesnāt mention this, but you can call mapply()
without the ...
argument, e.g. mapply(rep, MoreArgs = list(1:4))
. You wonāt get sensible output, but you also donāt get any warnings or errors.
If Iāve won this challenge, then allow me to take a victory lap by making the following point: By giving you the options of using ...
, MoreArgs
, or an anonymous function to do the same task, R gives you plenty of room to confuse yourself without providing any help in its documentation. Either provide fewer options, document them better, or make them so commonplace and consistent within the language that I only need to understand it once in order to understand it everywhere!
On top of many of the things that Iāve already said about the apply family, fans of the Tidyverse, particularly purrr
, often point out of the following inconsistencies. Theyāve never bothered me, but theyāre undeniably correct. It makes me wonder why we canāt just give all of the apply family a simplify
argument that takes either TRUE
, FALSE
, or whatever vapply()
would consider a valid FUN.VALUE
argument.
as.vector()
, apply()
has no simplify
argument. Version 4.0.6 did something about this, but Iāve yet to get my head around it.vapply()
for any member of the apply family other than sapply()
. Among others, neither tapply()
nor mapply()
have one.lapply()
as ālist in, data frame outā and by()
as ādata frame in, list outā, and so on for sapply()
and others, then where is the āarray in, list outā function?eapply()
and even fewer use it. Just about everyone who uses R for stats has had to invest a few hours getting their head around tapply()
, but at least thatās worth it. As for the other obscure ones ā e.g. simplify2array()
, rapply()
ā I honestly cannot recall ever using them or seeing them used.There are some community issues that make R harder to learn and work with. Put together with the earlier issues, it means that help can often neither be found inside nor outside of R.
data.table
package. Even if you find one that you like, you can bet that you will someday want to use a package that requires another. For example, if your friendly neighbourhood package author made use of the default drop = TRUE
argument when manipulating data frames, youāre not going to be allowed to use tibbles. Protecting the user from this isnāt easy, because both data.tables and tibbles return TRUE
for is.data.frame()
.R6
gets about as much mention as RC, even though they both do the same job. Depending on when they were made, you also see some popular libraries that are fully committed to some particular OOP system. For example, you will see a lot of S3 in base R, but the Bioconductor
package sticks to S4. Fortunately, all of this only becomes a problem when you want to contribute to these packages. If all goes well, you will never really notice which OOP system has been used; You will just have polymorphic code and not need to question it.data.table
, but itās much less common.for
loops, are considered a code smell. This goes double in the Tidyverse, with the R for Data Science book not even introducing them until chapter 21 (of 30). There are practical reasons for this, mostly in relation to the apply familyās code being written in C and therefore being faster than most R loops. However, it encourages you to do some silly things. For example, you have to make a judgement call between writing a for
loop that is inherently fast but slowed down by being written in R or writing an sapply()
that ought to be slow but is speeded up due to sapply()
calling C code. This issue also affects how you present your code. Calls to the apply family are inherently one-liners, so itās difficult to find the right way to present/comment them when they become complex. You either end up introducing unnecessary variables in to your code or indenting it in unconventional ways. The Tidyverseās solution ā piping ā often does the trick, but it openly admits to not being a universal solution. The new |>
base R pipe doesnāt do the trick either. As any advocate of the Tidyverse will tell you, base R just isnāt designed for piping.Rās generic function OOP systems are yet another source of unpredictability and internal inconsistency. Theyāre very cool and I must admit that Iāve not used them much, but what Iāve seen when trying to use them has discouraged me. Most of what Iām about to say is about S3, but youāll rarely find much said about Rās OOP systems at all. Itās not really any surprise. S3, S4, RC, and any of the OOP systems that come from packages are all openly admitted to being bolted-on to R rather than something that was part of its design from the early days. Points like the below make discovering this fact unavoidable. Presumably, this is what the Julia fans are talking about when they say that theyāre the only ones who have a generic function OOP system that is baked-in to their language. Iāve never used Julia or enough S4 or RC to be able to really comment, but I bet theyāre right.
The class system is a mess and the docs do a poor job of explaining it. Good luck understanding it without a book like Advanced R and a package like sloop
. I believe that this problem is mostly isolated to S3, but Iāve not used enough S4 to be able to say that with any certainty. Here are some problems that youāre likely to encounter early on:
Functions like mode()
and storage.mode()
exist only for compatibility with S. As an R user, they exist only to increase your confusion. This is particularly common when reading the language definition; It never shuts up about the modes of things.
Advanced R makes a strong case for is.numeric()
being inconsistent, particularly regarding its interaction with S3.
The documentation for class()
uses vague statements like āmethod dispatch may use more classes than are returned by class(x)
ā. May? MAY??? What am I supposed to do with that? Where do I look for more info? It mentions .class2()
, but warns you against using it. Why? It doesnāt say! Did you think that .class2()
would have its own documentation somewhere else? It doesnāt! All of the documentation for .class2()
is in the docs for class()
and most of that is a warning to not use it!
Call class()
on a matrix and you will see that they have a few classes. However, is.object()
, which has docs that correctly state that it will return TRUE
for anything with a class attribute, returns FALSE
.
a <- diag(3) class(a) ## [1] "matrix" "array" is.object(a) ## [1] FALSE
Why? Because class()
also returns implicit classes ā as detailed in its docs ā which is.object()
ignores because implicit classes arenāt part of the class attribute. The documentation for is.object()
does not mention this fact and the class()
functionās output does not tell you which classes are implicit. Can you see the potential for confusion? The docs never lied or even mislead, but they make naivety fatal. Maybe thatās starting to become a theme.
So what base R function actually returns the non-implicit classes? I think that you have to use attr(foo, "class")
. I say āI thinkā because the documentation for class()
does not offer any help.
Donāt ask what determines the implicit classes of an object or how S3 dispatch occurs with them. Itās far too complicated and not clearly documented in any place that I know of. Itās also whatās used for dispatching on anything without a class attribute, such as matrices. Good luck with that!
Donāt ask what an object is either. The community will tell you that is.object()
is poorly named and the language definition will tell you that pretty much everything in R is an object. However, there are functions like names(x)<-
that will not work on several types of objects (e.g. anything anonymous) despite their documentation saying that x
can be āan R objectā. Iād give examples, but you really donāt want to think too hard about this.
class()
always returns something?The [
and [[
functions like to drop the attributes from your S3 objects, meaning that you almost always have to write a [
and [[
method for them. On the bright side, this is documented behaviour.
The generic functions that youāll find in the base and other common libraries have a few surprises:
abline()
and sample()
can behave differently depending on what sort of input you gave them, but thatās because theyāre hard-coded to do so. If you expect to find some abline.lm()
function or be able to write your own abline.myClass()
method, youāll be disappointed.caret
library predict()
? Iām pretty sure that my Bayesian Statistics lecturer also showed me a few cases where anova()
definitely does not do an ANOVA. Iām out of practice, but I think that the R FAQ gives one such example.rep(matrix(1:4, nrow = 2, ncol = 2), 5)
treat the input matrix like itās a normal vector? I canāt imagine anyone calling rep()
on a matrix and wanting to work element-by-element rather than repeating the matrix.foo.bar()
because they look like extensions to foo()
. Open up any standard R library and you will see countless functions written in this form that are not extensions to anything. t()
and t.test()
are the most cited example.This is tough to explain, but Iāll try. If you consult ?"internal generic"
, you will find a big list of functions that you cannot extend properly with S3. Specifically, anything on that list cannot be extended to dispatch on any object for which is.object()
returns FALSE
. For example, writing a rep.matrix()
function that does what my earlier example wanted is easy, but because rep()
is on that list and matrices are not objects in this sense, rep()
will not dispatch to rep.matrix()
when given a matrix. The documentation for rep()
and other functions that share this misbehaviour do not do much to help the reader discover this fact. Advanced R has the only good explanation that Iāve found for this, but itās the sort of thing where you either have to read two chapters or the first editionās OO Field Guide. The short explanation is āinternal generics are written in C and therefore only understand non-implicit classes and whatever internal R type the C code ultimately gets fedā.
Did you notice that length()
is on the list mentioned above? This explains the inconsistency in how it behaves on data frames and matrices. You cannot properly extend internal generics with S3, so you cannot change how length()
behaves on implicitly classed input. Data frames are non-implicitly lists and matrices are non-implicitly atomic vectors, so thatās how length()
treats them. This issue isnāt unique to just data frames and matrices. Advanced R has a comical example in its S3 chapter: Linear models, which are non-implicitly lists, have a length of about 12! My personal favourite is giving it quoted input:
length(quote(5^30))
## [1] 3
length(quote(5^30 + 1))
## [1] 3
length(quote(5^30 + 12))
## [1] 3
length(quote(1))
## [1] 1
length(quote(length(1)))
## [1] 2
You will find issues like this ā i.e. unexpected and tough to explain output ā whenever using or extending most internal generics; length()
is just the easiest example to show.
Did you notice that I lied about data frames? The careful reader will notice that due to having the real data.frame
class, data frames donāt have any implicit classes. Thatās a thing, by the way, having a real class means not having implicit classes.
attr(mtcars, "class")
## [1] "data.frame"
class(mtcars)
## [1] "data.frame"
is.object(mtcars)
## [1] TRUE
This means that length(someDataFrame)
cannot possibly dispatch to some length.list()
internal method. Further inspection reveals that there is no S3 (i.e. non-internal) length.data.frame()
method. What actually happens is that R tries to find length.data.frame()
, fails, and then tries to find length.default()
, only to fail again and get pointed to the internal C code that presumably treats data frames just like lists. This happens even though data frames do not have implicit classes. Enjoying the complexity?
So what happens if you try to write a length.data.frame()
method ā something that is totally allowed because data frames return TRUE
for is.object()
and length()
is an internal generic function ā and have length()
dispatch to it? Youāll probably break R. I once redefined the length of a data frame to be its number of rows and I got a stack usage error. Please, take a few seconds to appreciate all of the complexity that weāve had to work through just for Rās most basic object system.
Much of the above makes the class system ā and therefore S3 dispatch ā impossible to clearly explain. Any real explanation would be so full of exceptions that it would become incomprehensible. The only way to explain it is to ignore the contradictions for as long as possible, meaning that you must be given incorrect information until youāre ready to read about the exceptions. This ultimately means that you cannot even find a good reference manual for the class system, because you never know if youāre reading the whole truth or not. Furthermore, if this is at all representative of the complexity of S3, how can anyone be expected to have the patience to even begin learning S4? I know that itās dishonest to blame S4 for the sins of S3, but I wouldnāt blame any newcomer to Rās OOP for doing so. One wonders if we should start newcomers on S4 and leave S3 until much later. The books by John Chambers take this approach and generally say to stick to S4.
Letās talk about S4. I promise that this will be an easier read than the earlier sections. Iām quite ignorant of S4, as Iāve already admitted to, so Iāve got very little to say. Regardless, the following seems clear:
Everything that Iāve read about S4 gives me the impression that it has far fewer stupid technicalities than S3. If Iām right, then I find that laughable. How have we managed to make S3 more complicated than S4? S3 should be extremely simple, but the technicalities of the previous few sections are too easy to stumble upon.
If Advanced Rās chapter on S4 is to be trusted, then the official documentation for S4 contains a lot of bad advice. Iāve not looked closely, but I have noticed that it shares Rās tendency to put many functions in one page of documentation and then not give examples for many of them. For example, ?getMethod
documents five functions, but only gives examples for two. Similarly, @
has no examples in its documentation.
S4 has some strange semantics. Why call something that is sometimes not a predicate function is()
? Why does it use an @
operator to do what the rest of R would use $
for?
is(mtcars)
## [1] "data.frame" "list" "oldClass" "vector"
As far as I can tell, S4 doesnāt inform you if there was some ambiguity in your dispatch, such as if it had to pick one option from two equally appropriate potential dispatches. I think that unless there is no appropriate method to dispatch to, it has some internal rules that silently handle these cases, meaning that there is no ambiguity even when there probably should be. In other words, it may misbehave by silently resolving the developerās ambiguities. Without being too spiteful, by now I find it quite easy to believe that R has an OOP system that silently misbehaves.
Section 8.2 of The R Inferno calls the factor and ordered variables āchimerasā. This is exactly the right criticism. Under the hood, theyāre S3 objects with integers as their base type and a character vector ā the levels ā as an attribute. When using these variables, it is difficult to predict if R will treat them as their integer base type, as their character vector levels attribute, or as a factor object. And thatās not even mentioning how the labels come in to it. The R Inferno has said more than I will and gives some examples of their unpredictable behaviour, but here are some points from my own experience:
There is no base R function for extracting the original object from its corresponding factor. To extract your original set of numbers (assuming that they were numbers, if not, you get nonsense) from a factor variable called f
, the documentation tells you to use either as.numeric(levels(f))[f]
or the slower as.numeric(as.character(f))
. Letās use a bit more code than usual and show off what each of these functions do before and after composition:
(withoutLabels <- factor(rep(seq(from = 2, by = 2, to = 10), 3)))
## [1] 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
## Levels: 2 4 6 8 10
(withLabels <- factor(rep(seq(from = 2, by = 2, to = 10), 3), labels = LETTERS[1:5]))
## [1] A B C D E A B C D E A B C D E
## Levels: A B C D E
fList <- list(withoutLabels, withLabels)
#Just to make sure that we're on the same page, here's the output of str().
#The internal integers are in plain sight.
lapply(fList, str)
## Factor w/ 5 levels "2","4","6","8",..: 1 2 3 4 5 1 2 3 4 5 ...
## Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 1 2 3 4 5 ...
## [[1]]
## NULL
##
## [[2]]
## NULL
#Nothing surprising to start:
lapply(fList, levels)
## [[1]]
## [1] "2" "4" "6" "8" "10"
##
## [[2]]
## [1] "A" "B" "C" "D" "E"
#as.character() returns the non-attribute part of what you get when you print the factor
#i.e. the result of mapping its internal integers to its character vector of levels.
#Notice that these are characters. It's not obvious from printing your factors that
#the non-attribute part becomes a character.
lapply(fList, as.character)
## [[1]]
## [1] "2" "4" "6" "8" "10" "2" "4" "6" "8" "10" "2" "4" "6" "8" "10"
##
## [[2]]
## [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
#Calling `as.numeric()` on a factor does not return the original numbers.
#It returns the underlying integers.
#Why would you ever want or expect these?
lapply(fList, as.numeric)
## [[1]]
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##
## [[2]]
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
#Subsetting with factors always treats them as their integer base type.
#If the factor fundamentally has nothing to do with integers
#(e.g. if you made the factor from something that was originally a set of characters),
#then you can expect nonsense.
#If the factor did originally have something to do with integers,
#then you're probably going to be very confused because it hasn't subsetted with
#the numbers that you get from printing the factor.
#In short, it's almost never a good idea, but R lets you do it anyway.
#Now ask yourself: What is the point of having a categorical data type if
#it's not practical to subset with?
lapply(fList, function(f) levels(f)[f])
## [[1]]
## [1] "2" "4" "6" "8" "10" "2" "4" "6" "8" "10" "2" "4" "6" "8" "10"
##
## [[2]]
## [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
lapply(fList, function(f) as.numeric(levels(f))[f])
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## [[1]]
## [1] 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
##
## [[2]]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
lapply(fList, function(f) as.numeric(as.character(f)))
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## [[1]]
## [1] 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
##
## [[2]]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
As youāve probably noticed by now, factor variables are inherently complex enough that they need you to either carefully read their documentation or be an R master before you can use them with confidence. You cannot tell me that as.numeric(levels(f))[f]
made perfect sense when you read it or that you would have come up with it yourself. Itās arcane. Half of the reason why I let the code speak for itself above, rather than adopting my usual bullet point style, is because I hardly even trust myself to describe them. In fact, even whoever wrote the R FAQ seems to have not mastered the art. In section 7.10, they suggest as.numeric(levels(f))[as.integer(f)]
for the same task as what weāve covered above. Can you see the redundant function call?
When writing example code, factors want to be called f
, just like functions do. This offends me.
On the bright side, it looks like R version 4 is steadily trying to fix factors. Every few updates, we see a minor change. For example, before version 4, you had to pass stringsAsFactors = FALSE
to a lot of functions. This was to stop R creating factor when you hadnāt asked for them. It was widely considered extremely annoying because there is nothing in the way that data frames print that signals to the reader that theyāre looking at a factor variable. For all you knew, you were looking at a character vector. You often would not discover your mistake until you had a serious error.
Personally, Iām afraid to use factor variables. Their unpredictability makes any code that uses them dramatically more complex, even if youāre confident that you know their rules.
The syntactic sugar is a source of problems, often to such a great degree that your best solution is to completely avoid the sugar. Iāll start with some small cases before splitting some of the bigger ones in to sections.
You usually only see this when dealing with names()
, but having a function that is both a setter and getter is a guaranteed source of confusion and found more than once in R. For example, names(output)
will give you the names of output
, but names(output) <- c("Alice", "Bob")
will change output
ās names (itās sugar for some complicated "names<-"
nonsense).
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
names(mtcars) <- LETTERS[1:11]
head(mtcars, 2)
## A B C D E F G H I J K
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Now what do you think names(foo) <- names(bar)
does? Seriously, can you guess? I can think of roughly four realistic guesses. Is it even valid syntax? Hereās the truth:
names(mtcars) <- LETTERS[1:11]
a <- rep(c("example", "text"), length.out = 11)
names(a) <- LETTERS[12:22]
a
## L M N O P Q R S
## "example" "text" "example" "text" "example" "text" "example" "text"
## T U V
## "example" "text" "example"
names(a) <- names(mtcars)
a
## A B C D E F G H
## "example" "text" "example" "text" "example" "text" "example" "text"
## I J K
## "example" "text" "example"
A lot of people seem to make the correct guess here, but syntax shouldnāt leave you guessing. Where possible, I try to stick to setNames()
.
The syntactic sugar sometimes leads to surprising syntax. For example names(output[2]) <- "foo"
doesnāt work, but names(output)[2] <- "foo"
does.
As is extremely well documented, T
and F
can be used in place of TRUE
and FALSE
, but you should never do this because T
and F
are just variables that can be overwritten in your code. Why let us do something that we never should? To my surprise, thereās actually a sensible answer. Section 3.3.3 of the R FAQ says that S had T
and F
as reserved words, but R changed that to allow variables called "T"
and "F"
to appear in your datasets. I can see the reasoning behind both Sās approach and Rās change to it, but I still think that Rās approach of āyou can do this, but never doā is obviously wrong. My suspicion is that the third option of ājust make T
and F
not mean anything until theyāre assigned toā will never be taken, because the current (and dangerous) approach helps with backwards compatibility. I donāt think that itās a good trade.
Although I like Rās many functional programming tools, the temptation to try to use them to solve every problem is very strong. Iāve wasted countless hours trying to pick the right one of sapply()
/lapply()
/mapply()
/Filter()
/Map()
⦠(not to mention their various arguments) when I really shouldāve just written the for
loop. This is more my fault than it is Rās, but itās a curse that every intermediate R user will suffer from. Itās a price that any R expert will tell you was worth it in the end. However, itās still a price that I donāt enjoy paying. It wouldnāt be so bad if R had less such functions, better error messages, or more consistency between these functions, but weāve already discussed that can of worms. Donāt think that Iām advocating for purrr
here. It has so many functional programming tools that it arguably makes the situation worse. Iāll cover its costs and benefits later.
The :
operator is absolutely lovely⦠until it screws you. The solution is to prefer the seq()
functions to using :
. Some quick examples:
Stuff like i in 1:n
is great, but if you accidentally have n <= 0
, it silently gives behaviours that you probably donāt want.
1:2
## [1] 1 2
1:1
## [1] 1
1:0
## [1] 1 0
1:-1
## [1] 1 0 -1
seq_len()
is better behaved, so I try to stick to it.
seq_len(2)
## [1] 1 2
seq_len(1)
## [1] 1
seq_len(0)
## integer(0)
#seq_len(-1) is an error.
:
has operator precedence issues. You might expect stuff like -6/2:3
to generate (-(6/2)):3
i.e. -3:3
. It doesnāt.
-6/2:3 #Treated as -6/(2:3)
## [1] -3 -2
Iāll leave -6/2:6/2
as an exercise for the reader. Iād like to keep things simple and say that the trick is that :
is always evaluated first, but thatās actually not true. Even if weāre only talking about arithmetical operations, exponentiation is done before :
is applied.
3^1:5 #Treated as (3^1):5
## [1] 3 4 5
Can you guess what data[-1:5]
returns? I canāt either, so donāt ever try it. If you must know, itās actually an error.
As Iāve said, seq()
and its related functions usually fix this issue. The only real disappointment with seq()
itself is that its documentation warns against not naming its arguments, so youāre forced to write the long-winded seq(from = 0, to = 100, by = 6)
rather than just seq(0, 100, 6)
.
The documentation for several functions with non-standard evaluation, e.g. with()
and subset()
, explicitly warns the user to not use them when programming. This is a source of a number of problems, both practically and in a meta sense:
First of all, the existence of functions that are not for programming use is abhorrent.
Theyāre excellent syntactic sugar, so I hate not being able to use them! For example, Iād argue that within(data, rm(colName1, colName2))
is the best way to remove unwanted columns from my data: It does not require me to quote or escape my column names, does not require me to put data$
before everything, does not require me to pass that annoying drop = FALSE
argument, warns me if I am trying to remove a column that is not in my data, and reads almost like English. All in all, thatās some major advantages over using [
. They both have some misbehaviour if your column names are duplicated, but thatās not very relevant here.
The documentation for a lot of the functions that use non-standard evaluation warn you to take care with them, but they do very little to tell you how or why. Iāve searched high and low, but I sincerely believe that there is almost nothing in Rās documentation that tells you what can go wrong with these functions. Seriously, if you can find it, let me know. Iāve even read the Thomas Lumley article that some of the docs tell you to check and I still canāt find much of relevance. As with .class2()
, I find Rās habit of putting unexplained warnings in its documentation deeply maddening.
Because you donāt know when itās safe to use these functions and when it isnāt, you feel incredible anger when the perfect solution to your problem is to use one of them. You get in situations like āI could either easily do this with with()
or write out an mapply()
with a long-winded anonymous functionā¦ā and always have to choose to do things the hard way. It makes you want to never use R outside of the REPL. This is one part where the Tidyverse completely destroys base R.
So what can actually go wrong with them? To be honest, I donāt really know. A lot of these functions internally rely on a function called substitute()
, which has special behaviour when it tries to interact with anything defined in the global environment, so itās slightly difficult to invent easy-to-type examples of these functions misbehaving. All that Iāve managed to find is:
subset(data, exampleCol > x)
will misbehave if x
is a column in data
but you intended it to come from the calling environment.Of all the problems that Iāve written about, this sectionās probably bother me the most. So many of Rās problems could be sidestepped if we could fearlessly use with()
and subset()
at all times, but Rās nasty habit of not explaining the dangers that it warms you of leaves me in constant paranoia.
Some things seems obviously missing from R:
For a Scheme-inspired language, the lack of any tail call optimisation or any macro system is strange. Then again, being a heavily functional language that looks like C is one of the best things about R. If it had tail call optimisation or Lisp-like macros, itād probably start to look more like a weird statistical version of Lisp.
You can only break out of the innermost loop. Unless you refactor, thereās no way to be many loops deep and break out of them all with one command.
R has no do-while
loop. Itās never bothered me, but I think thatās because Iāve never used one in any language. I can see it bothering others, but if I need one, then Iām pretty sure that theyāre trivial to make from a repeat
loop.
Without crude if(FALSE){}
workarounds, thereās no way to comment out blocks. IDEs can fix this.
Outside of packages, R lacks any real dictionary, associative array, or linked list type. The closest that we can get is matching elements to their names like this. Iāve always thought that it seems like a hacky way to get what other languages have built in. You can also do it with environments, which apparently has O(1) lookup, but Iāve never seen anyone do it. That may have something to do with how the base R syntax for creating environments from scratch isnāt as nice as its syntax for creating lists. You have to name and assign each element individually, e.g. e <- new.env(); e$a <- 1; e$b <- 2; e$c <- 3
, rather than just l <- list(a = 1, b = 2, c = 3)
. And if youāre going to use a package to fix this syntax issue, then you might as well just use one that gives you actual hash tables.
Given that R is a maths/stats language, I find the follow omissions surprising:
combn(1:3, 2)
canāt be convinced to include c(1, 1)
, c(2, 2)
, and c(3, 3)
. expand.grid(1:3, 1:3)
comes close, but that trick generates permutations rather than combinations.is.square()
function.is.integer()
checks for integer typing rather than if the input is in of itself an integer. The docs even show that is.integer(1)
is FALSE
. Worse still, these docs actually show you the code for a good is.wholenumber()
function! Why couldnāt that be in the base library?Once youāre aware of it, the previous issue starts coming up in weird places. This suggests that Rās missing something in its error checks. Take a look:
seq_len(4.8) #Not an error
## [1] 1 2 3 4
1:4.8
## [1] 1 2 3 4
a <- 1:10
a[4.8]
## [1] 4
a[-4.8]
## [1] 1 2 3 5 6 7 8 9 10
sample(4.8)
## [1] 4 3 1 2
The pattern is that R silently truncates the numeric index of choice towards 0.
The base libraries have no obvious dedicated functions for pivoting. You can do it with tapply()
, but nothing in the docs would make you guess that. In fact, virtually every occurrence of the word āpivotā in Rās docs is talking about chol()
. I think that you can pivot/unpivot with stack()
/unstack()
, but the only time Iāve ever seen those functions mentioned was in this SQL article.
Admittedly, few if any of these are major, but theyāre a bit annoying.
And now for everything that Iāve got left in the bag.
The two language problem: Sooner or later, youāll run in to a memory issue, go to Stack Overflow, and be told that the solution is to use a package that lets R talk to C++. Julia claims to have solved this. I donāt know if I believe it.
The index in a for
loop uses the same environment as its caller, so loops like for(i in 1:10)
will overwrite any variable called i
in the parent environment and set it to 10
when the loop finishes.
i <- 2000
for(i in 1:10){}
i
## [1] 10
This sounds awful, but Iāve never encountered it in practice. After all, it sounds like bad practice to use the same variable name for two different things. Apparently the for
loops also like to strip attributes, breaking S3 objects, but again, Iāve never encountered this. After all, idiomatic R it to prefer functions like sapply()
to for
loops.
Advanced R claims that R is a great language to metaprogram. I cannot deny that the Tidyverse is very strong evidence for that, but who would dare metaprogram a language as poorly documented and as inconsistent as Iāve claimed R is? Certainly not me. I canāt even predict Rās behaviour when Iām programming it, never mind metaprogramming! Iāve regretted most of my attempts at doing so. I usually get tripped up by some quirk of Rās string-manipulation facilities and how the strings get parsed as expressions.
For a language that was inspired by Scheme, Rās metaprogramming feels very limited. As far as I can tell, aside from the typical operation of building code from text that Iād expect any language to be capable of, it is only used to facilitate the creation of functions that evaluate their arguments in a non-standard way. Usually, this doesnāt go any further than creating an ad-hoc environment where the functionās arguments make sense, despite said arguments having no meaning in the calling environment. Typical examples are with()
and modelling functions like lm()
, which let you write code like lm(mpg ~ wt, mtcars)
. Being able to say ālet me tell you what data I want you to treat like an environment, so I can refer to its variables as if they were objects in the calling environmentā is great, but itās nowhere near what a Lisp user would expect.
The plot()
function has some strange defaults. For example, you need to have a plot before you can plot individual points, and it often doesnāt know what to do in terms of how long/wide its axes should be. I also donāt like how āpredict mpg
from wt
ā is foo(mpg~wt)
, but āplot mpg
on the y-axis and wt
on the x-axisā is plot(wt, mpg)
. I understand why both options are the way that they are, but it creates unpredictability.
I seem to have terrible luck with the documentation for Rās libraries. Even when using popular packages that have been around for years, I often find documentation errors that are so basic that I canāt explain how theyāve gone unnoticed. Iāve seen documentation that reports the wrong return types, imports unnecessary libraries in its example code, and completely fails to mention significant parameters! I try to fix these when I find them, so I can no longer name names, but itās a source of significant annoyance.
Advanced R points out that good R code is rare, but I have a different take on it that I think explains my poor luck with R libraries: Statisticians donāt want to write code or learn GitHub and programmers donāt want to use any more R than they strictly need to. This means that nobody is really doing any bug fixing or even reporting. On the bright side, this makes it very easy to improve other peopleās R code and get accepted pull requests.
5 The Tidyverse
As Iāve already admitted, my knowledge of the Tidyverse is much less than my knowledge of base R. However, no critique of R is complete without at least giving this a mention. Its popularity, along with R version 4.0.6. adopting some its ideas (pipes and a shorter anonymous function syntax), are clear evidence that itās on to something. Before going in to the specific libraries, Iāve given some general thoughts below. You may also be interested in Hadley Wickhamās comments on this section. Particularly with regards to purrr
, I was surprised by how much we agree (look inside the changes made in the pull request).
as.foo()
is inconsistent with Rās S3 system, but itās what Iād expect to find with a new class called foo
in a library. The only solution to this problem is to somehow write code that completely ignores base R, but that becomes impossible as soon as you try to load anyone elseās packages.rlang
, be made specifically to work with other Tidyverse packages (see the first paragraph of the manifesto), and deprecate their own functions in favour of functions from different Tidyverse packages....
argumentās ability to pass arguments to where they should not have gone. To counter this, most of the Tidyverse functions use names that you would never type. This means that without an IDE prompting you, youāre going to get a lot of the argument names wrong. Iāve slipped up with tibble
ās .name_repair
argument a few times. Get it wrong and R probably wonāt let you know![x]
/[x,]
/[,x]
business, I often guess wrong with functions like base::order()
, but I almost never guess wrong with dplyr::arrange()
..drop
argument to unnest()
being deprecated. Importing a package when making your own package is already a risky proposition, but issues like this would have me do all of my work in base R even if there was a perfect Tidyverse function for the job.aggregate()
in the Tidyverse. Even something as simple as dplyr::select()
has about 10 helper functions in its documentation. Iām willing to be proven wrong here, but everything that Iāve just said strikes me as obviously true to anyone who has used dplyr
or purrr
.tidyr
package as a strong counterexample to my claim that the argument count must be minimised in a pipe-based design. They also mention that thereās no obvious better way to design dplyr::select()
. On all counts, I have no counterargument. However, Iām confident that Iām still on to something here, even if my original points are wrong. Pipes must come at a cost, but it appears that Iāve incorrectly identified what that cost is.Overall, Iām more than happy to use Tidyverse functions when Iām writing some run-once code or messing around in the REPL, but the unstable API point is a real killer for anything else. In terms of how it compares to base R, Iād rather use quite a few of its packages than their base R equivalents. However, that doesnāt mean that it can actually replace base R. I see it as nothing more than a handy set of libraries.
Now for the specific libraries. Assume that Iām ignorant of any that Iāve skipped.
dplyr
completely nullifies most of these complains, for data frames at least. This is a huge win for the Tidyverse.dplyr::group_by()
takes a more SQL-like approach to the problem and feels a lot safer to work with.dplyr
will only output a tibble is a relief. Thereās no need to consider if I need tapply()
, by()
, or aggregate()
for a job or if I need to coerce my input in order to go in/out of functions like table()
. I therefore need to do a lot less guessing. This link demonstrates it better than I can, although the formula solution with aggregate()
is in base Rās favour.dplyr::mutate()
is just plain better than base Rās transform()
. In particular, it allow you to refer to columns that youāve just created.dplyr
. Itās rather persuasive. In particular, it gives you the sense that you can make safe guesses about the dplyr
functions.pivot_wider()
ās values_fn
argument makes dplyr
the only tool that Iāve ever seen that allows arbitrary functions in a pivot table.dplyr
functions only accept data frame or objects derived from them. If Iām doing some work with something like stringr
, I instinctively want to use a Tidyverse solution to problems like subsetting my inputs. However, if I reach for dplyr::filter()
, I get errors due to character vector not being data frames. This isnāt really dplyr
ās fault and they shouldnāt try to fix it, but itās still a minor annoyance.plot()
, ggplot2
ās is much better. You can tell R to do stuff like include a useful legend or grid, but ggplot2
does it by default.plot()
. When I canāt be bothered to think about what sort of plot I need, plot()
can save me the trouble by making a correct guess. There is no such facility in ggplot2
. Hadleyās comments have pointed out autoplot()
, but Iāve never gotten it to work. There are no examples in its documentation and Iāve not found all that much help online.I donāt use dates much, so I donāt have much to say. In fact, I donāt think that Iāve mentioned them before now. lubridate
appears to be much easier to use than base R dates and times, but I know neither very well. I donāt like how base R seems to force you to do any time arithmetic in seconds and date arithmetic in days. I also donāt like how itās hard to do arithmetic with times in base R without a date being attached. However, I could be wrong about all of that. I really donāt know any of them too well and Iāve never found much reason to learn. Iād really like to see https://rosettacode.org/wiki/Convert_seconds_to_compound_duration solved in both base R and lubridate
. Iām not saying that it would be hard, but Iād love to see the comparison. Overall, all that I can really say for certain is that experience has shown that when the day comes, Iāll have a much easier time learning this package than what base R offers for the same jobs.
Pipes come in very handy, but Iāve never been completely sold on them, even when viewing teaching examples that are supposed to demonstrate their superiority. Iāll admit that there is a time and a place for them ā e.g. printing and graphing code ā but I think that they only really shine when youāve abandoned base R in favour of purrr
. After all, base R wasnāt built for pipes. Iād even go as far as to say that the people who swear by magrittr
and purrr
have adopted a completely different paradigm to those who donāt, so they end up using totally different tools. For example, a master of the Tidyverse finds Advanced R chapter nineās
by_cyl %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(coef) %>%
map_dbl(2)
just as informative as my lapply(by_cyl, function(x) lm(mpg ~ wt, data = x)$coef[[2]])
(or its equivalent sapply()
or vapply()
, if you really insist). Overall, I think that I canāt evaluate magrittr
without also evaluating purrr
.
As a side-note, the claim that foo %>% bar()
is equivalent to bar(foo)
appears to be a white lie. Try it with a plotting function that cares about the variable name of its argument. Spot the difference:
plot(Nile)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
Nile %>% plot()
Nile |> plot() #The same as plot(Nile)
Donāt get me wrong, I like pipes a lot. When youāre dealing with data, thereās sometimes no way to avoid ādo foo()
to my data and then do bar()
ā code. However, youād be mad to use them all of the time. For people that do use them, all that I can say is that you should take the time to learn all of them and that said time really isnāt much. None of them are much more complicated than %>%
and %$%
is a handy replacement for with()
.
As a final point, I donāt like how much trouble base Rās new |>
pipe causes me. You canāt do x |> foo
. You instead need x |> foo()
. Also, to use a function where the target argument isnāt first, you need to use some nonsense like x |> (function(x) foo(bar, x))()
. For example, mtcars |> (function(x) Map(max, x))()
. I donāt like all of those extra brackets. magrittr
can do it with just mtcars %>% (function(x) Map(max, x))
or even mtcars %>% Map(max, .)
. Regardless, base Rās pipe is still new, so perhaps Iām judging it too early. It appears that future versions will expand it.
Unlike base R, where I can point to a specific example and explain to you why itās silly, my objections to purrr
are mostly philosophical. It certainly does some things much better than base R. For example, I really like the consistency in the map functions. Theyāre a breath of fresh air compared to base Rās apply family vs funprog mess. My code would probably be a lot easier to read and modify if I replaced all of my apply family and funprog calls with their purrr
equivalents. However, when writing the code in the first place, Iād much rather have the flexibility that the base R functions offer. I also like pluck()
and it being built in to the map functions, but Iāve yet to get used to it. Overall, I wouldnāt mind using purrr
, but I have some major objections:
map_lgl()
, map_int()
, map_dbl()
, and map_chr()
are separate functions. They do exactly the same thing, but will throw an error if they donāt get the return type in their name. Why isnāt this the job of some general mapping function that takes the desired output type as an argument (e.g. like base Rās vapply()
)? This same issue is found in the entire library. There is no need for the map2()
and pmap()
functions or their countless _type variants. Just make a general map()
function! To steal a point from the TidyverseSkeptic essay, purrr
has 178 functions, and 52 are maps. What would you rather learn: a handful of complex map functions (like base R) or 52 simple ones? The only defences that Iāve seen for the purrr
approach is that base R can be a bit verbose, e.g. vapply()
ās arguments like FUN.VALUE = logical(1)
, and that using the most restrictive possible tool for any given job increases the readability of your code.~
operator able to form anonymous functions, at least within purrr
(itās some funky parsing). I could get used to it, but I donāt like how it robs the user of the ability to give the arguments to their anonymous function any meaningful names. This is because the purrr
authors thought that the normal anonymous function syntax was too verbose, but Iād argue that theyāve gone too far and made their syntax too terse. Map(function(x) runif(1), 1:3)
is not long or particularly obscure, but map(1:3, ~ runif(1))
crosses the line for me, as does map(data, ~ .x * 2)
. My example in the previous section, which included map(~ lm(mpg ~ wt, data = .x))
, demonstrates another problem: It overloads the ~
operator in a dangerous way. The ~
inside the map()
is very different from the ~
in the call to lm()
.purrr
users donāt use a generalised map function because theyāve written off base Rās anonymous function syntax and replaced it with a variant that is so terse that their code becomes unreadable without the names of their functions telling the reader what theyāre doing?Overall, I could probably be convinced that purrr
ās way is better than base Rās, but I doubt that purrr
ās way is the best way.
For me, these both fall in to the same box. Theyāre not particularly outstanding, but theyāre clearly much better than their base R equivalents. Iāve praised tibble
enough already and have said plenty about the state of Rās strings. The only thing that Iāve got left to say is that once youāve noticed that tibbles let you use the columns that youāve just defined to define other columns, you really start to hate how many extra lines of code you have to write when using data frames for the same task.
If I were being generous, I would say that R teaches you some great lessons about functional programming while being a useful DSL and that its biggest fault is that it tries to do too much, ultimately becoming brutally inconsistent. Iād also say that the Tidyverse is a useful set of packages that, while unable to fix R and certainly not a panacea, do a lot to improve it within their specific domains. However, Iām not that generous.
The most damning thing about R is that much of The R Inferno still holds true. The fact that a nearly decade-old document that credibly compared R to a journey in to Hell is still a useful reference manual speaks volumes about the languageās attitude to change. To put it plainly, R wonāt change. If something about R frustrates you today, it always will. Thatās what kills the language for me. The popularity of the Tidyverse proves that R is broken and the continuing validity of the The R Inferno proves that it will stay that way. You may be able to put a blanket of sanity on top of it, as the best packages try to, but you wonāt fix it. Unless you find said packages so useful that they make R worth it, I find it impossible to argue against jumping ship. My ultimate conclusion on R is that itās good, but doomed by the unshifting weight of the countless little problems that Iāve documented here. Personally, Iām going to give Python a shot and I wouldnāt blame you for doing the same. Letās hope that I donāt end up writing a document of this size complaining about that.
All that being said, I have no intention of uninstalling R or going out of my way to avoid it. Iād gladly use it professionally and Iāve learned enough of its semantic semtex to get really damn good at using R to do in few lines what other languages would do in many. I wasnāt joking when I said that itās the best desktop calculator that Iāve ever used. But would I recommend learning it to anyone else? Absolutely not. We can do so much better.
7 Feedback
In late March 2022, this article suddenly exploded overnight. It was briefly in the top 10 on Hacker News and got about 20,000 views in one day. This came as quite a shock to me, given that I wasnāt even done proofreading yet. Iāve read through as much of the online commentary on this article as I can find. The comments on the Hacker News page are by far the most in-depth. You can find a fair bit on Twitter and Reddit as well, but youād have to go looking. I found that Twitter had the most positive reception, Reddit was more negative and Hacker News was mixed.
In April, I finished the proofreading and sent this off to R-devel. I am very thankful for their comments and their sincere efforts to help me. I hope that my replies didnāt come off as half hearted. I was simply unequipped to handle to mass of feedback.
I wonāt single out any particular commenters, but there are some ideas and trends that I feel are worth addressing:
From the negative feedback, I canāt help but wonder if some of my examples were too trivial or too petty. This document would have been a lot easier to read and write if I only mentioned big issues. Iāve made some minor edits to address this, but itās hard to judge. A master would never make some of the mistakes that my subsetting section warns against, but does that mean I shouldnāt even mention those issues?
I find it interesting to note what hasnāt been criticised. The following examples stand out to me:
"es"
in "test"
ā challenge a bit too easy (I think they missed my point), I could only find one person who made any attempt at my mapply()
challenge.A very common objection was that my Ignorance section invalidates much of my commentary. Of course, said ignorance makes me unable to know if theyāre right or not. The two most common criticisms were that my lack of expertise in the Tidyverse and/or data.table
mean that Iāve got nothing worthwhile to say and that using R as a programming language rather than a statistics tool is fundamentally wrong. All of these criticisms are partly correct. Using R for interactive data analysis is very different from trying to program with it, so such users simply wonāt encounter many of the issues that Iāve mentioned. Similarly, swapping base R for the Tidyverse automatically nullifies many of my complaints. You can even go through my table of contents and cross sections off. dplyr
and its tibble-focus already knock off most of my complaints about base Rās variable manipulations, data types, subsetting rules, and vector rules. Donāt get me wrong, the Tidyverse has its own problems. For example, Iād hate to develop anything reliant on the Tidyverseās unstable API. However, if youāre doing a run-once piece of analysis, then itās probably great. Itās just a shame to see so much of R replaced by its packages.
Iāve perhaps undersold just how good R can be at what itās specialised for. This chain of Hacker News comments seems to get across something that I havenāt. Iāve certainly said that R is a large mathematics and statistics tool that is easy to extend and has clear Scheme inspiration, but the sum of those comments seems to say it better. As for the idea that R is a āWorse is Betterā language, I find it appealing but I donāt feel qualified to judge. If anything was āWorse is Betterā, then it was probably S (which would make R āalmost the right thingā, in that essayās terms). However, Iām not historically knowledgeable enough to know key factors like how simple Sās early implementations were. I hear that it was very easy to get running on Unix?
I never made it clear that I understand why backwards compatibility is a priority for R. For example, R code appears in a lot of science papers and you donāt want such code to become unrunnable or to change meaning.
As a final point, making the changes to this document to reflect the changes coming in what I presume to be R version 4.1.4 has forced me to question my points about R being unable to change. Iāve not changed my mind yet, but time will tell. They certainly prove that R can change, but I think the real issue might be that it canāt fundamentally change.
Author: ReeceGoding
Source Code: https://github.com/ReeceGoding/Frustration-One-Year-With-R