1631544964
A Julia package for evaluating distances(metrics) between vectors.
This package also provides optimized functions to compute column-wise and pairwise distances, which are often substantially faster than a straightforward loop implementation. (See the benchmark section below for details).
For Euclidean distance
, Squared Euclidean distance
, Cityblock distance
, Minkowski distance
, and Hamming distance
, a weighted version is also provided.
The library supports three ways of computation: computing the distance between two iterators/vectors, "zip"-wise computation, and pairwise computation. Each of these computation modes works with arbitrary iterable objects of known size.
Each distance corresponds to a distance type. You can always compute a certain distance between two iterators or vectors of equal length using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist
is an instance of a distance type: for example, the type for Euclidean distance is Euclidean
(more distance types will be introduced in the next section). You can compute the Euclidean distance between x
and y
as
r = evaluate(Euclidean(), x, y)
r = Euclidean()(x, y)
Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below
r = euclidean(x, y)
Suppose you have two m-by-n
matrix X
and Y
, then you can compute all distances between corresponding columns of X
and Y
in one batch, using the colwise
function, as
r = colwise(dist, X, Y)
The output r
is a vector of length n
. In particular, r[i]
is the distance between X[:,i]
and Y[:,i]
. The batch computation typically runs considerably faster than calling evaluate
column-by-column.
Note that either of X
and Y
can be just a single vector -- then the colwise
function computes the distance between this vector and each column of the other argument.
Let X
and Y
have m
and n
columns, respectively, and the same number of rows. Then the pairwise
function with the dims=2
argument computes distances between each pair of columns in X
and Y
:
R = pairwise(dist, X, Y, dims=2)
In the output, R
is a matrix of size (m, n)
, such that R[i,j]
is the distance between X[:,i]
and Y[:,j]
. Computing distances for all pairs using pairwise
function is often remarkably faster than evaluting for each pair individually.
If you just want to just compute distances between all columns of a matrix X
, you can write
R = pairwise(dist, X, dims=2)
This statement will result in an m-by-m
matrix, where R[i,j]
is the distance between X[:,i]
and X[:,j]
. pairwise(dist, X)
is typically more efficient than pairwise(dist, X, X)
, as the former will take advantage of the symmetry when dist
is a semi-metric (including metric).
To compute pairwise distances for matrices with observations stored in rows use the argument dims=1
.
If the vector/matrix to store the results are pre-allocated, you may use the storage (without creating a new array) using the following syntax (i
being either 1
or 2
):
colwise!(r, dist, X, Y)
pairwise!(R, dist, X, Y, dims=i)
pairwise!(R, dist, X, dims=i)
Please pay attention to the difference, the functions for inplace computation are colwise!
and pairwise!
(instead of colwise
and pairwise
).
The distances are organized into a type hierarchy.
At the top of this hierarchy is an abstract class PreMetric, which is defined to be a function d
that satisfies
d(x, x) == 0 for all x
d(x, y) >= 0 for all x, y
SemiMetric is a abstract type that refines PreMetric. Formally, a semi-metric is a pre-metric that is also symmetric, as
d(x, y) == d(y, x) for all x, y
Metric is a abstract type that further refines SemiMetric. Formally, a metric is a semi-metric that also satisfies triangle inequality, as
d(x, z) <= d(x, y) + d(y, z) for all x, y, z
This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, derive the values immediately for the remaining half by leveraging the symmetry of semi-metrics. Note that the types of SemiMetric
and Metric
do not completely follow the definition in mathematics as they do not require the "distance" to be able to distinguish between points: for these types x != y
does not imply that d(x, y) != 0
in general compared to the mathematical definition of semi-metric and metric, as this property does not change computations in practice.
Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.
type name | convenient syntax | math definition |
---|---|---|
Euclidean | euclidean(x, y) | sqrt(sum((x - y) .^ 2)) |
SqEuclidean | sqeuclidean(x, y) | sum((x - y).^2) |
PeriodicEuclidean | peuclidean(x, y, w) | sqrt(sum(min(mod(abs(x - y), w), w - mod(abs(x - y), w)).^2)) |
Cityblock | cityblock(x, y) | sum(abs(x - y)) |
TotalVariation | totalvariation(x, y) | sum(abs(x - y)) / 2 |
Chebyshev | chebyshev(x, y) | max(abs(x - y)) |
Minkowski | minkowski(x, y, p) | sum(abs(x - y).^p) ^ (1/p) |
Hamming | hamming(k, l) | sum(k .!= l) |
RogersTanimoto | rogerstanimoto(a, b) | 2(sum(a&!b) + sum(!a&b)) / (2(sum(a&!b) + sum(!a&b)) + sum(a&b) + sum(!a&!b)) |
Jaccard | jaccard(x, y) | 1 - sum(min(x, y)) / sum(max(x, y)) |
BrayCurtis | braycurtis(x, y) | sum(abs(x - y)) / sum(abs(x + y)) |
CosineDist | cosine_dist(x, y) | 1 - dot(x, y) / (norm(x) * norm(y)) |
CorrDist | corr_dist(x, y) | cosine_dist(x - mean(x), y - mean(y)) |
ChiSqDist | chisq_dist(x, y) | sum((x - y).^2 / (x + y)) |
KLDivergence | kl_divergence(p, q) | sum(p .* log(p ./ q)) |
GenKLDivergence | gkl_divergence(x, y) | sum(p .* log(p ./ q) - p + q) |
RenyiDivergence | renyi_divergence(p, q, k) | log(sum( p .* (p ./ q) .^ (k - 1))) / (k - 1) |
JSDivergence | js_divergence(p, q) | KL(p, m) / 2 + KL(q, m) / 2 with m = (p + q) / 2 |
SpanNormDist | spannorm_dist(x, y) | max(x - y) - min(x - y) |
BhattacharyyaDist | bhattacharyya(x, y) | -log(sum(sqrt(x .* y) / sqrt(sum(x) * sum(y))) |
HellingerDist | hellinger(x, y) | sqrt(1 - sum(sqrt(x .* y) / sqrt(sum(x) * sum(y)))) |
Haversine | haversine(x, y, r = 6_371_000) | Haversine formula |
SphericalAngle | spherical_angle(x, y) | Haversine formula |
Mahalanobis | mahalanobis(x, y, Q) | sqrt((x - y)' * Q * (x - y)) |
SqMahalanobis | sqmahalanobis(x, y, Q) | (x - y)' * Q * (x - y) |
MeanAbsDeviation | meanad(x, y) | mean(abs.(x - y)) |
MeanSqDeviation | msd(x, y) | mean(abs2.(x - y)) |
RMSDeviation | rmsd(x, y) | sqrt(msd(x, y)) |
NormRMSDeviation | nrmsd(x, y) | rmsd(x, y) / (maximum(x) - minimum(x)) |
WeightedEuclidean | weuclidean(x, y, w) | sqrt(sum((x - y).^2 .* w)) |
WeightedSqEuclidean | wsqeuclidean(x, y, w) | sum((x - y).^2 .* w) |
WeightedCityblock | wcityblock(x, y, w) | sum(abs(x - y) .* w) |
WeightedMinkowski | wminkowski(x, y, w, p) | sum(abs(x - y).^p .* w) ^ (1/p) |
WeightedHamming | whamming(x, y, w) | sum((x .!= y) .* w) |
Bregman | bregman(F, ∇, x, y; inner=dot) | F(x) - F(y) - inner(∇(y), x - y) |
Note: The formulas above are using Julia's functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way. The arguments x
and y
are iterable objects, typically arrays of real numbers; w
is an iterator/array of parameters (like weights or periods); k
and l
are iterators/arrays of distinct elements of any kind; a
and b
are iterators/arrays of Bools; and finally, p
and q
are iterators/arrays forming a discrete probability distribution and are therefore both expected to sum to one.
For efficiency (see the benchmarks below), Euclidean
and SqEuclidean
make use of BLAS3 matrix-matrix multiplication to calculate distances. This corresponds to the following expansion:
(x-y)^2 == x^2 - 2xy + y^2
However, equality is not precise in the presence of roundoff error, and particularly when x
and y
are nearby points this may not be accurate. Consequently, Euclidean
and SqEuclidean
allow you to supply a relative tolerance to force recalculation:
julia> x = reshape([0.1, 0.3, -0.1], 3, 1);
julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
7.45058e-9
julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
0.0
The implementation has been carefully optimized based on benchmarks. The script in benchmark/benchmarks.jl
defines a benchmark suite for a variety of distances, under column-wise and pairwise settings.
Here are benchmarks obtained running Julia 1.5 on a computer with a quad-core Intel Core i5-2300K processor @ 3.2 GHz. Extended versions of the tables below can be replicated using the script in benchmark/print_table.jl
.
Generically, column-wise distances are computed using a straightforward loop implementation. For [Sq]Mahalanobis
, however, specialized methods are provided in Distances.jl, and the table below compares the performance (measured in terms of average elapsed time of each iteration) of the generic to the specialized implementation. The task in each iteration is to compute a specific distance between corresponding columns in two 200-by-10000
matrices.
distance | loop | colwise | gain |
---|---|---|---|
SqMahalanobis | 0.089470s | 0.014424s | 6.2027 |
Mahalanobis | 0.090882s | 0.014096s | 6.4475 |
Generically, pairwise distances are computed using a straightforward loop implementation. For distances of which a major part of the computation is a quadratic form, however, the performance can be drastically improved by restructuring the computation and delegating the core part to GEMM
in BLAS. The table below compares the performance (measured in terms of average elapsed time of each iteration) of generic to the specialized implementations provided in Distances.jl. The task in each iteration is to compute a specific distance in a pairwise manner between columns in a 100-by-200
and 100-by-250
matrices, which will result in a 200-by-250
distance matrix.
distance | loop | pairwise | gain |
---|---|---|---|
SqEuclidean | 0.001273s | 0.000124s | 10.2290 |
Euclidean | 0.001445s | 0.000194s | 7.4529 |
CosineDist | 0.001928s | 0.000149s | 12.9543 |
CorrDist | 0.016837s | 0.000187s | 90.1854 |
WeightedSqEuclidean | 0.001603s | 0.000143s | 11.2119 |
WeightedEuclidean | 0.001811s | 0.000238s | 7.6032 |
SqMahalanobis | 0.308990s | 0.000248s | 1248.1892 |
Mahalanobis | 0.313415s | 0.000346s | 906.1836 |
Download Details:
Author: JuliaStats
The Demo/Documentation: View The Demo/Documentation
Download Link: Download The Source Code
Official Website: https://github.com/JuliaStats/Distances.jl
License: MIT
#julia #programming #developer
1631544964
A Julia package for evaluating distances(metrics) between vectors.
This package also provides optimized functions to compute column-wise and pairwise distances, which are often substantially faster than a straightforward loop implementation. (See the benchmark section below for details).
For Euclidean distance
, Squared Euclidean distance
, Cityblock distance
, Minkowski distance
, and Hamming distance
, a weighted version is also provided.
The library supports three ways of computation: computing the distance between two iterators/vectors, "zip"-wise computation, and pairwise computation. Each of these computation modes works with arbitrary iterable objects of known size.
Each distance corresponds to a distance type. You can always compute a certain distance between two iterators or vectors of equal length using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist
is an instance of a distance type: for example, the type for Euclidean distance is Euclidean
(more distance types will be introduced in the next section). You can compute the Euclidean distance between x
and y
as
r = evaluate(Euclidean(), x, y)
r = Euclidean()(x, y)
Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below
r = euclidean(x, y)
Suppose you have two m-by-n
matrix X
and Y
, then you can compute all distances between corresponding columns of X
and Y
in one batch, using the colwise
function, as
r = colwise(dist, X, Y)
The output r
is a vector of length n
. In particular, r[i]
is the distance between X[:,i]
and Y[:,i]
. The batch computation typically runs considerably faster than calling evaluate
column-by-column.
Note that either of X
and Y
can be just a single vector -- then the colwise
function computes the distance between this vector and each column of the other argument.
Let X
and Y
have m
and n
columns, respectively, and the same number of rows. Then the pairwise
function with the dims=2
argument computes distances between each pair of columns in X
and Y
:
R = pairwise(dist, X, Y, dims=2)
In the output, R
is a matrix of size (m, n)
, such that R[i,j]
is the distance between X[:,i]
and Y[:,j]
. Computing distances for all pairs using pairwise
function is often remarkably faster than evaluting for each pair individually.
If you just want to just compute distances between all columns of a matrix X
, you can write
R = pairwise(dist, X, dims=2)
This statement will result in an m-by-m
matrix, where R[i,j]
is the distance between X[:,i]
and X[:,j]
. pairwise(dist, X)
is typically more efficient than pairwise(dist, X, X)
, as the former will take advantage of the symmetry when dist
is a semi-metric (including metric).
To compute pairwise distances for matrices with observations stored in rows use the argument dims=1
.
If the vector/matrix to store the results are pre-allocated, you may use the storage (without creating a new array) using the following syntax (i
being either 1
or 2
):
colwise!(r, dist, X, Y)
pairwise!(R, dist, X, Y, dims=i)
pairwise!(R, dist, X, dims=i)
Please pay attention to the difference, the functions for inplace computation are colwise!
and pairwise!
(instead of colwise
and pairwise
).
The distances are organized into a type hierarchy.
At the top of this hierarchy is an abstract class PreMetric, which is defined to be a function d
that satisfies
d(x, x) == 0 for all x
d(x, y) >= 0 for all x, y
SemiMetric is a abstract type that refines PreMetric. Formally, a semi-metric is a pre-metric that is also symmetric, as
d(x, y) == d(y, x) for all x, y
Metric is a abstract type that further refines SemiMetric. Formally, a metric is a semi-metric that also satisfies triangle inequality, as
d(x, z) <= d(x, y) + d(y, z) for all x, y, z
This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, derive the values immediately for the remaining half by leveraging the symmetry of semi-metrics. Note that the types of SemiMetric
and Metric
do not completely follow the definition in mathematics as they do not require the "distance" to be able to distinguish between points: for these types x != y
does not imply that d(x, y) != 0
in general compared to the mathematical definition of semi-metric and metric, as this property does not change computations in practice.
Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.
type name | convenient syntax | math definition |
---|---|---|
Euclidean | euclidean(x, y) | sqrt(sum((x - y) .^ 2)) |
SqEuclidean | sqeuclidean(x, y) | sum((x - y).^2) |
PeriodicEuclidean | peuclidean(x, y, w) | sqrt(sum(min(mod(abs(x - y), w), w - mod(abs(x - y), w)).^2)) |
Cityblock | cityblock(x, y) | sum(abs(x - y)) |
TotalVariation | totalvariation(x, y) | sum(abs(x - y)) / 2 |
Chebyshev | chebyshev(x, y) | max(abs(x - y)) |
Minkowski | minkowski(x, y, p) | sum(abs(x - y).^p) ^ (1/p) |
Hamming | hamming(k, l) | sum(k .!= l) |
RogersTanimoto | rogerstanimoto(a, b) | 2(sum(a&!b) + sum(!a&b)) / (2(sum(a&!b) + sum(!a&b)) + sum(a&b) + sum(!a&!b)) |
Jaccard | jaccard(x, y) | 1 - sum(min(x, y)) / sum(max(x, y)) |
BrayCurtis | braycurtis(x, y) | sum(abs(x - y)) / sum(abs(x + y)) |
CosineDist | cosine_dist(x, y) | 1 - dot(x, y) / (norm(x) * norm(y)) |
CorrDist | corr_dist(x, y) | cosine_dist(x - mean(x), y - mean(y)) |
ChiSqDist | chisq_dist(x, y) | sum((x - y).^2 / (x + y)) |
KLDivergence | kl_divergence(p, q) | sum(p .* log(p ./ q)) |
GenKLDivergence | gkl_divergence(x, y) | sum(p .* log(p ./ q) - p + q) |
RenyiDivergence | renyi_divergence(p, q, k) | log(sum( p .* (p ./ q) .^ (k - 1))) / (k - 1) |
JSDivergence | js_divergence(p, q) | KL(p, m) / 2 + KL(q, m) / 2 with m = (p + q) / 2 |
SpanNormDist | spannorm_dist(x, y) | max(x - y) - min(x - y) |
BhattacharyyaDist | bhattacharyya(x, y) | -log(sum(sqrt(x .* y) / sqrt(sum(x) * sum(y))) |
HellingerDist | hellinger(x, y) | sqrt(1 - sum(sqrt(x .* y) / sqrt(sum(x) * sum(y)))) |
Haversine | haversine(x, y, r = 6_371_000) | Haversine formula |
SphericalAngle | spherical_angle(x, y) | Haversine formula |
Mahalanobis | mahalanobis(x, y, Q) | sqrt((x - y)' * Q * (x - y)) |
SqMahalanobis | sqmahalanobis(x, y, Q) | (x - y)' * Q * (x - y) |
MeanAbsDeviation | meanad(x, y) | mean(abs.(x - y)) |
MeanSqDeviation | msd(x, y) | mean(abs2.(x - y)) |
RMSDeviation | rmsd(x, y) | sqrt(msd(x, y)) |
NormRMSDeviation | nrmsd(x, y) | rmsd(x, y) / (maximum(x) - minimum(x)) |
WeightedEuclidean | weuclidean(x, y, w) | sqrt(sum((x - y).^2 .* w)) |
WeightedSqEuclidean | wsqeuclidean(x, y, w) | sum((x - y).^2 .* w) |
WeightedCityblock | wcityblock(x, y, w) | sum(abs(x - y) .* w) |
WeightedMinkowski | wminkowski(x, y, w, p) | sum(abs(x - y).^p .* w) ^ (1/p) |
WeightedHamming | whamming(x, y, w) | sum((x .!= y) .* w) |
Bregman | bregman(F, ∇, x, y; inner=dot) | F(x) - F(y) - inner(∇(y), x - y) |
Note: The formulas above are using Julia's functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way. The arguments x
and y
are iterable objects, typically arrays of real numbers; w
is an iterator/array of parameters (like weights or periods); k
and l
are iterators/arrays of distinct elements of any kind; a
and b
are iterators/arrays of Bools; and finally, p
and q
are iterators/arrays forming a discrete probability distribution and are therefore both expected to sum to one.
For efficiency (see the benchmarks below), Euclidean
and SqEuclidean
make use of BLAS3 matrix-matrix multiplication to calculate distances. This corresponds to the following expansion:
(x-y)^2 == x^2 - 2xy + y^2
However, equality is not precise in the presence of roundoff error, and particularly when x
and y
are nearby points this may not be accurate. Consequently, Euclidean
and SqEuclidean
allow you to supply a relative tolerance to force recalculation:
julia> x = reshape([0.1, 0.3, -0.1], 3, 1);
julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
7.45058e-9
julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
0.0
The implementation has been carefully optimized based on benchmarks. The script in benchmark/benchmarks.jl
defines a benchmark suite for a variety of distances, under column-wise and pairwise settings.
Here are benchmarks obtained running Julia 1.5 on a computer with a quad-core Intel Core i5-2300K processor @ 3.2 GHz. Extended versions of the tables below can be replicated using the script in benchmark/print_table.jl
.
Generically, column-wise distances are computed using a straightforward loop implementation. For [Sq]Mahalanobis
, however, specialized methods are provided in Distances.jl, and the table below compares the performance (measured in terms of average elapsed time of each iteration) of the generic to the specialized implementation. The task in each iteration is to compute a specific distance between corresponding columns in two 200-by-10000
matrices.
distance | loop | colwise | gain |
---|---|---|---|
SqMahalanobis | 0.089470s | 0.014424s | 6.2027 |
Mahalanobis | 0.090882s | 0.014096s | 6.4475 |
Generically, pairwise distances are computed using a straightforward loop implementation. For distances of which a major part of the computation is a quadratic form, however, the performance can be drastically improved by restructuring the computation and delegating the core part to GEMM
in BLAS. The table below compares the performance (measured in terms of average elapsed time of each iteration) of generic to the specialized implementations provided in Distances.jl. The task in each iteration is to compute a specific distance in a pairwise manner between columns in a 100-by-200
and 100-by-250
matrices, which will result in a 200-by-250
distance matrix.
distance | loop | pairwise | gain |
---|---|---|---|
SqEuclidean | 0.001273s | 0.000124s | 10.2290 |
Euclidean | 0.001445s | 0.000194s | 7.4529 |
CosineDist | 0.001928s | 0.000149s | 12.9543 |
CorrDist | 0.016837s | 0.000187s | 90.1854 |
WeightedSqEuclidean | 0.001603s | 0.000143s | 11.2119 |
WeightedEuclidean | 0.001811s | 0.000238s | 7.6032 |
SqMahalanobis | 0.308990s | 0.000248s | 1248.1892 |
Mahalanobis | 0.313415s | 0.000346s | 906.1836 |
Download Details:
Author: JuliaStats
The Demo/Documentation: View The Demo/Documentation
Download Link: Download The Source Code
Official Website: https://github.com/JuliaStats/Distances.jl
License: MIT
#julia #programming #developer
1658614440
A Julia package for evaluating distances (metrics) between vectors.
This package also provides optimized functions to compute column-wise and pairwise distances, which are often substantially faster than a straightforward loop implementation. (See the benchmark section below for details).
For Euclidean distance
, Squared Euclidean distance
, Cityblock distance
, Minkowski distance
, and Hamming distance
, a weighted version is also provided.
The library supports three ways of computation: computing the distance between two iterators/vectors, "zip"-wise computation, and pairwise computation. Each of these computation modes works with arbitrary iterable objects of known size.
Each distance corresponds to a distance type. You can always compute a certain distance between two iterators or vectors of equal length using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist
is an instance of a distance type: for example, the type for Euclidean distance is Euclidean
(more distance types will be introduced in the next section). You can compute the Euclidean distance between x
and y
as
r = evaluate(Euclidean(), x, y)
r = Euclidean()(x, y)
Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below
r = euclidean(x, y)
Suppose you have two m-by-n
matrix X
and Y
, then you can compute all distances between corresponding columns of X
and Y
in one batch, using the colwise
function, as
r = colwise(dist, X, Y)
The output r
is a vector of length n
. In particular, r[i]
is the distance between X[:,i]
and Y[:,i]
. The batch computation typically runs considerably faster than calling evaluate
column-by-column.
Note that either of X
and Y
can be just a single vector -- then the colwise
function computes the distance between this vector and each column of the other argument.
Let X
and Y
have m
and n
columns, respectively, and the same number of rows. Then the pairwise
function with the dims=2
argument computes distances between each pair of columns in X
and Y
:
R = pairwise(dist, X, Y, dims=2)
In the output, R
is a matrix of size (m, n)
, such that R[i,j]
is the distance between X[:,i]
and Y[:,j]
. Computing distances for all pairs using pairwise
function is often remarkably faster than evaluting for each pair individually.
If you just want to just compute distances between all columns of a matrix X
, you can write
R = pairwise(dist, X, dims=2)
This statement will result in an m-by-m
matrix, where R[i,j]
is the distance between X[:,i]
and X[:,j]
. pairwise(dist, X)
is typically more efficient than pairwise(dist, X, X)
, as the former will take advantage of the symmetry when dist
is a semi-metric (including metric).
To compute pairwise distances for matrices with observations stored in rows use the argument dims=1
.
If the vector/matrix to store the results are pre-allocated, you may use the storage (without creating a new array) using the following syntax (i
being either 1
or 2
):
colwise!(r, dist, X, Y)
pairwise!(R, dist, X, Y, dims=i)
pairwise!(R, dist, X, dims=i)
Please pay attention to the difference, the functions for inplace computation are colwise!
and pairwise!
(instead of colwise
and pairwise
).
The distances are organized into a type hierarchy.
At the top of this hierarchy is an abstract class PreMetric, which is defined to be a function d
that satisfies
d(x, x) == 0 for all x
d(x, y) >= 0 for all x, y
SemiMetric is a abstract type that refines PreMetric. Formally, a semi-metric is a pre-metric that is also symmetric, as
d(x, y) == d(y, x) for all x, y
Metric is a abstract type that further refines SemiMetric. Formally, a metric is a semi-metric that also satisfies triangle inequality, as
d(x, z) <= d(x, y) + d(y, z) for all x, y, z
This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, derive the values immediately for the remaining half by leveraging the symmetry of semi-metrics. Note that the types of SemiMetric
and Metric
do not completely follow the definition in mathematics as they do not require the "distance" to be able to distinguish between points: for these types x != y
does not imply that d(x, y) != 0
in general compared to the mathematical definition of semi-metric and metric, as this property does not change computations in practice.
Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.
type name | convenient syntax | math definition |
---|---|---|
Euclidean | euclidean(x, y) | sqrt(sum((x - y) .^ 2)) |
SqEuclidean | sqeuclidean(x, y) | sum((x - y).^2) |
PeriodicEuclidean | peuclidean(x, y, w) | sqrt(sum(min(mod(abs(x - y), w), w - mod(abs(x - y), w)).^2)) |
Cityblock | cityblock(x, y) | sum(abs(x - y)) |
TotalVariation | totalvariation(x, y) | sum(abs(x - y)) / 2 |
Chebyshev | chebyshev(x, y) | max(abs(x - y)) |
Minkowski | minkowski(x, y, p) | sum(abs(x - y).^p) ^ (1/p) |
Hamming | hamming(k, l) | sum(k .!= l) |
RogersTanimoto | rogerstanimoto(a, b) | 2(sum(a&!b) + sum(!a&b)) / (2(sum(a&!b) + sum(!a&b)) + sum(a&b) + sum(!a&!b)) |
Jaccard | jaccard(x, y) | 1 - sum(min(x, y)) / sum(max(x, y)) |
BrayCurtis | braycurtis(x, y) | sum(abs(x - y)) / sum(abs(x + y)) |
CosineDist | cosine_dist(x, y) | 1 - dot(x, y) / (norm(x) * norm(y)) |
CorrDist | corr_dist(x, y) | cosine_dist(x - mean(x), y - mean(y)) |
ChiSqDist | chisq_dist(x, y) | sum((x - y).^2 / (x + y)) |
KLDivergence | kl_divergence(p, q) | sum(p .* log(p ./ q)) |
GenKLDivergence | gkl_divergence(x, y) | sum(p .* log(p ./ q) - p + q) |
RenyiDivergence | renyi_divergence(p, q, k) | log(sum( p .* (p ./ q) .^ (k - 1))) / (k - 1) |
JSDivergence | js_divergence(p, q) | KL(p, m) / 2 + KL(q, m) / 2 with m = (p + q) / 2 |
SpanNormDist | spannorm_dist(x, y) | max(x - y) - min(x - y) |
BhattacharyyaDist | bhattacharyya(x, y) | -log(sum(sqrt(x .* y) / sqrt(sum(x) * sum(y))) |
HellingerDist | hellinger(x, y) | sqrt(1 - sum(sqrt(x .* y) / sqrt(sum(x) * sum(y)))) |
Haversine | haversine(x, y, r = 6_371_000) | Haversine formula |
SphericalAngle | spherical_angle(x, y) | Haversine formula |
Mahalanobis | mahalanobis(x, y, Q) | sqrt((x - y)' * Q * (x - y)) |
SqMahalanobis | sqmahalanobis(x, y, Q) | (x - y)' * Q * (x - y) |
MeanAbsDeviation | meanad(x, y) | mean(abs.(x - y)) |
MeanSqDeviation | msd(x, y) | mean(abs2.(x - y)) |
RMSDeviation | rmsd(x, y) | sqrt(msd(x, y)) |
NormRMSDeviation | nrmsd(x, y) | rmsd(x, y) / (maximum(x) - minimum(x)) |
WeightedEuclidean | weuclidean(x, y, w) | sqrt(sum((x - y).^2 .* w)) |
WeightedSqEuclidean | wsqeuclidean(x, y, w) | sum((x - y).^2 .* w) |
WeightedCityblock | wcityblock(x, y, w) | sum(abs(x - y) .* w) |
WeightedMinkowski | wminkowski(x, y, w, p) | sum(abs(x - y).^p .* w) ^ (1/p) |
WeightedHamming | whamming(x, y, w) | sum((x .!= y) .* w) |
Bregman | bregman(F, ∇, x, y; inner=dot) | F(x) - F(y) - inner(∇(y), x - y) |
Note: The formulas above are using Julia's functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way. The arguments x
and y
are iterable objects, typically arrays of real numbers; w
is an iterator/array of parameters (like weights or periods); k
and l
are iterators/arrays of distinct elements of any kind; a
and b
are iterators/arrays of Bools; and finally, p
and q
are iterators/arrays forming a discrete probability distribution and are therefore both expected to sum to one.
For efficiency (see the benchmarks below), Euclidean
and SqEuclidean
make use of BLAS3 matrix-matrix multiplication to calculate distances. This corresponds to the following expansion:
(x-y)^2 == x^2 - 2xy + y^2
However, equality is not precise in the presence of roundoff error, and particularly when x
and y
are nearby points this may not be accurate. Consequently, Euclidean
and SqEuclidean
allow you to supply a relative tolerance to force recalculation:
julia> x = reshape([0.1, 0.3, -0.1], 3, 1);
julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
7.45058e-9
julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
0.0
The implementation has been carefully optimized based on benchmarks. The script in benchmark/benchmarks.jl
defines a benchmark suite for a variety of distances, under column-wise and pairwise settings.
Here are benchmarks obtained running Julia 1.5 on a computer with a quad-core Intel Core i5-2300K processor @ 3.2 GHz. Extended versions of the tables below can be replicated using the script in benchmark/print_table.jl
.
Generically, column-wise distances are computed using a straightforward loop implementation. For [Sq]Mahalanobis
, however, specialized methods are provided in Distances.jl, and the table below compares the performance (measured in terms of average elapsed time of each iteration) of the generic to the specialized implementation. The task in each iteration is to compute a specific distance between corresponding columns in two 200-by-10000
matrices.
distance | loop | colwise | gain |
---|---|---|---|
SqMahalanobis | 0.089470s | 0.014424s | 6.2027 |
Mahalanobis | 0.090882s | 0.014096s | 6.4475 |
Generically, pairwise distances are computed using a straightforward loop implementation. For distances of which a major part of the computation is a quadratic form, however, the performance can be drastically improved by restructuring the computation and delegating the core part to GEMM
in BLAS. The table below compares the performance (measured in terms of average elapsed time of each iteration) of generic to the specialized implementations provided in Distances.jl. The task in each iteration is to compute a specific distance in a pairwise manner between columns in a 100-by-200
and 100-by-250
matrices, which will result in a 200-by-250
distance matrix.
distance | loop | pairwise | gain |
---|---|---|---|
SqEuclidean | 0.001273s | 0.000124s | 10.2290 |
Euclidean | 0.001445s | 0.000194s | 7.4529 |
CosineDist | 0.001928s | 0.000149s | 12.9543 |
CorrDist | 0.016837s | 0.000187s | 90.1854 |
WeightedSqEuclidean | 0.001603s | 0.000143s | 11.2119 |
WeightedEuclidean | 0.001811s | 0.000238s | 7.6032 |
SqMahalanobis | 0.308990s | 0.000248s | 1248.1892 |
Mahalanobis | 0.313415s | 0.000346s | 906.1836 |
Author: JuliaStats
Source code: https://github.com/JuliaStats/Distances.jl
License: View license
1595763240
Many of the Supervised and Unsupervised machine learning models such as K-Nearest Neighbor and K-Means depend upon the distance between two data points to predict the output. Therefore, the metric we use to compute these distances plays an important role in these particular models.
Distance metric uses distance function which provides a relationship metric between each elements in the dataset.
A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly. In this article, we will discuss different Distance Metrics and how do they help in Machine Learning Modelling.
So, in this blog, we are going to understand distance metrics, such as Euclidean and Manhattan Distance used in machine learning models, in-depth.
Euclidean Distance Metric:
Euclidean Distance represents the shortest distance between two points.
The “Euclidean Distance” between two objects is the distance you would expect in “flat” or “Euclidean” space; it’s named after Euclid, who worked out the rules of geometry on a flat surface.
The Euclidean is often the “default” distance used in e.g., K-nearest neighbors (classification) or K-means (clustering) to find the “k closest points” of a particular sample point. The “closeness” is defined by the difference (“distance”) along the scale of each variable, which is converted to a similarity measure. This distance is defined as the Euclidian distance.
It is only one of the many available options to measure the distance between two vectors/data objects. However, many classification algorithms, as mentioned above, use it to either train the classifier or decide the class membership of a test observation and clustering algorithms (for e.g. K-means, K-medoids, etc) use it to assign membership to data objects among different clusters.
Mathematically, it’s calculated using Pythagoras’ theorem. The square of the total distance between two objects is the sum of the squares of the distances along each perpendicular co-ordinate.
#statistics #distance-metric #euclidean-distance #machine-learning #manhattan-distance
1657356960
An Analysis Tool for Smart Contracts
This repository is currently maintained by Xiao Liang Yu (@yxliang01). If you encounter any bugs or usage issues, please feel free to create an issue on our issue tracker.
A container with required dependencies configured can be found here. The image is however outdated. We are working on pushing the latest image to dockerhub for your convenience. If you experience any issue with this image, please try to build a new docker image by pulling this codebase before open an issue.
To open the container, install docker and run:
docker pull luongnguyen/oyente && docker run -i -t luongnguyen/oyente
To evaluate the greeter contract inside the container, run:
cd /oyente/oyente && python oyente.py -s greeter.sol
and you are done!
Note - If need the version of Oyente referred to in the paper, run the container from here
To run the web interface, execute docker run -w /oyente/web -p 3000:3000 oyente:latest ./bin/rails server
docker build -t oyente .
docker run -it -p 3000:3000 -e "OYENTE=/oyente/oyente" oyente:latest
Open a web browser to http://localhost:3000
for the graphical interface.
Execute a python virtualenv
python -m virtualenv env
source env/bin/activate
Install Oyente via pip:
$ pip2 install oyente
Dependencies:
The following require a Linux system to fufill. macOS instructions forthcoming.
$ sudo add-apt-repository ppa:ethereum/ethereum
$ sudo apt-get update
$ sudo apt-get install solc
Download the source code of version z3-4.5.0
Install z3 using Python bindings
$ python scripts/mk_make.py --python
$ cd build
$ make
$ sudo make install
pip install requests
pip install web3
#evaluate a local solidity contract
python oyente.py -s <contract filename>
#evaluate a local solidity with option -a to verify assertions in the contract
python oyente.py -a -s <contract filename>
#evaluate a local evm contract
python oyente.py -s <contract filename> -b
#evaluate a remote contract
python oyente.py -ru https://gist.githubusercontent.com/loiluu/d0eb34d473e421df12b38c12a7423a61/raw/2415b3fb782f5d286777e0bcebc57812ce3786da/puzzle.sol
And that's it! Run python oyente.py --help
for a list of options.
The accompanying paper explaining the bugs detected by the tool can be found here.
A collection of the utilities that were developed for the paper are in misc_utils
. Use them at your own risk - they have mostly been disposable.
generate-graphs.py
- Contains a number of functions to get statistics from contracts.get_source.py
- The get_contract_code function can be used to retrieve contract source from EtherScantransaction_scrape.py
- Contains functions to retrieve up-to-date transaction information for a particular contract.Note: This is an improved version of the tool used for the paper. Benchmarks are not for direct comparison.
To run the benchmarks, it is best to use the docker container as it includes the blockchain snapshot necessary. In the container, run batch_run.py
after activating the virtualenv. Results are in results.json
once the benchmark completes.
The benchmarks take a long time and a lot of RAM in any but the largest of clusters, beware.
Some analytics regarding the number of contracts tested, number of contracts analysed etc. is collected when running this benchmark.
Checkout out our contribution guide and the code structure here.
$ sudo apt-get install software-properties-common
$ sudo add-apt-repository -y ppa:ethereum/ethereum
$ sudo apt-get update
$ sudo apt-get install ethereum
Download Details:
Author: enzymefinance
Source Code: https://github.com/enzymefinance/oyente
License: GPL-3.0 license
#blockchain #smartcontract #ethereum
1597164060
Do you ever feel like for loops are taking over your life and there’s no escape from them? Do you feel trapped by all those loops? Well, fear not! There’s a way out! I’ll show you how to do the FizzBuzz challenge without any for loops at all.
Vectorize all the things! — SOURCE
The task of FizzBuzz is to print every number up to 100, but replace numbers divisible by 3 with “Fizz”, numbers divisible by 5 by “Buzz” and numbers that are divisible by both 3 and 5 have to be replaced by “FizzBuzz”.
Solving FizzBuzz with for loops is easy, you can even do this in BigQuery. Here, I’ll show you an alternative way of doing this — without any for loops whatsoever. The solution is Vectorised Functions.
If you already had some experience with R and Python, you’ve probably already come across vectorised functions in standard R or via Python’s numpy
library. Let’s see how we can use them in Julia similarly.
Vectorised functions are great as they reduce the clutter often associated with for loops.
Before we dive into solving FizzBuzz let’s see how you can replace a very simple for loop with a vectorized alternative in Julia.
Let’s start with a trivial task: Given a vector _a_
add 1 to each element of it.
a = [1,2,3];
for i in 1:length(a)
a[i] += 1
end
julia> print(a)
[2, 3, 4]
The above gets the job done, but it takes up 3 lines and a lot more characters than needed. If a
was a numpy
array in Python 🐍, you could just do a + 1
and job done. But first, you would have to convert your plain old array to a numpy
array.
a = [1,2,3];
a .+ 1
Julia has a clever solution. You can use the broadcast operator . to apply an operation — in this case, addition — to all elements of an object. Here it is in action:
This gives the same answer as the for loop above. And there’s no need to convert your array.
Even better than that, you can broadcast any function of your liking, even your own ones. Here we calculate the area of a circle and then we broadcast it across our array:
function area_of_circle(r)
return π * r^2
end
a = [1,2,3];
area_of_circle.(a)
Yes, pi is a built in constant in Julia!
julia> area_of_circle.(a)
3-element Array{Float64,1}:
3.141592653589793
12.566370614359172
28.274333882308138
Bye-bye for loops! — SOURCE
Now that we know the basics, let’s do FizzBuzz! But remember, no for loops allowed.
We will rephrase our problem a little bit. Instead of printing the numbers, Fizzes and Buzzes, we’ll return all of them together as a vector. I’ll break down the solution the same way as in the for loop article [LINK], so if you haven’t seen the previous posts, now would be a good time to check it out!
First, let’s return the numbers up until n
as a vector:
function fizzbuzz(n)
return collect(1:n)
end
Here, collect just takes our range operator and evaluates it to an array.
julia> fizzbuzz(5)
5-element Array{Int64,1}:
1
2
3
4
5
This works. Let’s see if we can print Fizz for each number that’s divisible by 3. We can do this by replacing all numbers that are divisible by 3 with a Fizz string.
julia> fizzbuzz(7)
7-element Array{String,1}:
"1"
"2"
"Fizz"
"4"
"5"
"Fizz"
"7"
Let’s break this down step by step:
string
? Well, the array of numbers are just that, an array of numbers. We don’t want to have numbers and strings mingled up in a single object.rem.(numbers, 3
to find the remainder of all the numbers..== 0
).true
.Feel free to break these steps down and try them in your own Julia REPL!
I know that the use of .=
to assign a single element to many can be a bit controversial, but I actually quite like it. By explicitly specifying the broadcast of assignment you force yourself to think about the differences of these objects and everyone who reads your code afterwards will see that one is a vector and the other one is a scalar.
Adding the Buzzes is done exactly the same way:
#programming #julia #optimization #coding #vectorization