Lawrence  Lesch

Lawrence Lesch

1678584120

Knip: Find Unused Files, Dependencies & Exports in Your JS/TS Project

✂️ Knip

Knip finds unused files, dependencies and exports in your JavaScript and TypeScript projects. Less code and dependencies leads to improved performance, less maintenance and easier refactorings.

export const myVar = true;

ESLint handles files in isolation, so it does not know whether myVar is actually used somewhere else. Knip lints the project as a whole, and finds unused exports, files and dependencies.

It's only human to forget removing things that you no longer use. But how do you find out? Where to even start finding things that can be removed?

The dots don't connect themselves. This is where Knip comes in:

  •  Finds unused files, dependencies and exports
  •  Finds used dependencies not listed in package.json
  •  Finds duplicate exports
  •  Finds unused members of classes and enums
  •  Built-in support for monorepos/workspaces
  •  Growing list of built-in plugins
  •  Checks npm scripts for used and unlisted dependencies
  •  Supports JavaScript (without tsconfig.json, or TypeScript allowJs: true)
  •  Features multiple reporters and supports custom reporters
  •  Run Knip as part of your CI environment to detect issues and prevent regressions

Knip shines in both small and large projects. It's a fresh take on keeping your projects clean & tidy!

An orange cow with scissors, Van Gogh style “An orange cow with scissors, Van Gogh style” - generated with OpenAI

Migrating to v1.0.0

When coming from version v0.13.3 or before, please see migration to v1.

Announcement: Knip v2

The next major release is upcoming. Please see https://github.com/webpro/knip/issues/73 for the full story. Use npm install knip@next to try it out if you're curious! No changes in configuration necessary. Find the updated documentation at https://github.com/webpro/knip/blob/v2/README.md.

Issues

Are you seeing false positives? Please report them by opening an issue in this repo. Bonus points for linking to a public repository using Knip, or even opening a pull request with a directory and example files in test/fixtures. Correctness and bug fixes have priority over performance and new features.

Also see the FAQ.

Installation

npm install -D knip

Knip supports LTS versions of Node.js, and currently requires at least Node.js v16.17 or v18.6. Knip is cutting edge!

Usage

Knip has good defaults and you can run it without any configuration, but especially larger projects get more out of Knip with a configuration file (or a knip property in package.json). Let's name this file knip.json with these contents (you might want to adjust right away for your project):

{
  "$schema": "https://unpkg.com/knip@1/schema.json",
  "entry": ["src/index.ts"],
  "project": ["src/**/*.ts"]
}

The entry files target the starting point(s) to resolve the rest of the imported code. The project files should contain all files to match against the files resolved from the entry files, including potentially unused files.

Use knip.ts with TypeScript if you prefer:

import type { KnipConfig } from 'knip';

const config: KnipConfig = {
  entry: ['src/index.ts'],
  project: ['src/**/*.ts'],
};

export default config;

If you have, please see workspaces & monorepos.

Then run the checks with npx knip. Or first add this script to package.json:

{
  "scripts": {
    "knip": "knip"
  }
}

Use npm run knip to analyze the project and output unused files, dependencies and exports. Knip works just fine with yarn or pnpm as well.

Command-line options

$ npx knip --help
✂️  Find unused files, dependencies and exports in your JavaScript and TypeScript projects

Usage: knip [options]

Options:
  -c, --config [file]      Configuration file path (default: [.]knip.json[c], knip.js, knip.ts or package.json#knip)
  -t, --tsConfig [file]    TypeScript configuration path (default: tsconfig.json)
  --production             Analyze only production source files (e.g. no tests, devDependencies, exported types)
  --strict                 Consider only direct dependencies of workspace (not devDependencies, not other workspaces)
  --workspace              Analyze a single workspace (default: analyze all configured workspaces)
  --include-entry-exports  Include unused exports in entry files (without `@public`)
  --ignore                 Ignore files matching this glob pattern, can be repeated
  --no-gitignore           Don't use .gitignore
  --include                Report only provided issue type(s), can be comma-separated or repeated (1)
  --exclude                Exclude provided issue type(s) from report, can be comma-separated or repeated (1)
  --dependencies           Shortcut for --include dependencies,unlisted
  --exports                Shortcut for --include exports,nsExports,classMembers,types,nsTypes,enumMembers,duplicates
  --no-progress            Don't show dynamic progress updates
  --reporter               Select reporter: symbols, compact, codeowners, json (default: symbols)
  --reporter-options       Pass extra options to the reporter (as JSON string, see example)
  --no-exit-code           Always exit with code zero (0)
  --max-issues             Maximum number of issues before non-zero exit code (default: 0)
  --debug                  Show debug output
  --debug-file-filter      Filter for files in debug output (regex as string)
  --performance            Measure running time of expensive functions and display stats table
  --h, --help              Print this help text
  --V, version             Print version

(1) Issue types: files, dependencies, unlisted, exports, nsExports, classMembers, types, nsTypes, enumMembers, duplicates

Examples:

$ knip
$ knip --production
$ knip --workspace packages/client --include files,dependencies
$ knip -c ./config/knip.json --reporter compact
$ knip --reporter codeowners --reporter-options '{"path":".github/CODEOWNERS"}'
$ knip --debug --debug-file-filter '(specific|particular)-module'

More documentation and bug reports: https://github.com/webpro/knip

Screenshots

Here's an example run using the default reporter:

example output of dependencies

This example shows more output related to unused and unlisted dependencies:

example output of dependencies

Reading the report

The report contains the following types of issues:

  • Unused files: did not find references to this file
  • Unused dependencies: did not find references to this dependency
  • Unlisted or unresolved dependencies: used dependencies, but not listed in package.json (1)
  • Unused exports: did not find references to this exported variable
  • Unused exports in namespaces: did not find direct references to this exported variable (2)
  • Unused exported types: did not find references to this exported type
  • Unused exported types in namespaces: did not find direct references to this exported variable (2)
  • Unused exported enum members: did not find references to this member of the exported enum
  • Unused exported class members: did not find references to this member of the exported class
  • Duplicate exports: the same thing is exported more than once

When an issue type has zero issues, it is not shown.

(1) This includes imports that could not be resolved.

(2) The variable or type is not referenced directly, and has become a member of a namespace. Knip can't find a reference to it, so you can probably remove it.

Output filters

You can --include or --exclude any of the types to slice & dice the report to your needs. Alternatively, they can be added to the configuration (e.g. "exclude": ["dependencies"]).

Knip finds issues of type files, dependencies, unlisted and duplicates very fast. Finding unused exports requires deeper analysis (exports, nsExports, classMembers, types, nsTypes, enumMembers).

Use --include to report only specific issue types (the following example commands do the same):

knip --include files --include dependencies
knip --include files,dependencies

Use --exclude to ignore reports you're not interested in:

knip --include files --exclude classMembers,enumMembers

Use --dependencies or --exports as shortcuts to combine groups of related types.

Still not happy with the results? Getting too much output/false positives? The FAQ may be useful. Feel free to open an issue and I'm happy to look into it. Also see the next section on how to ignore certain false positives:

Ignore

There are a few ways to tell Knip to ignore certain packages, binaries, dependencies and workspaces. Some examples:

{
  "ignore": ["**/*.d.ts", "**/fixtures"],
  "ignoreBinaries": ["zip", "docker-compose"],
  "ignoreDependencies": ["hidden-package"],
  "ignoreWorkspaces": ["packages/deno-lib"]
}

Now what?

This is the fun part! Knip, knip, knip ✂️

As always, make sure to backup files or use Git before deleting files or making changes. Run tests to verify results.

  • Unused files can be removed.
  • Unused dependencies can be removed from package.json.
  • Unlisted dependencies should be added to package.json.
  • Unused exports and types: remove the export keyword in front of unused exports. Then you can see whether the variable or type is used within the same file. If this is not the case, it can be removed.
  • Duplicate exports can be removed so they're exported only once.

🔁 Repeat the process to reveal new unused files and exports. Sometimes it's so liberating to remove things!

Workspaces & Monorepos

Workspaces and monorepos are handled out-of-the-box by Knip. Every workspace that is part of the Knip configuration will be part of the analysis. Here's an example:

{
  "ignoreWorkspaces": ["packages/ignore-me"],
  "workspaces": {
    ".": {
      "entry": "src/index.ts",
      "project": "src/**/*.ts"
    },
    "packages/*": {
      "entry": "{index,cli}.ts",
      "project": "**/*.ts"
    },
    "packages/my-lib": {
      "entry": "main.js"
    }
  }
}

Note that if you have a root workspace, it must be under workspaces and have the "." key like in the example.

Knip supports workspaces as defined in three possible locations:

  • In the workspaces array in package.json.
  • In the workspaces.packages array in package.json.
  • In the packages array in pnpm-workspace.yaml.

Every directory with a match in workspaces of knip.json is part of the analysis.

Extra "workspaces" not configured as a workspace in the root package.json can be configured as well, Knip is happy to analyze unused dependencies and exports from any directory with a package.json.

Here's some example output when running Knip in a workspace:

example output in workspaces

Plugins

Knip contains a growing list of plugins:

Plugins are automatically activated. Each plugin is automatically enabled based on simple heuristics. Most of them check whether one or one of a few (dev) dependencies are listed in package.json. Once enabled, they add a set of configuration and/or entry files for Knip to analyze. These defaults can be overriden.

Most plugins use one or both of the following file types:

  • config - custom dependency resolvers are applied to the config files
  • entry - files to include with the analysis of the rest of the source code

See each plugin's documentation for its default values.

config

Plugins may include config files. They are parsed by the plugin's custom dependency resolver. Here are some examples to get an idea of how they work and why they are needed:

  • The eslint plugin tells Knip that the "prettier" entry in the array of plugins means that the eslint-plugin-prettier dependency should be installed. Or that the "airbnb" entry in extends requires the eslint-config-airbnb dependency.
  • The storybook plugin understands that core.builder: 'webpack5' in main.js means that the @storybook/builder-webpack5 and @storybook/manager-webpack5 dependencies are required.
  • Static configuration files such as JSON and YAML always require a custom dependency resolver.

Custom dependency resolvers return all referenced dependencies for the configuration files it is given. Knip handles the rest to find which of those dependencies are unused or missing.

entry

Other configuration files use require or import statements to use dependencies, so they can be analyzed like the rest of the source files. These configuration files are also considered entry files.

For plugins related to test files, it's good to know that the following glob patterns are always included by default (see TEST_FILE_PATTERNS in constants.ts):

  • **/*.{test,spec}.{js,jsx,ts,tsx,mjs,cjs}
  • **/__tests__/**/*.{js,jsx,ts,tsx,mjs,cjs}
  • test/**/*.{js,jsx,ts,tsx,mjs,cjs}

Disable a plugin

In case a plugin causes issues, it can be disabled by using false as its value (e.g. "webpack": false).

Create a new plugin

Getting false positives because a plugin is missing? Want to help out? Feel free to add your own plugin! Here's how to get started:

npm run create-plugin -- --name [myplugin]

Production Mode

The default mode for Knip is holistic and targets all project code, including configuration files and tests. Test files usually import production files. This prevents the production files or its exports from being reported as unused, while sometimes both of them can be removed. This is why Knip has a "production mode".

To tell Knip what is production code, add an exclamation mark behind each pattern! that is meant for production and use the --production flag. Here's an example:

{
  "entry": ["src/index.ts!", "build/script.js"],
  "project": ["src/**/*.ts!", "build/*.js"]
}

Here's what's included in production mode analysis:

  • Only entry and project patterns suffixed with !.
  • Only entry patterns from plugins exported as PRODUCTION_ENTRY_FILE_PATTERNS (such as Next.js and Gatsby).
  • Only the postinstall and start script (e.g. not the test or other npm scripts in package.json).
  • Only exports, nsExports and classMembers are included in the report (types, nsTypes, enumMembers are ignored).

Strict

Additionally, the --strict flag can be used to:

  • Consider only dependencies (not devDependencies) when finding unused or unlisted dependencies.
  • Consider only non-type imports (i.e. ignore import type {}).
  • Assume each workspace is self-contained: they have their own dependencies (and not rely on packages of ancestor workspaces).

Plugins

Plugins also have this distinction. For instance, Next.js entry files for pages (pages/**/*.tsx) and Remix routes (app/routes/**/*.tsx) are production code, while Jest and Playwright entry files (e.g. *.spec.ts) are not. All of this is handled automatically by Knip and its plugins. You only need to point Knip to additional files or custom file locations. The more plugins Knip will have, the more projects can be analyzed out of the box!

Paths

Tools like TypeScript, Webpack and Babel support import aliases in various ways. Knip automatically includes compilerOptions.paths from the TypeScript configuration, but does not (yet) automatically find other types of import aliases. They can be configured manually:

{
  "$schema": "https://unpkg.com/knip@1/schema.json",
  "paths": {
    "@lib": ["./lib/index.ts"],
    "@lib/*": ["./lib/*"]
  }
}

Each workspace can also have its own paths configured. Note that Knip paths follow the TypeScript semantics:

  • Path values is an array of relative paths.
  • Paths without an * are exact matches.

Reporters

Knip provides the following built-in reporters:

The compact reporter shows the sorted files first, and then a list of symbols:

example output of dependencies

Custom Reporters

When the provided built-in reporters are not sufficient, a custom reporter can be implemented.

Pass --reporter ./my-reporter, with the default export of that module having this interface:

type Reporter = (options: ReporterOptions) => void;

type ReporterOptions = {
  report: Report;
  issues: Issues;
  cwd: string;
  workingDir: string;
  isProduction: boolean;
  options: string;
};

The data can then be used to write issues to stdout, a JSON or CSV file, or sent to a service.

Find more details and ideas in custom reporters.

Libraries and "unused" exports

Libraries and applications are identical when it comes to files and dependencies: whatever is unused should be removed. Yet libraries usually have exports meant to be used by other libraries or applications. Such public variables and types in libraries can be marked with the JSDoc @public tag:

/**
 * Merge two objects.
 *
 * @public
 */

export const merge = function () {};

Knip does not report public exports and types as unused.

FAQ

Really, another unused file/dependency/export finder?

There are already some great packages available if you want to find unused dependencies OR unused exports.

I love the Unix philosophy ("do one thing well"). But in this case I believe it's efficient to handle multiple concerns in a single tool. When building a dependency graph of the project, an abstract syntax tree for each file, and traversing all of this, why not collect the various issues in one go?

Why so much configuration?

The structure and configuration of projects and their dependencies vary wildly, and no matter how well-balanced, defaults only get you so far. Some implementations and some tools out there have smart or unconventional ways to import code, making things more complicated. That's why Knip tends to require more configuration in larger projects, based on how many dependencies are used and how much the configuration in the project diverges from the defaults.

One important goal of Knip is to minimize the amount of configuration necessary. When you false positives are reported and you think there are feasible ways to infer things automatically, reducing the amount of configuration, please open an issue.

How do I handle too many output/false positives?

Too many unused files

When the list of unused files is too long, this means the gap between the set of entry and the set of project files needs tweaking. The gap can be narrowed down by increasing the entry files or reducing the project files, for instance by ignoring specific folders that are not related to the source code imported by the entry files.

Too many unused dependencies

Dependencies that are only imported in unused files are also marked as unused. So a long list of unused files would be good to remedy first.

When unused dependencies are related to dependencies having a Knip plugin, maybe the config and/or entry files for that dependency are at custom locations. The default values are at the plugin's documentation, and can be overridden to match the custom location(s).

When the dependencies don't have a Knip plugin yet, please file an issue or create a new plugin.

Too many unused exports

When the project is a library and the exports are meant to be used by consumers of the library, there are two options:

  1. By default, unused exports of entry files are not reported. You could re-export from an existing entry file, or add the containing file to the entry array in the configuration.
  2. The exported values or types can be marked using the JSDoc @public tag.

How to start using Knip in CI while having too many issues to sort out?

Eventually this type of QA only really works when it's tied to an automated workflow. But with too many issues to resolve this might not be feasible right away, especially in existing larger codebase. Here are a few options that may help:

  • Use --no-exit-code for exit code 0 in CI.
  • Use --include (or --exclude) to report only the issue types that have little or no errors.
  • Use a separate --dependencies and/or --exports Knip command.
  • Use ignore (for files and directories) and ignoreDependencies to filter out some problematic areas.
  • Limit the number of workspaces configured to analyze in knip.json.

All of this is hiding problems, so please make sure to plan for fixing them and/or open issues here for false positives.

Comparison

This table is an ongoing comparison. Based on their docs (please report any mistakes):

Featureknip[depcheck][54][unimported][55][ts-unused-exports][56][ts-prune][57]
Unused files---
Unused dependencies--
Unlisted dependencies--
[Plugins][1]--
Unused exports--
Unused class members----
Unused enum members----
Duplicate exports--
Search namespaces--
Custom reporters----
JavaScript support--
Configure entry files
[Support workspaces/monorepos][52]--
ESLint plugin available----

✅ = Supported, ❌ = Not supported, - = Out of scope

Migrating from other tools

depcheck

The following commands are similar:

depcheck
knip --dependencies

unimported

The following commands are similar:

unimported
knip --production --dependencies --include files

Also see production mode.

ts-unused-exports

The following commands are similar:

ts-unused-exports
knip --include exports,types,nsExports,nsTypes
knip --exports  # Adds unused enum and class members

ts-prune

The following commands are similar:

ts-prune
knip --include exports,types
knip --exports  # Adds unused exports/types in namespaces and unused enum/class members

TypeScript language services

TypeScript language services could play a major role in most of the "unused" areas, as they have an overview of the project as a whole. This powers things in VS Code like "Find references" or the "Module "./some" declares 'Thing' locally, but it is not exported" message. I think features like "duplicate exports" or "custom dependency resolvers" are userland territory, much like code linters.

Knip?!

Knip is Dutch for a "cut". A Dutch expression is "to be geknipt for something", which means to be perfectly suited for the job. I'm motivated to make knip perfectly suited for the job of cutting projects to perfection! ✂️


Download Details:

Author: Webpro
Source Code: https://github.com/webpro/knip 
License: ISC license

#typescript #lint #dependency #analysis #maintenance

Knip: Find Unused Files, Dependencies & Exports in Your JS/TS Project
Royce  Reinger

Royce Reinger

1678530180

DataFrame: C++ DataFrame for Statistical, Financial, & ML Analysis

DataFrame Documentation / Code Samples

This is a C++ analytical library designed for data analysis similar to libraries in Python and R. For example, you could compare this to Pandas or R data.frame.
You could slice the data in many different ways. You could join, merge, group-by the data. You could run various statistical, summarization, financial, and ML algorithms on the data. You could add your custom algorithms easily. You could multi-column sort, custom pick and delete the data. And more …
DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as Mean, Std Deviation, Return, … to more involved analysis such as Affinity Propagation, Polynomial Fit, Fast Fourier transform of arbitrary length … including a good collection of trading indicators. You could also easily add your own algorithms.
For basic operations to start you off, see Hello World. For a complete list of features with code samples, see documentation.

I have followed a few principles in this library:
 

  1. Support any type either built-in or user defined without needing new code
  2. Never chase pointers ala linked lists, std::any, pointer to base, ..., including virtual functions
  3. Have all column data in contiguous memory space. Also, be mindful of cache-line aliasing misses between multiple columns
  4. Never use more space than you need ala unions, std::variant, ...
  5. Avoid copying data as much as possible
  6. Use multi-threading but only when it makes sense
  7. Do not attempt to protect the user against garbage in, garbage out

DateTime
DateTime class included in this library is a very cool and handy object to manipulate date/time with nanosecond precision and multi timezone capability.
 


Performance

There is a test program dataframe_performance that should give you a sense of how this library performs. As a comparison, there is also a Pandas pandas_performance script that does exactly the same thing.
dataframe_performance.cc uses DataFrame async interface and is compiled with gcc (10.3.0) compiler with -O3 flag. pandas_performance.py is ran with Pandas 1.3.2, Numpy 1.21.2 and Python 3.7 on Xeon E5-2667 v2. What the test program roughly does:
 

  1. Generate ~1.6 billion timestamps (second resolution) and load them into the DataFrame/Pandas as index.
     
  2. Generate ~1.6 billion random numbers for 3 columns with normal, log normal, and exponential distributions and load them into the DataFrame/Pandas.
     
  3. Calculate the mean of each of the 3 columns.
     

Result:

$ python3 benckmarks/pandas_performance.py
Starting ... 1629817655
All memory allocations are done. Calculating means ... 1629817883
6.166675403767268e-05, 1.6487168460770107, 0.9999539627671375
1629817894 ... Done

real    5m51.598s
user    3m3.485s
sys     1m26.292s

$ Release/bin/dataframe_performance
Starting ... 1629818332
All memory allocations are done. Calculating means ... 1629818535
1, 1.64873, 1
1629818536 ... Done
  
real    3m34.241s                                                                                                                      
user    3m14.250s
sys     0m25.983s

The Interesting Part:
 

  1. Pandas script, I believe, is entirely implemented in Numpy which is in C.
  2. In case of Pandas, allocating memory + random number generation takes almost the same amount of time as calculating means.
  3. In case of DataFrame ~90% of the time is spent in allocating memory + random number generation.
  4. You load data once, but calculate statistics many times. So DataFrame, in general, is about ~11x faster than parts of Pandas that are implemented in Numpy. I leave parts of Pandas that are purely in Python to imagination.
  5. Pandas process image at its peak is ~105GB. C++ DataFrame process image at its peak is ~56GB.

Installing using CMake

mkdir [Debug | Release]
cd [Debug | Release]
cmake -DCMAKE_BUILD_TYPE=[Debug | Release] -DHMDF_BENCHMARKS=1 -DHMDF_EXAMPLES=1 -DHMDF_TESTING=1 ..
make
make install

cd [Debug | Release]
make uninstall

Package managers

DataFrame is available on Conan platform. Add dataframe/x.y.z@ to your requires, where x.y.z is the release version you want to use. Conan will acquire DataFrame, build it from source in your computer, and provide CMake integration support for your projects. See the Conan docs for more information.
Sample conanfile.txt:

[requires]
dataframe/1.22.0@

[generators]
cmake

DataFrame is also available on Microsoft VCPKG platform

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
bootstrap-vcpkg.[bat|sh]
vcpkg(.exe) integrate install
vcpkg(.exe) install DataFrame

Download Details:

Author: Hosseinmoein
Source Code: https://github.com/hosseinmoein/DataFrame 
License: BSD-3-Clause license

#machinelearning #datascience #statistical #analysis #cpluplus 

DataFrame: C++ DataFrame for Statistical, Financial, & ML Analysis
Royce  Reinger

Royce Reinger

1678506780

PySAL: Python Spatial Analysis Library Meta-Package

Python Spatial Analysis Library

PySAL, the Python spatial analysis library, is an open source cross-platform library for geospatial data science with an emphasis on geospatial vector data written in Python. It supports the development of high level applications for spatial analysis, such as

  • detection of spatial clusters, hot-spots, and outliers
  • construction of graphs from spatial data
  • spatial regression and statistical modeling on geographically embedded networks
  • spatial econometrics
  • exploratory spatio-temporal data analysis

PySAL Components

PySAL is a family of packages for spatial data science and is divided into four major components:

Lib

solve a wide variety of computational geometry problems including graph construction from polygonal lattices, lines, and points, construction and interactive editing of spatial weights matrices & graphs - computation of alpha shapes, spatial indices, and spatial-topological relationships, and reading and writing of sparse graph data, as well as pure python readers of spatial vector data. Unike other PySAL modules, these functions are exposed together as a single package.

  • libpysal : libpysal provides foundational algorithms and data structures that support the rest of the library. This currently includes the following modules: input/output (io), which provides readers and writers for common geospatial file formats; weights (weights), which provides the main class to store spatial weights matrices, as well as several utilities to manipulate and operate on them; computational geometry (cg), with several algorithms, such as Voronoi tessellations or alpha shapes that efficiently process geometric shapes; and an additional module with example data sets (examples).

Explore

The explore layer includes modules to conduct exploratory analysis of spatial and spatio-temporal data. At a high level, packages in explore are focused on enabling the user to better understand patterns in the data and suggest new interesting questions rather than answer existing ones. They include methods to characterize the structure of spatial distributions (either on networks, in continuous space, or on polygonal lattices). In addition, this domain offers methods to examine the dynamics of these distributions, such as how their composition or spatial extent changes over time.

esda : esda implements methods for the analysis of both global (map-wide) and local (focal) spatial autocorrelation, for both continuous and binary data. In addition, the package increasingly offers cutting-edge statistics about boundary strength and measures of aggregation error in statistical analyses

giddy : giddy is an extension of esda to spatio-temporal data. The package hosts state-of-the-art methods that explicitly consider the role of space in the dynamics of distributions over time

inequality : inequality provides indices for measuring inequality over space and time. These comprise classic measures such as the Theil T information index and the Gini index in mean deviation form; but also spatially-explicit measures that incorporate the location and spatial configuration of observations in the calculation of inequality measures.

momepy : momepy is a library for quantitative analysis of urban form - urban morphometrics. It aims to provide a wide range of tools for a systematic and exhaustive analysis of urban form. It can work with a wide range of elements, while focused on building footprints and street networks. momepy stands for Morphological Measuring in Python.

pointpats : pointpats supports the statistical analysis of point data, including methods to characterize the spatial structure of an observed point pattern: a collection of locations where some phenomena of interest have been recorded. This includes measures of centrography which provide overall geometric summaries of the point pattern, including central tendency, dispersion, intensity, and extent.

segregation : segregation package calculates over 40 different segregation indices and provides a suite of additional features for measurement, visualization, and hypothesis testing that together represent the state-of-the-art in quantitative segregation analysis.

spaghetti : spaghetti supports the the spatial analysis of graphs, networks, topology, and inference. It includes functionality for the statistical testing of clusters on networks, a robust all-to-all Dijkstra shortest path algorithm with multiprocessing functionality, and high-performance geometric and spatial computations using geopandas that are necessary for high-resolution interpolation along networks, and the ability to connect near-network observations onto the network

Model

In contrast to explore, the model layer focuses on confirmatory analysis. In particular, its packages focus on the estimation of spatial relationships in data with a variety of linear, generalized-linear, generalized-additive, nonlinear, multi-level, and local regression models.

mgwr : mgwr provides scalable algorithms for estimation, inference, and prediction using single- and multi-scale geographically-weighted regression models in a variety of generalized linear model frameworks, as well model diagnostics tools

spglm : spglm implements a set of generalized linear regression techniques, including Gaussian, Poisson, and Logistic regression, that allow for sparse matrix operations in their computation and estimation to lower memory overhead and decreased computation time.

spint : spint provides a collection of tools to study spatial interaction processes and analyze spatial interaction data. It includes functionality to facilitate the calibration and interpretation of a family of gravity-type spatial interaction models, including those with production constraints, attraction constraints, or a combination of the two.

spreg : spreg supports the estimation of classic and spatial econometric models. Currently it contains methods for estimating standard Ordinary Least Squares (OLS), Two Stage Least Squares (2SLS) and Seemingly Unrelated Regressions (SUR), in addition to various tests of homokestadicity, normality, spatial randomness, and different types of spatial autocorrelation. It also includes a suite of tests for spatial dependence in models with binary dependent variables.

spvcm : spvcm provides a general framework for estimating spatially-correlated variance components models. This class of models allows for spatial dependence in the variance components, so that nearby groups may affect one another. It also also provides a general-purpose framework for estimating models using Gibbs sampling in Python, accelerated by the numba package.

tobler : tobler provides functionality for for areal interpolation and dasymetric mapping. Its name is an homage to the legendary geographer Waldo Tobler a pioneer of dozens of spatial analytical methods. tobler includes functionality for interpolating data using area-weighted approaches, regression model-based approaches that leverage remotely-sensed raster data as auxiliary information, and hybrid approaches.

access : access aims to make it easy for analysis to calculate measures of spatial accessibility. This work has traditionally had two challenges: [1] to calculate accurate travel time matrices at scale and [2] to derive measures of access using the travel times and supply and demand locations. access implements classic spatial access models, allowing easy comparison of methodologies and assumptions.

spopt: spopt is an open-source Python library for solving optimization problems with spatial data. Originating from the original region module in PySAL, it is under active development for the inclusion of newly proposed models and methods for regionalization, facility location, and transportation-oriented solutions.

Viz

The viz layer provides functionality to support the creation of geovisualisations and visual representations of outputs from a variety of spatial analyses. Visualization plays a central role in modern spatial/geographic data science. Current packages provide classification methods for choropleth mapping and a common API for linking PySAL outputs to visualization tool-kits in the Python ecosystem.

legendgram : legendgram is a small package that provides "legendgrams" legends that visualize the distribution of observations by color in a given map. These distributional visualizations for map classification schemes assist in analytical cartography and spatial data visualization

mapclassify : mapclassify provides functionality for Choropleth map classification. Currently, fifteen different classification schemes are available, including a highly-optimized implementation of Fisher-Jenks optimal classification. Each scheme inherits a common structure that ensures computations are scalable and supports applications in streaming contexts.

splot : splot provides statistical visualizations for spatial analysis. It methods for visualizing global and local spatial autocorrelation (through Moran scatterplots and cluster maps), temporal analysis of cluster dynamics (through heatmaps and rose diagrams), and multivariate choropleth mapping (through value-by-alpha maps. A high level API supports the creation of publication-ready visualizations

Installation

PySAL is available through Anaconda (in the defaults or conda-forge channel) We recommend installing PySAL from conda-forge:

conda config --add channels conda-forge
conda install pysal

PySAL can also be installed using pip:

pip install pysal

As of version 2.0.0 PySAL has shifted to Python 3 only.

Users who need an older stable version of PySAL that is Python 2 compatible can install version 1.14.3 through pip or conda:

conda install pysal==1.14.3

Documentation

For help on using PySAL, check out the following resources:

Development

As of version 2.0.0, PySAL is now a collection of affiliated geographic data science packages. Changes to the code for any of the subpackages should be directed at the respective upstream repositories, and not made here. Infrastructural changes for the meta-package, like those for tooling, building the package, and code standards, will be considered.

Development is hosted on github.

Discussions of development as well as help for users occurs on the developer list as well as gitter.

Getting Involved

If you are interested in contributing to PySAL please see our development guidelines.

Bug reports

To search for or report bugs, please see PySAL's issues.

Build Instructions

To build the meta-package pysal see tools/README.md.


Download Details:

Author: Pysal
Source Code: https://github.com/pysal/pysal 
License: BSD-3-Clause license

#machinelearning #python #analysis #library 

PySAL: Python Spatial Analysis Library Meta-Package
Lawrence  Lesch

Lawrence Lesch

1678209060

Converts JavaScript to TypeScript & TypeScript to better TypeScript

TypeStat

Converts JavaScript to TypeScript and TypeScript to better TypeScript.

Usage

TypeStat is a CLI utility that modifies TypeScript types in existing code. The built-in mutators will only ever add or remove types and will never change your runtime behavior. TypeStat can:

  • ✨ Convert JavaScript files to TypeScript in a single bound!
  • ✨ Add TypeScript types on files freshly converted from JavaScript to TypeScript!
  • ✨ Infer types to fix --noImplicitAny and --noImplicitThis violations!
  • ✨ Annotate missing nulls and undefineds to get you started with --strictNullChecks!

⚡ To start, the typestat command will launch an interactive guide to setting up a configuration file. ⚡

npx typestat
👋 Welcome to TypeStat! 👋
This will create a new typestat.json for you.
...

After, use typestat --config typestat.json to convert your files.

Configuration

To get a deeper understanding of TypeStat, read the following docs pages in order:

  1. Usage.md for an explanation of how TypeStat works
  2. Fixes.md for the type of fixes TypeStat will generate mutations for
  3. Types.md for configuring how to work with types in mutations
  4. Filters.md for using tsquery to ignore sections of source files
  5. Custom Mutators.md for including or creating custom mutators

Development

See Development.md. 💖


Download Details:

Author: JoshuaKGoldberg
Source Code: https://github.com/JoshuaKGoldberg/TypeStat 
License: MIT license

#typescript #javascript #static #analysis 

Converts JavaScript to TypeScript & TypeScript to better TypeScript
Royce  Reinger

Royce Reinger

1677904465

PyMC: Bayesian Modeling in Python

PyMC

PyMC (formerly PyMC3) is a Python package for Bayesian statistical modeling focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

Check out the PyMC overview, or one of the many examples! For questions on PyMC, head on over to our PyMC Discourse forum.

Features

  • Intuitive model specification syntax, for example, x ~ N(0,1) translates to x = Normal('x',0,1)
  • Powerful sampling algorithms, such as the No U-Turn Sampler, allow complex models with thousands of parameters with little specialized knowledge of fitting algorithms.
  • Variational inference: ADVI for fast approximate posterior estimation as well as mini-batch ADVI for large data sets.
  • Transparent support for missing value imputation

Getting started

If you already know about Bayesian statistics:

Learn Bayesian statistics with a book together with PyMC

Audio & Video

  • Here is a YouTube playlist gathering several talks on PyMC.
  • You can also find all the talks given at PyMCon 2020 here.
  • The "Learning Bayesian Statistics" podcast helps you discover and stay up-to-date with the vast Bayesian community. Bonus: it's hosted by Alex Andorra, one of the PyMC core devs!

Installation

To install PyMC on your system, follow the instructions on the installation guide.

Citing PyMC

Please choose from the following:

  • DOIpaper Probabilistic programming in Python using PyMC3, Salvatier J., Wiecki T.V., Fonnesbeck C. (2016)
  • DOIzenodo A DOI for all versions.
  • DOIs for specific versions are shown on Zenodo and under Releases

Contact

We are using discourse.pymc.io as our main communication channel.

To ask a question regarding modeling or usage of PyMC we encourage posting to our Discourse forum under the “Questions” Category. You can also suggest feature in the “Development” Category.

You can also follow us on these social media platforms for updates and other announcements:

To report an issue with PyMC please use the issue tracker.

Finally, if you need to get in touch for non-technical information about the project, send us an e-mail.

Software using PyMC

General purpose

  • Bambi: BAyesian Model-Building Interface (BAMBI) in Python.
  • calibr8: A toolbox for constructing detailed observation models to be used as likelihoods in PyMC.
  • gumbi: A high-level interface for building GP models.
  • SunODE: Fast ODE solver, much faster than the one that comes with PyMC.
  • pymc-learn: Custom PyMC models built on top of pymc3_models/scikit-learn API

Domain specific

  • Exoplanet: a toolkit for modeling of transit and/or radial velocity observations of exoplanets and other astronomical time series.
  • beat: Bayesian Earthquake Analysis Tool.
  • CausalPy: A package focussing on causal inference in quasi-experimental settings.

Please contact us if your software is not listed here.

Papers citing PyMC

See Google Scholar for a continuously updated list.

Contributors

See the GitHub contributor page. Also read our Code of Conduct guidelines for a better contributing experience.

Support

PyMC is a non-profit project under NumFOCUS umbrella. If you want to support PyMC financially, you can donate here.

Professional Consulting Support

You can get professional consulting support from PyMC Labs.

Relies on PyTensor which provides:

  • Computation optimization and dynamic C or JAX compilation
  • NumPy broadcasting and advanced indexing
  • Linear algebra operators
  • Simple extensibility

Download Details:

Author: Pymc-devs
Source Code: https://github.com/pymc-devs/pymc 
License: View license

#machinelearning #python #statistical #analysis 

PyMC: Bayesian Modeling in Python

DCCA.jl: Julia Module for Detrended Cross-Correlation analysis

Detrended Cross-Correlation Analysis

A module to perform DCCA coefficients analysis. The coefficient rho describes the correlation strength between two time-series depending on time scales. It lies in [-1, 1], 1 being perfect correlations, and -1 perfect anticorrelations.
The package provides also functions returning a 95% confidence interval for the null-hypothesis (= "no-correlations").

The implementation is based on Zebende G, Et al. DCCA cross-correlation coefficient differentiation: Theoretical and practical approaches (2013), and was tested by reproducing the results of DCCA and DMCA correlations of cryptocurrency markets (2020) from Paulo Ferreira, et al.

Perform a DCCA coefficients computation:

To compute DCCA coefficients, call the rhoDCCA function like: pts, rho = rhoDCCA(timeSeries1, timeSeries2). It has the following parameters:

rhoDCCA(timeSeries1, timeSeries2; box_start = 3, box_stop = div(length(series1),10), nb_pts = 30, order = 1)

Input arguments:

  • timeSeries1, timeSeries2 (Array{Float64,1}): Time series to analyse, need to be of the same length.
  • box_start, box_stop (Int): Start and end point of the analysis. defaults respectively to 3 (the minimal possible time-scale) and 1/10th of the data length (passed this size the variance gets large).
  • nb_pts (Int): Number of points to carry the analysis onto. mostly relevant for plotting.
  • order (Int): Order of the polynomial to use for detrending. If not given, defaults to 1 (linear detrending). If order is too high, overfitting can happen, impacting the results.

Returns:

  • pts (Array{Int,1}): List of points (time-scales) where the analysis is carried out.
  • rho (Array{Float64,1}): Value of the DCCA coefficient at each points in pts.

Get the 95% confidence interval

As a rule of thumb : values of rho in [-0.1,0.1] usually aren't significant.

The confidence intervals provided by this package correspond to the null-hypothesis i.e no correlations. If rho gets outside of this interval it can be considered significant.

To get a fast estimation of the confidence interval, call the empirical_CI function like: pts, ci = empirical_CI(dataLength).

For a more accurate estimation, you can call bootstrap_CI: pts, ci = bootstrap_CI(timeSeries1, timeSeries2; iterations = 200). This operation can be much more demanding (a few minutes). The iterations argument controls the number of repetitions for the bootstrap procedure, the higher the value, the smoother and cleaner the estimation will be, but it will also take longer.

Example of simple analysis:

Calling the DCCA function with random white noise

julia> ts1 = rand(2000)
ts2 = rand(2000)
x, y = rhoDCCA(ts1, ts2)
pts, ci = empirical_CI(length(ts1))

Gave the following plot :

a = scatter(x,y, markersize = 7, xscale = :log, title = "Example of DCCA analysis : \n Correlations between two white noise time series", label = "rho coefficients", xlabel = "window sizes", ylabel = "Correlation strengh")
plot!(a,pts,ci, color = "red", linestyle = :dot, label = "limits of null-hypothesis")
plot!(a,pts,-ci, color = "red", linestyle = :dot, label = "")
display(a)

As noted previously, the value here lies in [-0.1,0.1] although we took here 2 series of white uncorrelated noise.

Installation:

julia> Using Pkg
Pkg.add("DCCA")

To-do:

  • implement spline detrending?

Travis
Build Status

Download Details:

Author: CNelias
Source Code: https://github.com/CNelias/DCCA.jl 

#julia #cross #correlation #time #analysis 

DCCA.jl: Julia Module for Detrended Cross-Correlation analysis
Royce  Reinger

Royce Reinger

1676762160

Bulbea: Deep Learning based Python Library for Stock Market Prediction

Bulbea

“Deep Learning based Python Library for Stock Market Prediction and Modelling.”

Installation

Clone the git repository:

$ git clone https://github.com/achillesrasquinha/bulbea.git && cd bulbea

Install necessary dependencies

$ pip install -r requirements.txt

Go ahead and install as follows:

$ python setup.py install

You may have to install TensorFlow:

$ pip install tensorflow     # CPU
$ pip install tensorflow-gpu # GPU - Requires CUDA, CuDNN

Usage

1. Prediction

a. Loading

Create a share object.

>>> import bulbea as bb
>>> share = bb.Share('YAHOO', 'GOOGL')
>>> share.data
# Open        High         Low       Close      Volume  \
# Date                                                                     
# 2004-08-19   99.999999  104.059999   95.959998  100.339998  44659000.0   
# 2004-08-20  101.010005  109.079998  100.500002  108.310002  22834300.0   
# 2004-08-23  110.750003  113.479998  109.049999  109.399998  18256100.0   
# 2004-08-24  111.239999  111.599998  103.570003  104.870002  15247300.0   
# 2004-08-25  104.960000  108.000002  103.880003  106.000005   9188600.0
...

b. Preprocessing

Split your data set into training and testing sets.

>>> from bulbea.learn.evaluation import split
>>> Xtrain, Xtest, ytrain, ytest = split(share, 'Close', normalize = True)

c. Modelling

>>> import numpy as np
>>> Xtrain = np.reshape(Xtrain, (Xtrain.shape[0], Xtrain.shape[1], 1))
>>> Xtest  = np.reshape( Xtest, ( Xtest.shape[0],  Xtest.shape[1], 1))

>>> from bulbea.learn.models import RNN
>>> rnn = RNN([1, 100, 100, 1]) # number of neurons in each layer
>>> rnn.fit(Xtrain, ytrain)
# Epoch 1/10
# 1877/1877 [==============================] - 6s - loss: 0.0039
# Epoch 2/10
# 1877/1877 [==============================] - 6s - loss: 0.0019
...

d. Testing

>>> from sklearn.metrics import mean_squared_error
>>> p = rnn.predict(Xtest)
>>> mean_squared_error(ytest, p)
0.00042927869370525931
>>> import matplotlib.pyplot as pplt
>>> pplt.plot(ytest)
>>> pplt.plot(p)
>>> pplt.show()

2. Sentiment Analysis

Add your Twitter credentials to your environment variables.

export BULBEA_TWITTER_API_KEY="<YOUR_TWITTER_API_KEY>"
export BULBEA_TWITTER_API_SECRET="<YOUR_TWITTER_API_SECRET>"

export BULBEA_TWITTER_ACCESS_TOKEN="<YOUR_TWITTER_ACCESS_TOKEN>"
export BULBEA_TWITTER_ACCESS_TOKEN_SECRET="<YOUR_TWITTER_ACCESS_TOKEN_SECRET>"

And then,

>>> bb.sentiment(share)
0.07580128205128206

Documentation

Detailed documentation is available here.

Dependencies

  1. quandl
  2. keras
  3. tweepy
  4. textblob

Download Details:

Author: Achillesrasquinha
Source Code: https://github.com/achillesrasquinha/bulbea 
License: View license

#machinelearning #python #finance #deeplearning #analysis 

Bulbea: Deep Learning based Python Library for Stock Market Prediction
Royce  Reinger

Royce Reinger

1676734080

Thunder: Scalable analysis Of Images and Time Series

Thunder

scalable analysis of image and time series analysis in python

Thunder is an ecosystem of tools for the analysis of image and time series data in Python. It provides data structures and algorithms for loading, processing, and analyzing these data, and can be useful in a variety of domains, including neuroscience, medical imaging, video processing, and geospatial and climate analysis. It can be used locally, but also supports large-scale analysis through the distributed computing engine spark. All data structures and analyses in Thunder are designed to run identically and with the same API whether local or distributed.

Thunder is designed around modularity and composability — the core thunder package, in this repository, only defines common data structures and read/write patterns, and most functionality is broken out into several related packages. Each one is independently versioned, with its own GitHub repository for organizing issues and contributions.

This readme provides an overview of the core thunder package, its data types, and methods for loading and saving. Tutorials, detailed API documentation, and info about all associated packages can be found at the documentation site.

Install

The core thunder package defines data structures and read/write patterns for images and series data. It is built on numpy, scipy, scikit-learn, and scikit-image, and is compatible with Python 2.7+ and 3.4+. You can install it using:

pip install thunder-python

related packages

Lots of functionality in Thunder, especially for specific types of analyses, is broken out into the following separate packages.

You can install the ones you want with pip, for example

pip install thunder-regression
pip install thunder-registration

example

Here's a short snippet showing how to load an image sequence (in this case random data), median filter it, transform it to a series, detrend and compute a fourier transform on each pixel, then convert it to an array.

import thunder as td

data = td.images.fromrandom()
ts = data.median_filter(3).toseries()
frequencies = ts.detrend().fourier(freq=3).toarray()

usage

Most workflows in Thunder begin by loading data, which can come from a variety of sources and locations, and can be either local or distributed (see below).

The two primary data types are images and series. images are used for collections or sequences of images, and are especially useful when working with movie data. series are used for collections of one-dimensional arrays, often representing time series.

Once loaded, each data type can be manipulated through a variety of statistical operators, including simple statistical aggregiations like mean min and max or more complex operations like gaussian_filter detrend and subsample. Both images and series objects are wrappers for ndarrays: either a local numpy ndarray or a distributed ndarray using bolt and spark. Calling toarray() on an images or series object at any time returns a local numpy ndarray, which is an easy way to move between Thunder and other Python data analysis tools, like pandas and scikit-learn.

For a full list of methods on image and series data, see the documentation site.

loading data

Both images and series can be loaded from a variety of data types and locations. For all loading methods, the optional argument engine allows you to specify whether data should be loaded in 'local' mode, which is backed by a numpy array, or in 'spark' mode, which is backed by an RDD.

All loading methods are available on the module for the corresponding data type, for example

import thunder as td

data = td.images.fromtif('/path/to/tifs')
data = td.series.fromarray(somearray)
data_distributed = ts.series.fromarray(somearray, engine=sc)

The argument engine can be either None for local use or a SparkContext for distributed use with Spark. And in either case, methods that load from files e.g. fromtif or frombinary can load from either a local filesystem or Amazon S3, with the optional argument credentials for S3 credentials. See the documentation site for a full list of data loading methods.

using with spark

Thunder doesn't require Spark and can run locally without it, but Spark and Thunder work great together! To install and configure a Spark cluster, consult the official Spark documentation. Thunder supports Spark version 1.5+ (currently tested against 2.0.0), and uses the Python API PySpark. If you have Spark installed, you can install Thunder just by calling pip install thunder-python on both the master node and all worker nodes of your cluster. Alternatively, you can clone this GitHub repository, and make sure it is on the PYTHONPATH of both the master and worker nodes.

Once you have a running cluster with a valid SparkContext — this is created automatically as the variable sc if you call the pyspark executable — you can pass it as the engine to any of Thunder's loading methods, and this will load your data in distributed 'spark' mode. In this mode, all operations will be parallelized, and chained operations will be lazily executed.

contributing

Thunder is a community effort! The codebase so far is due to the excellent work of the following individuals:

Andrew Osheroff, Ben Poole, Chris Stock, Davis Bennett, Jascha Swisher, Jason Wittenbach, Jeremy Freeman, Josh Rosen, Kunal Lillaney, Logan Grosenick, Matt Conlen, Michael Broxton, Noah Young, Ognen Duzlevski, Richard Hofer, Owen Kahn, Ted Fujimoto, Tom Sainsbury, Uri Laseron, W J Liddy

If you run into a problem, have a feature request, or want to contribute, submit an issue or a pull request, or come talk to us in the chatroom!

Download Details:

Author: Thunder-project
Source Code: https://github.com/thunder-project/thunder 
License: Apache-2.0 license

#machinelearning #python #analysis #images 

Thunder: Scalable analysis Of Images and Time Series
Royce  Reinger

Royce Reinger

1676532360

Traces: A Python Library for Unevenly-spaced Time Series Analysis

Traces

A Python library for unevenly-spaced time series analysis.

Why?

Taking measurements at irregular intervals is common, but most tools are primarily designed for evenly-spaced measurements. Also, in the real world, time series have missing observations or you may have multiple series with different frequencies: it can be useful to model these as unevenly-spaced.

Traces was designed by the team at Datascope based on several practical applications in different domains, because it turns out unevenly-spaced data is actually pretty great, particularly for sensor data analysis.

Installation

To install traces, run this command in your terminal:

$ pip install traces

Quickstart: using traces

To see a basic use of traces, let's look at these data from a light switch, also known as Big Data from the Internet of Things.

The main object in traces is a TimeSeries, which you create just like a dictionary, adding the five measurements at 6:00am, 7:45:56am, etc.

>>> time_series = traces.TimeSeries()
>>> time_series[datetime(2042, 2, 1,  6,  0,  0)] = 0 #  6:00:00am
>>> time_series[datetime(2042, 2, 1,  7, 45, 56)] = 1 #  7:45:56am
>>> time_series[datetime(2042, 2, 1,  8, 51, 42)] = 0 #  8:51:42am
>>> time_series[datetime(2042, 2, 1, 12,  3, 56)] = 1 # 12:03:56am
>>> time_series[datetime(2042, 2, 1, 12,  7, 13)] = 0 # 12:07:13am

What if you want to know if the light was on at 11am? Unlike a python dictionary, you can look up the value at any time even if it's not one of the measurement times.

>>> time_series[datetime(2042, 2, 1, 11,  0, 0)] # 11:00am
0

The distribution function gives you the fraction of time that the TimeSeries is in each state.

>>> time_series.distribution(
>>>   start=datetime(2042, 2, 1,  6,  0,  0), # 6:00am
>>>   end=datetime(2042, 2, 1,  13,  0,  0)   # 1:00pm
>>> )
Histogram({0: 0.8355952380952381, 1: 0.16440476190476191})

The light was on about 16% of the time between 6am and 1pm.

Adding more data...

Now let's get a little more complicated and look at the sensor readings from forty lights in a house.

How many lights are on throughout the day? The merge function takes the forty individual TimeSeries and efficiently merges them into one TimeSeries where the each value is a list of all lights.

>>> trace_list = [... list of forty traces.TimeSeries ...]
>>> count = traces.TimeSeries.merge(trace_list, operation=sum)

We also applied a sum operation to the list of states to get the TimeSeries of the number of lights that are on.

How many lights are on in the building on average during business hours, from 8am to 6pm?

>>> histogram = count.distribution(
>>>   start=datetime(2042, 2, 1,  8,  0,  0),   # 8:00am
>>>   end=datetime(2042, 2, 1,  12 + 6,  0,  0) # 6:00pm
>>> )
>>> histogram.median()
17

The distribution function returns a Histogram that can be used to get summary metrics such as the mean or quantiles.

It's flexible

The measurements points (keys) in a TimeSeries can be in any units as long as they can be ordered. The values can be anything.

For example, you can use a TimeSeries to keep track the contents of a grocery basket by the number of minutes within a shopping trip.

>>> time_series = traces.TimeSeries()
>>> time_series[1.2] = {'broccoli'}
>>> time_series[1.7] = {'broccoli', 'apple'}
>>> time_series[2.2] = {'apple'}          # puts broccoli back
>>> time_series[3.5] = {'apple', 'beets'} # mmm, beets

To learn more, check the examples and the detailed reference.

More info

Contributing

Contributions are welcome and greatly appreciated! Please visit our guidelines for more info.

Download Details:

Author: Datascopeanalytics
Source Code: https://github.com/datascopeanalytics/traces 
License: MIT license

#machinelearning #python #analysis 

Traces: A Python Library for Unevenly-spaced Time Series Analysis
Royce  Reinger

Royce Reinger

1676504400

Pastas: Analysis of Groundwater Time Series

Pastas: Analysis of Groundwater Time Series

Pastas: what is it?

Pastas is an open source python package for processing, simulating and analyzing groundwater time series. The object oriented structure allows for the quick implementation of new model components. Time series models can be created, calibrated, and analysed with just a few lines of python code with the built-in optimization, visualisation, and statistical analysis tools.

Quick installation guide

To install Pastas, a working version of Python 3.8, 3.9 or 3.10 has to be installed on your computer. We recommend using the Anaconda Distribution as it includes most of the python package dependencies and the Jupyter Notebook software to run the notebooks. However, you are free to install any Python distribution you want.

Stable version

To get the latest stable version, use:

pip install pastas

Update

To update pastas, use:

pip install pastas --upgrade

Developers

To get the latest development version, use:

pip install git+https://github.com/pastas/pastas.git@dev#egg=pastas

Related packages

  • Pastastore is a Python package for managing multiple timeseries and pastas models
  • Hydropandas can be used to obtain Dutch timeseries (KNMI, Dinoloket, ..)
  • PyEt can be used to compute potential evaporation from meteorological variables.

Dependencies

Pastas depends on a number of Python packages, of which all of the necessary are automatically installed when using the pip install manager. To summarize, the dependencies necessary for a minimal function installation of Pastas

  • numpy>=1.7
  • matplotlib>=3.1
  • pandas>=1.1
  • scipy>=1.8
  • numba>=0.51

To install the most important optional dependencies (solver LmFit and function visualisation Latexify) at the same time with Pastas use:

pip install pastas[full]

or for the development version use:

pip install git+https://github.com/pastas/pastas.git@dev#egg=pastas[full]

How to Cite Pastas?

If you use Pastas in one of your studies, please cite the Pastas article in Groundwater:

To cite a specific version of Python, you can use the DOI provided for each official release (>0.9.7) through Zenodo. Click on the link to get a specific version and DOI, depending on the Pastas version.

  • Collenteur, R., Bakker, M., Caljé, R. & Schaars, F. (XXXX). Pastas: open-source software for time series analysis in hydrology (Version X.X.X). Zenodo. http://doi.org/10.5281/zenodo.1465866

Documentation & Examples

Get in Touch

  • Questions on Pastas can be asked and answered on Github Discussions.
  • Bugs, feature requests and other improvements can be posted as Github Issues.
  • Pull requests will only be accepted on the development branch (dev) of this repository. Please take a look at the developers section on the documentation website for more information on how to contribute to Pastas.

Download Details:

Author: Pastas
Source Code: https://github.com/pastas/pastas 
License: MIT license

#machinelearning #python #analysis 

Pastas: Analysis of Groundwater Time Series
Royce  Reinger

Royce Reinger

1676495520

HCTSA: Highly Comparative Time-series analysis

〰️ hctsa 〰️: highly comparative time-series analysis

hctsa is a Matlab software package for running highly comparative time-series analysis. It extracts thousands of time-series features from a collection of univariate time series and includes a range of tools for visualizing and analyzing the resulting time-series feature matrix, including:

  1. Normalizing and clustering time-series data;
  2. Producing low-dimensional representations of time-series data;
  3. Identifying and interpreting discriminating features between different classes of time series; and
  4. Fitting and evaluating multivariate classification models.

Installation 

For users familiar with git (recommended), please make a fork of the repo and then clone it to your local machine. To update, after setting an upstream remote (git remote add upstream git://github.com/benfulcher/hctsa.git) you can use git pull upstream main.

Users unfamiliar with git can instead download the repository by clicking the green "Code" button then "Download ZIP".

Once downloaded, you can install hctsa by running the install.m script (see docs for details).

Documentation and Wiki 📖

Comprehensive documentation for hctsa, from getting started through to more advanced analyses is on GitBook.

There is also alot of additional information on the wiki, including:

  • 👉 Information about alternative feature sets (including the much faster catch22), and information about other time-series packages available in R, python, and Julia.
  • 〰️ The accompanying time-series data archive for this project, CompEngine.
  • 💾 Downloadable hctsa feature matrices from time-series datasets with example workflows.
  • 💻 Resources for distributing an hctsa computation on a computing cluster.
  • 📕 A list of publications that have used hctsa to address different research questions.
  • 💁 Frequently asked questions about hctsa and related feature-based time-series analyses.

Acknowledgement 👍

If you use this software, please read and cite these open-access articles:

Feedback, as email, GitHub issues or pull requests, is much appreciated.

For commercial use of hctsa, including licensing and consulting, contact Engine Analytics.

External packages and dependencies

Many features in hctsa rely on external packages and Matlab toolboxes. In the case that some of them are unavailable, hctsa can still be used, but only a reduced set of time-series features will be computed.

hctsa uses the following Matlab Add-On Toolboxes: Statistics and Machine Learning, Signal Processing, Curve Fitting, System Identification, Wavelet, and Econometrics.

The following external time-series analysis code packages are provided with the software (in the Toolboxes directory), and are used by our main feature-extraction algorithms to compute meaningful structural features from time series:

Acknowledgements 👋

Many thanks go to Romesh Abeysuriya for helping with the mySQL database set-up and install scripts, and Santi Villalba for lots of helpful feedback and advice on the software.


Feel free to email me for advice on applications of hctsa 🤓


Download Details:

Author: Benfulcher
Source Code: https://github.com/benfulcher/hctsa 
License: View license

#machinelearning #python #matlab #analysis 

HCTSA: Highly Comparative Time-series analysis

Guide to Python Social Media analysis

Strategic Listening: A Guide to Python Social Media Analysis

Listening is everything—especially when it comes to effective marketing and product design. Gain key market insights from social media data using sentiment analysis and topic modeling in Python.

With a global penetration rate of 58.4%, social media provides a wealth of opinions, ideas, and discussions shared daily. This data offers rich insights into the most important and popular conversation topics among users.

In marketing, social media analysis can help companies understand and leverage consumer behavior. Two common, practical methods are:

  • Topic modeling, which answers the question, “What conversation topics do users speak about?”
  • Sentiment analysis, which answers the question, “How positively or negatively are users speaking about a topic?”

In this article, we use Python for social media data analysis and demonstrate how to gather vital market information, extract actionable feedback, and identify the product features that matter most to clients.

Social Media Analysis Case Study: Smartwatches on Reddit

To prove the utility of social media analysis, let’s perform a product analysis of various smartwatches using Reddit data and Python. Python is a strong choice for data science projects, and it offers many libraries that facilitate the implementation of the machine learning (ML) and natural language processing (NLP) models that we will use.

This analysis uses Reddit data (as opposed to data from Twitter, Facebook, or Instagram) because Reddit is the second most trusted social media platform for news and information, according to the American Press Institute. In addition, Reddit's subforum organization produces “subreddits” where users recommend and criticize specific products; its structure is ideal for product-centered data analysis.

First we use sentiment analysis to compare user opinions on popular smartwatch brands to discover which products are viewed most positively. Then, we use topic modeling to narrow in on specific smartwatch attributes that users frequently discuss. Though our example is specific, you can apply the same analysis to any other product or service.

Preparing Sample Reddit Data

The data set for this example contains the title of the post, the text of the post, and the text of all comments for the most recent 100 posts made in the r/smartwatch subreddit. Our dataset contains the most recent 100 complete discussions of the product, including users' experiences, recommendations about products, and their pros and cons.

To collect this information from Reddit, we will use PRAW, the Python Reddit API Wrapper. First, create a client ID and secret token on Reddit using the OAuth2 guide. Next, follow the official PRAW tutorials on downloading post comments and getting post URLs.

Sentiment Analysis: Identifying Leading Products

To identify leading products, we can examine the positive and negative comments users make about certain brands by applying sentiment analysis to our text corpus. Sentiment analysis models are NLP tools that categorize texts as positive or negative based on their words and phrases. There is a wide variety of possible models, ranging from simple counters of positive and negative words to deep neural networks.

We will use VADER for our example, because it is designed to optimize results for short texts from social networks by using lexicons and rule-based algorithms. In other words, VADER performs well on data sets like the one we are analyzing.

Use the Python ML notebook of your choice (for example, Jupyter) to analyze this data set. We install VADER using pip:

pip install vaderSentiment 

First, we add three new columns to our data set: the compound sentiment values ​​for the post title, post text, and comment text. To do this, iterate over each text and apply VADER's polarity_scores method, which takes a string as input and returns a dictionary with four scores: positivity, negativity, neutrality, and compound.

For our purposes, we’ll use only the compound score—the overall sentiment based on the first three scores, rated on a normalized scale from -1 to 1 inclusive, where -1 is the most negative and 1 is the most positive—in order to characterize the sentiment of a text with a single numerical value:

# Import VADER and pandas
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
import pandas as pd

analyzer = SentimentIntensityAnalyzer()

# Load data
data = pd.read_json("./sample_data/data.json", lines=True)

# Initialize lists to store sentiment values 
title_compound = []
text_compound = []
comment_text_compound = []

for title,text,comment_text in zip(data.Title, data.Text, data.Comment_text):
    title_compound.append(analyzer.polarity_scores(title)["compound"])
    
    text_compound.append(analyzer.polarity_scores(text)["compound"])

    comment_text_compound.append(analyzer.polarity_scores(comment_text["compound"])

# Add the new columns with the sentiment    
data["title_compound"] = title_compound
data["text_compound"] = text_compound
data["comment_text_compound"] = comment_text_compound 

Next, we want to catalog the texts by product and brand; this allows us to determine the sentiment scores associated with specific smartwatches. To do this, we designate a list of product lines we want to analyze, then we verify which products are mentioned in each text:

list_of_products = ["samsung", "apple", "xiaomi", "huawei", "amazfit", "oneplus"]

for column in ["Title","Text","Comment_text"]:
    for product in list_of_products:
        l = []
        for text in data[column]:
            l.append(product in text.lower())
        data["{}_{}".format(column,product)] = l

Certain texts may mention multiple products (for example, a single comment might compare two smartwatches). We can proceed in one of two ways:

  • We can discard those texts.
  • We can split those texts using NLP techniques. (In this case, we would assign a part of the text to each product.)

For the sake of code clarity and simplicity, our analysis discards those texts.

Sentiment Analysis Results

Now we are able to examine our data and determine the average sentiment associated with various smartwatch brands, as expressed by users:

for product in list_of_products:
    mean = pd.concat([data[data["Title_{}".format(product)]].title_compound,
                      data[data["Text_{}".format(product)]].text_compound,
                      data[data["Comment_text_{}".format(product)]].comment_text_compound]).mean()
    print("{}: {})".format(product,mean))

We observe the following results:

SmartwatchSamsungAppleXiaomiHuaweiAmazfitOnePlus
Sentiment Compound Score (Avg.)0.49390.53490.64620.43040.39780.8413

Our analysis reveals valuable market information. For example, users from our data set have a more positive sentiment regarding the OnePlus smartwatch over the other smartwatches.

Beyond considering average sentiment, businesses should also consider the factors affecting these scores: What do users love or hate about each brand? We can use topic modeling to dive deeper into our existing analysis and produce actionable feedback on products and services.

Topic Modeling: Finding Important Product Attributes

Topic modeling is the branch of NLP that uses ML models to mathematically describe what a text is about. We will limit the scope of our discussion to classical NLP topic modeling approaches, though there are recent advances taking place using transformers, such as BERTopic.

There are many topic modeling algorithms, including non-negative matrix factorization (NMF), sparse principal components analysis (sparse PCA), and latent dirichlet allocation (LDA). These ML models use a matrix as input then reduce the dimensionality of the data. The input matrix is structured such that:

  • Each column represents a word.
  • Each row represents a text.
  • Each cell represents the frequency of each word in each text.

These are all unsupervised models that can be used for topic decomposition. The NMF model is commonly used for social media analysis, and is the one we will use for our example, because it allows us to obtain easily interpretable results. It produces an output matrix such that:

  • Each column represents a topic.
  • Each row represents a text.
  • Each cell represents the degree to which a text discusses a specific topic.

Our workflow follows this process:

 

A green box labeled “Start topic modeling analysis” points right to a dark blue box: “Identify and import dependencies.” This box points right to a second dark blue box: “Create corpus of texts.” This box points right to a third dark blue box: “Apply NMF model.” This box points right to a fourth dark blue box: “Analyze results.” This box points right to a green box labeled “Integrate results into marketing,” and down to two light blue boxes: “General analysis” and “Detailed (sentiment-based) analysis.”

The Topic Modeling Process

 

First, we'll apply our NMF model to analyze general topics of interest, and then we'll narrow in on positive and negative topics.

Analyzing General Topics of Interest

We'll look at topics for the OnePlus smartwatch, since it had the highest compound sentiment score. Let's import the required packages providing NMF functionality and common stop words to filter from our text:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import NMF

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

Now, let's create a list with the corpus of texts we will use. We use the scikit-learn ML library's CountVectorizer and TfidfTransformer functions to generate our input matrix:

product = "oneplus"
corpus = pd.concat([data[data["Title_{}".format(product)]].Title,
                      data[data["Text_{}".format(product)]].Text,
                      data[data["Comment_text_{}".format(product)]].Comment_text]).tolist()

count_vect = CountVectorizer(stop_words=stopwords.words('english'), lowercase=True)
x_counts = count_vect.fit_transform(corpus)

feature_names = count_vect.get_feature_names_out()
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

(Note that details about handling n-grams—i.e., alternative spellings and usage such as "one plus"—can be found in my previous article on topic modeling.)

We are ready to apply the NMF model and find the latent topics in our data. Like other dimensionality reduction methods, NMF needs the total number of topics to be set as a parameter (dimension). Here, we choose a 10-topic dimensionality reduction for simplicity, but you could test different values to see what number of topics yields the best unsupervised learning result. Try setting dimension to maximize metrics such as the silhouette coefficient or the elbow method. We also set a random state for reproducibility:

import numpy as np

dimension = 10
nmf = NMF(n_components = dimension, random_state = 42)
nmf_array = nmf.fit_transform(x_tfidf)

components = [nmf.components_[i] for i in range(len(nmf.components_))]
features = count_vect.get_feature_names_out()
important_words = [sorted(features, key = lambda x: components[j][np.where(features==x)], reverse = True) for j in range(len(components))]

important_words contains lists of words, where each list represents one topic and the words are ordered within a topic by importance. It includes a combination of meaningful and “garbage” topics; this is a common result in topic modeling because it is difficult for the algorithm to successfully cluster all texts into just a few topics.

Examining the important_words output, we notice meaningful topics around words like “budget” or “charge”, which points to features that matter to users when discussing OnePlus smartwatches:

['charge', 'battery', 'watch', 'best', 'range', 'days', 'life', 'android', 'bet', 'connectivity']
['budget', 'price', 'euros', 'buying', 'purchase', 'quality', 'tag', 'worth', 'smartwatch', '100']

Since our sentiment analysis produced a high compound score for OnePlus, we might assume that this means it has a lower cost or better battery life compared to other brands. However, at this point, we don't know whether users view these factors positively or negatively, so let's conduct an in-depth analysis to get tangible answers.

Analyzing Positive and Negative Topics

Our more detailed analysis uses the same concepts as our general analysis, applied separately to positive and negative texts. We will uncover which factors users point to when speaking positively—or negatively—about a product.

Let’s do this for the Samsung smartwatch. We will use the same pipeline but with a different corpus:

  • We create a list of positive texts that have a compound score greater than 0.8.
  • We create a list of negative texts that have a compound score less than 0.

These numbers were chosen to select the top 20% of positive texts scores (>0.8) and top 20% of negative texts scores (<0), and produce the strongest results for our smartwatch sentiment analysis:

# First the negative texts.
product = "samsung"
corpus_negative = pd.concat([data[(data["Title_{}".format(product)]) & (data.title_compound < 0)].Title,
                      data[(data["Text_{}".format(product)]) & (data.text_compound < 0)].Text,
                      data[(data["Comment_text_{}".format(product)]) & (data.comment_text_compound < 0)].Comment_text]).tolist()


# Now the positive texts.
corpus_positive = pd.concat([data[(data["Title_{}".format(product)]) & (data.title_compound > 0.8)].Title,
                      data[(data["Text_{}".format(product)]) & (data.text_compound > 0.8)].Text,
                      data[(data["Comment_text_{}".format(product)]) & (data.comment_text_compound > 0.8)].Comment_text]).tolist()

print(corpus_negative)
print(corpus_positive)

We can repeat the same method of topic modeling that we used for general topics of interest to reveal the positive and negative topics. Our results now provide much more specific marketing information: For example, our model's negative corpus output includes a topic about the accuracy of burned calories, while the positive output is about navigation/GPS and health indicators like pulse rate and blood oxygen levels. Finally, we have actionable feedback on aspects of the smartwatch that the users love and areas where the product has room for improvement.

 

A word cloud with various words, from largest to smallest: health, pulse, screen, sensor, fitness, exercise, miles, feature, heart, active.

Word Cloud of a Samsung Positive Topic, Created With the wordcloud Library

 

To amplify your data findings, I'd recommend creating a word cloud or another similar visualization of the important topics identified in our tutorial.

Limitless Product Insights With Social Media Analysis

Through our analysis, we understand what users think of a target product and those of its competitors, what users love about top brands, and what may be improved for better product design. Public social media data analysis allows you to make informed decisions regarding business priorities and enhance overall user satisfaction. Incorporate social media analysis into your next product cycle for improved marketing campaigns and product design—because listening is everything.

Original article source at: https://www.toptal.com

#python #analysis 

Guide to Python Social Media analysis

Introduction to Multivariate Regression analysis

Introduction to Multivariate Regression

In today’s world, data is everywhere. Data itself is just facts and figures, and this needs to be explored to get meaningful information. Hence, data analysis is important. Data analysis is the process of applying statistical analysis and logical techniques to describe and visualize, reduce, revise, summarize, and assess data into useful information that provides a better context for the data.

Data analysis plays a significant role in finding meaningful information which will help business take better decision basis the output.

Along with Data analysis, Data science also comes into the picture. Data science is a field combining many methods of scientific methodology, processes, algorithms, and tools to extract information from, particularly huge datasets for insights on structured and unstructured data. A different range of terms related to data mining, cleaning, analyzing, and interpreting data are often used interchangeably in data science.

Let us look at one of the important models of data science.

 

Regression analysis

Regression analysis is one of the most sought out methods used in data analysis. It follows a supervised machine learning algorithm. Regression analysis is an important statistical method that allows us to examine the relationship between two or more variables in the dataset.

Regression analysis is a way of mathematically differentiating variables that have an impact. It answers the questions: the important variables? Which can be ignored? How do they interact with each other? And most important is how certain we are about these variables?

We have a dependent variable — the main factor that we are trying to understand or predict. And then we have independent variables — the factors we believe have an impact on the dependent variable.

Simple linear regression is a regression model that estimates the relationship between a dependent variable and an independent variable using a straight line.

On the other hand, Multiple linear regression estimates the relationship between two or more independent variables and one dependent variable. The difference between these two models is the number of independent variables.

Sometimes the above-mentioned regression models will not work. Here’s why.

As known, regression analysis is mainly used in understanding the relationship between a dependent and independent variable. In the real world, there are an ample number of situations where many independent variables get influenced by other variables for that we have to look for other options rather than a single regression model that can only work with one independent variable.

With these setbacks in hand, we would want a better model that will fill up the disadvantages of Simple and Multiple Linear Regression and that model is Multivariate Regression. If you are a beginner in the field and wish to learn more such concepts to start your career in Machine Learning, you can head over to Great Learning Academy and take up the Basics of machine learning , Linear Regression. The course will cover all the basic concepts required for you to kick-start your machine learning journey.

Looking to improve your skills in regression analysis? This regression analysis using excel course will teach you all the techniques you need to know to get the most out of your data. You’ll learn how to build models, interpret results, and use regression analysis to make better decisions for your business. Enroll today and get started on your path to becoming a data-driven decision maker!

What is Multivariate Regression?

Multivariate Regression is a supervised machine learning algorithm involving multiple data variables for analysis. Multivariate regression is an extension of multiple regression with one dependent variable and multiple independent variables. Based on the number of independent variables, we try to predict the output.

Multivariate regression tries to find out a formula that can explain how factors in variables respond simultaneously to changes in others.

There are numerous areas where multivariate regression can be used. Let’s look at some examples to understand multivariate regression better.

  1. Praneeta wants to estimate the price of a house. She will collect details such as the location of the house, number of bedrooms, size in square feet, amenities available, or not. Basis these details price of the house can be predicted and how each variables are interrelated.
  2. An agriculture scientist wants to predict the total crop yield expected for the summer. He collected details of the expected amount of rainfall, fertilizers to be used, and soil conditions. By building a Multivariate regression model scientists can predict his crop yield. With the crop yield, the scientist also tries to understand the relationship among the variables.
  3. If an organization wants to know how much it has to pay to a new hire, they will take into account many details such as education level, number of experience, job location, has niche skill or not. Basis this information salary of an employee can be predicted, how these variables help in estimating the salary.
  4. Economists can use Multivariate regression to predict the GDP growth of a state or a country based on parameters like total amount spent by consumers, import expenditure, total gains from exports, total savings, etc.
  5. A company wants to predict the electricity bill of an apartment, the details needed here are the number of flats, the number of appliances in usage, the number of people at home, etc. With the help of these variables, the electricity bill can be predicted.

The above example uses Multivariate regression, where we have many independent variables and a single dependent variable.

Mathematical equation

The simple regression linear model represents a straight line meaning y is a function of x. When we have an extra dimension (z), the straight line becomes a plane.

Here, the plane is the function that expresses y as a function of x and z. The linear regression equation can now be expressed as:

y = m1.x + m2.z+ c

y is the dependent variable, that is, the variable that needs to be predicted.
x is the first independent variable. It is the first input.

m1 is the slope of x1. It lets us know the angle of the line (x).
z is the second independent variable. It is the second input.
m2 is the slope of z. It helps us to know the angle of the line (z).
c is the intercept. A constant that finds the value of y when x and z are 0.

The equation for a model with two input variables can be written as:

y = β0 + β1.x1 + β2.x2

What if there are three variables as inputs? Human visualizations can be only three dimensions. In the machine learning world, there can be n number of dimensions. The equation for a model with three input variables can be written as:

y = β0 + β1.x1 + β2.x2 + β3.x3

Below is the generalized equation for the multivariate regression model-

y = β0 + β1.x1 + β2.x2 +….. + βn.xn

Where n represents the number of independent variables, β0~ βn represents the coefficients, and x1~xn is the independent variable.

The multivariate model helps us in understanding and comparing coefficients across the output. Here, the small cost function makes Multivariate linear regression a better model.

Also Read: 100+ Machine Learning Interview Questions

What is Cost Function?

The cost function is a function that allows a cost to samples when the model differs from observed data. This equation is the sum of the square of the difference between the predicted value and the actual value divided by twice the length of the dataset. A smaller mean squared error implies better performance. Here, the cost is the sum of squared errors.

Cost of Multiple Linear regression:

Steps of Multivariate Regression analysis

Steps involved for Multivariate regression analysis are feature selection and feature engineering, normalizing the features, selecting the loss function and hypothesis, setting hypothesis parameters, minimizing the loss function, testing the hypothesis, and generating the regression model.

  • Feature selection-
    The selection of features is an important step in multivariate regression. Feature selection also known as variable selection. It becomes important for us to pick significant variables for better model building.
     
  • Normalizing Features-
    We need to scale the features as it maintains general distribution and ratios in data. This will lead to an efficient analysis. The value of each feature can also be changed.
     
  • Select Loss function and Hypothesis-
    The loss function predicts whenever there is an error. Meaning, when the hypothesis prediction deviates from actual values. Here, the hypothesis is the predicted value from the feature/variable.
     
  • Set Hypothesis Parameters-
    The hypothesis parameter needs to be set in such a way that it reduces the loss function and predicts well.
     
  • Minimize the Loss Function-
    The loss function needs to be minimized by using a loss minimization algorithm on the dataset, which will help in adjusting hypothesis parameters. After the loss is minimized, it can be used for further action. Gradient descent is one of the algorithms commonly used for loss minimization.
     
  • Test the hypothesis function-
    The hypothesis function needs to be checked on as well, as it is predicting values. Once this is done, it has to be tested on test data.

Advantages of Multivariate Regression

The most important advantage of Multivariate regression is it helps us to understand the relationships among variables present in the dataset. This will further help in understanding the correlation between dependent and independent variables. Multivariate linear regression is a widely used machine learning algorithm.

Disadvantages of Multivariate Regression

  • Multivariate techniques are a bit complex and require a high-levels of mathematical calculation. 
  • The multivariate regression model’s output is not easy to interpret sometimes, because it has some loss and error output which are not identical.
  • This model does not have much scope for smaller datasets. Hence, the same cannot be applied to them. The results are better for larger datasets.

Conclusion

Multivariate regression comes into the picture when we have more than one independent variable, and simple linear regression does not work. Real-world data involves multiple variables or features and when these are present in data, we would require Multivariate regression for better analysis.


Original article source at: https://www.mygreatlearning.com

#analysis 

Introduction to Multivariate Regression analysis

LogParser.jl: Julia Package for Parsing Server Log Files

LogParser

LogParser.jl is a package for parsing server logs. Currently, only server logs having the Apache Combined format are supported (although Apache Common may parse as well). Additional types of logs may be added in the future as well.

LogParser.jl will attempt to handle the log format even if it is mangled, returning partial matches as best as possible. For example, if the end of the log entry is mangled, you may still get an IP address returned, timestamp and other parts that were able to be parsed.

Code examples

The API for this package is straightforward:

using LogParser

logarray = [...] #Any AbstractArray of Strings

#Parse file
parsed_vals = parseapachecombined.(logarray)

#Convert to DataFrame if desired
parsed_df = DataFrame(parsed_vals)

Linux: Build Status 
Windows: Build status 
Codecov: codecov 

Download Details:

Author: randyzwitch
Source Code: https://github.com/randyzwitch/LogParser.jl 
License: View license

#julia #log #analysis 

LogParser.jl: Julia Package for Parsing Server Log Files

CallGraphs.jl: Analysis Of Source Callgraphs for Julia

CallGraphs

A package for analyzing source-code callgraphs, particularly of Julia's src/ directory. The main motivation for this package was to aid in finding all functions that might trigger garbage collection by directly or indirectly calling jl_gc_collect; however, the package has broader uses.

Installation

Add with

Pkg.clone("https://github.com/timholy/CallGraphs.jl.git")

You'll also need to have clang++ installed, as well at the corresponding opt tool. On the author's machine, opt is called opt-3.4.

Analyzing a source repository

Extracting the callgraph

An example script is callgraph_jlsrc.bash, which is set to analyze julia's src directory. It should be called from within that directory. You may need to change the OPT variable to match your system. This script can be modified to analyze other code repositories.

This writes a series of *.ll and *.dot files. These *.dot files are then analyzed by the julia code in this repository.

Analyzing the callgraph

The most general approach is

using CallGraphs
cgs = parsedots()   # or supply the dirname
calls, calledby = combine(cgs...)

This will merge data from all the *.dot files in the directory into a single callgraph. parsedots and combine are both described in online help.

Garbage-collection analysis

If your main interest is analyzing the callgraph of julia's garbage collection, you will likely be more interested in

using CallGraphs
gcnames = findgc()
highlight(srcfilename, gcnames)

which produces output that looks like this:

Source highlighting

Shown in red are all functions that might trigger a call to jl_gc_collect. The general principle is to look for cases where one line's allocation is not protected from a later garbage-collection.

You can save a (crude) emacs highlighting file with

emacs_highlighting(filename, gcnames)

which you can M-x load-file after opening a C file.

Download Details:

Author: Timholy
Source Code: https://github.com/timholy/CallGraphs.jl 
License: View license

#julia #analysis #source

CallGraphs.jl: Analysis Of Source Callgraphs for Julia