Morpheus Core | A Library Of The Morpheus Data Science Framework

Introduction

The Morpheus library is designed to facilitate the development of high performance analytical software involving large datasets for both offline and real-time analysis on the Java Virtual Machine (JVM). The library is written in Java 8 with extensive use of lambdas, but is accessible to all JVM languages.

For detailed documentation with examples, see here

Motivation

At its core, Morpheus provides a versatile two-dimensional memory efficient tabular data structure called a DataFrame, similar to that first popularised in R. While dynamically typed scientific computing languages like R, Python & Matlab are great for doing research, they are not well suited for large scale production systems as they become extremely difficult to maintain, and dangerous to refactor. The Morpheus library attempts to retain the power and versatility of the DataFrame concept, while providing a much more type safe and self describing set of interfaces, which should make developing, maintaining & scaling code complexity much easier.

Another advantage of the Morpheus library is that it is extremely good at scaling on multi-core processor architectures given the powerful threading capabilities of the Java Virtual Machine. Many operations on a Morpheus DataFrame can seamlessly be run in parallel by simply calling parallel() on the entity you wish to operate on, much like with Java 8 Streams. Internally, these parallel implementations are based on the Fork & Join framework, and near linear improvements in performance are observed for certain types of operations as CPU cores are added.

Capabilities

A Morpheus DataFrame is a column store structure where each column is represented by a Morpheus Array of which there are many implementations, including dense, sparse and memory mapped versions. Morpheus arrays are optimized and wherever possible are backed by primitive native Java arrays (even for types such as LocalDate, LocalDateTime etc...) as these are far more efficient from a storage, access and garbage collection perspective. Memory mapped Morpheus Arrays, while still experimental, allow very large DataFrames to be created using off-heap storage that are backed by files.

While the complete feature set of the Morpheus DataFrame is still evolving, there are already many powerful APIs to affect complex transformations and analytical operations with ease. There are standard functions to compute summary statistics, perform various types of Linear Regressions, apply Principal Component Analysis (PCA) to mention just a few. The DataFrame is indexed in both the row and column dimension, allowing data to be efficiently sorted, sliced, grouped, and aggregated along either axis.

Data Access

Morpheus also aims to provide a standard mechanism to load datasets from various data providers. The hope is that this API will be embraced by the community in order to grow the catalogue of supported data sources. Currently, providers are implemented to enable data to be loaded from Quandl, The Federal Reserve, The World Bank, Yahoo Finance and Google Finance.

Morpheus at a Glance

A Simple Example

Consider a dataset of motor vehicle characteristics accessible here. The code below loads this CSV data into a Morpheus DataFrame, filters the rows to only include those vehicles that have a power to weight ratio > 0.1 (where weight is converted into kilograms), then adds a column to record the relative efficiency between highway and city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed result to a CSV file.

DataFrame.read().csv(options -> {
    options.setResource("http://zavtech.com/data/samples/cars93.csv");
    options.setExcludeColumnIndexes(0);
}).rows().select(row -> {
    double weightKG = row.getDouble("Weight") * 0.453592d;
    double horsepower = row.getDouble("Horsepower");
    return horsepower / weightKG > 0.1d;
}).cols().add("MPG(Highway/City)", Double.class, v -> {
    double cityMpg = v.row().getDouble("MPG.city");
    double highwayMpg = v.row().getDouble("MPG.highway");
    return highwayMpg / cityMpg;
}).rows().sort(false, "MPG(Highway/City)").write().csv(options -> {
    options.setFile("/Users/witdxav/cars93m.csv");
    options.setTitle("DataFrame");
});

This example demonstrates the functional nature of the Morpheus API, where many method return types are in fact a DataFrame and therefore allow this form of method chaining. In this example, the methods csv(), select(), add(), and sort() all return a frame. In some cases the same frame that the method operates on, or in other cases a filter or shallow copy of the frame being operated on. The first 10 rows of the transformed dataset in this example looks as follows, with the newly added column appearing on the far right of the frame.

A Regression Example

The Morpheus API includes a regression interface in order to fit data to a linear model using either OLS, WLS or GLS. The code below uses the same car dataset introduced in the previous example, and regresses Horsepower on EngineSize. The code example prints the model results to standard out, which is shown below, and then creates a scatter chart with the regression line clearly displayed.

//Load the data
DataFrame<Integer,String> data = DataFrame.read().csv(options -> {
    options.setResource("http://zavtech.com/data/samples/cars93.csv");
    options.setExcludeColumnIndexes(0);
});

//Run OLS regression and plot 
String regressand = "Horsepower";
String regressor = "EngineSize";
data.regress().ols(regressand, regressor, true, model -> {
    System.out.println(model);
    DataFrame<Integer,String> xy = data.cols().select(regressand, regressor);
    Chart.create().withScatterPlot(xy, false, regressor, chart -> {
        chart.title().withText(regressand + " regressed on " + regressor);
        chart.subtitle().withText("Single Variable Linear Regression");
        chart.plot().style(regressand).withColor(Color.RED).withPointsVisible(true);
        chart.plot().trend(regressand).withColor(Color.BLACK);
        chart.plot().axes().domain().label().withText(regressor);
        chart.plot().axes().domain().format().withPattern("0.00;-0.00");
        chart.plot().axes().range(0).label().withText(regressand);
        chart.plot().axes().range(0).format().withPattern("0;-0");
        chart.show();
    });
    return Optional.empty();
});

==============================================================================================                                   Linear Regression Results                                                             ============================================================================================== Model:                                   OLS    R-Squared:                            0.5360 Observations:                             93    R-Squared(adjusted):                  0.5309 DF Model:                                  1    F-Statistic:                        105.1204 DF Residuals:                             91    F-Statistic(Prob):                  1.11E-16 Standard Error:                      35.8717    Runtime(millis)                           52 Durbin-Watson:                        1.9591                                                 ==============================================================================================   Index     |  PARAMETER  |  STD_ERROR  |  T_STAT   |   P_VALUE   |  CI_LOWER  |  CI_UPPER  | ----------------------------------------------------------------------------------------------  Intercept  |    45.2195  |    10.3119  |   4.3852  |   3.107E-5  |    24.736  |   65.7029  | EngineSize  |    36.9633  |     3.6052  |  10.2528  |  7.573E-17  |    29.802  |   44.1245  | ==============================================================================================

UK House Price Trends

It is possible to access all UK residential real-estate transaction records from 1995 through to current day via the UK Government Open Data initiative. The data is presented in CSV format, and contains numerous columns, including such information as the transaction date, price paid, fully qualified address (including postal code), property type, lease type and so on.

Let us begin by writing a function to load these CSV files from Amazon S3 buckets, and since they are stored one file per year, we provide a parameterized function accordingly. Given the requirements of our analysis, there is no need to load all the columns in the file, so below we only choose to read columns at index 1, 2, 4, and 11. In addition, since the files do not include a header, we re-name columns to something more meaningful to make subsequent access a little clearer.

/**
 * Loads UK house price from the Land Registry stored in an Amazon S3 bucket
 * Note the data does not have a header, so columns will be named Column-0, Column-1 etc...
 * @param year      the year for which to load prices
 * @return          the resulting DataFrame, with some columns renamed
 */
private DataFrame<Integer,String> loadHousePrices(Year year) {
    String resource = "http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-%s.csv";
    return DataFrame.read().csv(options -> {
        options.setResource(String.format(resource, year.getValue()));
        options.setHeader(false);
        options.setCharset(StandardCharsets.UTF_8);
        options.setIncludeColumnIndexes(1, 2, 4, 11);
        options.getFormats().setParser("TransactDate", Parser.ofLocalDate("yyyy-MM-dd HH:mm"));
        options.setColumnNameMapping((colName, colOrdinal) -> {
            switch (colOrdinal) {
                case 0:     return "PricePaid";
                case 1:     return "TransactDate";
                case 2:     return "PropertyType";
                case 3:     return "City";
                default:    return colName;
            }
        });
    });
}

Below we use this data in order to compute the median nominal price (not inflation adjusted) of an apartment for each year between 1995 through 2014 for a subset of the largest cities in the UK. There are about 20 million records in the unfiltered dataset between 1993 and 2014, and while it takes a fairly long time to load and parse (approximately 3.5GB of data), Morpheus executes the analytical portion of the code in about 5 seconds (not including load time) on a standard Apple Macbook Pro purchased in late 2013. Note how we use parallel processing to load and process the data by calling results.rows().keys().parallel().

//Create a data frame to capture the median prices of Apartments in the UK'a largest cities
DataFrame<Year,String> results = DataFrame.ofDoubles(
    Range.of(1995, 2015).map(Year::of),
    Array.of("LONDON", "BIRMINGHAM", "SHEFFIELD", "LEEDS", "LIVERPOOL", "MANCHESTER")
);

//Process yearly data in parallel to leverage all CPU cores
results.rows().keys().parallel().forEach(year -> {
    System.out.printf("Loading UK house prices for %s...\n", year);
    DataFrame<Integer,String> prices = loadHousePrices(year);
    prices.rows().select(row -> {
        //Filter rows to include only apartments in the relevant cities
        final String propType = row.getValue("PropertyType");
        final String city = row.getValue("City");
        final String cityUpperCase = city != null ? city.toUpperCase() : null;
        return propType != null && propType.equals("F") && results.cols().contains(cityUpperCase);
    }).rows().groupBy("City").forEach(0, (groupKey, group) -> {
        //Group row filtered frame so we can compute median prices in selected cities
        final String city = groupKey.item(0);
        final double priceStat = group.colAt("PricePaid").stats().median();
        results.data().setDouble(year, city, priceStat);
    });
});

//Map row keys to LocalDates, and map values to be percentage changes from start date
final DataFrame<LocalDate,String> plotFrame = results.mapToDoubles(v -> {
    final double firstValue = v.col().getDouble(0);
    final double currentValue = v.getDouble();
    return (currentValue / firstValue - 1d) * 100d;
}).rows().mapKeys(row -> {
    final Year year = row.key();
    return LocalDate.of(year.getValue(), 12, 31);
});

//Create a plot, and display it
Chart.create().withLinePlot(plotFrame, chart -> {
    chart.title().withText("Median Nominal House Price Changes");
    chart.title().withFont(new Font("Arial", Font.BOLD, 14));
    chart.subtitle().withText("Date Range: 1995 - 2014");
    chart.plot().axes().domain().label().withText("Year");
    chart.plot().axes().range(0).label().withText("Percent Change from 1995");
    chart.plot().axes().range(0).format().withPattern("0.##'%';-0.##'%'");
    chart.plot().style("LONDON").withColor(Color.BLACK);
    chart.legend().on().bottom();
    chart.show();
});

The percent change in nominal median prices for apartments in the subset of chosen cities is shown in the plot below. It shows that London did not suffer any nominal house price decline as a result of the Global Financial Crisis (GFC), however not all cities in the UK proved as resilient. What is slightly surprising is that some of the less affluent northern cities saw a higher rate of appreciation in the 2003 to 2006 period compared to London. One thing to note is that while London did not see any nominal price reduction, there was certainly a fairly severe correction in terms of EUR and USD since Pound Sterling depreciated heavily against these currencies during the GFC.

Visualization

Visualizing data in Morpheus DataFrames is made easy via a simple chart abstraction API with adapters supporting both JFreeChart as well as Google Charts (with others to follow by popular demand). This design makes it possible to generate interactive Java Swing charts as well as HTML5 browser based charts via the same programmatic interface. For more details on how to use this API, see the section on visualization here, and the code here.

Maven Artifacts

Morpheus is published to Maven Central so it can be easily added as a dependency in your build tool of choice. The codebase is currently divided into 5 repositories to allow each module to be evolved independently. The core module, which is aptly named morpheus-core, is the foundational library on which all other modules depend. The various Maven artifacts are as follows:

Morpheus Core

The foundational library that contains Morpheus Arrays, DataFrames and other key interfaces & implementations.

<dependency>
    <groupId>com.zavtech</groupId>
    <artifactId>morpheus-core</artifactId>
    <version>${VERSION}</version>
</dependency>

Morpheus Visualization

The visualization components to display DataFrames in charts and tables.

<dependency>
    <groupId>com.zavtech</groupId>
    <artifactId>morpheus-viz</artifactId>
    <version>${VERSION}</version>
</dependency>

Morpheus Quandl

The adapter to load data from Quandl

<dependency>
    <groupId>com.zavtech</groupId>
    <artifactId>morpheus-quandl</artifactId>
    <version>${VERSION}</version>
</dependency>

Morpheus Google

The adapter to load data from Google Finance

<dependency>
    <groupId>com.zavtech</groupId>
    <artifactId>morpheus-google</artifactId>
    <version>${VERSION}</version>
</dependency>

Morpheus Yahoo

The adapter to load data from Yahoo Finance

<dependency>
    <groupId>com.zavtech</groupId>
    <artifactId>morpheus-yahoo</artifactId>
    <version>${VERSION}</version>
</dependency>

Q&A Forum

A Questions & Answers forum has been setup using Google Groups and is accessible here

Javadocs

Morpheus Javadocs can be accessed online here.

Build Status

A Continuous Integration build server can be accessed here, which builds code after each merge.

Download Details:
Author: zavtech
Source Code: https://github.com/zavtech/morpheus-core
License: Apache-2.0 license

#java

What is GEEK

Buddha Community

Morpheus Core | A Library Of The Morpheus Data Science Framework
Uriah  Dietrich

Uriah Dietrich

1618449987

How To Build A Data Science Career In 2021

For this week’s data science career interview, we got in touch with Dr Suman Sanyal, Associate Professor of Computer Science and Engineering at NIIT University. In this interview, Dr Sanyal shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

With industry-linkage, technology and research-driven seamless education, NIIT University has been recognised for addressing the growing demand for data science experts worldwide with its industry-ready courses. The university has recently introduced B.Tech in Data Science course, which aims to deploy data sets models to solve real-world problems. The programme provides industry-academic synergy for the students to establish careers in data science, artificial intelligence and machine learning.

“Students with skills that are aligned to new-age technology will be of huge value. The industry today wants young, ambitious students who have the know-how on how to get things done,” Sanyal said.

#careers # #data science aspirant #data science career #data science career intervie #data science education #data science education marke #data science jobs #niit university data science

Gerhard  Brink

Gerhard Brink

1624699032

Introduction to Data Libraries for Small Data Science Teams

At smaller companies access to and control of data is one of the biggest challenges faced by data analysts and data scientists. The same is true at larger companies when an analytics team is forced to navigate bureaucracy, cybersecurity and over-taxed IT, rather than benefit from a team of data engineers dedicated to collecting and making good data available.

Creative, persistent analysts find ways to get access to at least some of this data. Through a combination of daily processes to save email attachments, run database queries, and copy and paste from internal web pages one might build up a mighty collection of data sets on a personal computer or in a team shared drive or even a database.

But this solution does not scale well, and is rarely documented and understood by others who could take it over if a particular analyst moves on to a different role or company. In addition, it is a nightmare to maintain. One may spend a significant part of each day executing these processes and troubleshooting failures; there may be little time to actually use this data!

I lived this for years at different companies. We found ways to be effective but data management took up way too much of our time and energy. Often, we did not have the data we needed to answer a question. I continued to learn from the ingenuity of others and my own trial and error, which led me to the theoretical framework that I will present in this blog series: building a self-managed data library.

A data library is _not _a data warehousedata lake, or any other formal BI architecture. It does not require any particular technology or skill set (coding will not be required but it will greatly increase the speed at which you can build and the degree of automation possible). So what is a data library and how can a small data analytics team use it to overcome the challenges I’ve described?

#big data #cloud & devops #data libraries #small data science teams #introduction to data libraries for small data science teams #data science

 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

'Commoditization Is The Biggest Problem In Data Science Education'

The buzz around data science has sent many youngsters and professionals on an upskill/reskilling spree. Prof. Raghunathan Rengasamy, the acting head of Robert Bosch Centre for Data Science and AI, IIT Madras, believes data science knowledge will soon become a necessity.

IIT Madras has been one of India’s prestigious universities offering numerous courses in data science, machine learning, and artificial intelligence in partnership with many edtech startups. For this week’s data science career interview, Analytics India Magazine spoke to Prof. Rengasamy to understand his views on the data science education market.

With more than 15 years of experience, Prof. Rengasamy is currently heading RBCDSAI-IIT Madras and teaching at the department of chemical engineering. He has co-authored a series of review articles on condition monitoring and fault detection and diagnosis. He has also been the recipient of the Young Engineer Award for the year 2000 by the Indian National Academy of Engineering (INAE) for outstanding engineers under the age of 32.

Of late, Rengaswamy has been working on engineering applications of artificial intelligence and computational microfluidics. His research work has also led to the formation of a startup, SysEng LLC, in the US, funded through an NSF STTR grant.

#people #data science aspirants #data science course director interview #data science courses #data science education #data science education market #data science interview

Ananya Gupta

Ananya Gupta

1611381728

What Are The Advantages and Disadvantages of Data Science?

Data Science becomes an important part of today industry. It use for transforming business data into assets that help organizations improve revenue, seize business opportunities, improve customer experience, reduce costs, and more. Data science became the trending course to learn in the industries these days.

Its popularity has grown over the years, and companies have started implementing data science techniques to grow their business and increase customer satisfaction. In online Data science course you learn how Data Science deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.

Advantages of Data Science:- In today’s world, data is being generated at an alarming rate in all time lots of data is generated; from the users of social networking site, or from the calls that one makes, or the data which is being generated from different business. Because of that reason the huge amount of data the value of the field of Data Science has many advantages.

Some Of The Advantages Are Mentioned Below:-

Multiple Job Options :- Because of its high demand it provides large number of career opportunities in its various fields like Data Scientist, Data Analyst, Research Analyst, Business Analyst, Analytics Manager, Big Data Engineer, etc.

Business benefits: - By Data Science Online Course you learn how data science helps organizations knowing how and when their products sell well and that’s why the products are delivered always to the right place and right time. Faster and better decisions are taken by the organization to improve efficiency and earn higher profits.

Highly Paid jobs and career opportunities: - As Data Scientist continues working in that profile and the salaries of different position are grand. According to a Dice Salary Survey, the annual average salary of a Data Scientist $106,000 per year as we consider data.

Hiring Benefits:- If you have skills then don’t worry this comparatively easier to sort data and look for best of candidates for an organization. Big Data and data mining have made processing and selection of CVs, aptitude tests and games easier for the recruitment group.

Also Read: How Data Science Programs Become The Reason Of Your Success

Disadvantages of Data Science: - If there are pros then cons also so here we discuss both pros and cons which make you easy to choose Data Science Course without any doubts. Let’s check some of the disadvantages of Data Science:-

Data Privacy: - As we know Data is used to increase the productivity and the revenue of industry by making game-changing business decisions. But the information or the insights obtained from the data may be misused against any organization.

Cost:- The tools used for data science and analytics can cost tons to a corporation as a number of the tools are complex and need the people to undergo a knowledge Science training to use them. Also, it’s very difficult to pick the right tools consistent with the circumstances because their selection is predicated on the proper knowledge of the tools also as their accuracy in analyzing the info and extracting information.

#data science training in noida #data science training in delhi #data science online training #data science online course #data science course #data science training