A time series is a series of data points indexed in time. The fact that time series data is ordered makes it unique in the data space because it often displays serial dependence. Serial dependence occurs when the value of a datapoint at one time is statistically dependent on another datapoint in another time. However, this attribute of time series data violates one of the fundamental assumptions of many statistical analyses — that data is statistically independent.
Autocorrelation is a type of serial dependence. Specifically, autocorrelation is when a time series is linearly related to a lagged version of itself. By contrast, correlation is simply when two independent variables are linearly related.
Often, one of the first steps in any data analysis is performing regression analysis. However, one of the assumptions of regression analysis is that the data has no autocorrelation. This can be frustrating because if you try to do a regression analysis on data with autocorrelation, then your analysis will be misleading.
Additionally, some time series forecasting methods (specifically regression modeling) rely on the assumption that there isn’t any autocorrelation in the residuals (the difference between the fitted model and the data). People often use the residuals to assess whether their model is a good fit while ignoring that assumption that the residuals have no autocorrelation (or that the errors are independent and identically distributed or i.i.d). This mistake can mislead people into believing that their model is a good fit when in fact it isn’t. I highly recommend reading this article about How (not) to use Machine Learning for time series forecasting: Avoiding the pitfalls in which the author demonstrates how the increasingly popular LSTM (Long Short Term Memory) Network can appear to be an excellent univariate time series predictor, when in reality it’s just overfitting the data. He goes further to explain how this misconception is the result of accuracy metrics failing due to the presence of autocorrelation.
Finally, perhaps the most compelling aspect of autocorrelation analysis is how it can help us uncover hidden patterns in our data and help us select the correct forecasting methods. Specifically, we can use it to help identify seasonality and trend in our time series data. Additionally, analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) in conjunction is necessary for selecting the appropriate ARIMA model for your time series prediction.
For this exercise, I’m using InfluxDB and the InfluxDB Python CL. I am using available data from the National Oceanic and Atmospheric Administration’s (NOAA) Center for Operational Oceanographic Products and Services. Specifically, I will be looking at the water levels and water temperatures of a river in Santa Monica.
curl https://s3.amazonaws.com/noaa.water-database/NOAA_data.txt -o NOAA_data.txt
influx -import -path=NOAA_data.txt -precision=s -database=NOAA_water_database
This analysis and code is included in a jupyter notebook in this repo.
First, I import all of my dependencies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from influxdb import InfluxDBClient
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import linregress
Next, I connect to the client, query my water temperature data, and plot it.
client = InfluxDBClient(host=‘localhost’, port=8086)
h2O = client.query(‘SELECT mean(“degrees”) AS “h2O_temp” FROM “NOAA_water_database”.“autogen”.“h2o_temperature” GROUP BY time(12h) LIMIT 60’)
h2O_points = [p for p in h2O.get_points()]
h2O_df = pd.DataFrame(h2O_points)
h2O_df[‘time_step’] = range(0,len(h2O_df[‘time’]))
Fig 1. H2O temperature vs. timestep
From looking at the plot above, it’s not obviously apparent whether or not our data will have any autocorrelation. For example, I can’t detect the presence of seasonality, which would yield high autocorrelation.
I can calculate the autocorrelation with Pandas.Sereis.autocorr() function which returns the value of the Pearson correlation coefficient. The Pearson correlation coefficient is a measure of the linear correlation between two variables. The Pearson correlation coefficient has a value between -1 and 1, where 0 is no linear correlation, >0 is a positive correlation, and <0 is a negative correlation. Positive correlation is when two variables change in tandem while a negative correlation coefficient means that the variables change inversely. I compare the data with a lag=1 (or data(t) vs. data(t-1)) and a lag=2 (or data(t) vs. data(t-2).
shift_1 = h2O_df[‘h2O_temp’].autocorr(lag=1)
These values are very close to 0, which indicates that there is little to no correlation. However, calculating individual autocorrelation values might not tell the whole story. There might not be any correlation at lag=1, but maybe there is a correlation at lag=15. It’s a good idea to make an autocorrelation plot to compare the values of the autocorrelation function (AFC) against different lag sizes. It’s also important to note that the AFC becomes more unreliable as you increase your lag value. This is because you will compare fewer and fewer observations as you increase the lag value. A general guideline is that the total number of observations (T) should be at least 50, and the greatest lag value (k) should be less than or equal to T/k. Since I have a total of 60 observations, I will only consider the first 20 values of the AFC.
Fig 2. Autocorrelation plot for H2O temperatures
From this plot, we see that values for the ACF are within 95 percent confidence interval (represented by the solid gray line) for lags > 0, which verifies that our data doesn’t have any autocorrelation. At first, I found this result surprising, because usually the air temperature on one day is highly correlated with the temperature the day before. I assumed the same would be true about water temperature. This result reminded me that streams and rivers don’t have the same system behavior as air. I’m no hydrologist, but I know spring-fed streams or snowmelt can often be the same temperature year-round. Perhaps they exhibit a stationary temperature profile day to day where the mean, variance, and autocorrelation are all constant (where autocorrelation is = 0).
The ACF can also be used to uncover and verify seasonality in time series data. Let’s take a look at the water levels from the same dataset.
client = InfluxDBClient(host=‘localhost’, port=8086)
h2O_level = client.query(‘SELECT “water_level” FROM “NOAA_water_database”.“autogen”.“h2o_feet” WHERE “location”=‘santa_monica’ AND time >= ‘2015-08-22 22:12:00’ AND time <= ‘2015-08-28 03:00:00’’)
h2O_level_points = [p for p in h2O_level.get_points()]
h2O_level_df = pd.DataFrame(h2O_level_points)
h2O_level_df[‘time_step’] = range(0,len(h2O_level_df[‘time’]))
Fig 3. H2O level vs. timestep
Just by plotting the data, it’s fairly obvious that seasonality probably exists, evident by the predictable pattern in the data. Let’s verify this assumption by plotting the ACF.
Fig. 4: Autocorrelation plot for H2O levels
From the ACF plot above, we can see that our seasonal period consists of roughly 246 timesteps (where the ACF has the second largest positive peak). While it was easily apparent from plotting time series in Figure 3 that the water level data has seasonality, that isn’t always the case. In Seasonal ARIMA with Python, author Sean Abu shows how he must add a seasonal component to his ARIMA method in order to account for seasonality in his dataset. I appreciated his dataset selection because I can’t detect any autocorrelation in the following figure. It’s a great example of how using ACF can help uncover hidden trends in the data.
Fig. 5: Monthly Ridership vs. Year. Source: Seasonal ARIMA with Python
In order to take a look at the trend of time series data, we first need to remove the seasonality. Lagged differencing is a simple transformation method that can be used to remove the seasonal component of the series. A lagged difference is defined by:
difference(t) = observation(t) – observation(t-interval)2,
where interval is the period. To calculate the lagged difference in the water level data, I used the following function:
def difference(dataset, interval):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
return pd.DataFrame(diff, columns = [“water_level_diff”])
h2O_level_diff = difference(h2O_level_df[‘water_level’], 246)
h2O_level_diff[‘time_step’] = range(0,len(h2O_level_diff[‘water_level_diff’]))
Fig. 6: Lagged difference for H2O levels
We can now plot the ACF again.
Fig. 7: ACF of lagged difference for H2O levels
It might seem that we still have seasonality in our lagged difference. However, if we pay attention to the y-axis in Figure 5, we can see that the range is very small and all the values are close to 0. This informs us that we successfully removed the seasonality, but there is a polynomial trend. I used seasonal_decompose to verify this.
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import pyplot
result = seasonal_decompose(h2O[‘water_level’], model=‘additive’, freq=250)
Fig. 8. Seasonal Decomposition of H2O levels
Autocorrelation is important because it can help us uncover patterns in our data, successfully select the best prediction model, and correctly evaluate the effectiveness of our model. I hope this introduction to autocorrelation is useful to you.
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
Data management, analytics, data science, and real-time systems will converge this year enabling new automated and self-learning solutions for real-time business operations.
The global pandemic of 2020 has upended social behaviors and business operations. Working from home is the new normal for many, and technology has accelerated and opened new lines of business. Retail and travel have been hit hard, and tech-savvy companies are reinventing e-commerce and in-store channels to survive and thrive. In biotech, pharma, and healthcare, analytics command centers have become the center of operations, much like network operation centers in transport and logistics during pre-COVID times.
While data management and analytics have been critical to strategy and growth over the last decade, COVID-19 has propelled these functions into the center of business operations. Data science and analytics have become a focal point for business leaders to make critical decisions like how to adapt business in this new order of supply and demand and forecast what lies ahead.
In the next year, I anticipate a convergence of data, analytics, integration, and DevOps to create an environment for rapid development of AI-infused applications to address business challenges and opportunities. We will see a proliferation of API-led microservices developer environments for real-time data integration, and the emergence of data hubs as a bridge between at-rest and in-motion data assets, and event-enabled analytics with deeper collaboration between data scientists, DevOps, and ModelOps developers. From this, an ML engineer persona will emerge.
#analytics #artificial intelligence technologies #big data #big data analysis tools #from our experts #machine learning #real-time decisions #real-time analytics #real-time data #real-time data analytics
The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.
Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.
#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt
CVDC 2020, the Computer Vision conference of the year, is scheduled for 13th and 14th of August to bring together the leading experts on Computer Vision from around the world. Organised by the Association of Data Scientists (ADaSCi), the premier global professional body of data science and machine learning professionals, it is a first-of-its-kind virtual conference on Computer Vision.
The second day of the conference started with quite an informative talk on the current pandemic situation. Speaking of talks, the second session “Application of Data Science Algorithms on 3D Imagery Data” was presented by Ramana M, who is the Principal Data Scientist in Analytics at Cyient Ltd.
Ramana talked about one of the most important assets of organisations, data and how the digital world is moving from using 2D data to 3D data for highly accurate information along with realistic user experiences.
The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment, 3D data for object detection and two general case studies, which are-
This talk discussed the recent advances in 3D data processing, feature extraction methods, object type detection, object segmentation, and object measurements in different body cross-sections. It also covered the 3D imagery concepts, the various algorithms for faster data processing on the GPU environment, and the application of deep learning techniques for object detection and segmentation.
#developers corner #3d data #3d data alignment #applications of data science on 3d imagery data #computer vision #cvdc 2020 #deep learning techniques for 3d data #mesh data #point cloud data #uav data