Tia  Gottlieb

Tia Gottlieb

1594436580

Cluster analysis on stock selection

Background

Have you ever got tired to select a stock by looking at more than hundreds of financial ratios ? Or have you felt bored on polishing your technical analysis skills or improving your time series model to make better price prediction ? If the answer is YES, you have come to the right place.

In this article, we will go though an experiment to see whether financial ratios in different dimensions really add value to stock selection. During the process, we would also see how cluster analysis helps us to get rid of the sea of financial metrics.

Before that, I am also going to show you the procedures to (1) download historical financial ratios of stocks, and (2) daily price data of stocks, which could be very useful for many of you playing around stock data in different projects.

For the originals of all the codes below, you may refer to my Github link here.

(1) Download financial indicators of stocks

First, we would use a library called FundamentalAnalysis. For the details of this library, please refer to the website here.

To be able to use this package, we need an API Key from FinancialModellingPrep, and follow the instructions there to obtain a free API Key. Please note that these keys are limited to 250 requests per account though there is no time limit. So I strongly recommend all of you to output the downloaded data as excel file for further use. Otherwise, the limitation is easy to exceed.

And due to this limitation, I set the scope for stock selection in the experiment as 97 stocks listed below, which are all components of Nasdaq 100 Index.

The above codes would generate two excel files, (a) key_metrics.xlsx, and (b) financial_ratios.xlsx, with each stock in a separate sheet. Both files store various financial indicators for the past 10–20 years, depending on when the companies are listed. We will combine them together with the returns and price volatility data in later step.

Image for post

A captured image of key metrics data of MSFT

(2) Download stock price data

N

ext, let’s download the price data. Since the above package has limitation on requests quota, we would switch to another free library yfinance.

The above code would download the daily price data for all the stocks in the ticker list. And close price is selected to represent the price and is outputted as an excel file, price.xlsx.

Image for post

A captured image of price data

(3) Combine all the data for use

At

last, we would like to combine the three excel files prepared into a single file, each for one year in the selected period (2017–2019). I am not planning to go through the details here, since all are just some basic skills using pandas and numpy. For the original code of this section, please refer to **_cluster_stocks_data.py _**in the Github link here.

Nonetheless, I would like to mention some key tricks that are quite useful and applicable even when you are working on other projects.

(a) dataframe.at[index, column name]_ — instead of getting confused about iloc or loc or any other similar functions, you may try the function **“.at”**,_which directly refers to a single cell within the dataframe. And you can easily set its value with an equal sign.

(b) dataframe.T_ —transpose the dataframe if you want to swap the row and column. By applying this function **“.T”**, column name will then become index, and vice versa._

_© pd.concat() _— combine two dataframes in parallel (horizontally).

(d) dataframe.fillna(value=N)_ — fill the cells with NaN with a specific value._

After going through tedious procedures, the end products would be three excel files (2017, 2018 and 2019). Each stores the data of returns, price volatility and other financial indicators for every stock in the ticker list in a specific year.

#cluster-analysis #portfolio #data-science #stock-market #finance #data analysis

What is GEEK

Buddha Community

Cluster analysis on stock selection
Tia  Gottlieb

Tia Gottlieb

1594436580

Cluster analysis on stock selection

Background

Have you ever got tired to select a stock by looking at more than hundreds of financial ratios ? Or have you felt bored on polishing your technical analysis skills or improving your time series model to make better price prediction ? If the answer is YES, you have come to the right place.

In this article, we will go though an experiment to see whether financial ratios in different dimensions really add value to stock selection. During the process, we would also see how cluster analysis helps us to get rid of the sea of financial metrics.

Before that, I am also going to show you the procedures to (1) download historical financial ratios of stocks, and (2) daily price data of stocks, which could be very useful for many of you playing around stock data in different projects.

For the originals of all the codes below, you may refer to my Github link here.

(1) Download financial indicators of stocks

First, we would use a library called FundamentalAnalysis. For the details of this library, please refer to the website here.

To be able to use this package, we need an API Key from FinancialModellingPrep, and follow the instructions there to obtain a free API Key. Please note that these keys are limited to 250 requests per account though there is no time limit. So I strongly recommend all of you to output the downloaded data as excel file for further use. Otherwise, the limitation is easy to exceed.

And due to this limitation, I set the scope for stock selection in the experiment as 97 stocks listed below, which are all components of Nasdaq 100 Index.

The above codes would generate two excel files, (a) key_metrics.xlsx, and (b) financial_ratios.xlsx, with each stock in a separate sheet. Both files store various financial indicators for the past 10–20 years, depending on when the companies are listed. We will combine them together with the returns and price volatility data in later step.

Image for post

A captured image of key metrics data of MSFT

(2) Download stock price data

N

ext, let’s download the price data. Since the above package has limitation on requests quota, we would switch to another free library yfinance.

The above code would download the daily price data for all the stocks in the ticker list. And close price is selected to represent the price and is outputted as an excel file, price.xlsx.

Image for post

A captured image of price data

(3) Combine all the data for use

At

last, we would like to combine the three excel files prepared into a single file, each for one year in the selected period (2017–2019). I am not planning to go through the details here, since all are just some basic skills using pandas and numpy. For the original code of this section, please refer to **_cluster_stocks_data.py _**in the Github link here.

Nonetheless, I would like to mention some key tricks that are quite useful and applicable even when you are working on other projects.

(a) dataframe.at[index, column name]_ — instead of getting confused about iloc or loc or any other similar functions, you may try the function **“.at”**,_which directly refers to a single cell within the dataframe. And you can easily set its value with an equal sign.

(b) dataframe.T_ —transpose the dataframe if you want to swap the row and column. By applying this function **“.T”**, column name will then become index, and vice versa._

_© pd.concat() _— combine two dataframes in parallel (horizontally).

(d) dataframe.fillna(value=N)_ — fill the cells with NaN with a specific value._

After going through tedious procedures, the end products would be three excel files (2017, 2018 and 2019). Each stores the data of returns, price volatility and other financial indicators for every stock in the ticker list in a specific year.

#cluster-analysis #portfolio #data-science #stock-market #finance #data analysis

Hubify Apps

Hubify Apps

1614420140

Back In Stock Notification App for Your Shopify Store

The last thing you want to do is to dissatisfy your customers. It is quite disappointing for online shoppers to want to purchase a product and they end up discovering that it is out of stock.

One thing that is common among Shopify stores is that they usually experience stockouts. A stockout occurs when inventory gets finished. If customers want to handle issues concerning stock outs effectively, then, they should use Shopify product back-in-stock alerts App.

What can back in stock alerts help you do? It can help customers notify shoppers when products are available if they subscribe to it using the back in stock notification app.

Learn More : https://hubifyapps.com/back-in-stock-notification-app/

#back in stock notification app #back in stock alert #in stock alert #in stock #back in stock #stock alert app

Stock Fundamental Analysis: EDA of SEC’s quarterly data summary

Many investors consider fundamental analysis as their secret weapon to beat the stock market. You can perform it using many methods, but one thing they have in common. They all need data about companies’ financial statements.

Luckily all stocks traded on US stock markets must quarterly report to the Securities and Exchange Commission (SEC). Every quarter SEC prepares a comfortable CSV package to help all the investors in their quest for the investment opportunity. Let’s explore how to get valuable insights from these .csv files.

In this tutorial, we will use python’s pandas library which ideal for parsing CSV files, and we will learn how to:

We will process the data and:

  • explore files in the SEC dump
  • review each column of these files and talk about the most relevant
  • remove **duplicated **data grouped by a key column or multiple columns
  • visualize the data to support our exploration using interactive Plotly charts
  • and much more

As usual, you can follow the code in the notebook shared on GitHub.

vaclavdekanovsky/data-analysis-in-examples

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

SEC Quarterly data

There doesn’t seem to be any problem. You simply download the quarterly package from the SEC dataset page, you sort the values from the financial statements in descending order and pick the stocks on the top. The reality isn’t that straightforward. Let’s have a look and explore 45.55MB big zip file with all SEC filings for the first quarter of 2020.

The package for every quarter contains 5 files. Here’s an example of 2020 Q1:

  • readme.htm — describes the structure of the files
  • **sub.txt **— master information about the submissions including company identifiers and type of the filing
  • **num.txt **— numeric data for each financial statement and other documents
  • tag.txt — standard taxonomy tags
  • pre.txt — information about how the data from num.txt is displayed in the online presentation

Image for post

Image for post

Unzipped files in the SEC quarterly data dump

This article will only deal with the submission master because it contains more than enough information for one article. Follow-up story will examine the data in more detail. Let’s begin.

2020Q1 Submission files

In the first quarter of 2020, the companies have submitted 13560 files and the sub.txt gathers 36 columns about them.

# load the .csv file into pandas
sub = pd.read_csv(os.path.join(folder,"sub.txt"), sep="\t", dtype={"cik":str})

# explore number of rows and columns
sub.shape
[Out]: (13560, 36)

I always start with a simple function that reviews each column of the data frame, checks the percentage of empty values, and how many unique values appear in the columns.

Explore the sub.txt file to see what data each column contain

Let me highlight a few important columns in the SEC submission master.

Image for post

Image for post

Example of the quick file overview in pandas

  • adsh — EDGAR accession number uniquely identifies each report. This value is **never duplicated **in the sub.txt. Example 0001353283–20–000008 is the code for 10-K (yearly filing) of Splunk.
  • cik — Central Index Key, unique key identifying each SEC registrant. E.g. 0001353283 for Splunk. As you can see the first part of the adsh is the cik.
  • name — the name of the company submitting the quarterly financial data
  • form — the type of the report being submitted

Form s— submissions types delivered to SEC

Based on the analysis, we see that the 2020Q1 submission contains 23 unique types of financial reports. Investors’ primary interest lies in the 10-K report, which covers the annual performance of the publically traded company. Because this report is expectedly delivered only once a year, important is also 10-Q report showing quarterly changes in the company’s financials.

  • 10-K Annual report of US-based company
  • 10-Q Quarterly report and maybe
  • 20-F Annual Reports of a foreign company
  • 40-F Annual Reports of a foreign company (Canadian)

Let’s see which forms are the most common in the dataset. Plotting of the form types in the 2020Q1 will show this picture:

Using Plotly’s low level API to produce bar and pie subplots

Image for post

Image for post

Different submission types reported by the companies in 2020Q1 using visualization in Plotly

The dataset contains over 7000 8-K reports notifying about important events like agreements, layoffs, usage of material, modification of shareholder rights, change in the senior positions, and more (see SEC’s guideline). Since they are the most common we should spend some time exploring them.

#stocks #exploratory-data-analysis #python #data-analysis #stock-market #data analysis

Ray  Patel

Ray Patel

1623292080

Getting started with Time Series using Pandas

An introductory guide on getting started with the Time Series Analysis in Python

Time series analysis is the backbone for many companies since most businesses work by analyzing their past data to predict their future decisions. Analyzing such data can be tricky but Python, as a programming language, can help to deal with such data. Python has both inbuilt tools and external libraries, making the whole analysis process both seamless and easy. Python’s Panda s library is frequently used to import, manage, and analyze datasets in various formats. However, in this article, we’ll use it to analyze stock prices and perform some basic time-series operations.

#data-analysis #time-series-analysis #exploratory-data-analysis #stock-market-analysis #financial-analysis #getting started with time series using pandas

Feature selection and error analysis while working with spatial data

Did you ever spend hours experimenting with model architecture and parameter tuning before finding out that the algorithm is missing some crucial details? Find out how to conduct an efficient feature selection and error analysis while working with spatial data.

Image for post

Error analysis is one of the key parts while training any ML model. Despite its importance, I have found myself spending far too many hours experimenting with model architecture and hyperparameter tuning before investigating errors made by the algorithm. Efficient error analysis requires combining good knowledge of the input data, algorithms as well as domain knowledge about the problem we are trying to solve.

Working with spatial data makes structuring error analysis easier as you can extract a lot of insights just by mapping your errors. In this article, I will focus on benchmarking Real Estate prices in Warsaw using Random Forest Regression.

The key data source consists of around 25k property sale offers from Warsaw with nearly 100 features. During project development, this data was enriched with additional data sources after the need for extra location information has been found during error analysis. I will demonstrate the key stages of the error analysis driven development process.

Whole code and data sources are available on GitHub: https://github.com/Jan-Majewski/Project_Portfolio/blob/master/03_Real_Estate_pricing_in_Warsaw/03_03_Feature_selection_and_error_analysis.ipynb

Access code with nbViewer for full interactivity: https://nbviewer.jupyter.org/github/Jan-Majewski/Project_Portfolio/blob/eb4bb8be0cf79cac979d9411b69d5150270550d5/03_Real_Estate_pricing_in_Warsaw/03_03_Feature_selection_and_error_analysis.ipynb

1. Introduction

The data used can be downloaded from GitHub

df = pd.read_excel(r"https://raw.githubusercontent.com/Jan-Majewski/Project_Portfolio/master/03_Real_Estate_pricing_in_Warsaw/Warsaw_RE_data.xlsx")

After initial data transformation and basic EDA, which is described in detail within the linked notebook, we end up with a DataFrame called ml_data with 25240 rows and 89 columns.

ml_data.columns

Image for post

property characteristic features available in base input data

Going through nearly 100 features might seem difficult at first, but we can quickly realize that apart from 5 numerical features like Area, building year, number of rooms and floors, the remaining features are 1-hot columns created from numerical data.

We can see that the basic dataset consists only of property level characteristics similar to BostonHousing or CaliforniaHousing datasets. Initial data misses a detailed description of building location, which in reality is the key price driver.

2. Feature selection

The first challenge is how to select the most important features to make the training of the regression model easier and avoid overfitting. Sklearn provides a great function — SelectKBest to aid us in feature selection. As we are facing a regression problem I chose f_regression scoring function.

SelectKBest allows us to find top features, which carry most information about variable y, which in our case is unit_price expressed as price per area.

from sklearn.feature_selection import SelectKBest, f_regression

bestfeatures = SelectKBest(score_func=f_regression, k="all")
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#Let's transform outputs into one DataFrame for readability
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
featureScores.nlargest(50,'Score').head(20)

Image for post

Top 20 features from base data input

After several rounds of iteration, I decided to use only features with a f_regression score over 200. While working with initial data there are 33 features passing this threshold.

top_features=featureScores.query("Score>200").Feature.unique()
top_features

Image for post

Features used in the initial model

3. Building initial model

As feature selection and error analysis is the key focus of this article, I chose Random Forest Regressor as the best algorithm — it combines quite good performance with the ability to analyze feature importances. I will be using one model with the same hyperparameters across all interactions for results comparability.

To avoid overfitting I chose a few regularization hyperparameters — they could probably be tuned to achieve slightly better results but hyperparameter tuning could be a good material for another article. Please find the model setup below:

Image for post

Random Forest Regressor hyperparameters

3.1 Investigating model performance and feature importance

As the key goal of my model is to accurately benchmark real estate prices I aim to get as many properties as possible close to the benchmark. I will analyze the model performance as the share of properties from the test set, for which the absolute percentage error of model forecast was within 5%, 10%, and 25% boundary.

#real-estate #error-analysis #spatial-analysis #data-visualization #feature-selection #data analysis