Advanced candlesticks for machine learning

In this article, we will learn how to create tick bars, comprehensively analyze statistical properties such as normality of returns or autocorrelation, and explore in which scenarios these bars can replace traditional time-based candlesticks. To demonstrate the applicability of teak bars in predicting cryptocurrency markets, we will base our analysis on a complete dataset of 16 cryptocurrency trading pairs, including the most popular crypto assets such as Bitcoin, Ethereum or Litecoin.

1. Introduction

In a previous article, we investigated why traditional time-based candlesticks are not the most affordable price data format if we are planning to train a machine learning (ML) algorithm. That is to say: (1) time-based candlesticks sample over periods of low activity and low-sample high activity, (2) markets are increasingly controlled by trading algorithms that no longer follow any daylight cycles involving humans, Time-based candlesticks It is ubiquitous among traders and trading bots, which increases competition and, as we will see in this article, (4) time-based candlesticks offer weaker statistical properties. If you missed the update, find the link to the article below.

In this article, we'll explore one of the suggested alternative sticks: teak sticks. Let's investigate them.

2.- Creating tick sticks

There are at least two main definitions of what a tick is. From Investtopedia quote to :

A tick is a measure of the minimum up or down movement in the price of a security. A tick can also indicate a change in the price of a security from trade to trade.

When it comes to teak sticks, the definition we care about is the second: Under teak sticks, a teak is essentially a trade and the price at which the trade is made on the stock market. A tick stick or a tick wax is the gathering of a predefined number of ticks. For example, if we want to produce 100 ticks of bars, we have to keep a store of all transactions, and every time we "buy" 100 transactions from the stock market, we build a stick or candlestick. The candlesticks are then created by calculating the Open, High, Low, Close, and Volume values (these are often abbreviated as OHLCV).

The opening and closing values correspond to the price of the first and last trade, respectively. High and low are the maximum and minimum price of all trades in the candle (may coincide with the opening and closing). Finally, volume is the sum of all assets traded (for example, in an ETH-USD pair, volume is measured as the number of Ethereum traded at the time of the candle). By convention, when a candle closes at a higher price than its open price, we paint them green (or keep it blank), but if the closing price is lower than the open price, we paint them red (or fill them with black).

Here is a very simple yet fast Python application for creating tick candles:

And here is a visual of how check bars look compared to standard time-based candlesticks. In this case, we show the 4-hour and 1000 tick bars for the BTC-USD trading pair, as well as all the trade prices between 21–01–2020 and 02–20–2020. Note that for candlesticks, we show an asterisk each time we sample a stick

Two main observations about these plots:

Yes, tick candlesticks look so ugly. Chaotic, overlapping, and difficult to understand, but remember they shouldn't be human-friendly: they need to be machine-friendly.

The main reason they are ugly is because they do their job very well. Look at the asterisks, see how more asterisks (and more bars) are mixed together during periods of high price change? And vice versa: when the price doesn't change much, teak bar sampling is much lower. Essentially, we are creating a system where we synchronize the entry of information into the market (higher activity and price volatility) with the sampling of candlesticks. In the end, we sample more during periods of high activity and less during periods of low activity. Hurray!

What about statistical properties? Is it better than their traditional time-based counterparts?

We will look at two different properties: (1) series correlation and (2) Bitfinex stock of including all historical bars in cryptodatum.io offered 15 crypto currency unit for each of the pair and normal return-based and tick candle sticks size for each time:

Time-based bar sizes : 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 4 hours, 12 hours, 1 day.

Tick stick sizes : 50, 100, 200, 500, 1000

Serial correlation measures how much each value of a time series is correlated with the following (for delay = 1), or between any i value and any other i + n value (delay = n). In our case, we will calculate the series correlation of log returns calculated as the first difference of the log of candlestick closing prices.

Ideally, each data point of our series should be an independent observation. If there is a series correlation, it means that they are not independent (they depend on each other when lag = 1 or higher) and this will have consequences when constructing regressive models because the errors we will observe in our regression will be smaller or larger than they actually are. mistakes that will mislead our comments and predictions. You can see a very visual explanation of the problem here .

To measure series correlation, we will calculate the Pearson correlation of the sequence with respect to the shifted self (lag = 1, aka first order correlation). Here are the results:

It turns out that tick bars (labeled as tick- *) often have lower autocorrelation than time-based candlesticks (labeled as time- *) - this seems to be closer to the Pearson's autocorrelation to 0. it is less pronounced for larger time bars (4h, 12h, 1d), but interestingly even the smallest check bars (50-tick and 100-tick) give a very low auto-correlation and this applies to smaller time bars is not (1 minute, 5 minutes).
Finally, it is interesting to see that several cryptocurrencies (BTC, LTC, ZEC, and ZIL) express quite strong negative autocorrelation in several of their time bars. Roberto Pedace comments on negative automatic correlations here :
A utocorrelation, also known as serial correlation , can exist in the regression model when the order of observations in the data is relevant or important. In other words, autocorrelation in time series (and sometimes panel or logical) data is a concern. […] No autocorrelation implies a situation where there is no definable relationship between the values of the error term. […] Although not likely, negative autocorrelation is also possible. Negative autocorrelation occurs when an error in a particular sign tends to be followed by an error of the opposite sign. For example, positive errors are usually followed by negative errors, negative errors are usually followed by positive errors.
We will perform an additional statistical test called the Durbin-Watson (DB) test, which also diagnoses the presence of serial correlation. The DB statistic ranges from 0-4 and its interpretation is as follows:
Value Meaning
DB-statistic << 2 positive serial correlation
DB-statistic ~ 2 no first-order correlation
DB-statistic >> 2 negative serial correlation
Value

Value	Meaning
DB-statistic << 2	positive serial correlation
DB-statistic ~ 2	no first-order correlation
DB-statistic >> 2	negative serial correlation

Meaning
DB-statistic << 2
positive serial correlation
DB-statistic ~ 2
no first-order correlation
DB-statistic >> 2
negative serial correlation
view raw
Durbin-watson.md hosted with ❤ by GitHub
Essentially, it is the closest to 2 and the lowest series correlation. Here are the results:

The results are in line with the Pearson autocorrelation test, which strengthens the narrative that check bars exhibit a slightly lower autocorrelation than time-based candlesticks.

3.2 - Normality of returns
Another statistic we can look at is the normality of returns, which is whether the distribution of our log returns follows a normal (aka Gaussian) distribution.

There are a few tests that we can run to check for normality - of which the 2 will perform: Data of a normal distribution fits skew and kurtosis whether tests that Jarque-Bera test and one of them is the Shapiro-Wilk test to check whether it follows the Gaussian distribution of a sample at from the classic tests.

In both cases, the null hypothesis is that it follows a normality, for example. If the null hypothesis is rejected (p value lower than the significance level - usually <0.05), there is convincing evidence that the sample does not follow a normal distribution.

Let's first look at the p values for Jarque-Bera:
The results are almost unanimous: their daily returns do not follow a Gaussian distribution (most p-values <0.05). If we set our significance to 0.05, the two cryptocurrency pairs (Stellar and Zilliqa) actually seem to follow a Gaussian. Let's take a look at their distributions (kernel density estimates):

Meaning
DB-statistic << 2	positive serial correlation
DB-statistic ~ 2	no first-order correlation
DB-statistic >> 2	negative serial correlation

Fair enough, some may look like Gaussian (at least visually). Note, however, that the number of samples (n) is very small (e.g. XLM-USD candle_tick_1000 n = 195), so I suspect that one of the reasons may be a lack of sampling, which does not provide enough evidence to Jarque-Bera rejects the null hypothesis of normality.

In fact, a quick glance at the CryptoDatum.io database shows that the trading pairs XLM-USD and ZIL-USD were launched in May and July (2018) last year, respectively, and appear to be of fairly low volume.

Mystery solved? :)

Now let's run the Shapiro-Wilk test to see if it fits with the previous results:

Damn, Shapiro, didn't they teach you not to cheat during an exam at school? It seems like the rule that returns are not normal regardless of the type of bar.

4. What have we learned?

Tick candlesticks are created by collecting a predefined number of ticks and calculating the associated OHLCV values.

Confirmation bars look ugly on a chart, but they do their job well: they sample more during high activity periods and less sample during low activity periods.

Daily returns from tick candlesticks show a lower series of correlations compared to time-based candlesticks, even at small sizes (50, 100 marked bars).

Daily turns from both tick and time-based bars do not follow a normal distribution.

Labels

Most Popular

Advanced candlesticks for machine learning

Post a Comment

0 Comments

Popular Posts

Subscribe Us

Tags

Categories

Featured post

How does insurance work when you rent a car

Labels

Ad Code

Labels

Most Popular

Advanced candlesticks for machine learning

You may like these posts

Post a Comment

0 Comments

Connected Social Media

Popular Posts

Subscribe Us

Tags

Categories

Featured post

How does insurance work when you rent a car

Labels