5 Cryptocurrency Developments To Look Forward To In 2019

2018 turned out to be a disaster year for many investors in cryptocurrency. Expectations were high after the crazy price increases of Bitcoin and other cryptocurrencies witnessed prior to and in…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

The Quest for the Ultimate Optimizer . Episode 2

Can we find zero in less than 20 iterations ?

Being a continuation of the first episode, it reuses quite a lot of the same code. This means that:

Let’s re-run the last 2 RNNs proposed in the previous notebook.

We concluded the last episode by declaring victory over RMSProp ... but is it the best we can do ?

To do that we need to define what is 0 in our context.
It turns out Numpy offers an easy definition :

Let’s see how it looks

The approach proposed below adapts the gradient range during the optimization, gradually lowering the floor of log(gradients) as the RNN gets more precise.

So, on the plus side, the convergence is initially faster. We also seem to have removed the barrier preventing the RNN going lower than 1e-28. This allows the average result to continue improving, albeit very slowly.
On the minus side, well, we are still nowhere near 0 (i.e. 1e-38) on average.

Before exploring new RNN configurations, let’s try one last trick : instead of minimizing the log of the last result, we can minimize the sum of log of results of all 20 iterations.
In theory, minimizing the last result should also minimize all previous iterations results, as the back propagation of gradients goes through each iterations starting from the last up to the first, so this should not change the results much. However, SGD-like optimizer can behaves rather chaotically, meaning a small delta on the first iteration can result in a big and unpredictable difference 20 iterations later. So, I’m not sure how far up the iterations chain we can back propagate gradients without losing “meaning” (i.e. improvement to the RNN that would generalize to another initialization). Anyway, if this back propagation of gradients through iterations is an issue, adding all the iterations directly into the loss function will take care of it.
An other benefit of this approach is that, if we ever want to reliably get to the “zero machine” within 20 iterations, we need to use a loss function that still gets smaller after the target is reached, which is what this approach does.

Enough talk, let’s give it a go.

We do get another improvement in terms of average end result.
We also seem to have some cases where the optimizers does find the “zero machine” : where the light red area reaches the purple horizontal line. So we are definitely getting closer …

Let’s have a look at what 3 tries of “base” convergence look like.

This is a reminder that, although we have managed to remove the floor that prevented the RNN optimizer to reach the zero machine (1e-38), the average end result is still around 1e-23 … far from our goal of 1e-38

The first problem we highlighted is that we are trying to design a RNN that works as well at y=1 as at y=1e-38, with gradient varying between 1 and exp(-43)≈1e-19 (I should mention that this python confusing convention of writing small numbers like 10–5 with 1e-5 is most unfortunate in our context).
The different implementations of logarithmic preprocessing of the gradients proposed above sort of address the problem by rescaling this huge variation of scale into a linear segment between 0 and 1 so that it is more or less interpretable by the RNN, but it’s never truly scale invariant.
There is probably a much better implementation of this idea of logarithmic preprocessing, but instead of sinking more time into fine tuning this (or digging into Deepmind’s code to see how they cracked this :-), we can try a simpler approach : since the RNN is being fed the past 20 inputs, why not feed it only the ratios of gradients between one step and the next and let it make sense of it all. The information provided should be equivalent and it has the big benefit of being completely scale invariant. It’s actually one of the first ideas I tried. However, I was using the direct result of the RNN as the function to be minimized, and as we have seen, this leads to vanishing gradient if you’re not applying log to the function result.

To be noted : the implementation above devides the gradient by the norm of the previous gradient. Dividing by the previous gradient, without the norm, yields more or less the same results.

Taking another look at the “base” convergence above reminds us that what we are trying to achieve — get through 38 orders of magnitude to find the zero machine within 20 iterations — is both pretty useless in terms of practical applications and pretty tricky as a theoretical exercise. It means we want the RNN to divide the “base” loss function by more than 100 every iteration, on average.
Looking at the convergence of the RNN above, we can see that some iterations provide no gains, and some get through 4 orders of magnitude in one step. It means that, even if we feed into the RNN the ratio of gradients from one iteration to the next, we still get RNN input variations of 4 orders of magnitude (between 1e-4 and 1) which, as we have seen, is not ideal.

We already dealt with this problem with the log casting previously, so why not reuse the same solution and apply it on the ratio of gradient ? In other words, let’s use every tricks we have used so far and see where we get.