Failure-Driven Backtesting
Introduction
Backtesting is the process of testing a strategy against historical market data to see how it would have performed X years ago.
I’ll start by repeating the familiar saying: past performance does not guarantee future results. What is mentioned far less often is that achieving even a decent backtest is incredibly hard and expensive.
In this post, I’ll share how I approached this problem in BullSheet, using a set of pragmatic hacks and simulations to crash my backtest results rather than aiming for perfection.
My holding periods are generally in months. So if you’re looking for EMA-cross kind of backtesting for day trading or short-term swing trading, I’m doing completely the opposite.
(Disclaimer: This is not investment advice. This is just the logic I use for my own sanity.)
Let me introduce myself quickly.
I have a Master’s degree in Computer Science and a Bachelor’s degree in Mathematics. I’ve been working as a backend and infrastructure engineer since around 2008 and am currently based in Berlin. But beyond my day job (or lack thereof recently), I’ve been an active investor. That journey is years of learning, years of trying different things and acknowledging some expensive lessons. That I do not hesitate to share with others.
You can reach out to me via:
Email: bullsheet@anar-bayramov.com
The Setup
Assume you have already done the hard work.
You collected 10 years of individual stock data
You have both fundamental and technical metrics for 10 years
You want to test a very simple active strategy
Example random strategy
At each rebalance (monthly):
Select the 10 S&P 500 stocks with the lowest P/E ratios below 15
Rank candidates by ascending P/E and take the top 10
Exit conditions (whichever happens first):
P/E rises above 30
Position reaches +40% profit
Position reaches −15% loss
Now you want to backtest this strategy over 10 years.
Problem 1: Screening
Most financial screeners only work on current data.
To my knowledge there is no easy way to fetch: “Which stocks had P/E < 15 in 2015?” and complex screening logic is even harder. (my strategy calculates BullSheet score with ~30 parameters)
So most likely you need to build a screener that works with historical states of the world. Which is essentially what I did with BullSheet
Example raw data:
Ticker: AA
PE Ratios:
01/11/2015 → 15
02/11/2015 → 16
So to screen correctly, you need point-in-time data so that every query only uses information that was actually known on that date. This also “somehow” eliminates look-ahead bias.
For the record you can also buy point-in-time datasets but spoiler alert they are quite expensive and I haven’t tried if they can be computed without a cluster of computers.
Problem 2: Data Changes
Fundamental data is not static.
Stocks get split. Tickers get renamed. Financial statements are restated.
Earnings are often corrected months later. If you use those corrected numbers in a backtest, you are calculating P/E ratios that did not exist at the time. You are time-traveling with better information than you actually had. Then comes survivorship bias.
If you pick today’s stock universe and run a backtest backward, you silently exclude:
Bankrupt companies
Acquired companies
Companies that slowly shrank
Basically, the strategy could look great because the losers are already eliminated.
Problem 3: Entry/Exit Prices
Daily and weekly candles give you only a rough summary: open, close, high, low, and volume. Your actual trades will happen intraday.
Example:
Open: 100
Close: 90
High: 140
Low: 75
So which price did you actually get filled at?
You can pick one of these prices or use some average, but either way you are approximating reality. The more precision you try to add, the more complexity you introduce.
If you include intraday data in your backtests, your dataset can easily explode into terabytes.
So there is no perfect solution for Retail investors. Only different types of approximation, each with its own problems.
Problem 4: Sentiment
Our simple strategy ignores sentiment, but real humans do not. Also there are strategies that involves market sentiment.
Imagine a healthcare company:
Perfect fundamentals
Fits your screener
Just failed a clinical trial
The stock collapses 40%. In 3 years, it recovers and goes on to 10x.
But at the time of crash, the news is screaming about massive losses “Stock A is worst pick of the year” headlines everywhere. Buying it feels like catching a falling knife.
Looking at historical data and running backtests does not include that fear. But even after years, I still do have that fear when doing tr.
How BullSheet Tackles Backtesting
As an engineer, I am very familiar with the trade-off between “perfection” and “execution.” I did not try to invent a perfect backtest. Instead, I approached the problem from several angles and incorporated those solutions into my overall BullSheet scoring logic.
1. The “Russian Roulette” Factor
This is my favorite hack. I simulate massive stock crash risk without knowing who actually crashed by introducing a Random Death Factor.
My real portfolios typically hold 8 to 12 diversified positions with volatility-based stops. Technically one or two forced crashes are enough to materially distort returns and weaken my risk management. It is highly possible while my hold period crashes do not happen.
Just to make sure it happens. Each year, I randomly select up to 2 holding stocks and force a -30% drawdown in a random month or quarter. It does not matter whether those specific companies actually crashed in real life. I am mathematically injecting the probability of catastrophic failure in my portfolio and back into my backtesting model.
Because I know it is almost impossible to pick only winners in such portfolios.
Yes I have hard stop losses on my backtests but they only protect me while market is open.
This is not theoretical actually and I have seen this happen in real life:
Just last Friday, one of my gold mining stocks crashed hard below my stop loss while closed market. That company passed all my scores, filters, backtests, forward tests, and was top 4 in my BullSheet Score. My overall portfolio is still profitable, but that single event caused significant harm and reminded me once again: finding great companies or running a solid strategies does not immunize from sudden drawdowns.
Risk management in BullSheet operates both at individual and portfolio level. But majority of risk management comes from portfolio. So if one position hits the Russian roulette in any given month, it can significantly alter results.
For what it is worth, I do not rely on a single simulation. I run the same strategy 10 times and take the average outcome it somehow smoothens results.
And if a strategy survives this artificial “Death” I mark its backtesting score as robust.
2. The “Too Big to Fail” Filter
Survivorship bias is brutal in small-cap stocks. Small companies fail constantly.
In mega-cap stocks, survivorship bias is negligible. BullSheet focuses on companies with a market cap of $2B+ (but strategies run on $3B+, with a $1B buffer to account for companies moving in or out of range).
I should maybe add market cap based stop loss too. Something like if stock goes below 2B sell. but I haven’t had a chance to analyze consequences of it in my backtests. Also generally companies high in BullSheet Score generally 10-15B market capital.
But over the last 10 years, based on my research the number of companies in this universe that went to zero is statistically low. Yes, it happens but rarely enough that it doesn’t dominate results.
3. Entry and Exit Prices
My entry prices are approximated using Typical Price:
Price = (High + Low + Close) / 3Exit rules:
In my backtests I use a trailing hard stop loss of 15%.
Example: If I’m 20% up and the stock drops 15%, I exit with 5% profit.
Example: If I’m 10% up and the stock drops 15%, I exit with 5% loss.
For positions up 45% or more, I tighten trailing stops to 3% to lock in profits.
In real strategies, I follow volatility based stop losses.
Here is AI simplified version of my real stop loss code
def _add_stop_loss_value_column(self, df):
# 2 is typically used for Standard Deviations Stop loss to cover ~95% of noise.
std_dev_multiplier = 2.0
# Convert percentage to decimal (e.g., 5.0 -> 0.05)
volatility_decimal = df["Volatility 1 Month"] / 100
# Formula: Current Price - (Current Price * Volatility * Multiplier)
stop_loss_distance = (
df["Current Price"] * volatility_decimal * std_dev_multiplier
)
df["Stop Loss Price"] = (df["Current Price"] - stop_loss_distance).round(2)
return df4. The “Ceiling” Test
I always assume my backtests are optimistic. So I treat results as a theoretical maximum, then I apply a mental deduct of beta distribution.
Because it allows me to model a penalty that is usually standard (e.g., 6-8%), but implies a ‘fat tail’ risk where the penalty is occasionally severe (e.g., 14%).
>>> import random
>>> [int(5 + random.betavariate(2, 6) * 10) for _ in range(10)]
[10, 7, 9, 7, 8, 6, 7, 7, 9, 6]
>>> [int(5 + random.betavariate(2, 6) * 10) for _ in range(10)]
[7, 8, 6, 7, 7, 9, 7, 5, 7, 7]
>>> [int(5 + random.betavariate(2, 6) * 10) for _ in range(10)]
[5, 8, 9, 5, 6, 6, 6, 6, 9, 7]
>>> [int(5 + random.betavariate(2, 6) * 10) for _ in range(10)]
[8, 10, 6, 7, 5, 7, 10, 8, 8, 9]
>>> [int(5 + random.betavariate(2, 6) * 10) for _ in range(10)]
[6, 7, 8, 7, 5, 8, 5, 9, 6, 6]Lets say my value is 9 percent
If a survivor-biased backtest shows 30 percent returns, reality might be 21 percent.
If a survivor-biased backtest shows 5 percent returns, reality is probably 4 percent negative.
If a strategy barely beats the market while cheating, it is probably a bad strategy.
5. Different Timeframes and Several Runs
As I mentioned above I have several dynamic values that takes part into my backtest this is why I run:
10-year backtests with quarterly rotation
5-year backtests with monthly rotation
Each scenario runs 10 times
Results are averaged
I currently do not fully model regime changes. That is work in progress.
For now, I tagged months as Bull or Bear using S&P 500 and VIX context.
It is imperfect, but better than avoiding regimes completely
The Bottom Line
As a retail investor, I had to build my own point-in-time solution. I generate what I call the BullSheet Score, which aggregates signals from 14 separate engines, some of which I’ve covered in my other posts.
Backtesting contributes only about 5% of the total score.
Any strategy that looks promising is forward-tested using paper trading.
My recommendation:
Do not fully trust your backtests: Assume your results are somehow inflated. Strategies that rely purely on technical analysis are easier to backtest but most of my strategies are fundamentally driven, which makes realistic backtesting much harder.
Simulate against your own favor: Randomly assign losses. Deduct from profits. Multiply drawdowns. If a strategy survives after you deliberately hammer it with artificial losses, then you might actually have something that may work in the real world.
What’s Next?
I have covered Valuation (Margin of Safety) and Quality (Revenue Score).
In the next post, I’ll maybe explain another engine or different part of BullSheet. I prefer keeping it casual.
See you soon.

