International Simulation Football League
*PSA - Printable Version

+- International Simulation Football League (https://forums.sim-football.com)
+-- Forum: Community (https://forums.sim-football.com/forumdisplay.php?fid=5)
+--- Forum: Media (https://forums.sim-football.com/forumdisplay.php?fid=37)
+---- Forum: Graded Articles (https://forums.sim-football.com/forumdisplay.php?fid=38)
+---- Thread: *PSA (/showthread.php?tid=45278)

Pages: 1 2


*PSA - slate - 05-01-2023

Code:
2,572 words, ready for grading

Today, a bunch of hub was bubbed about a Sim Testing Guide being leaked. While a small part of the drama was about whether or not the guide was somehow stolen by other teams or whether appropriate “mad props” were given to London for creating it, another large part of the conversation was people worried about “DDSPF16-style” mass sim testing returning to the league. I am writing this article to make the case that, even if mass testing were to return in force, fundamental differences between the 2016 and 2021 iterations of the sim engine mean that its impact on the league this time around shouldn’t be feared.

Some newbies who haven’t yet learned the sim league ways may not fully understand the reason this caused anguish among some older league members. My own interpretation/perspective is that by the end of the DDSPF16 era in Season 26, sim testing had been refined to the point where many believed it created three main issues for the league:

1. Sim testing was very important for team success as you (as a GM or war room member) put your team at a significant disadvantage by not testing.¹

2. Sim testing was very time-consuming and the most common methods basically required a PC to be devoted to sim testing with no breaks for hours at a time.

3. Sim testing sucked all of the creativity and innovation out of the league because teams could sim almost every possible variant of their strategy and many different depth charts to refine their strategies, player builds, etc. to near-optimal perfection.

I personally agree with this perspective! The grind of sim testing in DDSPF16 was a combination of being extremely draining while feeling very necessary. This led to burnout or near-burnout for a lot of very heavily invested S25 users that I have personally talked to about it, and I believe made the league worse.

Readers who were fortunate enough to join in the DDPSF21 era shouldn’t be worried about the return of mass sim testing leading to unenjoyable burnout, though. DDSPF21 is very different than DDSPF16 in many ways, and I believe that several of these ways make it extremely difficult for mass sim testing to ever have the same impact on the league that it did in the olden days. Specifically, I think that limitations on the speed of sim testing mean that #1 is much less of a concern, and that the expanded strategy options available and changes to how strategies are input mean that #3 will never be possible.

If anyone cares about my credentials to discuss this topic, I should mention that I was a member of the sim transfer team that led the transition to DDSPF21 in Season 27, have been on the sim balance team since then, and in the dark ages I created an open source Python script that allowed you to automate testing several different strategies in one go - see this previous media piece. Despite how bad for the league I think DDSPF16 mass testing was, I’m still very proud of that work (but even prouder of the work the transfer team did to switch the league to a much better sim engine).

Sample Size Limitations

DDSPF16 had an Exhibition Game feature, where you could pit any two teams in the same league file against each other and the sim would use their current depth charts and strategies to simulate a game, just as if the two teams played in the regular season². The results of this game would be written to the normal game output files, a post-game menu would pop up to allow you to see the box score, play-by-play, or watch the game. If you closed that post-game menu, you could just… press “Play” again and a second game would run in the exact same manner. You could do this about 500-600 times before memory issues would cause the game to crash. Then, you could export the game data to a file, and without saving the game, re-open the league file and do it again. This process would take about 2-3 minutes, allowing for 10K-18K sims an hour.

This is blisteringly fast relative to the method that has been circulating recently. One of the Orange County war room members who has gotten a mass testing method working for DDSPF21³ can test about 150 games per hour. There are only about 48 hours between depth chart submission deadlines⁴, so one individual could test something like 6K-7K games in that entire window, less than the old sim could do in one hour. This is obviously a massive difference, but how important is that difference in scale?

To answer that, we need to talk about parallel universes statistics!

I originally wrote some long boring section because I think statistics is really important and interesting and I enjoy trying to teach it. But I’m sparing you all that to cut to the point - you can use a freely available online calculator to tell you how many sims you need to run in order to detect a meaningful difference, like this one here:

[Image: image.png]

What this image is saying is that, if I had two strategies where one had a 60% win rate and one had a 55% win rate, if I wanted to detect a meaningful difference 80% of the time and I had a pretty low threshold for detection⁵, I would need to have a sample size of 1,206 games. In DDSPF16, this took five minutes, while in DDSPF21 it takes 8 hours. I will also note that this 5% difference in win rate between two strategies is quite large - when I did testing on DDSPF16 I was mostly changing one or two variables at a time and the difference was rarely more than 2-3% between the best and worst options.

So, if one person dedicated all of their time between two strategy deadlines to sim testing on DDSPF21, they might be able to test 5-6 different strategies and have a pretty good shot at detecting if any of those handful of strategies would give them a 5% or greater boost to their win percentage. That is meaningful, but not game-breaking IMO. I don’t think teams who aren’t doing this for each individual game are at a significant disadvantage given my experience with testing in DDSPF16.

There are more advanced strategies with the new sim involving custom league files that can be used to simulate much larger numbers of games and almost come close to matching the old sim that I have heard about. But, still don’t be afraid, because they unfortunately fall prey to the second main limitation of DDSPF21.

Strategy Coverage Limitations

In the old sim, you had only a few degrees of freedom when creating a gameplan - your depth chart for each of 5 formations on offense and defense (or was it only 4 on defense?), your playbooks used on each down/distance combination, and your passing/blitz ratios on each down/distance combination. Oh, and I think Tempo and Primary Receiver existed as well.

By the late DDSPF16 era a lot of this stuff was fairly optimized without needing to do additional testing on - the best defensive playbooks, places to play your best players, etc. were pretty well-known. That really narrowed the options down a lot, and made it so that you could brute force your way, in combination with the large sample size, through a huge number of different options and get a very clear picture of the optimal strategy. I don’t know how many teams were really pushing this to its limits, but the Season 26 Sarasota Sailfish were able to run at least 1,000+ sims for pretty much every offensive playbook at every passing ratio on every down and distance in the 48 hours between strat submission deadlines. This is simply never⁶ going to be possible in DDSPF21.

First, there are way more options for changing strategies game to game in DDSPF21 with the addition of the Game Planning menu in addition to Play Calling. Testing all combinations of options across both screens when they interact with one another in complex ways would be a much more difficult task.

Second, these screens are much harder to automate than they were in DDSPF16. The major innovation of sim-batcher was that it allowed you to leave your computer running overnight and come back to a large dataset of results using a variety of different strategies without having to rely on you manually sitting at your computer and entering in the depth charts and strategies you wanted to test. This was done using AutoHotkey by tabbing between drop-down menus and using the arrow keys to select the playbook. The menu screens in DDSPF21 to the best of my knowledge don’t respond to these types of inputs at all, meaning that automating this process would require calibrating a large number of mouse inputs and generally be a finicky and buggy mess.

All of this is on top of DDSPF21 just taking a much longer time to load new screens, inserting delays in the process at each of these steps. The super-method I hinted at before, as far as I understand, involves creating a custom league with many instances of each team in the matchup you want to test, and then manually changing each  of those team’s strategies to the ones you want to test. I've heard through the grapevine a rough estimate that this takes up to one full hour to do, meaning that if you want to test multiple different strategies there is significant overhead cost for each that makes it impossible to quickly iterate between them like sim-batcher did.

Without being able to test all the strategies, I think there is still a lot of room outside of the mass testing setup itself to make a difference. Better understanding of the sim would allow a team to choose a better set of strategies to test and be more likely to find improvements than someone randomly choosing things to change. The dream of the sim transfer team is that at some point there could be a predictive element to setting strategies - I think the opposing team is going to do X, so I will do Y with my strategy to counter that. To the extent that anything in DDSPF21 works like that, I think that would also help reward teams for going outside of the framework of mass testing, or at least create an even larger array of possibilities to test (since you don't know exactly what the opposing team is going to submit).

The Future of Testing

Without the ability to test huge numbers of games using a variety of different strategies in advance of every individual game, I think that sim testing burnout is much less likely to be a significant problem with the new sim even if mass testing knowledge becomes widespread. Even though the issue of it being time-intensive isn't quite solved, the value obtained from spending that time is much lower and I believe/hope that people won't waste too much of their time on it when they could instead be shitposting in gen chat. However, I still believe that mass testing can serve many useful purposes in the league that probably don’t detract from people’s enjoyment! If you have gotten your hands on a guide to mass sim testing and are eager to try it out, I would recommend starting with some of these:

Testing with High Value-Add
While there are many decisions made in Game Planning that probably have very hard to measure effects on win rates, there are also probably things that make very large differences. I imagine that testing these things, probably on a once/season timeframe or so, could be a very useful way to ensure that teams aren’t running blatantly terrible strategies.

I’m hesitant to name specific things because people will assume I have sim insider knowledge about this when I really don’t, but I’m imagining stuff like figuring out depth charts when there are multiple similar options you could go with, or setting high-impact Game Planning options (maybe Passing Preference and DL/LB Role?).

Testing to Gain Intuition
In DDSPF16, partly because the decompiled code was widely available and partly because testing was so optimized, there was a lot of agreement about things like what was generally best to do for strategies, depth charts, player builds, etc. I think that there is much less agreement on this with DDSPF21 even after 15 seasons, and that mass testing is a way to understand “OK when I change this Game Planning setting or alter this player attribute, it broadly has this effect on the game”. Testing to gain insight is fundamentally different from testing to gain an edge in a specific matchup and I think feels more rewarding and much less draining to do.

Testing to Make Money / Earn TPE
Get much better accuracy on weekly predictions and gambling by using a larger sample size!

Testing to Fight the Sim Testers
Join sim balance team and help develop player archetypes / playbooks / strategy options to make it harder for dedicated sim testers to solve the game. Message @Pat if you are interested.

And for the really determined among you, I think this article also lays out what would be the most useful contributions to sim testing. If you want to prove me wrong and unlock the almighty power of the sim, figure out ways to do the things I said aren’t possible with DDSPF21!

And if we as a league really want to be extra safe in preventing DDSPF16-style burnout from becoming an issue, we could always limit strategy submissions to once per week. This would reduce the workload of the sim team as they wouldn't need to enter in new strategies for each team each weeknight, and it could add an element of strategy/planning as teams would use the same strategies against 3-4 different teams who each might have different strengths and weaknesses. Just an idea.

Thanks for reading and happy to answer any questions.



¹ In the comments I guarantee that someone from some team will quote this part and laugh because they won a trophy without doing any sim testing at all. My counterarguments would be (a) the sim is still extremely random, so especially at the DSFL level a team could translate raw TPE into wins without needing much sim testing, especially because (b) you could learn what the optimal things to do were from other teams without testing just through discussion / general osmosis, e.g. always blitz the maximum ratios, which were the best defensive playbooks (I want to say it was 3-4 and Nickel but my memory could be fuzzy), etc.

² I have heard rumors that @iStegosauruz has somehow determined that the exhibition game results were systematically different than regular season game results but in the absence of hard data, and a lack of understanding of how/why Wolverine Studios would even program in such a difference, my inclination is that it may not be true. But I would be interested to learn otherwise!

³ Mad props to the London Royals.

⁴ Weekends mess things up but also updates aren’t processed until Sunday morning so I count that as a wash.

⁵ A 90% confidence level means using a p-value of 0.1 for those of you who remember hypothesis testing from any statistics class you’ve taken.

⁶ I would be willing to place a large amount of money on a user bet regarding this (or even a much much weaker version of it) if you can propose a reasonable end condition where I get paid.


RE: PSA - Sebster - 05-01-2023

commenting to see if parallel universes is included in the word count or not since it’s strickenthrough


RE: PSA - Twenty6 - 05-01-2023

Mad props to the London Royals!


RE: PSA - SwankyPants31 - 05-01-2023

Mad props to you for writing this article Slate


RE: PSA - wizard_literal - 05-01-2023

Bang Bang! Lion Gang!

Love this article I'm putting my Sim Testing PC up on ebay as we speak(really though great article(this won't stop my from sim testing though)).


RE: PSA - aeonsjenni - 05-01-2023

Thank you Mad props for giving your perspective on this. It's definitely making me reconsider the approach I want to take when it comes to sim-testing in the future. I had never considered how detrimental a league environment that demanded constant testing could be, and it's something I'm very glad to be warned about.


RE: PSA - Weaves - 05-02-2023

dont tell them about p-hacking


RE: PSA - sakrosankt - 05-02-2023

Great article!

I'm very confident there is the possibility to do larger sets of more than 150 games per hour, but the setup to get to that kind of environment mostly isn't worth the effort. A long time ago I started working on something like that and used it at times, but never finished to automate the whole process.

I think you could get up to over 500 games per hour, maybe even close to 1000, but never did the exact count on that. Also, there are some limitations to that method which might never really let you automate the whole process. So it won't be worth the effort, just wanted to state that there is at least one possibility to increase games per hour. And if I get to that conclusion, probably smarter heads than me might finalize it, or come up with even smarter ideas to test in big ways.


RE: PSA - infinitempg - 05-02-2023

slate saw the memes and said "what if I actually took this seriously"

but for real, this is really great work

(05-01-2023, 10:31 PM)slate Wrote: ¹ In the comments I guarantee that someone from some team will quote this part and laugh because they won a trophy without doing any sim testing at all. My counterarguments would be (a) the sim is still extremely random

S22 Yeti tested at 20% (or 30% for half of us, for some weird reason) against OCO in the Ultimus. We won.

S24 and S25 Yeti tested over 80% against SJS in both games. We lost both games.

I used to be huge on sim testing (especially since you actually could get statistically significant results), but I think those games broke me lol


RE: PSA - jdc4654 - 05-02-2023

Me over here thinking I'm advanced by running 50 tests in an hour

Great article, Slate. Mad props