This is something I created back in 4.5 and have been updating/maintaining ever since. I never really shared how it works with anyone, but I'm actually really quite proud of it. And I'm sure there are some people out there who would be interested in or have thought about something like this so here it is. Github repo and then implementation details below:
Github Link
JK backstory first:
I was originally working on a blm simulator to try and figure out optimal t3 usage. The original planned idea was to create a set of rules for the simulator to follow and then figure out which rules worked best. That changed when I realized how annoying that would be to do manually. Luckily, my friend was working on his thesis at the time which gave me the idea to look into reinforcement learning, specifically, Q-learning. Through some struggling, guesswork, and lots of optimism, the AI worked and did some really crazy things that can be summed up as a 2% damage increase
Simulator implementation:
The job state (of blm and any job really) is mostly just a bunch of timers. GCD, CDs, buffs, ticks, actually that's about it. So I decided to use a timeline implemented with a priority queue. Whenever anything happens, events are generated and placed on the timeline and when the current timestep is done, we just jump to the next event and update all the timers. This is a lot faster than incrementing the simulator in 0.01s steps and seeing if anything changed. For example, if you use triple cast, two events are generated: the buff runs out in 15 seconds and the CD ends in 60 seconds. When we get to the buff-running-out event, the buff would've probably already been used up, but that's fine - we just update timers, see that nothing changed, and move on to the next event.
For using actions, a list of usable GCDs and oGCDs is generated whenever you jump to an event where the job state is not "action-locked" i.e. not in the middle of casting, not animation locked by oGCDs, and actually have available abilities. At this point, the simulator allows you to choose an action or skip (which jumps to the next event). For these points where you can actually choose an action, the simulator saves the state and the corresponding action which is necessary for the AI to train on.
AI implementation:
For the AI, I used a reinforcement learning algorithm called Q-learning which essentially learns the value of actions given the current state - Q(s, a). The idea behind Q-learning is that the value of a state and action is based on 1) the reward for doing said action in said state and 2) the optimal value of the next state (i.e. assume you choose the best action). With this idea, we can iteratively update state-action values until things converge and stop changing.
In the actual application, I try to define the value of a state-action pair to be the dps in the next 10 minutes, and the reward is the damage done between this state and the next. I update the damage with (essentially) the following formula:
dps = (damage reward + next state dps * (600 seconds - time delta in seconds)) / 600 seconds
For (a simplified) example, let's say I use a GCD that does 100 damage and the next available state is in 2.5s. Let's say the AI gives us 38 for the optimal dps of the next state. Since our GCD took 2.5s, we have 597.5s of damage left to estimate. Using the dps of the next state, we estimate we do 597.5 * 38 damage in that time, so in the next 10 minutes, we do approx. 100 + 597.5 * 38 = 22805 damage. The est. dps over 10 minutes is 22805 / 600 = 38.0083 then.
In addition to Q-learning, I had to use a neural network to approximate the value (dps) of state-action pairs because the state space is simply too large. The neural network I used had 4 layers consisting of sizes 57 x 128 x 128 x 20 where the input is job state and the output is the est. dps of actions (20 actions in total). I also maintained a memory of 1 million state-action pairs to batch train the neural network on. This memory is updated in episodes of 2500 simulation steps (usually about 30-40 minutes of simulated rotation time). Each epoch, the memory is updated by 10000 states (4 episodes) and the neural network is batch-trained 50 times on 10000 random states from the memory. Usually, I train the AI for a few days which is like 400k epochs (I just leave my computer running ._.)
tl;dr AI uses reinforcement learning and neural network to estimate 10-minute dps resulting from action used in current state and just chooses the best action every time.
There are definitely some (a lot of) details I'm leaving out since I'm not really sure how interested people are about hearing them. I've pretty much covered mostly everything I wanted. If you have any questions, please feel free to leave them below. (I hope I get at least one question cough) I'd be happy to answer them
submitted by /u/brianx2000
[link] [comments]
Continue reading...
Github Link
JK backstory first:
I was originally working on a blm simulator to try and figure out optimal t3 usage. The original planned idea was to create a set of rules for the simulator to follow and then figure out which rules worked best. That changed when I realized how annoying that would be to do manually. Luckily, my friend was working on his thesis at the time which gave me the idea to look into reinforcement learning, specifically, Q-learning. Through some struggling, guesswork, and lots of optimism, the AI worked and did some really crazy things that can be summed up as a 2% damage increase

Simulator implementation:
The job state (of blm and any job really) is mostly just a bunch of timers. GCD, CDs, buffs, ticks, actually that's about it. So I decided to use a timeline implemented with a priority queue. Whenever anything happens, events are generated and placed on the timeline and when the current timestep is done, we just jump to the next event and update all the timers. This is a lot faster than incrementing the simulator in 0.01s steps and seeing if anything changed. For example, if you use triple cast, two events are generated: the buff runs out in 15 seconds and the CD ends in 60 seconds. When we get to the buff-running-out event, the buff would've probably already been used up, but that's fine - we just update timers, see that nothing changed, and move on to the next event.
For using actions, a list of usable GCDs and oGCDs is generated whenever you jump to an event where the job state is not "action-locked" i.e. not in the middle of casting, not animation locked by oGCDs, and actually have available abilities. At this point, the simulator allows you to choose an action or skip (which jumps to the next event). For these points where you can actually choose an action, the simulator saves the state and the corresponding action which is necessary for the AI to train on.
AI implementation:
For the AI, I used a reinforcement learning algorithm called Q-learning which essentially learns the value of actions given the current state - Q(s, a). The idea behind Q-learning is that the value of a state and action is based on 1) the reward for doing said action in said state and 2) the optimal value of the next state (i.e. assume you choose the best action). With this idea, we can iteratively update state-action values until things converge and stop changing.
In the actual application, I try to define the value of a state-action pair to be the dps in the next 10 minutes, and the reward is the damage done between this state and the next. I update the damage with (essentially) the following formula:
dps = (damage reward + next state dps * (600 seconds - time delta in seconds)) / 600 seconds
For (a simplified) example, let's say I use a GCD that does 100 damage and the next available state is in 2.5s. Let's say the AI gives us 38 for the optimal dps of the next state. Since our GCD took 2.5s, we have 597.5s of damage left to estimate. Using the dps of the next state, we estimate we do 597.5 * 38 damage in that time, so in the next 10 minutes, we do approx. 100 + 597.5 * 38 = 22805 damage. The est. dps over 10 minutes is 22805 / 600 = 38.0083 then.
In addition to Q-learning, I had to use a neural network to approximate the value (dps) of state-action pairs because the state space is simply too large. The neural network I used had 4 layers consisting of sizes 57 x 128 x 128 x 20 where the input is job state and the output is the est. dps of actions (20 actions in total). I also maintained a memory of 1 million state-action pairs to batch train the neural network on. This memory is updated in episodes of 2500 simulation steps (usually about 30-40 minutes of simulated rotation time). Each epoch, the memory is updated by 10000 states (4 episodes) and the neural network is batch-trained 50 times on 10000 random states from the memory. Usually, I train the AI for a few days which is like 400k epochs (I just leave my computer running ._.)
tl;dr AI uses reinforcement learning and neural network to estimate 10-minute dps resulting from action used in current state and just chooses the best action every time.
There are definitely some (a lot of) details I'm leaving out since I'm not really sure how interested people are about hearing them. I've pretty much covered mostly everything I wanted. If you have any questions, please feel free to leave them below. (I hope I get at least one question cough) I'd be happy to answer them
submitted by /u/brianx2000
[link] [comments]
Continue reading...