Reinforcement Learning With TicTacJoe: A Simple Brain Coded Explicitly in R

Estimated time:

<h2 id="anchor-1">Reinforcement Learning: Introduction</h2> Reinforcement Learning is a scheme of training machine learning models in which a certain agent’s actions in an environment (typically in the form of moves in a game played by the agent) are adjusted over time. Adjustments are made by reinforcing those which lead to a good outcome (reward for winning), possibly suppressing those which lead to a bad outcome (punishment for losing). A group of authors from DeepMind recently published a paper formulating the Reward-is-Enough hypothesis (<a href="" target="_blank" rel="noopener noreferrer">Silver et al., 2021</a>). The authors claim that the reward maximization mechanism may well be enough to explain the phenomenon of intelligence. This would mean the Reinforcement Learning framework, an embodiment of reward maximization, may be broad enough to encompass artificial general intelligence. With this in mind, let’s explore a simple and explicit example of Reinforcement Learning. <ul><li><a href="#anchor-1" target="_blank" rel="noopener noreferrer">Reinforcement Learning</a></li><li><a href="#anchor-2" target="_blank" rel="noopener noreferrer">What (or who) is TicTacJoe?</a></li><li><a href="#anchor-3" target="_blank" rel="noopener noreferrer">Why is TicTacJoe interesting?</a></li><li><a href="#anchor-5" target="_blank" rel="noopener noreferrer">TicTacJoe's state of mind</a></li><li><a href="#anchor-6" target="_blank" rel="noopener noreferrer">Possibilities with reinforcement learning</a></li></ul> <h2 id="anchor-2">What (or who) is TicTacJoe?</h2> TicTacJoe is a Reinforcement Learning agent operating in the game of Tic-Tac-Toe (<a href="" target="_blank" rel="noopener noreferrer">you can play around with it here</a>). To play against TicTacJoe, click on the “Play a game” button. The probabilities of moves that TicTacJoe makes in a given round are displayed on the tiles before TicTacJoe picks one of them. Mind you, only nonsymmetric choices are considered. As you can see, when TicTacJoe makes his first move, all three choices are equally likely. <img class="wp-image-7691 size-full" src="" alt="Likelihood of TicTacJoe's moves as a noob" width="512" height="404" /> When TicTacJoe is a Noob, he has an equal chance of making each possible move. The interesting thing about TicTacJoe is that he can learn from playing the game. Click “Let TicTacJoe train” to see how he gradually becomes better at the game. He starts as a “Young Padawan” and eventually becomes a “Guru” - by then, almost always picking one of the optimal moves. <img class="wp-image-7690 size-full" src="" alt="Likelihood of TicTacJoe's moves as a guru" width="512" height="404" /> When TicTacJoe is a Guru, he almost always picks the optimal move when he starts. <h3 id="anchor-5">What's TicTacJoe doing?</h3> What actually happens under the hood? When TicTacJoe is in training, he plays multiple games against himself: 10,000 to become a Guru. Each time, we reward moves made by the agent which won a given game while discouraging the moves of the other agent (this is the reward mechanism in action!). By introducing a simple temperature-like mechanism, TicTacJoe explores all possibilities of moves. The mechanism prevents TicTacJoe from fixating on a given strategy too soon. You can find more details of the implementation below. <blockquote><strong>Play with <a href="" target="_blank" rel="noopener noreferrer">TicTacJoe here</a>!</strong></blockquote> A set of 3 graphs provided after launching the training shows TicTacJoe’s learning curve. They show how the three likelihoods of TicTacJoe’s first move evolve after each of the 10,000 games he plays to reach the top skill level. <img class="wp-image-7741 size-full" src="" alt="3 graphs showing TicTacJoe's movement progression" width="1920" height="499" /> Likelihoods of TicTacJoe making the first move evolve as the training progresses <blockquote> <h2><strong style="font-size: 16px;">Interested in Convolutional Neural Networks? Read our <a href="" target="_blank" rel="noopener noreferrer">Introduction to Convolutional Neural Networks</a>.</strong></h2> </blockquote> <h2 id="anchor-3">Why is TicTacJoe interesting?</h2> TicTacJoe is a simple creature. He’s interesting because the inner workings of his brain have been coded explicitly in R, with no extra packages used. This makes TicTacJoe accessible for easier inspection. The<a href="" target="_blank" rel="noopener noreferrer"> code is available here</a>. Read on to learn how it all works. <h2 id="anchor-5">TicTacJoe’s state of mind</h2> In this section, we dissect TicTacJoe's brain to see how it functions. <b>Amplify your business with Appsilon’s</b><a href="" target="_blank" rel="noopener noreferrer"> <b>custom Machine Learning solutions</b></a> We can show everything TicTacJoe knows about playing Tic-Tac-Toe in a graph. The graph represents the likelihood of TicTacJoe picking a given move when faced with a certain board configuration. A graph holding all such possibilities would have 9! = 362 880 nodes, with some pruning possible, since no further moves can be made after a game is won. To fit this information in the memory, we chose to reduce the graph in terms of symmetry. That’s why only 3 nonsymmetric options are available in the first move: corner, side, and center. After the reduction, there are only 765 nodes in our graph! <img class="wp-image-7739 size-large" src="" alt="Representation of TicTacJoe's Mind" width="1024" height="386" /> Above is the graph of TicTacJoe’s mind. Each row represents non-symmetric configurations of the board in a given round. Edges link the possible board configurations in the next move. From the start at the top, there are only three possible moves (corner, side, and center), while further on, the number of possibilities at first increases to eventually decrease. When initialized, the likelihood assigned to each edge is equally distributed among the edges with the same origin state on the board. But as the training progresses, moves that lead to losses are discouraged and less likely. Those which are beneficial to the agent are encouraged and have an increased likelihood. There’s another trick that stabilizes training. A temperature-like mechanism is introduced so that the updated probabilities are processed with a softmax function dependent on a temperature parameter. This parameter gradually decreases from a high value to a low value. The high value encourages TicTacJoe to explore new moves, while the low value forces him to exploit his experience.  <h2 id="anchor-6">Possibilities with reinforcement learning</h2> Interestingly, the approach presented above is possible in simple games like Tic-Tac-Toe, where (with some extra tricks, like symmetry reduction) all possible states of the board and their links can be stored in memory. However, this approach doesn’t scale to larger environments. In such cases, the agent needs to read the state of the environment and make a decision based on the outcome of this perception (possibly enriched with the history of such perceptions). Curious to see reinforcement learning in action? Feel free to explore the<a href="" target="_blank" rel="noopener noreferrer"> application presenting TicTacJoe</a> or dive into the<a href="" target="_blank" rel="noopener noreferrer"> code repository</a>! <h2>Let's build something beautiful together</h2> Appsilon provides innovative <strong>data</strong> <strong>analytics</strong>, <a href="" target="_blank" rel="noopener noreferrer"><strong>machine</strong> <strong>learning</strong></a>, and <strong>managed services </strong>solutions for Fortune 500 companies, NGOs, and non-profit organizations. We deliver <strong>the world’s most advanced <a class="gmail-QuillHelpers-link gmail-textEditor-link" href="" target="_blank" rel="nofollow noopener noreferrer">R Shiny</a> applications</strong>, with a unique ability to rapidly develop and scale enterprise Shiny dashboards. Our proprietary machine learning frameworks allow us to deliver <strong><a href="" target="_blank" rel="noopener noreferrer">Computer Vision</a></strong>, <strong>NLP</strong>, and <strong>fraud detection</strong> prototypes in as little as one week. Discover our growing list of open-source R Shiny packages at <a href="" target="_blank" rel="noopener noreferrer"></a>. If you find our open-source packages useful please consider dropping a star on your favorites at our <a href="" target="_blank" rel="noopener noreferrer">Github</a>. It helps let us know we’re on the right track. And if you have any comments or suggestions, swing by our feedback threads like the discussion at our new <a href="" target="_blank" rel="noopener noreferrer">shiny.fluent package</a> or submit a pull request. We value the R community’s input. <a href="" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-full wp-image-7584" src="" alt="world class enterprise Shiny dashboards" width="1914" height="880" /></a>

Contact us!
Damian's Avatar
Iwona Matyjaszek
Head of Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Have questions or insights?
Engage with experts, share ideas and take your data journey to the next level!
Join Slack
Explore Possibilities

Take Your Business Further with Custom Data Solutions

Unlock the full potential of your enterprise with our data services, tailored to the unique needs of Fortune 500 companies. Elevate your strategy -connect with us today!

Talk to our Experts