Behavioral Credit Assignment

This program was developed as a response to a challenge presented by Dr. Michael Mauk. In discussions with RC, he stated that the “fixer” mechanism, although apparently viable, did not evidently solve the behavioral credit assignment problem. The program was developed to see if it could.

Program Description

This program is designed to simulate the neural decision making of an organism learning to negotiate a 4 choice points T maze. The neural network below illustrates two neurons at each choice point, the firing of one of which will generate a turn to the left or the right at each choice point. CP 0 corresponds to the initial choice point in the maze. The brain matrix shows that, as per the illustration below, each choice point neuron makes excitatory connections with both of the neurons making the subsequent left or right choice. Note that this implies that ONLY the previous choice point neurons provide stimulus input to the next layer of neurons so this models ONLY the effect of prior choices on subsequent behavior. The model assumes no input from external cues that might be (in fact, always are) in the environment that could guide/influence choices. This is consistent with efforts to conduct precisely controlled animal experiments designed to determine how animals succeed in learning their way through such mazes.

The other thing to note about the brain matrix is the ones in the last two columns of the first two rows. This DOES represent a stimulus that responds to some information available to the organism at the start of the maze. We believe it is absurd to imagine that after re-positioning the animal at the start of the maze repeatedly, it would somehow be unable to recognize that it was at the start of the maze. Without some such indicative information, there is no way an organism could ever learn to make the proper first choice and hence no way it could ever succeed at a rate greater than 50 percent of the time.

As in other programs, at first behaviors are generated randomly, generating temporary increases in synaptic efficacy which decay with time unless reinforced. Notice however, in this model an organism must emit four behaviors before ANY reinforcement is possible and if run randomly, there is only a 1 in 16 chance of obtaining reinforcement. These constraints dictate the parameters of the program: can a set be found that allows a short term increase in synaptic efficacy, generated by the behavior at the first choice point, to persevere long enough to be “fixed” by reinforcement obtained only after the fourth choice is made?

Unique Details of This Program

Unlike any other program described here, you will note that while synaptic efficacy values between neurons are initially set at 60 in the brain. During the operation of the program the synaptic efficacies are set as random values between 40 and 95. This is clearly at odds with a synaptic value that is set unless modified according to prescribed rules.

In other programs when we desire to fire a particular neuron (e.g., to have the organism emit an operant) we do so by putting a value of 1 in the stimulus matrix at the column that maps/describes/indicates the connection of a particular stimulus to that neuron. By making the synaptic value of that connection greater than threshold, we can assure the firing of the neuron. This works well in simple networks but fails in this simulation because if we fire a neuron with a “hardwired” connection (i.e., via a “1” in the first 8 positions of the stimulus matrix) it will always fire, preempting all other inputs. This then obliterates the effect of the growing synaptic efficacy of the connections between prior and subsequent choice points neurons. Thus, for example, the firing of a cp 0 neuron can’t come to fire a cp 1 neuron on its own and obviously the simulation couldn’t possibly work.

It should be noted that hardwired connections described above are the ones indicated by a value of 60 in the brain matrix. They represent an unmodifiable synapse and hence the learn method does not address them at all. However, by varying randomly the value of these “hardwired” synapse we were able to much more realistically model the fact that in a real organism multiple factors will contribute to the organism’s choices. To name a few: fatigue level, hunger, interest in the new vs boredom with the familiar, fluctuations in smells, sounds, temperature.

The program code “brain[2cp+1,2cp+1]=rand(40..95);stimulus[2*cp+1,0]=1″ generates a product of brain[2cp+1,2cp+1] * stimulus[2*cp+1,0]=1″ which if > threshold will generate the firing of choice point neuron. However, if that stimulus input is not itself sufficient, it will, when combined with sufficient input from the firing of the previous choice point, then generate a firing. As the synaptic connections between choice points (values set as 1 in the original brain matrix) increase in value as a result of reinforcement learning they will eventually produce the sequence of behaviors that lead to reinforcement.

Note that only those synapses with initial values =1 in columns 8 through 14 are modifiable.

The Neural Network

The Matrix Representation

The Code

https://github.com/Ondaweb/CREDIT

Credit Assignment Addendum

The significance of the credit assignment simulation findings depends on the validity of the procedures used to model the problem.  Therefore additional details are provided here to allow others to make their own judgements.

There are some variances from a strict interpretation of the model.  No effort was made to implement a formal eligibility period.  Since a priori choice had to be made before a subsequent one could be, the firing of an earlier choice point neuron was held to occur “just before” the firing of the latter one.  The decay between choices employed here is clearly arbitrary; however as long as the decay rate exceeds the intertrial interval, the same results would be obtained albeit possibly requiring significantly more trials.  All that is required is that some residual temporary increase in efficacy produced by the decision at the first choice point is not fully decayed when reinforcement is ultimately obtained.

The other unique feature of this simulation was in modeling of the initial firing signal to the choice point neurons.  Rather than a fixed, binary fire/don’t fire input sent from somewhere “upstream” (as in a reflex, for example), a time varying probabilistic input was modeled.  This is implemented by the values of 60 along the diagonal in the matrix of Figure 3 and the assignment of random values in the input stimulus matrix.  This produces a random value that is sometimes sufficient to by itself cause the firing of a left or right turn choice but sometimes is not.  However, if some increase (whether temporary or “fixed”) in the synaptic efficacy of the other inputs to a choice point has occurred then the summed products will allow a neuron to fire anyway.  Although this might appear to violate the basic precepts of the model, it does not and was done for two reasons. The first is that it became clear that a single input to the turn left and turn right neurons was absurd, especially at the first choice point.  A moment’s reflection leads to the understanding that the animal, say a lab rat, would clearly visually recognize the entrance to the maze and it’s motivation and decision making at that point would also be influenced by many additional factors such as how hungry it is, its over all energy level, its response to be placed in front of the maze for the nth time in a row, etc. In real life all these factors would be stimuli being sent to a choice point decision neuron and therefore a part of the decision process.  This could be faithfully modeled in the simulation by modeling a multiplicity of neurons connected to the two first choice point options but for the purposes of exploring the assignment of credit issue, doing so is unnecessary.  Instead the connection between external stimuli and the first choice points is represented by a single stimulus connected to Choice Point 0 and Choice Point 1 by the values of 1 in positions B(0,14) and B(1,14 ) in the brain matrix for this simulation.  due to multiple previous encounters (and therefore have some residual predisposition to one choice or another) By placing a random value at the appropriate synaptic position, the varying nature of such inputs can be  simulated without in any way biasing the results.