1 Introduction
Many modern artificially intelligent agents are trained with deep reinforcement learning algorithms
Silver et al. (2017); Mnih et al. (2016); Kulkarni et al. (2016). But neural networks have long been criticized for being uninterpretable black boxes that cannot be relied upon in safetycritical applications Zhang and Zhu (2018); Chakraborty et al. (2017).It is important to note, however, that human brains are uninterpretable as well. For example, we know what a face is, because our brains have evolved to detect facial features, and yet, it is nearly impossible to communicate in words what a face is. This problem is especially acute for patients with severe prosopagnosia, who have to rely on other visual cues to identify their friends and family. In fact, it is also quite difficult to communicate precisely the meaning of words. Try talking to a philosopher or a translator about what otherwise ordinary words might mean, precisely, and one can be sure to spark a huge debate.
Nonetheless, it is possible to program a computer to detect faces, by reducing highdimensional images of faces into lowdimensional vector representations with semantic meaning
Schroff et al. (2015); Radford et al. (2015). It is also possible to perform sophisticated natural language processing tasks by representing words in a high dimensional vocabulary as lowdimensional vectors
Mikolov et al. (2013); Pennington et al. (2014). Remarkably, these embeddings are amenable to simple linear arithmetic. Take the difference between the latent codes for a face with a mustache and one without a mustache, and one gets something approximating a ‘mustache’ vector. Famously, Mikolov et al. (2013) showed ‘King’  ‘Queen’ = ‘Man’  ‘Woman’.We propose that a similar strategy can be applied to even something as highdimensional and complicated as a deep reinforcement learning agent. Our aim is to demonstrate that neural network agents can be compressed into lowdimensional vector representations with semantic meaning, which we term agent embeddings. In this paper, we propose to learn agent embeddings by collecting existing examples of neural network agents, vectorizing their weights, and then learning a generative model over the weight space in a supervised fashion.
1.1 Our Contribution
As a proof of concept, we report on a series of experiments involving agent embeddings for policy gradient networks that play CartPole, a game of polebalancing.
We present three interesting findings:

The embedding space learned by the generative model can be used to answer questions of convergent learning Li et al. (2015)
, i.e. how similar are different neural networks that solve the same task. To our knowledge, we are the first to investigate convergent learning in the context of reinforcement learning agents rather than image classifiers. We extend
Li et al.’s work on convergent learning by proposing a new distance metric for measuring convergence between two neural networks. We observe surprisingly that good polebalancing networks make different decisions despite learning similar representations, whereas bad polebalancing networks make similar (bad) decisions while learning dissimilar representations. 
It has been demonstrated that linear structure between semantic attributes exist in the latent space of a good generative model in the domain of natural language words Mikolov et al. (2013) and faces Radford et al. (2015), among other kinds of data. We show that a similar linear structure can be learned in an embedding space for reinforcement learning agents that can be used to directly control the performance of the policy gradient network generated.

We demonstrate that the generative model can be used to recover missing weights in the policy gradient network via a simple and straightforward rejection sampling method. More sophisticated methods of conditional generation are left to future work.
The rest of the paper is organized as follows: we survey the relevant literature (Related Work), introduce the polebalancing task and describe how we learn agent embeddings for it (Learning Agent Embeddings for CartPole), present the abovementioned findings (Experimental Results and Discussion), discuss the shortcomings of our approach (Limitations of Supervised Generation), speculate on potential applications (Potential Applications for AI), and finally summarize the paper at the end (Conclusion).
2 Related Work
There are four areas of research that are related to our work: interpretability, generative modeling, metalearning, and Bayesian neural networks.
2.1 Interpretability
There has been a lot of recent interest in making reinforcement learning agents and policies interpretable. This is especially important in highstake domains like health care and education. Verma et al. (2018) proposed to learn policies in a humanreadable programming language, while Dann et al. (2018) proposed to learn certificates that provides guarantees on policy outcomes. Zha et al. (2018) demonstrated utility in learning embeddings for action traces in path planning. Ashlock and Lee
’s work is very similar to ours  they proposed a tool to compare phenotypic differences between solutions found by evolutionary algorithms as a way to explore the geometry of the problem space.
One line of work that has proven useful in increasing our understanding of deep neural network models is that of convergent learning Li et al. (2015), which measures correlations between the weights of different neural networks with the same architecture to determine the similarity of representations learned by these different networks. Convergent learning investigations have hitherto, to our knowledge, only been done on image classifiers, but we extend them to reinforcement learning agents in this paper.
2.2 Generative Modeling
Generative modeling is the technique of learning the underlying data distribution of a training set, with the objective of generating new data points similar to those from the training set. Deep neural networks have been used to build generative models for images Radford et al. (2015), audio Van Den Oord et al. (2016), video Vondrick et al. (2016), natural language sentences Bengio et al. (2003), DNA sequences Xiao et al. (2017), and even protein structures Anand and Huang (2018). Complex semantic attributes can often be reduced to simple linear vectors and linear arithmetic in the latent spaces of these generative models.
The ultimate (meta) challenge for neural network based generative models is not to generate images or audio, but other neural networks. We use existing networks as metatraining points and use them to train a neural network generator that can produce new polebalancing networks that do not then need to be further trained with training data from the CartPole simulator. A key advantage of using the same learning framework for both the meta learner and the learner is that this approach could potentially be applied recursively (cue the Singularity).
2.3 MetaLearning
The salient aspect of metalearning that our work is connected to is the use of neural networks to generate other neural networks. This has been done before in the context of hyperparameter optimization, where one neural network is used to tune the hyperparameters of another neural network.
Zoph and Le used a neural network as a reinforcement learning agent to select architectural choices (like the width of the convolution kernel or the operations in a recurrent cell) in the design of another neural network. This is known as neural architecture search, and several efficiency improvements to the original idea have since been proposed Pham et al. (2018); Liu et al. (2018). Smithson et al. modeled hyperparameter optimization in a neural network as a response surface that can be approximated by another neural network. Andrychowicz et al.; Ravi and Larochelle used an external LSTM to metalearn the optimization function used to update a child network.proposed the concept of a HyperNet, a neural network that generates the weights of another neural network with a differentiable function. This allows changes in the weights of the generated network to be backpropagated to the HyperNet itself.
Brock et al. developed a oneshot algorithm for neural architecture search using a DenseNet Huang et al. (2017) as a HyperNet. Chang and Lipson used a neural network to generate its own weights as a way to implement artificial selfreplication.2.4 Bayesian Neural Networks
Bayesian neural networks Bishop (1997)
maintain a probabilistic model over the weights of a neural network. In this framework, traditional optimization is viewed as finding the maximum likelihood estimate of the probabilistic model. Posterior inference in this case is typically intractable, but variational approximations can be used
Kingma and Welling (2013); Krueger et al. (2017); Louizos and Welling (2017). Our work involves learning a generative model over the weights of a neural network using existing examples of networks, which is philosophically akin to learning an ‘empirical Bayesian’ prior over the weights in a Bayesian neural network.3 Learning Agent Embeddings for CartPole
3.1 Supervised Generation
We propose to learn agent embeddings for neural networks using a twostep process we call Supervised Generation. First, we train a collection of neural networks of a fixed architecture to solve a particular task. Next, the weights are saved and used as training input to a generative model. This is a supervised method because we are learning the mapping from a latent distribution to the space of neural network weights by feeding inputoutput pairs to the model. (There are some obvious downsides to Supervised Generation as a method of learning agent embeddings. See the Limitations of Supervised Generation section for a detailed discussion.)
In this case, we trained a variational autoencoder (
CartPoleGen) on the parameter space of a small network (CartPoleNet) used to play CartPole.3.2 CartPole
CartPole is a pole balancing task introduced by Barto et al. with a modern implementation in the OpenAI Gym Brockman et al. (2016). It is also known as the inverted pendulum task and is a classic control problem. The agent chooses to move left or right at every time step with the objective of preventing the pole from falling over for as long as possible. We chose this task because it is easy  around times easier than MNIST on one measure Li et al. (2018)  and hence can be solved with small neural networks.
3.3 CartPoleNet
We devised a simple policy gradient neural network we call CartPoleNet with exactly one hidden layer of dimension (see Figure 2) using the exponential linear unit Clevert et al. (2015)
as the activation function. We collected
such networks by training them in the CartPole simulator with varying amounts of time, hyperparameters and random seeds for over a week on a cloud computing platform. The dimensional weight vectors belonging to these networks were then used as the training data for the generative model.A policy gradient neural network approximates the optimal actionvalue function
(1) 
which is the maximum expected sum of rewards discounted by and achieved by a policy that makes an action after observing state . CartPole assigns a reward of for every step taken, and each episode terminates whenever the pole angle exceeds , the position exceeds the edge of the display, or once the pole has been successfully balanced for more than time steps.
At each epoch, we sample stateaction pairs with an epsilondecreasing policy and store them with their rewards in an experience replay buffer to train the neural network. Note that the neural network only takes state
as input, and its Qvalue at action is represented by the corresponding activation on the last layer. Parametrizing the Qfunction with a stateaction pair as input is possible but more computationally expensive because it requires number of forward passes where is the action space Mnih et al. (2015).3.4 CartPoleGen
CartPoleGen is a variational autoencoder with a diagonal Gaussian latent space of dimension . It contains skip connections (with concatenation not addition) and uses the exponential linear unit as the activation function as in CartPoleNet (see Figure 3).
A variational autoencoder Kingma and Welling (2013) is a latent variable model with latent and data . We assume the prior over the latent space to be the spherical Gaussian and the conditional likelihood to be Gaussian, which we compute with a neural network decoder parametrized by . The true posterior is intractable in this case, but we assume that it can be approximated by a Gaussian with a diagonal covariance structure that we can compute with a neural network encoder parametrized by .
Sampling from the posterior involves reparametrizing to where to allow the gradients to backpropagate through to and .
We can train the variational autoencoder by maximizing the variational lower bound on the marginal log likelihood of data point :
(2)  
The Monte Carlo estimator (with latent dimension and noise minibatch of size ) for equation (2), also known as the SGVB estimator, becomes
(3)  
Notice that maximizing the above lower bound involves maximizing the model’s loglikelihood, which is equivalent to minimizing its negative loglikelihood. Minimizing the negative loglikelihood of a Gaussian model is equivalent to minimizing the mean squared error, which is simply the reconstruction cost in an autoencoder.
3.5 Sampling from CartPoleGen
We divided the networks into four groups depending on the network’s survival time, which we measure as the average number of steps before the episode terminates across random testing episodes. The survival time is quite a robust measure of CartPoleNet’s performance; it varies at most due to the stochasticity of the CartPole simulator.
We trained CartPoleGen in two settings. The first setting involves training on all networks, and then measuring the survival time of new samples drawn from the posterior distribution of the variational autoencoder. The second setting involves training a separate CartPoleGen conditioned on each group with a conditional VAE setup Sohn et al. (2015). The survival time in the second setting is also measured with new samples drawn from the posterior of the conditional generative model.
The training was conducted using ADAM Kingma and Ba (2014) for epochs with a batch size of . The results are summarized in Table 1. For comparison, an agent that randomly selects actions lasts on average steps, and an agent that makes the same action at every time step lasts only steps. The CartPole simulation ends once an agent has survived steps, so it is not possible to survive longer than that.
Figure 4 shows that the CartPoleGen does not accurately capture the exact distribution of the training data, but that it does offer an approximation to it. Training on better networks tends to lead to better generated networks, with the exception of the
survival time group. Curiously, CartPoleGen seems to display zeroavoiding rather than zeroforcing behavior, which is not typical of variational approximations using the reverse KL. It is interesting that in some cases, we are able to sample new networks that dramatically outperform the original networks that were in the training set. In the conditional groups, the generated samples typically display much higher variance than is found in the training set, but this does not hold true in the combined setting.
We hypothesize that the approximation gap is partially due to the limitations of the variational autoencoder and can be narrowed with a more expressive generative model. We experimented with various other neural architectures for the encoder and decoder, but did not manage to find significant improvements. In fact, the architecture of CartPoleGen presented here approximates a similar distribution when the encoder and decoder are trained with linear layers.
We also experimented with using GANs Goodfellow et al. (2014); Radford et al. (2015) as the generative model for CartPoleGen, but did not manage to successfully train them. In our experiments, the discriminator was not able to provide a good teaching signal to the generator because it managed to rapidly distinguish between the fake and real samples.
Group  Trainset Size  (Mean, Std) of Survival Time in Trainset  (Mean, Std) of Survival Time in Generated Samples 

steps  25608  21.8, 11.5  11.0, 9.7 
steps  9400  69.7, 14.2  77.3, 46.5 
steps  10103  132.6, 13.1  127.0, 55.3 
steps  28889  184.9, 16.3  116.4, 58.6 
Combined  74000  106.7, 73.3  136.7, 42.8 
4 Experimental Results and Discussion
In this section, we perform three experiments using the agent embeddings learned by CartPoleGen in the previous section. These experiments involve (1) deciding if different CartPoleNets of similar ability learn similar representations, (2) exploring the latent space learned by CartPoleGen, and (3) repairing missing weights in a CartPoleNet.
4.1 Convergent Learning
posed the question of convergent learning: do different neural networks learn the same representations? In the case of convolutional neural networks used as image classifiers, they found that shallow representations that resemble Gaborlike edge detectors are reliably learned, while more semantic representations sometimes differ.
Success is usually not an accident. Prima facie, for a given complex task, it seems like there can be a million ways to fail it, but only a handful of ways to successfully solve it. We hypothesize this to be the case for CartPole, but found surprisingly that the reverse was true.
measured activations on a reference set of images from the ImageNet Large Scale Visual Recognition Challenge 2012 dataset
Russakovsky et al. (2015), and calculated the correlation of such activations between pairs of convolutional neural networks. For CartPoleNets, the inputs are environment states in CartPole, so we had to first collect a reference set of diverse states in the CartPole simulator before computing CartPoleNet activations on them.We follow the same methodology as Li et al.
with the slight modification that we use the absolute value of the activations. This is because we use ELUs in CartPoleNet which have important negative activations that ReLUbased networks do not.
(4) 
(5) 
(6) 
The correlation between activations of a pair of networks can then be used to pair units from the first network with units from the second. In a bipartite matching, we assign each pair by matching units with the highest correlation, taking them out of consideration, and repeating the process until all the units have been paired. Hence, each unit belongs to exactly one pair. This can be done efficiently with the HopcroftKraft algorithm Hopcroft and Karp (1973). In a semimatching, we sequentially assign each unit from the first network using the unit from the second network with the highest correlation . It is thus possible that some units will belong to multiple pairs, while others will not get paired at all.
Two networks are in some sense equivalent if we can arrive at one network by permuting the ordering of the units of the other. The convergence distance (CD) between two networks can hence be quantitatively measured as the distance between the bipartite matching and the semimatching (see Equation 7). There is exactly one bipartite matching of maximum cardinality, but multiple possible semimatchings depending on the order of assignment. We compute the convergence distance using the canonical semimatching, defined as the semimatching performed in descending order from the most highly correlated to the least highly correlated pair in the bipartite matching.
(7) 
We sampled ten networks with survival time (from the conditional CartPoleGen trained on the  survival time group) and ten networks with survival time (from the conditional CartPoleGen trained on the  survival time group) to represent good and bad networks respectively. Randomly selecting actions results in a survival time of , so represents a bad network that is nonetheless acting better than random. The average allpairs convergence distance in the good group and in the bad group are then computed, with the results summarized in Table 2. We visualize the convergence distances in the hidden and output layer between selected pairs of CartPoleNets in Figures 5 and 6 respectively.
Group  Survival Time  Mean, Std CD (Hidden)  Mean, Std CD (Output) 

Good  191  2.75, 1.96  0.32, 0.49 
Bad  29  3.13, 1.7  0.09, 0.11 
Higher CDs correspond to divergence, while lower CDs correspond to convergence.
The data suggests that for the task of CartPole that there are more ways to be successful than to be bad. In other words, given a random state in the environment, the good networks can diverge in their decision to move left or right to balance the pole, but the bad networks uniformly make the wrong decision. Surprisingly also, despite the good networks displaying divergence in their actions, they pick up on more convergent (good) representations.
It is quite interesting that there are more ways to balance a pole successfully than poorly, but the skills needed for the different paths to success are similar. We hypothesize that this is because the order of actions might be less important than the overall composition of the two actions. Consider a sequence of four actions. {Left, Right, Left, Right} would be highly negatively correlated with {Right, Left, Right, Left} but on average, they might produce the same outcome of keeping the pole balanced. On the other hand, {Left, Left, Left, Left} is highly correlated with {Left, Left, Left, Left} and they both cause the pole to quickly lose its balance.
4.2 Exploring the Latent Space
The latent space in CartPoleGen gives us semantic information about the kinds of networks that can be generated. We selected pairs of agent embeddings and sampled new embeddings from to where represents the coefficient of linear interpolation between the pair of embeddings. represents interpolation, while represents extrapolation. The results are summarized in Figure 7.
The top left graph represents a pair of agent embeddings with a hidden CD of , the top right , the bottom left , and the bottom right . We observe that linearly interpolating within the latent space of CartPoleGen is not the same as simply interpolating within the weight space of CartPoleNet, given that CartPoleGen is nonlinear in nature. In many cases, moving from a worse agent embedding to a better one tracks a similar improvement in survival time, as is the case in the top left and bottom right graphs. Furthermore, extrapolation results in a performance boost, up to a point.
However, we also observed many cases where interpolation resulted in agent embeddings whose network performed far worse or far better than the two embeddings used as endpoints for the interpolation. Interestingly, when the interpolated embeddings performed far better, it is often the case that the hidden CDs of the networks used for the two endpoint embeddings is fairly large. In the case of the top right graph, the hidden CD is in fact a few standard deviations above the mean.
4.3 Repairing Missing Weights
The generative model can be used to repair CartPoleNets with missing weights. We propose a simple rejection sampling based method (see Algorithm 1) to continuously sample new CartPoleNets from the model until suitable candidates are found to fill out the missing weights. We experiment with two possible criteria that can be used to pick the candidate.
(8) 
(9) 
The Missing Criterion (see Equation 10) picks out the candidate who is most similar to the damaged CartPoleNet when we are only comparing the missing weights.
(10) 
The Whole Criterion (see Equation 11) picks out the candidate who is most similar to the damaged CartPoleNet. This biases the selection towards finding candidates with tiny weights in the missing space.
(11) 
We can probe the limits of our generative model for the task of weight repair by determining how much degradation can be reversed with a fixed computational budget (i.e. and are fixed). To investigate this, we fix a given CartPoleNet, degrade it at a fixed level (i.e. zero out a fixed fraction of the weights at random), and repair it using the rejection sampling based algorithm proposed. The results are summarized in Figure 8.
We observe that the two criteria seem to perform similarly, with Whole Criterion performing slightly better, and we managed to successfully recover the network at some levels of degradation. While we do not recover the network completely (below the acceptable threshold of ) in many cases, it is hopeful to note that there is partial recovery (the difference in survival times is at most ). It is also interesting that it is possible to recover the network at complete degradation; this suggests perhaps that CartPoleGen has memorized this network.
The scheme described here can also be straightforwardly applied to the task of repairing (or verifying) corrupted weights instead of missing weights. We note that rejection sampling is an inefficient method of doing weight repair, and more sophisticated methods of conditional generation should be used if efficiency is of concern. One possibility is to use the proposed criteria as loss functions to finetune the parameters of CartPoleGen as a way of performing conditional generation.
5 Limitations of Supervised Generation
We note three main limitations of the Supervised Generation method in learning agent embeddings.
5.1 High Sample Complexity
One of the primary drawbacks of the Supervised Generation method is the twostep process needed to first collect the data then train a generative model on it. This requires training a very large number of networks to provide the generative model with data. Figure 9 shows progressively worse approximations when we decrease the number of sampled networks by an order of magnitude.
In principle, an agent embedding does not have to be learned in this manner. For example, it might be possible to do Online Generation where a generative model learns to generate new networks onthefly with an online algorithm. Online Generation
will probably be more sample efficient.
5.2 Subpar Model Performance
5.3 Scaling Issues
We tried using a variational autoencoder to learn a dimensional weight vector for a small neural network that does MNIST image classification. Reinforcement learning agents that process images with CNNs would most likely contain weights at this order of magnitude at minimum. We trained it on a dataset of networks each with accuracy, but none of the sampled networks managed to perform with accuracy on a test set.
It might be difficult to scale the Supervised Generation method to large networks, even with significant advances made in generative modeling techniques. This is because even state of the art supervised generative models typically deal with data of much lower dimensions (). A notable exception is WaveNet Van Den Oord et al. (2016), but it deals with audio data which is relatively smooth and can tolerate high amounts of error, while the weights of a neural network are very discontinuous and are not robust to small amounts of additive noise.
6 Potential Applications for AI
The ultimate challenge for neural network based generative systems is not generating images, sounds, or videos. The ultimate challenge is the generation of other neural networks. Learning agent embeddings is therefore a very difficult goal to accomplish, but we outline several potential applications for AI in general.

AI systems powered by neural networks are often criticized for being uninterpretable. Agent embeddings provide us with a tool to gain insight into its internal workings and the space of possible solutions, which we have demonstrated with the task of pole balancing in this paper.

The generative model can be conditioned to prevent it from generating networks that have undesirable properties like biases or security vulnerabilities. This is helpful for improving the fairness and security of AI systems. We showed how CartPoleGen can be used to repair weights in a network for example, which increases the data integrity of the system.

It is helpful for an AI system to be able to generate worker AIs in a modular fashion. Each worker AI can be represented with its own agent embedding, and the generative model can be a factory that delivers a custom solution conditioned on the task given.

Reinforcement learning agents perform better when they have access to a model of their environment. We think they will also perform better in multiagent systems when they have access to compressed embeddings of other agents.
7 Conclusion
In this paper, we presented the concept of agent embeddings, a way to reduce a reinforcement learning agent into a small, meaningful vector representation. As a proof of concept, we trained an autoencoder neural network CartPoleGen on a large number of policy gradient neural networks collected to solve the polebalancing task CartPole. We showcased three interesting experimental findings with CartPoleGen and described the challenges of the Supervised Generation method.
8 Acknowledgments
This research was supported in part by the US Defense Advanced Research Project Agency (DARPA) Lifelong Learning Machines Program, grant HR00111820020. We would like to thank Peter Duraliev for his helpful suggestions in editing the paper.
References
 Anand and Huang (2018) N. Anand and P. Huang. Generative modeling for protein structures, 2018. URL https://openreview.net/forum?id=HJFXnYJvG.
 Andrychowicz et al. (2016) M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 (3) D. Ashlock and C. Lee. Agentcase embeddings for the analysis of evolved systems.
 Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, SMC13(5):834–846, 1983.

Bengio et al. (2003)
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin.
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155, 2003.  Bishop (1997) C. M. Bishop. Bayesian neural networks. Journal of the Brazilian Computer Society, 4(1), 1997.
 Brock et al. (2017) A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
 Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Chakraborty et al. (2017)
S. Chakraborty, R. Tomsett, R. Raghavendra, D. Harborne, M. Alzantot,
F. Cerutti, M. Srivastava, A. Preece, S. Julier, R. M. Rao, et al.
Interpretability of deep learning models: a survey of results.
In IEEE Smart World Congress 2017 Workshop: DAIS, 2017.  Chang and Lipson (2018) O. Chang and H. Lipson. Neural network quine. The 2018 Conference on Artificial Life: A Hybrid of the European Conference on Artificial Life (ECAL) and the International Conference on the Synthesis and Simulation of Living Systems (ALIFE), pages 234–241, 2018. doi: 10.1162/isal_a_00049. URL https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00049.
 Clevert et al. (2015) D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 Dann et al. (2018) C. Dann, L. Li, W. Wei, and E. Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Ha et al. (2016) D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
 Hopcroft and Karp (1973) J. E. Hopcroft and R. M. Karp. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973.

Huang et al. (2017)
G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.
Densely connected convolutional networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, volume 1, page 3, 2017.  Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krueger et al. (2017) D. Krueger, C.W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
 Kulkarni et al. (2016) T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 Li et al. (2018) C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
 Li et al. (2015) Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft. Convergent learning: Do different neural networks learn the same representations? In Feature Extraction: Modern Questions and Challenges, pages 196–212, 2015.
 Liu et al. (2018) H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 Louizos and Welling (2017) C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Mikolov et al. (2013) T. Mikolov, W.t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013.
 Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.
 Pennington et al. (2014) J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 Pham et al. (2018) H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Ravi and Larochelle (2018) S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJY0Kcll.
 Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Schroff et al. (2015) F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
 Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Smithson et al. (2016) S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer. Neural networks designing neural networks: multiobjective hyperparameter optimization. In ComputerAided Design (ICCAD), 2016 IEEE/ACM International Conference on, pages 1–8. IEEE, 2016.
 Sohn et al. (2015) K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 Van Den Oord et al. (2016) A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Verma et al. (2018) A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri. Programmatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477, 2018.
 Vondrick et al. (2016) C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
 Xiao et al. (2017) T. Xiao, J. Hong, and J. Ma. Dnagan: Learning disentangled representations from multiattribute images. arXiv preprint arXiv:1711.05415, 2017.
 Zha et al. (2018) Y. Zha, Y. Li, S. Gopalakrishnan, B. Li, and S. Kambhampati. Recognizing plans by learning embeddings from observed action distributions. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2153–2155. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 Zhang and Zhu (2018) Q. Zhang and S.C. Zhu. Visual interpretability for deep learning: a survey. arXiv preprint arXiv:1802.00614, 2018.
 Zoph and Le (2016) B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Comments
There are no comments yet.