# Neural Fictitious Self Play——从博弈论到深度强化学习

NFSP 就是引入神经网络近似函数的 FSP，是一种利用强化学习技术来从自我博弈中学习近似纳什均衡的方法。解决了三个问题：

1. 无先验知识 NFSP agent 学习
2. 运行时不依赖局部搜索
3. 收敛到自我对局的近似纳什均衡

## 强化学习

on-policy 学习是指 agent 学习当前采用的策略。off-policy 学习则是 agent 从另一个 agent 或者另一个策略（如之前的策略）的经验中学习。

Q-learning 是一种 off-policy 学习方法。其学习的是贪婪策略，在每个状态下，都会选择有最高的估计值的行动。通过应用 off-policy 强化学习方法在 respective 转移元组来存储和回放过去的经验的方法称为经验回放。Fitted Q Iteration （FQI）是一种采用经验回放的 Q-learning 的批量强化学习方法。Neural Fitted Q Iteration（NFQ）和 Deep Q Network（DQN）则分别是在批量和在线更新的 FQI 两种扩展。

# 虚拟自我对局 Fictitious Self Play, FSP

$\sigma(s,a) \propto \lambda_1 x_{\pi_1}(s) \pi_1(s,a) + \lambda_2 x_{\pi_2}(s)\pi_2(s,a) \forall s, a$, (1)

# 神经网络虚拟自我对局

## 神经网络虚拟自我对局 NFSP 算法

1. Γ {Game}
2. MRL, MSL {RL and SL memories}
3. FQ, FS {Action value and policy networks}
4. β = ε-GREEDY(FQ) {Best response policy}
5. π = FS {Average policy}
6. σ {Current policy}

function STEP():

1. st,rt,ct ←OBSERVE(Γ)
2. at ←THINK(st,rt,ct)
3. ACT(Γ,at)

end function

function THINK(st, rt, ct)

1. if ct = 0 {episode terminated} then
σ ← SAMPLEPOLICY(β, π)
end if
2. if st-1 ̸= nil then
τt ←(st-1,at-1,rt,st,ct)
UPDATERLMEMORY(MRL, τt)
end if
3. at ← SAMPLEACTION(σ)
4. if σ = β then
end if
5. st−1 ← st
6. at−1 ← at
7. β ← REINFORCEMENTLEARNING(MRL)
8. π ← SUPERVISEDLEARNING(MSL)

end function

function REINFORCEMENTLEARNINIG(MRL)

1. FQ ← DQN(MRL)
2. return ε-GREEDY(FQ)

end function

function SUPERVISEDLEARNING(MSL)

1. FS ← Apply stochastic gradient descent to loss:
E(s,a)∼MSL[−logπ(s,a)]
2. return FS

end function

# Counting Edge Colorings Is Hard

Another dichotomy theorem Jin-Yi Cai is one of the world’s experts on hardness of counting problems, especially those related to methods based on complex—pun intended—gadgets. He …

# 采用双 Q-学习的深度强化学习技术

$\theta_{t+1} = \theta_t + \alpha (Y_t^Q - Q(S_t,A_t;\theta_t)) \nabla_{\theta_t}Q(S_t,A_t;\theta_t)$ .(1)

$Y_t^Q \equiv R_{t+1} + \gamma \max_{a}Q(S_{t+1},a;\theta_t)$.(2)

### 深度 $Q$ 网络

$Y_t^{DQN}\equiv R_{t+1} + \gamma \max_{a}Q(S_{t+1},a;\theta_t^{-1})$.(3)

### 双 $Q-$ 学习

$Y_t^Q = R_{t+1} + \gamma Q(S_{t+1}, argmax_{a} Q(S_{t+1},a;\theta_t);\theta_t)$.

$Q-$ 学习误差可以被写成：

$Y_t^{DoubleQ} \equiv R_{t+1} + \gamma Q(S_{t+1}, argmax_{a} Q(S_{t+1},a;\theta_t);\theta_t'$. (4)

# Why Deep Learning Works II: the Renormalization Group

Deep Learning is amazing.  But why is Deep Learning so successful?  Is Deep Learning just old-school Neural Networks on modern hardware?  Is it just that we have so much data now the methods work better?  Is Deep Learning just a really good at finding features. Researchers are working hard to sort this out.

Recently it has been shown that [1]

Unsupervised Deep Learning implements the Kadanoff Real Space Variational Renormalization Group (1975)

This means the success of Deep Learning is intimately related to some very deep and subtle ideas from Theoretical Physics.  In this post we examine this.

#### Unsupervised Deep Learning: AutoEncoder Flow Map

An AutoEncoder is a Unsupervised Deep Learning algorithm that learns how to represent an complex image or other data structure $latex X$.   There are several kinds of AutoEncoders; we care about so-called Neural Encoders–those using Deep Learning techniques to reconstruct the data:

The simplest Neural Encoder…

View original post 2,232 more words

# Streaming Median

Problem: Compute a reasonable approximation to a “streaming median” of a potentially infinite sequence of integers.

Solution: (in Python)

DiscussionBefore we discuss the details of the Python implementation above, we should note a few things.

First, because the input sequence is potentially infinite, we can’t store any amount of information that is increasing in the length of the sequence. Even though storing something like $latex O(log(n))$ integers would be reasonable for the real world (note that the log of a petabyte is about 60 bytes), we should not let that stop us from shooting for the ideal $latex O(1)$ space bound, and exploring what sorts of solutions arise under that constraint. For the record, I don’t know of any algorithms to compute the true streaming median which require $latex O(log(n))$ space, and I would be very interested to see one.

Second, we should note the motivation for…

View original post 388 more words

# Mapping WordPress Posts to Elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.

If you aren’t familiar with the WordPress database scheme it looks like this:

These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.

To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

A few notes on the analyzers:

• The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.
• Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

Most of the…

View original post 379 more words

# The Strange Ruby Splat

As of ruby 1.9, you can do some pretty odd things with array destructuring and splatting. Putting the star before an object invokes the splat operator, which has a variety of effects. First we?ll start with some very useful examples, then we will poke around the dark corners of ruby?s arrays and the splat operator.

## Method Definitions

You can use a splat in a method definition to gather up any remaining arguments:

In the example above, what will get the first argument, then *people will capture however many other arguments you pass into say. A real world example of this can be found in the definition of Delegator#method_missing. A common ruby idiom is to pass a hash in as the last argument to a method. Rails defines an array helper Array#extract_options! to make this idiom easier to…

View original post 328 more words

# 感想

Let’s create a novel world!