人工智能【论文笔记】IntelliLight智能交通灯：一种基于强化学习的智能交通信号灯控制方法

博客声明：本文仅为个人论文阅读笔记，大部分原文对照的中文为翻译而来，只对其中错误明显的部分作了修改。其他一些个人理解不到位或有误的地方也尽请见谅。

标题原文：IntelliLight：A Reinforcement Learning Approach for Intelligent Traffic Light Control论文来源：Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining（2018 SIGKDD，数据挖掘领域顶级会议，A类）论文DOI：IntelliLight | Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mininghttps://doi.org/10.1145/3219819.3220096 论文代码：https://github.com/wingsweihua/IntelliLighthttps://github.com/wingsweihua/IntelliLight关键词：Traffic light control、reinforcement learning

0 摘要

概括：

使用深度强化学习来进行交通信号灯控制现有的研究没有在真实数据上进行测试，也没有解释其运行策略本文贡献：

提出的方法在交通信号灯控制上更为有效使用的数据集是基于监控摄像头获取的大规模真实交通数据展示了从真实数据中学习到的案例策略

“The intelligent traffic light control is critical for an efficient transportation system. While existing traffic lights are mostly operated by hand-crafted rules, an intelligent traffic light control system should be dynamically adjusted to real-time traffic. There is an emerging trend of using deep reinforcement learning technique for traffic light control and recent studies have shown promising results. However, existing studies have not yet tested the methods on the real-world traffic data and they only focus on studying the rewards without interpreting the policies. In this paper, we propose a more effective deep reinforcement learning model for traffic light control. We test our method on a large-scale real traffic dataset obtained from surveillance cameras. We also show some interesting case studies of policies learned from the real data.”

智能交通灯控制对于高效的交通系统至关重要。目前现有的交通信号灯大多由手动制定的规则来操控，但智能交通信号灯控制系统应进行动态地调整以适应实时交通。使用深度强化学习技术来进行交通灯控制是一个新兴的趋势，其最近的研究也展现出一些具有前景的结果。然而，现有的研究还没有在现实世界的交通数据上测试其方法，他们只专注于研究奖励而没有解释政策。在本文中，我们提出了一个更有效的深度强化学习模型的交通灯控制。我们在基于监控摄像头获取的一个大规模的真实的交通数据集上测试了我们的方法。我们还展示了一些有趣的从真实数据中学习到的案例研究策略。

1 介绍

概括：

可以通过智能控制信号灯来减少交通拥堵问题交通灯控制分传统的预定义规则控制和基于强化学习的根据实时交通动态调整，各有优缺项和难点本文使用更先进的深度强化学习来解决信号灯控制问题，并做出了3个重要贡献

Traffic congestion has become increasingly costly. For example, traffic congestion costs Americans $124 billion a year, according to a report by Forbes in 2014 [12]. In European Union, the traffic congestion cost is estimated to be 1% of its GDP [7]. Improving traffic conditions could increase city efficiency, improve economy, and ease people’s daily life.

交通拥堵的代价越来越高。例如，根据福布斯2014年的一份报告，交通拥堵每年给美国人造成1240亿美元的损失。在欧盟，交通拥堵成本估计为GDP的1%[7]。改善交通条件可以提高城市效率，改善经济，缓解人们的日常生活。

One way to reduce the traffic congestion is by intelligently controlling traffic lights. Nowadays, most traffic lights are still controlled with pre-defined fixed-time plan [18, 23] and are not designed by observing real traffic. Recent studies propose hand-crafted rules according to real traffic data [5, 20]. However, these rules are still pre-defined and cannot be dynamically adjusted w.r.t. real-time traffic.

减少交通拥堵的一种方法是智能控制交通信号灯。现在，大多数交通灯仍然是用预定义的固定时间计划来控制[18，23]，并且不是通过观察真实的交通来设计的。最近的研究提出了根据真实的交实时交通通数据手工制作的规则[5，20]。然而，这些规则仍然是预定义的，并且不能动态调整真实世界里的实时交通。

To dynamically adjust traffic lights according to real-time traffic, people have been using reinforcement learning technique [13, 22, 24]. Traditional reinforcement learning is difficult to apply due to two key challenges: (1) how to represent environment; and (2) how to model the correlation between environment and decision. To address these two challenges, recent studies [15, 22] have applied deep reinforcement learning techniques, such as Deep Q-learning (DQN), for traffic light control problem. Figure 1 illustrates the basic idea of deep reinforcement learning framework. Environment is composed of traffic light phase and traffic condition. State is a feature representation of the environment. Agent takes state as input and learns a model to predict whether to “keep the current phase of traffic lights” or “change the current phase”. The decision is sent to the environment and the reward (e.g., how many vehicles pass the intersection) is sent back to the agent. The agent consequently updates the model and further makes the new decision for the next timestamp based on the new state and the updated model. In such a framework, traffic condition can be described as an image and such an image is directly taken as an input for a CNN-based model to enrich the hand-crafted features of the environment.

为了根据实时交通动态调整交通灯，人们一直在使用强化学习技术[13，22，24]。传统的强化学习由于两个关键挑战而难以应用：（1）如何表现环境;（2）如何建立环境与决策的相关性模型。为了解决这两个挑战，最近的研究[15，22]已经将深度强化学习技术，如深度Q学习（DQN）应用于交通灯控制问题。图1展示了深度强化学习框架的基本思想。环境由交通灯相位和交通状况两个方面组成。状态是环境的特征表示。Agent以状态作为输入，学习模型来预测是“保持交通灯的当前相位”还是“改变当前相位”。该决定被发送到环境和奖励（例如，有多少车辆经过交叉路口）被发送回代理。因此，代理更新模型，并且进一步基于新状态和更新的模型对下一时间戳做出新的决定。在这样的框架中，交通状况可以被描述为图像，并且这样的图像被直接作为基于CNN的模型的输入，以丰富环境的手工制作的特征。

图1：交通信号灯控制的深度强化学习框架

Recent deep reinforcement learning approaches made promising progress for the traffic light control problem. Our approach extends this line ofwork by making several important new contributions:

最近的深度强化学习方法在交通灯控制问题上取得了可喜的进展。我们的方法通过做出几个重要的新贡献来扩展这一工作线：

1.1使用真实数据进行实验

概括：

以往的研究都是通过交通模拟来测试方法，我们通过交叉路口的监控摄像头获取了中国济南31天时间段的大规模真实交通数据，并在上面进行了实验。

Nowadays, increasing amount of traffic data is being collected from various sources. In China, many big cities have installed AI-equipped traffic surveillance cameras to monitor traffic conditions in real time. Such realtime traffic data enables us to implement reinforcement learning in real world. However, to the best of our knowledge, none of existing studies have used the real traffic data to test their methods. Instead, they use traffic simulations and such simulations do not reflect the real-world traffic. For example, the simulation models in current studies often assume that vehicles arrive at a constant rate but real traffic are highly dynamic over time. In our paper, we test the methods on a large-scale real traffic data obtained from 1,704 surveillance cameras in Jinan, China for a period of 31 days (see experiment section for details). In this dataset, there are more than 405 million vehicle records and more than 11 million unique vehicle plates. We conduct comprehensive experiments on such large real dataset.

如今，越来越多的交通数据从各种来源收集。在中国，许多大城市都安装了配备人工智能的交通监控摄像头，来实时地监控交通状况。这样的实时交通数据使我们能够在现实世界中实现强化学习。然而，据我们所知，现有的研究都没有使用真实的交通数据来测试他们的方法。相反，他们使用交通模拟，而这种模拟并不反映真实世界的交通。例如，当前研究中的仿真模型通常假设车辆以恒定的速率到达，但真实的交通随时间的推移是高度动态的。在本文中，我们测试了一个大规模的真实的交通数据，它是从中国济南的1,704个监控摄像头31天的时间收集完成（见实验部分的详细信息）。在这个数据集中，有超过4.05亿个车辆记录和超过1100万个独特的车牌。我们在这样大的真实的数据集上进行了全面的实验。

1.2 政策的解读

概括：

只对控制模型的总体奖励进行衡量，而不考虑其运行策略是不合适的，因为奖励在某些场景下可能存在误导。本文中我们对模型在不同场景下学习到的策略进行了一定的解读。

A frequently used measure to quantify the performance of traffic light control is by examining the overall reward, which can be defined by several factors such as waiting time of vehicles and number of vehicles passing the intersections. However, existing studies rarely make observations of the policy learned from the model. The reward could be misleading in some cases. There could be different policies with the same reward but one is more suitable than the other. Take Figure 2 as an example. Assume there is only traffic on South-North direction and the traffic comes for the first 80 seconds in every 120 seconds. Policy #1 is 80 seconds for green light on South-North direction and followed by red light for 40 seconds, and then repeat. And policy #2 is different from policy #1 in the way that, instead of 40-second red light on South-North direction, the light changes every 10 seconds. Both policies will result in the same reward because no vehicle will be waiting under either policy. However, policy #1 is preferred over policy #2 in real scenario. In this paper, we claim that it is important to study the policies rather than simply showing the overall reward. In our experiments, we show several interesting policies learned from the real traffic under different scenarios (e.g., peak hours vs. non-peak hours, weekday vs. weekend).

量化交通灯控制性能的一个常用的措施是通过检查总体奖励，其可以由几个因素定义，如车辆的等待时间和通过交叉口的车辆数量。然而，现有的研究很少对从模型中学习到的政策进行观察。而奖励在某些情况下可能会产生误导。比如两个不同的政策获得了相同的奖励，但其中一个比另一个更合适。以图2为例。假设只有南北方向上的交通，并且交通在每120秒的前80秒内到来。策略#1是南北方向绿色80秒，红灯40秒，然后重复。并且策略#2与策略#1的不同之处在于，在南北方向上不是40秒的红灯，而是每10秒改变一次。这两种政策将导致相同的奖励，因为没有车辆将在任何一种政策下等待。然而，在真实的场景中，策略#1优于策略#2。在本文中，我们认为，重要的是研究政策，而不是简单地显示整体回报。在我们的实验中，我们展示了在不同场景下从真实的流量中学习到的几种有趣的策略（例如，高峰时段与非高峰时段、工作日与周末）。

图2：奖励并不是对交通灯控制性能的综合评价。图中两种政策将带来相同的回报。但在现实世界中，政策1比政策2更合适。

1.3 相位门控模型学习

概括：

以往的研究工作只是将相位同其他特征一起作为模型的输入，本文希望在某些环境中将相位这一特征独立出来，发挥更加重要的作用。于是提出一种新的结构组件，相位门结合记忆宫殿，来帮助模型获得更好的性能表现。（具体见目录4.3和4.4节）

As described earlier in deep reinforcement learning framework, the agent will take the state, which is the representation of environment, as model input. The environment usually includes the current traffic light phase and traffic conditions. For example, the environments of two cases in Figure 3 are the same, except the traffic light phases. Previous studies all take phase as one feature [17, 22], together with many other features (e.g., number of vehicles at different lanes, positions of vehicles). And it is likely that this one feature does not play a role that is significant enough to affect the model output. Therefore, the model will make the same decision (i.e., either keep or change the current phase) for these two different cases. However, such a decision, no matter which one, is not ideal for one of the cases. Because in Figure 3, case A hopes to keep the phase and case B hopes to change the phase. In this paper, we propose a new phase-sensitive (i.e., phase gate combined with memory palace) reinforcement learning agent, which is a critical component that leads to superior performance.

正如早期的深度强化学习框架中所描述的那样，智能体将作为环境的表示的状态作为模型输入。环境通常包括当前交通灯相位和交通状况。例如，图3中的两种情况的环境是相同的，除了交通灯相位。先前的研究工作都将相位作为一个特征[17，22]，以及和许多其他特征一起（例如，不同车道上的车辆数量、车辆位置）。而且很可能这一个特征并没有发挥足够重要的作用来影响模型输出。因此，模型将做出相同的决定（即，保持或者改变当前相位）在那些两种不同的情况。然而，这样的决定，无论哪一个，对其中一个情况来说都不理想。因为在图3中，情况A希望保持相位，而情况B希望改变相位。在本文中，我们提出了一种新的相位敏感（即，相位门结合记忆宫殿）强化学习代理，这是获得更好的性能的关键组件。

图3:案例A和案例B除了交通灯相位外，拥有相同的环境

The rest of this paper is organized as follows. Section 2 discusses the literature. Section 3 formally defines the problem. The method is shown in Section 4 and the experimental results are shown in Section 5. Finally, we conclude the paper in Section 6.

本文的其余部分组织如下。第2节讨论文献。第3节正式定义了问题。该方法示于第4节，实验结果示于第5节。最后，我们在第6节总结了本文。

2 相关工作

In this section, we firstly introduce conventional methods for traffic light control, then introduce methods using reinforcement learning.

在本节中，我们首先介绍传统的交通灯控制方法，然后介绍使用强化学习的方法。

2.1 传统交通信号灯控制

概括：

介绍了早期的交通灯控制方法，预定时信号控制和车辆驱动控制和它们的缺点。

Early traffic light control methods can be roughly classified into two groups. The first is pre-timed signal control [6, 18, 23], where a fixed time is determined for all green phases according to historical traffic demand, without considering possible fluctuations in traffic demand. The second is vehicle-actuated control methods [5, 20] where the real-time traffic information is used. Vehicle-actuated methods are suitable for the situations with relatively high traffic randomness. However, this method largely depends on the handcraft rules for current traffic condition, without taking into account future situation. Therefore, it cannot reach the global optimal.

早期的交通灯控制方法大致可分为两类。第一种是预定时信号控制[6，18，23]，其中根据历史交通需求确定所有绿色相位的固定时间，而不考虑交通需求的可能波动。第二种是车辆驱动控制方法[5，20]，其中使用实时交通信息。车辆驱动方法适用于交通随机性较高的情况。然而，这种方法在很大程度上依赖于当前交通状况的手工规则，而没有考虑到未来的情况。因此，它不能达到全局最优。

2.2 基于强化学习的交通信号灯控制

Recently, due to the incapability of dealing with dynamic multidirection traffic in previous methods, more works try to use reinforcement learning algorithms to solve the traffic light control problem [13, 17, 24]. Typically, these algorithms take the traffic on the road as state, and the operation on light as action. These methods usually show better performance compared with fixed-time and traffic-responsive control methods.

近年来，由于以前的方法无法处理动态多方向交通，更多的工作尝试使用强化学习算法来解决交通灯控制问题[13，17，24]。通常，这些算法将道路上的交通作为状态，将灯光上的操作作为动作。这些方法相比固定时间和交通响应控制方法通常表现出更好的性能。

Methods in [1, 2, 4, 8, 24] designed the state as discrete values like the location of vehicles or number of waited cars. However, the discrete state-action pair value matrix requires huge storage space, which keeps these methods from being used in large state space problems.

[1，2，4，8，24]中的方法将状态设计为离散值，如车辆的位置或正在等待汽车的数量。然而，离散状态-动作对值矩阵需要巨大的存储空间，这使得这些方法不能用于大的状态空间问题。

To solve the in-managablely large state space of previous methods, recent studies [15, 22] propose to apply Deep Q-learning methods using continuous state representations. These studies learn a Q-function (e.g. a deep neural network) to map state and action to reward. These works vary in the state representation including hand craft features (e.g., queue length [15, 17], average delay [10, 22]) and image features[9, 16, 22]) They are also different in reward design, including average delay [3, 22],the average travel time [16, 22], and queue length[15].

为了解决以前方法的不可管理的大状态空间，最近的研究[15，22]提出使用连续状态表示应用深度Q学习方法。这些研究学习Q函数（例如深度神经网络）将状态和动作映射到奖励。这些作品在包括手工设置特征（例如，队列长度[15，17]、平均延迟[10，22]）和图像特征[9，16，22]）它们在奖励设计上也不同，包括平均延迟[3，22]、平均行进时间[16，22]和队列长度[15]。

However, all these methods assume relatively static traffic environments, and hence far from the real case. Further, they only focus on rewards and overlook the adaptability of the algorithms to the real traffic. Therefore, they cannot interpret why the learned light signal changes corresponding to the traffic. In this paper, we try to test the algorithms in a more realistic traffic setting, and add more interpretation other than reward.

然而，所有这些方法假设相对静态的交通环境，因此远离真实的情况。此外，他们只关注奖励，忽略了算法对真实的流量的适应性。因此，他们不能解释为什么学习的灯信号对应于交通而改变。在本文中，我们试图测试算法在一个更现实的交通设置，并添加更多的解释以外的奖励。

3 问题定义

In our problem, we have the environment E as an intersection of two roads (and the traffic on this intersection). There is an intelligent traffic light agent G. To make the notation simpler, we use “N”, “S”, “W”, “E” to represent north, south, west, and east respectively, and use “Red” and “Green” to represent red light and green light correspondingly. A setting of the traffic light is defined as a phase (e.g., green light on the west-east direction which can be simplified as Green-WE). When a light changes from green to red, there is a 3 second yellow light, while the other directions still keep red. So one green light and the subsequent yellow light can be represented together by “Green”. To simplify the problem, we assume there are only two phases of the traffic light, i.e., 1) Green-WE, and 2) Red-WE. Due to the limitation of real-world setting, the traffic light can only change in a specific order (i.e., 1 -> 2 -> 1 -> 2 -> ...). Given the state s (describing the positions and speed of the traffic near this intersection), the goal of the agent G is to give the optimal action a (i.e., whether to change the light to the next phase), so that the reward r (i.e., the smoothness of traffic) can be maximized.

在我们的问题中，我们有环境E作为两条道路的交叉口（以及这个交叉口上的交通）。有一个智能交通灯代理G。为了简化符号，我们用“N”、“S”、“W”、“E”分别表示北、南、西、东，用“红”和“绿”分别表示红灯和绿灯。交通灯的设置被定义为相位（例如，西向东方向绿灯，可简化为Green-WE）。当一个灯从绿灯变成红灯时，有3秒的黄灯，而其他方向仍然保持红灯。因此，一个绿灯和随后的黄灯可以一起用“Green”表示。为了简化问题，我们假设交通灯只有两个相位阶段，即，1)绿灯-WE和红灯-WE。由于现实世界设置的限制，交通灯只能以特定顺序（即，1 -> 2 -> 1 -> 2 -> ...）。给定状态s（描述该交叉口附近的交通的位置和速度），代理G的目标是给出最优动作a（即，是否将信号灯改变到下一个相位），使得奖励R（即交通的平滑度）可以被最大化。

表1：记号

4 方法

Traffic light control has attracted a lot of attention in recent years due to its essential role in adjusting traffic. Current methods generally have two categories, conventional methods, and deep reinforcement learning based methods. Conventional methods usually rely on previous knowledge to set fixed time for each light phase or set changing rules. These rules are prone to dynamically changing traffic. Reinforcement learning methods usually take the traffic condition (e.g., queue length of waiting cars and updated waiting time) as state, and try to make actions that can improve the traffic condition based on the current state.

交通灯控制由于其在调节交通中的重要作用，近年来受到了广泛的关注。目前的方法通常有两类，传统方法和基于深度强化学习的方法。常规方法通常依赖于先验知识来为每个交通相位设置固定时间或设置变化规则。这些规则易于动态地改变流量。强化学习方法通常采用交通状况（例如，等待汽车的队列长度和更新的等待时间）作为状态，并且尝试基于当前状态做出可以改善交通状况的动作。

However, the current methods do not consider the complex situations in real case, and hence may lead to stuck in one single kind of action. This will lead to inferior traffic adjusting performance under complex traffic situation.

然而，现有的方法没有考虑真实的情况下的复杂情况，因此可能导致卡在单一类型的动作。这将导致在复杂交通状况下的交通调节性能较差。

In this section, we propose a deep reinforcement traffic light agent to solve this problem.We will first introduce the model framework in Section 4.1. Then,we show the design of agent in Section 4.2. We further describe the network structure in Section 4.3. In addition, we describe the memory palace in Section 4.4. Note that, although our model is designed for a four way intersection with two phases, it is not difficult to extend it to other types of intersections or to multiple phases scenarios.

在本节中，我们提出了一个深度强化红绿灯代理来解决这个问题，我们将首先在4.1节中介绍模型框架然后，我们在4.2节中展示了Agent的设计。我们将在第4.3节中进一步描述网络结构。此外，我们在4.4节中描述了记忆宫殿。请注意，虽然我们的模型是为具有两个相位的四路交叉口设计的，但将其扩展到其他类型的交叉口或多个相位的场景并不困难。

4.1 框架

Our model is composed of offline part and online part (as shown in Figure 4). We extract five kinds of features describing the traffic conditions as state (detailed in Section 4.2), and use reward to describe how much the action has improved the traffic (detailed in Section 4.2). In offline stage, we set a fixed timetable for the lights, and let traffic go through the system to collect data samples. After training with the samples logged in this stage, the model will be put into the online part. In online stage, at every time interval ∆t, the traffic light agent will observe the state s from the environment and take action a (i.e., whether to change light signal to the next phase) according to ϵ-greedy strategy combining exploration (i.e., random action with probability ϵ) and exploitation (i.e., taking the action with maximum estimated reward). After that, the agent G will observe the environment and get the reward r from it. Then, the tuple (s, a, r) will be stored into memory. After several timestamps (e.g., t2 in Figure 4), agent G will update the network according to the logs in the memory.

我们的模型由离线部分和在线部分组成（如图4所示）。我们提取了五种描述交通状况的特征（在第4.2节中详细描述），并使用奖励来描述行动对交通的改善程度（在第4.2节中详细描述）。在离线阶段，我们为信号灯设置了固定的时间表，让交通流量通过系统采集数据样本。在使用这一阶段记录的样本进行训练后，模型将被放入在线部分。在在线阶段，在每个时间间隔Δ t，交通灯代理将观察来自环境的状态s并采取行动a（是否将改变交通信号灯到下一个相位）根据改进贪婪策略组合探索（具有概率ε的随机动作）和利用（采取具有最大估计奖励的动作）。之后，代理G将观察环境并从中获得奖励r。然后，元组（s，a，r）将被存储到存储器中。在几个时间戳之后（例如，图4中的t2），代理G将根据内存中的日志更新网络。

图4：模型框架

4.2 智能体设计

First, we introduce the state, action and reward representation.

首先，我们介绍状态，动作和奖励表述。

State. Our state is defined for one intersection. For each lane i at this intersection, the state component includes queue length Li, number of vehicles Vi, updated waiting time of vehicles Wi. In addition, the state includes an image representation of vehicles’ position M, current phase Pc and next phase Pn.

状态。我们的状态是针对一个交叉点定义的。对于该交叉口处的每个车道i，状态分量包括队列长度Li、车辆数量Vi、最新的车辆等待时间Wi。此外，状态包括图像表示的车辆的位置M、当前阶段Pc和下一阶段Pn。

Action. Action is defined as a = 1: change the light to next phase Pn, and a = 0: keep the current phase Pc .

动作。动作定义为当a = 1：将信号灯改变到下一个相位Pn，和当a = 0：保持当前相位Pc。

Reward. As is shown in Equation 3, reward is defined as a weighted sum of the following factors:

如等式3所示，奖励被定义为以下因素的加权和：

(1) Li：Sum of queue length L over all approaching lanes, where L is calculated as the total number of waiting vehicles on the given lane. A vehicle with a speed of less than 0.1 m/s is considered as waiting.所有进场车道上的车辆排队长度L之和，其中L计算为给定车道上等待车辆的总数。车速小于0.1m/s的车辆可以视为等待。(2) Di：Sum of delay D over all approaching lanes, where the delay Di for lane i is defined in Equation 1, where the lane speed is the average speed of vehicles on lane i, and the speed limit is the maximum speed allowed on lane i: Di = 1 − lane speed speed limit (1)所有进场车道上的延迟D之和，其中车道i的延迟Di在等式1中定义，其中lane speed是车道i上车辆的平均速度，speed limit是车道i上允许的最大速度 (3) Wi：Sum of updated waiting time W over all approaching lanes. This equals to the sum of W over all vehicles on approaching lanes. The updated waiting time W for vehicle j at time t is defined in Equation 2. Note that the updated waiting time of a vehicle is reset to 0 every time it moves. For example, if a vehicle’s speed is 0.01m/s from 0s to 15s, 5m/s from 15s to 30s, and 0.01m/s from 30s to 60s, Wj is 15 seconds, 0 seconds and 30 seconds when t =15s, 30s and 60s relatively. 所有进场车道的最新等待时间W的总和。这等于所有车辆在进场车道上等待时间W的总和。车辆j在时间t的最新的等待时间W在等式2中定义。请注意，车辆每次移动时，最新的等待时间将重置为0。例如，如果车辆的速度从0 s到15 s为0.01m/s，从15 s到30 s为5 m/s，并且从30 s到60 s为0.01m/s，则当t = 15 s、30 s和60 s时刻，Wj相对地为15秒、0秒和30秒。 (4) Indicator of light switches C, where C = 0 for keeping the current phase, and C = 1 for changing the current phase.信号灯的切换指示器C，其中C = 0用于保持当前相位，C = 1用于改变当前相位。(5) Total number of vehicles N that passed the intersection during time interval ∆t after the last action a.在上一个动作a之后的时间间隔Δ t内通过交叉口的车辆总数N。(6) Total travel time of vehicles T that passed the intersection during time interval ∆t after the last action a, defined as the total time (in minutes) that vehicles spent on approaching lanes.在最后一个动作a之后，车辆T在时间间隔∆T通过交叉口的总行驶时间T ，定义为车辆在驶近车道上花费的总时间(分钟)。

Hence, given the current state s of the traffic condition, the mission of the agent G is to find the action a (change or keep current phase) that may lead to the maximum reward r in the long run, following the Bellman Equation (Equation 4) [21]. In this situation, the action value function q for time t is the summation of the reward of the next timestamp t + 1 and the maximum potential future reward. Through this conjecture of future, the agent can select action that is more suitable for long-run reward.

因此，给定交通状况的当前状态s, agent G的任务是根据Bellman方程(等式4)[21]，找到可能导致长期最大回报r的动作a(改变或保持当前阶段)。在这种情况下，时间t的动作值函数q是下一个时间戳t + 1的奖励和最大潜在未来奖励的总和。通过这种对未来的推测，行为主体可以选择更适合长期奖励的行为。

4.3 网络结构

In order to estimate the reward based on the state, and action, the agent needs to learn a Deep Q-Network Q(s, a).

为了根据状态和动作估计奖励，agent需要学习Deep Q- network Q(s, a)。

In the real-world scenario, traffic is very complex and contain many different cases need to be considered separately. We will illustrate this in Example 4.1.

在现实场景中，流量非常复杂，包含很多不同的情况，需要分别考虑。我们将在例4.1中说明这一点。

Example 4.1. We still assume a simple intersection with two phase light transition here: 1) Green-WE（东西方向）, and 2) Red-WE. The decision process of whether to change the traffic light consists of two steps. The first step is the mapping from traffic condition (e.g., how many cars are waiting, how long has each car been waiting) to a partial reward. An example of this mapping could be r = −0.5×L−0.7×W. This is shared by different phases, no matter which lane the green light is on. Then, to determine the action, the agent should watch on the traffic in different lanes during different phases. For instance, as is shown in Figure 3 (a), when the red light is on the NS direction, more waiting traffic (i.e., lower reward in the first step) on the NS direction will make the light tend to change (because by changing the light on this lane from red to green, more cars on this lane can pass through this intersection), while more waiting traffic (i.e., lower reward in the first step) on the WE direction will make the light tend to keep. When the red light is on the WE direction, the case is right the opposite. Therefore, the light phase should have an explicit selection on features.

例4.1。我们仍然假设一个具有两个交通相位过渡的简单交叉口:1)绿灯-WE，2)红灯-WE。是否更换红绿灯的决策过程包括两个步骤。第一步是从交通状况(例如，有多少辆车在等待，每辆车等待了多长时间)映射到部分奖励。例如其中一个映射例子:r =−0.5×L−0.7×W。这是由不同的相位共享的，无论哪个车道上亮着绿灯。然后，agent需要在不同阶段对不同车道上的车辆进行监控，以确定车辆的动作。例如，如图3 (a)所示，当NS方向为红灯时，NS方向上等待的车流越多(即第一步的奖励越低)，就会使绿灯趋于变化(因为将这条车道上的灯由红变绿，这条车道上更多的车可以通过这个十字路口)，而在WE方向上等待较多的车流(即第一步的奖励较低)会使交通灯趋于保持。当红灯亮在WE方向时，情况正好相反。因此，信号灯相位应该有一个明确的特征选择。

图3:案例A和案例B除了交通灯相位外，拥有相同的环境

In previous studies, due to the simplified design of the model for approximating Q-function under complex traffic condition, agents are having difficulties in distinguishing the decision process for different phases. Therefore, we hereby propose a network structure that can explicitly consider the different cases explicitly. We call this special sub-structure “Phase Gate”.

在以往的研究中，由于复杂交通条件下q函数近似模型的简化设计，agent难以区分不同相位的决策过程。因此，我们在此提出一种能够明确考虑不同情况的网络结构。我们称这种特殊的子结构为“相位门”。

Our whole network structure can be shown as in Figure 5. The image features are extracted from the observations of the traffic condition and fed into two convolutional layers. The output of these layers are concatenated with the four explicitly mined features, queue length L, updated waiting time W, phase P and number of total vehicles V. The concatenated features are then fed into fully-connected layers to learn the mapping from traffic conditions to potential rewards. Then, for each phase, we design a separate learning process of mapping from rewards to the value of making decisions Q(s, a). These separate processes are selected through a gate controlled by the phase. As shown in Figure 5, when phase P = 0, the left branch will be activated, while when phase P = 1, the right branch will be activated. This will distinguish the decision process for different phases, prevent the decision from favoring certain action, and enhance the fitting ability of the network.

我们的整个网络结构如图5所示。从交通状况的观察中提取图像特征，并将其反馈到两个卷积层。这些层的输出与四个显式挖掘的特征连接在一起，队列长度L，更新的等待时间W，相位P和总车辆数量V。然后连接的特征被输入到完全连接的层，以学习从交通状况到潜在奖励的映射。然后，对于每个阶段，我们设计了一个从奖励到决策价值Q(s, a)映射的独立学习过程。这些独立的过程是通过由阶段控制的门来选择的。如图5所示，当相位P= 0时，左侧分支被激活，而当P阶段= 1时，右侧分支被激活。这样可以区分不同阶段的决策过程，防止决策倾向于某种行为，增强网络的拟合能力。

4.4 记忆宫殿和模型更新

Periodically, the agent will take samples from the memory and use them to update the network. This memory is maintained by adding the new data samples in and removing the old samples occasionally. This technique is noted as experience replay [19] and has been widely used in reinforcement learning models.

智能体将定期从内存中获取样本，并使用它们更新网络。通过偶尔添加新的数据样本并删除旧的样本来维护这种内存。这种技术被称为经验回放[19]，并已广泛应用于强化学习模型。

However, in the real traffic setting, traffic on different lanes can be really imbalanced. As previous methods [9, 10, 15, 22] store all the state-action-reward training samples in one memory, this memory will be dominated by the phases and actions that appear most frequently in imbalanced settings. Then, the agent will be learned to estimate the reward for these frequent phase-action combinations well, but ignore other less frequent phase-action combinations. This will cause the learned agent to make bad decisions on the infrequent phase-action combinations. Therefore, when traffic on different lanes are dramatically different, these imbalanced samples will lead to inferior performance on less frequent situation.

然而，在真实的交通环境中，不同车道上的交通是非常不平衡的。以往的方法[9,10,15,22]将所有的状态-行为-奖励训练样本存储在一个记忆中，这个记忆将被不平衡环境中出现最频繁的阶段和行为所支配。然后，智能体将学会估计这些频繁相作用组合的奖励，但忽略其他较不频繁的相作用组合。这将导致学习智能体对不常见的相位-动作组合做出错误的决策。因此，当不同车道的交通差异较大时，这些不平衡的样本会导致在不太频繁的情况下性能较差。

Inspired by Memory Palace theory [11, 14] in cognitive psychology, we can solve this imbalance by using different memory palaces for different phase-action combinations. As shown in Figure 6, training samples for different phase-action combinations are stored into different memory palaces. Then same number of samples will be selected from different palaces. These balanced samples will prevent different phase-action combinations from interfering each other’s training process, and hence, improve the fitting capability of the network to predict the reward accurately.

根据认知心理学中的记忆宫殿理论[11,14]，我们可以通过使用不同的记忆宫殿进行不同的相位-动作组合来解决这种不平衡。如图6所示，不同相位-动作组合的训练样本存储在不同的内存宫殿中。然后从不同的宫殿中选取相同数量的样本。这些均衡的样本可以防止不同的相位-动作组合干扰彼此的训练过程，从而提高网络的拟合能力，准确预测奖励。

5 实验

In this section, we conduct experiments using both synthetic and real-world traffic data. We show a comprehensive quantitative evaluation by comparing with other methods and also show some interesting case studies.

在本节中，我们将使用合成和真实交通数据进行实验。通过与其他方法的比较，我们给出了一个全面的定量评价，并给出了一些有趣的案例研究。

5.1 实验设置

The experiments are conducted on a simulation platform SUMO (Simulation of Urban MObility) 2. SUMO provides flexible APIs for road network design, traffic volume simulation and traffic light control. Specifically, SUMO can control the traffic moving according to the given policy oftraffic light (obtained by the traffic light agent).

实验是在SUMO模拟平台上进行的。SUMO为道路网络设计、交通流量模拟和交通灯控制提供了灵活的api。具体来说，SUMO可以根据给定的交通灯策略(由交通灯智能体获取)控制交通的移动。

The environment for the experiments on synthetic data is a fourway intersection as Figure 2. The intersection is connected with four road segments of 150-meters long, where each road have three incoming and three outgoing lanes. The traffic light in this part of experiment contains two phases: (1) Green-WE (green light on WE with red light on SN), (2) Red-WE (red light on WE with green light on SN). Note that when a green light is on one direction, there is a red light on the other direction. Also, a green light is followed by a 3-second yellow light before it turns to red light. Although this is a simplification of the real world scenario, the research of more types of intersections (e.g., three-way intersection), and more complex light phasing (e.g., with left-turn phasing) can be further conducted in similar way.

合成数据的实验环境为图3所示的四通路口。十字路口连接着四条150米长的路段，每条路有三条进车道和三条出车道。这部分实验中的交通灯包含两个阶段:(1)绿灯-东西方向(绿灯亮WE，红灯亮SN)，(2)红灯-东西方向(红灯亮WE，绿灯亮SN)。请注意，当一个方向上的绿灯，另一个方向上的红灯。此外，绿灯之后是一个3秒的黄灯，然后它变成红灯。虽然这是对真实世界场景的简化，但更多类型的交叉口(如三向交叉口)和更复杂的信号灯相位(如左转相位)的研究可以用类似的方式进一步进行。

图3:案例A和案例B除了交通灯相位外，拥有相同的环境

5.2 参数设置

The parameter setting and reward coefficients for our methods are shown in Table 2 and Table 3 respectively. We found out that the action time interval ∆t has minimal influence on performance of our model as long as ∆t is between 5 seconds and 25 seconds.

我们方法的参数设置和奖励系数分别如表2和表3所示。我们发现，只要∆t在5秒到25秒之间，动作时间间隔∆t对我们模型的性能影响很小。

5.3 评估指标

We evaluate the performance of different methods using the following measures:

我们使用以下方法来评估不同方法的性能:

Reward: average reward over time. Defined in Equation 3, the reward is a combination of several terms (positive and negative terms), therefore, the range of reward is from −∞ to ∞. Under specific configuration, there will be an upper bound for the reward when all cars move freely without any stop or delay.奖励：随着时间的推移的平均报酬。在公式3中定义，奖励是若干项(正项和负项)的组合，因此，奖励的范围为−∞到∞。在特定配置下，当所有车辆自由行驶，不停车不延迟时，奖励会有一个上限。

Queue length: average queue length over time, where the queue length at time t is the sum of L (defined in Section 4.2) over all approaching lanes. A smaller queue length means there are fewer waiting vehicles on all lanes.队列长度:随着时间的推移的平均队列长度，其中t时刻的队列长度是所有进场车道的L(定义见4.2节)之和。队列长度越短，意味着所有车道上等待的车辆就越少。

Delay: average delay over time, where the delay at time t is the sum of D (defined in Equation 1) of all approaching lanes. A lower delay means a higher speed of all lanes.延迟:随着时间的推移的平均延迟，其中t时刻的延迟为所有进场车道的D之和(定义如公式1)。较低的延迟意味着所有车道的行驶速度更高。

Duration: average travel time vehicles spent on approaching lanes (in seconds). It is one of the most important measures that people care when they drive on the road. A smaller duration means vehicles spend less time passing through the intersection.持续时间:车辆在进场车道上的平均行驶时间(秒)。这是人们在路上开车时关心的最重要的措施之一。时间越短意味着车辆通过十字路口的时间越短。

In summary, a higher reward indicates a better performance of the method, and a smaller queue length, delay and duration indicates the traffic is less jammed.

综上所述，奖励越高，表明该方法的性能越好，而队列长度、延迟和持续时间越小，表明流量越不拥挤。

5.4 比较方法

To evaluate the effectiveness of our model, we compare our model with the following baseline methods, and tune the parameter for all methods. We then report their best performance.

为了评估我们的模型的有效性，我们将我们的模型与下列基线方法进行比较，并为所有方法调整参数。然后我们报告他们的最佳表现。

Fixed-time Control (FT). Fixed-time control method use a pre-determined cycle and phase time plan [18] and is widely used when the traffic flow is steady.固定时间控制。固定时间控制方法采用预先确定的周期和相位时间计划[18]，广泛应用于交通流稳定的情况。

Self-Organizing Traffic Light Control (SOTL) [5]. This method controls the traffic light according to the current traffic state, including the eclipsed time and the number of vehicles waiting at the red light. Specifically, the traffic light will change when the number of waiting cars is above a hand-tuned threshold.自组织交通灯控制(SOTL). 该方法根据当前的交通状态，包括重叠时间和等待红灯的车辆数量来控制交通灯。具体来说，当等候车辆的数量超过手动设置的阈值时，交通灯就会改变。

Deep Reinforcement Learning for Traffic Light Control (DRL). Proposed in [22], this method applies DQN framework to select optimal light configurations for traffic intersections. Specifically, it solely relies on the original traffic information as an image.基于深度强化学习的交通灯控制。该方法在[22]中提出，应用DQN框架选择交通交叉路口的最优信号灯配置。具体来说，它只依赖于原始的图像交通信息。

In addition to the baseline methods, we also consider several variations of our model.

除了基线方法之外，我们还考虑了本文提出的模型的几个变体。

IntelliLight (Base). Using the same network structure and reward function defined as in Section 4.2 and 4.3. This method is without Memory Palace and Phase Gate.使用4.2和4.3节中定义的相同的网络结构和奖励函数。这种方法没有记忆宫殿和相位门。

IntelliLight (Base+MP). By adding Memory Palace in psychology to IntelliLight-Base, we store the samples from different phase and time in separate memories.通过在智力库中加入心理学中的记忆宫殿，我们将不同相位和时间的样本存储在不同的记忆中。

IntelliLight (Base+MP+PG). This is the model adding two techniques (Memory Palace and Phase Gate).这是添加了两种技术(记忆宫殿和相位门)的模型。

5.5 数据集

5.5.1 Synthetic data.

In the first part of our experiment, synthetic data is used with four traffic flow settings: simple changing traffic (configuration 1), equally steady traffic (configuration 2), unequally steady traffic (configuration 3) and complex traffic (configuration 4) which is a combination of previous three configurations. As is shown in Table 4, the arriving of vehicles are generated by Poisson distribution with certain arrival rates.

5.5.1合成数据。

在我们实验的第一部分中，合成数据使用了四种交通流设置:简单改变流量(配置1)、等速稳定流量(配置2)、非等速稳定流量(配置3)和复杂流量(配置4)，复杂流量是前三种配置的组合。如表4所示，车辆到达是由一定到达率的泊松分布产生的。

5.5.2 Real-world data.

The real-world dataset is collected by 1,704 surveillance cameras in Jinan, China over the time period from 08/01/2016 to 08/31/2016. The locations of these cameras are shown in Figure 7. Gathered every second by the cameras facing towards vehicles near intersections, each record in the dataset consists of time, camera ID and the information about vehicles. By analyzing these records with camera locations, the trajectories of vehicles are recorded when they pass through road intersections. The dataset covers 935 locations, where 43 of them are four-way intersections. We use the number of vehicles passing through 24 intersections as traffic volume for experiments since only these intersections have consecutive data. Then we feed this real-world traffic setting into SUMO as online experiments. It can be seen from Table 5 that traffic flow on different roads are dynamically changing in the real world.

5.5.2 真实数据。

真实世界数据集由中国济南1704个监控摄像头收集，时间跨度为2016年8月1日至2016年8月31日。这些摄像机的位置如图7所示。该数据集中的每一秒记录都是由面向路口附近车辆的摄像头收集的，包括时间、摄像头ID和车辆信息。通过分析这些记录和摄像机的位置，车辆通过十字路口时的轨迹被记录下来。该数据集覆盖了935个地点，其中43个是四岔路口。因为只有24个交叉口的数据是连续的，所以我们使用通过24个交叉口的车流量作为实验交通量。然后我们把这个真实世界的交通设置作为在线实验输入SUMO。由表5可以看出，现实世界中不同道路上的车流量是动态变化的。

5.6 基于合成数据的性能表现

5.6.1 Comparison with state-of-the-art methods

We first compare our method with three other baselines under different synthetic traffic settings. From Table 6, 7, 8 and 9 we can see that our method performs better than all other baseline methods in configurations 1, 2, 3 and 4. Although some baselines perform well on certain setting, they perform badly in other configurations (e.g., SOTL achieves good rewards under configuration 1, almost the same as our method in 3 digit floats. This is because our method has learned to keep the light until 36000 s and switch the light after that, and SOTL is also designed to behave similarly. Hence, these two methods perform very similar). On the contrary, our method IntelliLight shows better performance under different configurations.

5.6.1 与最先进方法的比较

我们首先将我们的方法与其他三个基线在不同的合成流量设置下进行比较。从表6、7、8和9中，我们可以看到，在配置1、2、3和4中，我们的方法比所有其他基线方法执行得更好。尽管一些基线在某些设置下表现良好，但在其他配置下表现很差(例如，SOTL在配置1下获得了很好的奖励，几乎与我们方法的3位数的浮点数结果相同。这是因为我们的方法已经学会了保持信号灯控制到36000秒，并在这之后切换信号，SOTL也被设计成类似的行为。因此，这两个方法表现得非常相似)。相反，我们的方法intelllight在不同配置下取得了更好的性能。

5.6.2 Comparison with variants of our proposed method.

Table 6, 7, 8 and 9 show the performance of variants of our proposed method. First, we can see that adding Memory Palace helps achieve higher reward under configuration 3 and 4, although it does not boost the reward under configuration 1 and 2. This is because for the simple case (configuration 1 and 2), the phase is relatively steady for a long time (because the traffic only comes from one direction or keeps not changing in a long time). Therefore, the memory palace does not help in building a better model for predicting the reward. Further adding Phase Gate also reduces the queue length in most cases and achieves highest reward, demonstrating the effectiveness of these two techniques.

5.6.2 与本方法各种变体的比较

表6、7、8和9显示了我们提出的方法的变体性能。首先，我们可以看到，添加Memory Palace可以帮助该变体在配置3和4下获得更高的奖励，尽管它不会提升配置1和配置2下的奖励。这是因为对于简单的情况(配置1和2)，此时相位在很长一段时间内是相对稳定的(因为流量只来自一个方向或者在很长一段时间内保持不变)。因此，记忆宫殿并不能帮助建立一个更好的模型来预测奖励。进一步增加Phase Gate也在大多数情况下减少了队列长度，并获得了最高的奖励，证明了这两种技术的有效性。

5.6.3 Interpretation of learned signal.

To understand what our method have learned w.r.t. dynamic traffic conditions, we show the percentage of duration for phase Green-WE (i.e., green light on WE direction with red light on SN direction), along with the ratio of traffic flow on WE over total traffic flow from all directions. With the changing of traffic, an ideal traffic light control method would be able to adjust its phase duration to traffic flows and get high reward. For example, as traffic changes from direction WE to SN, the traffic light agent is expected to adjust its phase duration from giving WE green light to giving SN green light. As we can see from Figure 8, IntelliLight can adjust its phase duration as the traffic changes.

5.6.3 所学信号的解读

为了理解我们的方法从真实世界动态交通条件学习到了什么，我们展示了相位绿灯-东西方向持续时间的百分比(即，绿灯在we方向上，红灯在SN方向上)，以及交通流量在we方向上占所有方向总交通流量的比例。随着交通流量的变化，理想的交通灯控制方法可以根据交通流量的变化调整相位持续时间，来获得较高的奖励。例如，当交通量从WE方向变为SN方向时，交通灯智能体将会预测并调整其相位持续时间，从WE绿灯变为SN绿灯。从图8可以看出，IntelliLight可以随着流量的变化调整其相位持续时间。

5.7 基于真实世界数据的性能表现

5.7.1 Comparison ofdifferent methods.

In this section, we compare our method with baseline methods on real-world data. The overall results are shown in Table 10. Our method IntelliLight achieves the best reward, queue length, delay and duration over all the compared methods, with a relative improvement of 32%, 38%, 19% and 22% correspondingly over the best baseline method. In addition, our method has a relatively steady performance over multiple intersections (small standard deviation).

5.7.1不同方法的比较。

在本节中，我们将把我们的方法与实际数据上的基线方法（基准）进行比较。总体结果见表10。在所有比较方法中，我们的方法intelllight获得了最佳奖励、队列长度、延迟和持续时间，相对于最佳基线方法，分别提高了32%、38%、19%和22%。此外，我们的方法在多个交叉口上具有相对稳定的性能(标准差小)。

表10：不同方法在真实世界数据上的性能表现。在±后的数字表示标准偏差。奖励：越高越好；其他指标：越低越好。

5.7.2 Observations with respect to real traffic.

In this section, we make observations on the policies we learned from the real data. We analyze the learned traffic light policy for the intersection of Jingliu Road (WE direction) and Erhuanxi Auxiliary Road (SN direction) under different scenarios: peak hours vs. non-peak hours, weekdays vs. weekends, and major arterial vs. minor arterial.

5.7.2 对真实流量的观测。

在本节中，我们将对从实际数据中了解到的策略进行观察。我们分析了Jingliu路(西东方向)和二环西路(南北方向)交叉口在高峰时段与非高峰时段、工作日与周末、主干道与次要干道不同场景下的交通灯策略。

Peak hour vs. Non-peak hour. Figure 9 (a) shows the average traffic flow from both directions (WE and SN) on a Monday. On this day, there is more traffic on WE direction than SN for most of the time, during which an ideal traffic light control method is expected to give longer time for WE direction. It can be seen from Figure 9 (c) that, the ratio of the time duration for phase Green-WE (i.e., green light on WE, while red light on SN) is usually larger than 0.5, which means for most ofthe time, our method gives longer time for WE. And during peak hours (around 7:00, 9:30 and 18:00), the policies learned from our method also give longer time for green light on WE than non-peak hours. In early morning, the vehicle arrival rates on SN are larger than the rates on WE, and our method automatically gives longer time to SN. This shows our method can intelligently adjust to different traffic conditions.高峰时段与非高峰时段。图9 (a)显示了周一从两个方向(WE和SN)的平均车流量。这一天大部分时间，WE方向的交通量都比SN方向多，在这段时间内，理想的交通灯控制方法预计会给WE方向提供更长的时间。从图9 (c)可以看出，相位绿灯-东西方向(即绿灯亮WE，红灯亮SN)的持续时间的比值通常大于0.5，这意味着在大多数情况下，我们的方法给WE的时间更长。而在高峰期(7:00、9:30、18:00左右)，我们的方法中所学习到的政策也给出了比非高峰期更长的WE绿灯时间。在清晨，SN上的车辆到达率大于WE上的车辆到达率，我们的方法自动给予SN更长的时间。这表明我们的方法可以根据不同的交通条件进行智能调整。

Weekday vs. Weekend. Unlike weekdays, weekend shows different patterns about traffic condition and traffic light control policies. Our policy gives less green light on WE (more green light on SN) during weekend daytime than it gives on weekday. This is because there is more traffic on SN than on WE during weekend daytime in Figure 9 (b), while during weekday traffic on SN is less than on WE. Besides, by comparing Figure 9 (a) with Figure 9 (b), we can see that the traffic ofWE and SN during late night time on Monday is similar, making the ratio of duration Green-We close to 0.5. 工作日vs.周末。与工作日不同，周末的交通状况和交通灯控制政策呈现出不同的模式。我们的政策是，周末白天对WE的绿灯比工作日少(对SN的绿灯多)。这是因为在图9 (b)中，周末白天SN上的流量大于WE，而工作日时SN上的流量小于WE。此外，通过对比图9 (a)和图9 (b)可以看出，周一深夜we和SN的流量相似，使得持续时间Green-We的比值接近0.5。

Major arterial vs. Minor arterial. Major arterials are roads that have higher traffic volume within a period, and are expected to have a longer green light time. Without prior knowledge about major arterial, learned traffic light control policy using our method prefer giving the major arterial green light (including keeping the green light already on major arterial, and tend to switching red light to green light for major arterial). Specifically, we look into three periods of time (3:00, 12:00 and 23:30) of August 1st. From Figure 9 (a), we can tell that the road on WE direction is the main road, since traffic on WE is usually heavier than traffic on SN. As is shown in Figure 10, the dotted lines indicates the number of arriving cars for every second on two different directions. Along with the arrival rate, we also plot the change of phases (dashed area). It can be seen from Figure 10 (a) that: 1) the overall time period of phase Red-WE is longer than Green-WE, which is compatible with traffic volume at this time. 2) although the traffic volume of SN is larger than WE, the traffic light change from Green-WE to Red-WE is usually not triggered by waiting cars on SN direction. On the contrary, in Figure 10 (b) and Figure 10 (c), the change from Green-WE to Red-WE is usually triggered by waiting cars on SN direction. This is mainly because the road on WE is the main road during these time periods, and the traffic light tends to favor phase Green-WE. 主干道vs.次干道。主干道是指在一段时间内交通量较高的道路，预计会有较长的绿灯时间。在没有主干道相关的先验知识的情况下，通过我们的方法学习的交通灯的控制策略，更倾向于给主干道绿灯(包括让已经在主干道上的绿灯继续亮下去，和倾向于把主干道红灯换成绿灯)。具体来说，我们研究了8月1日的三个时间段(3:00,12:00和23:30)。从图9 (a)可以看出，we方向的道路是主干道，因为we方向的车流量通常比SN方向的车流量大。如图10所示，虚线表示的是每秒钟在两个不同方向到达的车辆数。随着到达率的增加，我们还绘制了相位的变化(虚线区域)。从图10 (a)可以看出:1)相位红灯-东西方向的总时间周期比绿灯-东西方向长，与此时的交通量兼容。2)虽然SN的交通量比WE大，但是从绿灯-东西方向到红灯-东西方向的交通灯变化通常不是由SN方向等待车辆触发的。相反，在图10 (b)和图10 (c)中，从绿灯-东西方向到红灯-东西方向的变化通常是由SN方向等待的车辆触发的。这主要是因为在这段时间里，WE上的道路是主要道路，交通信号灯偏向于相位绿灯-东西方向。

6 结论

In this paper, we address the traffic light control problem using a well-designed reinforcement learning approach. We conduct extensive experiments using both synthetic and real world experiments and demonstrate the superior performance ofour proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the agent adjust to the changing traffic, as a complement to quantitative measure on rewards. These in-depth case studies can help generate traffic rules for real world application.

在本文中，我们使用一种精心设计的强化学习方法来解决交通信号灯控制问题。我们进行了广泛的实验包括像使用合成和真实世界数据的实验，并证明了我们提出的方法优于当前最先进的方法。此外，我们还展示了深入的案例研究和观察，以了解智能体如何适应不断变化的交通流量，作为对奖励的定量度量的补充。这些深入的案例研究可以帮助生成适用于现实世界的交通规则。

We also acknowledge the limitations of our current approach and would like to point out several important future directions to make the method more applicable to real world. First, we can extend the two-phase traffic light to multi-phase traffic light, which will involve more complicated but more realistic state transition. Second, our paper addresses a simplified one intersection case, whereas the real world road network is much more complicated than this. Although some studies have tried to solve the multi-intersection problem by using multiple reinforcement learning agents, they do not explicitly consider the interactions between different intersections (i.e., how can the phase of one intersection affect the state of nearby intersections) and they are still limited to small number of intersections. Lastly, our approach is still tested on a simulation framework and thus the feedback is simulated. Ultimately, a field study should be conducted to learn the real-world feedback and to validate the proposed reinforcement learning approach.

我们也承认当前方法的局限性，并想要指出几个重要的未来方向来使该方法更适用于现实世界。首先，我们可以将两相位交通灯扩展到多相位交通灯，这将涉及到更复杂但更现实的状态转换。其次，我们的论文解决了一个简化的交叉口情况，而现实世界的道路网络要比这复杂得多。虽然一些研究尝试使用多个强化学习代理来解决多交叉口问题，但并没有明确考虑不同交叉口之间的相互作用(即，一个交叉口的相位是如何影响附近交叉口的状态），同时也仅限于一个有限较小个数的交通路口。最后，我们的方法只是在一个仿真框架上进行了测试，因此其反馈也是基于仿真的。最后，应该进行实地研究，以学习研究真实世界获得的反馈，并验证提出的强化学习方法。

金钥匙

人工智能【论文笔记】IntelliLight智能交通灯：一种基于强化学习的智能交通信号灯控制方法

论文阅读 python 机器学习人工智能神经网络 pytorch [论文笔记]ITRANSFORMER: INVERTEDTRANSFORMERSARE EFFECTIVEFORTIMESERIESFORECASTING

论文阅读：All4One=Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction

发表评论取消回复

金钥匙

人工智能 【论文笔记】IntelliLight智能交通灯：一种基于强化学习的智能交通信号灯控制方法

论文阅读 python 机器学习 人工智能 神经网络 pytorch [论文笔记]ITRANSFORMER: INVERTEDTRANSFORMERSARE EFFECTIVEFORTIMESERIESFORECASTING

论文阅读：All4One=Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction

相关文章

发表评论取消回复

人工智能【论文笔记】IntelliLight智能交通灯：一种基于强化学习的智能交通信号灯控制方法

论文阅读 python 机器学习人工智能神经网络 pytorch [论文笔记]ITRANSFORMER: INVERTEDTRANSFORMERSARE EFFECTIVEFORTIMESERIESFORECASTING