相较于强化学习(RL)方法,模仿学习方法不 需要专家手工设置合适的奖赏函数.它模仿专家演 示的样本,从而学习得到与人类相当的策略.基于生 成对抗网络的模仿学习(GANsIL)是模仿学习的 一类重要方法,它将基于逆向强化学习的模仿学习
(IRLIL)推广到了更复杂的大规模问题,使得模 仿学习方法能够解决现实的应用问题.并且,随着生 成对抗网络(GANs)和RL等技术的不断发展, GANsIL中遇到的模态崩塌与生成样本利用效率
低等问题将以更为有效的方式得到解决.从而, GANsIL能够稳定、有效地解决实际问题.
本文首先介绍了最早出现且最具代表性的 GANsIL方法,即生成对抗模仿学习(GAIL)的核 心思想,然后分析了其所面临的模态崩塌和生成样 本利用效率低等问题.接着从这两个问题出发综述 了前沿的改进工作.其中,针对模态崩塌的问题综述 了结合GANs技术的改进方法,针对生成样本利用 效率低的问题综述了结合强化学习等技术的改进方 法.最后综述了GANsIL在不同观察环境和多智能 体等方面的拓展.
综上,本文将GANsIL的未来发展前景作如下 展望:(1)结合GANs技术的发展,GANsIL在模 态崩塌问题将进一步改善,并且二者博弈的训练过 程将更为稳定,且易于训练[100];(2)结合RL技术的 发展,GANsIL将在多智能体方面、部分可观察的 马尔可夫决策过程[101]等方面有更深的应用拓展;
(3)随着深度学习的发展,GANsIL将具有更强的 表征能力,并能应用于需要感知复杂状态的实际 问题.不仅如此,在决策问题中,对抗式学习的运用
6 4
3 计 算 机 学 报 2020年
《 计
算
机
学
报
》
将更为广泛,如将对抗式的模仿学习结合在RL 中[102],或将对抗式学习用来使得RL具有更强的泛 化性和鲁棒性[103].
致 谢 在此感谢南京大学的俞扬教授参与本文的 讨论并给出了大量的修改意见!
参考文献
[1]SilverD,HuangA,MaddisonCJ,etal.Masteringthe gameofGowithdeepneuralnetworksandtreesearch. Nature,2016,529(7587):484489
[2]AbbeelP,NgAY.Apprenticeshiplearningviainverse reinforcementlearning//Proceedingsofthe21stInternational ConferenceonMachineLearning(ICML).Banff,Canada, 2004:18
[3]SuttonRS,BartoAG.ReinforcementLearning:An Introduction.Cambridge,USA:MITPress,1998
[4]LiuQuan,ZhaiJianWei,ZhangZongZhang,etal.A surveyondeepreinforcementlearning.ChineseJournalof
Computers,2018,41(1):127(inChinese)
(刘全,翟建伟,章宗长等.深度强化学习综述.计算机学 报,2018,41(1):127)
[5]GaoYang,ChenShiFu,LuXin.Researchonreinforcement learningtechnology:Areview.ActaAutomaticaSinica, 2004,30(1):86100(inChinese)
(高阳,陈世福,陆鑫.强化学习研究综述.自动化学报, 2004,30(1):86100)
[6]ZhaoDongBin,ShaoKun,ZhuYuanHeng,etal.Review ofdeepreinforcementlearninganddiscussionsonthe developmentofcomputerGo.ControlTheoryandApplications, 2016,33(6):701717(inChinese)
(赵冬斌,邵坤,朱圆恒等.深度强化学习综述:兼论计算机 围棋的发展.控制理论与应用,2016,33(6):701717)
[7]SilverD,SchrittwieserJ,SimonyanK,etal.Masteringthe gameofGowithouthumanknowledge.Nature,2017,550
(7676):354359
[8]MnihV,KavukcuogluK,SilverD,etal.PlayingAtariwith deepreinforcementlearning//ProceedingsoftheWorkshops atthe27thNeuralInformationProcessingSystems(NIPS).
LakeTahoe,USA,2013:201220
[9]ByrneRW,RussonAE.Learningbyimitation:Ahierarchical approach.BehavioralandBrainSciences,1998,21(5):667 721
[10]SchaalS.Isimitationlearningtheroutetohumanoidrobots? TrendsinCognitiveSciences,1999,3(6):233242
[11]PomerleauD.Efficienttrainingofartificialneuralnetworks forautonomousnavigation.NeuralComputation,1991,3(1): 8897
[12]BojarskiM,DelTestaD,DworakowskiD,etal.Endtoend learningforselfdrivingcars.arXivpreprintarXiv:1604. 07316,2016
[13]RossS,BagnellD.Efficientreductionsforimitationlearning// Proceedingsofthe13thInternationalConferenceonArtificial IntelligenceandStatistics(AISTATS).Sardinia,Italy, 2010:661668
[14]RossS,GordonGJ,BagnellD.Areductionofimitation learningandstructuredpredictiontonoregretonlinelearning
//Proceedingsofthe14thInternationalConferenceonArtificial IntelligenceandStatistics(AISTATS).FortLauderdale,
USA,2011:627635
[15]LiYaoYu,ZhuYiFan,YangFeng,etal.Inverse reinforcementlearningbasedoptimalschedulegeneration approachforcarrieraircraftonfightdeck.JournalofNational UniversityofDefenseTechnology,2013,35(4):171175(in Chinese)
(李耀宇,朱一凡,杨峰等.基于逆向强化学习的舰载机甲板 调度优化方案生成方法.国防科技大学学报,2013,35(4): 171175)
[16]JinZhuoJun,QianHui,ChenShenYi,ZhuMiaoLiang. Surveyofapprenticeshiplearningbasedonrewardfunction learning.CAAITransactionsonIntelligentSystems,2009, 4(3):208212(inChinese)
(金卓军,钱徽,陈沈秩,朱淼良.回报函数学习的学徒学习 综述,智能系统学报,2009,4(3):208212)
[17]NgAY,RussellSJ.Algorithmsforinversereinforcement learning//Proceedingsofthe17thInternationalConference onMachineLearning(ICML).Stanford,USA,2000:663 670
[18]LevineS,PopovicZ,KoltunV.Nonlinearinversereinforcement learningwithGaussianprocesses//Proceedingsofthe25th NeuralInformationProcessingSystems(NIPS).Granada, Spain,2011:1927
[19]HoJ,GuptaJK,ErmonS.Modelfreeimitationlearning withpolicyoptimization//Proceedingsofthe34thInternational ConferenceonMachineLearning(ICML).NewYork,USA, 2016:27602769
[20]HoJ,ErmonS.Generativeadversarialimitationlearning// Proceedingsofthe30thNeuralInformationProcessing Systems(NIPS).Barcelona,Spain,2016:45654573
[21]GoodfellowIJ,PougetAbadieJ,MirzaM,etal.Generative adversarialnets//Proceedingsofthe28thNeuralInformation ProcessingSystems(NIPS).Montreal,Canada,2014:2672 2680
[22]WangKunFeng,GouChao,DuanYanJie,etal.Genera tiveadversarialnetworks:Thestateoftheartandbeyond. ActaAutomaticaSinica,2017,43(3):321332(inChinese)
(王坤峰,苟超,段艳杰等.生成式对抗网络GAN的研究进 展与展望.自动化学报,2017,43(3):321332)
7 4 3 2期 林嘉豪等:基于生成对抗网络的模仿学习综述
《 计
算
机
学
报
》
[23]HintonGE,OsinderoS,TehYW.Afastlearning
[51]SchulmanJ,MoritzP,LevineS,etal.Highdimensional
[77]RezendeDJ,MohamedS,WierstraD.Stochasticbackprop agationandapproximateinferenceindeepgenerativemodels// Proceedingsofthe31stInternationalConferenceonMachine Learning(ICML).Beijing,China,2014:12781286
[78]JangE,GuS,PooleB.Categoricalreparameterizationwith GumbelSoftmax.arXivpreprintarXiv:1611.01144,2016
[79]ChoK,vanMerrienboerB,GulcehreC,etal.Learning phraserepresentationsusingRNNencoderdecoderfor statisticalmachinetranslation.arXivpreprintarXiv: 1406.1078,2014
[80]PetersJ,BagnellJA.Policygradientmethods.Encyclopedia ofMachineLearning.Boston,USA:Springer,2010:774 776
[81]SilverD,LeverG,HeessN,etal.Deterministicpolicy gradientalgorithms//Proceedingsofthe31stInternational ConferenceonMachineLearning(ICML).Beijing,China, 2014:387395
[82]SuttonRS,McAllesterD,SinghS,etal.Policygradient methodsforreinforcementlearningwithfunction approximation//Proceedingsofthe13thNeuralInformation ProcessingSystems(NIPS).Denver,USA,1999:1057 1063
[83]LillicrapTP,HuntJJ,PritzelA,etal.Continuouscontrol withdeepreinforcementlearning//Proceedingsofthe4th InternationalConferenceonLearningRepresentations(ICLR). SanJuan,PuertoRico,2016
[84]PfauD,VinyalsO.Connectinggenerativeadversarial networksandactorcriticmethods.arXivpreprintarXiv: 1610.01945,2016
[85]SuttonRS.Learningtopredictbythemethodsoftemporal differences.Machinelearning,1988,3(1):944
[86]TzengE,HoffmanJ,ZhangN,etal.Deepdomainconfu sion:Maximizingfordomaininvariance.arXivpreprintarX iv:1412.3474,2014
[87]SundermeyerM,SchluterR,NeyH.LSTMneuralnetworks forlanguagemodeling//Proceedingsofthe13thAnnual ConferenceoftheInternationalSpeechCommunication Association(ISCA).Portland,USA,2012:194197
[88]OhJ,ChockalingamV,SinghS,LeeH.Controlofmemory, activeperception,andactioninMinecraft//Proceedingsof the33rdInternationalConference(ICML).NewYork,
USA,2016:27902799
[89]WierstraD,ForsterA,PetersJ,SchmidhuberJ.Recurrent policygradients.LogicJournaloftheIGPL,2010,18(5): 620634
[90]ChungJ,GulcehreC,ChoKH,BengioY.Empirical evaluationofgatedrecurrentneuralnetworksonsequence
modeling.arXivpreprintarXiv:1412.3555,2014
[91]AytarY,PfaffT,BuddenD,LePaineT.Playinghard
explorationgamesbywatchingYouTube.arXivpreprint arXiv:1805.11592,2018
[92]TorabiF,WarnellG,StoneP.Behavioralcloningfrom observation//Proceedingsofthe27thInternationalJoint ConferenceonArtificialIntelligence(IJCAI).Stockholm, Sweden,2018:49504957
[93]EdwardsAD,SahniH,SchroeckerY,IsbellCL.Imitating latentpoliciesfromobservation.arXivpreprintarXiv: 1805.07914,2018
[94]GibbonsR.APrimerinGameTheory.UpperSaddleRiver, USA:PrenticeHall,1992
[95]HuJ,WellmanMP.Multiagentreinforcementlearning: Theoreticalframeworkandanalgorithm//Proceedingsofthe 15thInternationalConferenceonMachineLearning(ICML).
Madison,USA,1998:242250
[96]SongJ,RenH,SadighD,ErmonS.Multiagentgenerative adversarialimitationlearning//Proceedingsofthe31stNeural InformationProcessingSystems(NIPS).LongBeach,USA, 2017:74717482
[97]WuY,MansimovE,LiaoS,etal.Scalabletrustregion methodfordeepreinforcementlearningusingKronecker factoredapproximation//Proceedingsofthe31stNeural InformationProcessingSystems(NIPS).LongBeach,USA, 2017:52795288
[98]LoweR,WuY,TamarA,etal.Multiagentactorcriticfor mixedcooperativecompetitiveenvironments//Proceedingsof the31stNeuralInformationProcessingSystems(NIPS).
LongBeach,USA,2017:63796390
[99]GuptaJK,EgorovM,KochenderferMJ.Cooperative multiagentcontrolusingdeepreinforcementlearning// Proceedingsofthe16thInternationalConferenceonAutonomous AgentsandMultiagentSystems(AAMAS).SaoPaulo, Brazil,2017:6683
[100]PengXB,KanazawaA,ToyerS,etal.Variationaldiscrim inatorbottleneck:Improvingimitationlearning,inverseRL, andGANsbyconstraininginformationflow.arXivpreprint arXiv:1810.00821,2018
[101]ChoiJ,KimKE.Inversereinforcementlearninginpartially observableenvironments//Proceedingsofthe21stInterna tionalJointConferenceonArtificialIntelligence(IJCAI).
Pasadena,USA,2009:10281033
[102]PengXB,AbbeelP,LevineS,VanDePanneM.DeepMimic: Exampleguideddeepreinforcementlearningofphysicsbased characterskills.ACMTransactionsonGraphics,2018,37(4): 143:1143:14
[103]PintoL,DavidsonJ,SukthankarR,GuptaA.Robust adversarialreinforcementlearning//Proceedingsofthe34th InternationalConferenceonMachineLearning(ICML). Sydney,Australia,2017:28172826
0 5
3 计 算 机 学 报 2020年
《 计
算
机
学
报
》
犔犐犖犑犻犪犎犪狅,M.S.candidate.His mainresearchinterestsincludeimitation learningandreinforcementlearning.
犣犎犃犖犌犣狅狀犵犣犺犪狀犵,Ph.D.,associateprofessor.His researchinterestsincludereinforcementlearning,intelligent planning,andmultiagentsystems.
犑犐犃犖犌犆犺狅狀犵,M.S.candidate.Hisresearchinterests includeimitationlearningandreinforcementlearning.
犎犃犗犑犻犪狀犢犲,Ph.D.,associateprofessor.Hisresearch interestsincludedeepreinforcementlearningandmultiagent systems.
犅犪犮犽犵狉狅狌狀犱
Imitationlearningbasedongenerativeadversarialnets
(GANsIL),asacombinationoftheadversarialtraining mechanismofgenerativeadversarialnetworksandtheideaof theiterativeimprovementinimitationlearningmethodsbased oninversereinforcementlearning,hasachievedremarkable successesinavarietyofdomains,suchasautonomous driving,simulation,roboticcontrol,andsoon.Ourpaper introducesthemainideaofgenerativeadversarialimitation learning(GAIL),summarizestwomainproblemsinGAIL, outlinesmanysolutionstothesetwoproblems,discusses
somepracticalGANsILapplications,andhighlightssome futuretrendsinthefield,withthehopeofprovidingavalua
blereferenceinitsfuturedevelopment.
ThispaperispartiallysupportedbytheNationalNatural ScienceFoundationofChina(61876119,61502323)andthe NaturalScienceFoundationofJiangsu(BK20181432).These projectsaimtoenrichthelearningandplanningtheoryand developefficientlearningandplanningalgorithmstoexpand thepowerandapplicabilityoflearningandplanningagentsin
partiallyobservablestochasticenvironments.
1 5 3 2期 林嘉豪等:基于生成对抗网络的模仿学习综述