CycleCoopNet: 基於合作學習的神經網路進行圖片轉換 - 政大學術集成

全文

(1)國立政治大學資訊管理學系. 碩士學位論文指導教授:郁. 立. 方博士. 政治大. ‧ 國. 學. CycleCoopNet: 基於合作學習的神經網路進行. ‧. 圖片轉換. er. io. sit. y. Nat. n. a l Image-to-Imagei vTranslation with CycleCoopNet: n U engchi Cooperative Learning Networks. Ch. 研究生：翁健豪中華民國一〇八年十二月 DOI:10.6814/NCCU201901290.

(2) CycleCoopNet: Image-to-Image Translation with Cooperative Learning Networks Chien-Hao Weng December 18, 2019. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901290.

(3) Abstract This paper proposes a new Image-to-Image translation method, CycleCoopNet. The image-to-image translation is a method of changing pictures from one style to another style, with which we can create novel pictures that do not exist. CycleCoopNet adopts the CoopNet framework with two main models called generator and descriptor. The generator generates pictures that are revised by the descriptor with MCMC (Markov Chain Monte Carlo) sampling, thus the generator is learned from supervised learning guided by the descriptor. On the other hand, the descriptor learns from real data by modified contrastive divergence, such that the descriptor. 政治大. is adjusted to output the same vector from the revised data and the real data.. 立. Several previous works are doing the Image-to-Image translation method. Cy-. ‧ 國. 學. cleGAN is one of the famous work doing similar working as our work, it used the concept of GAN (generative adversarial network) to demonstrate this method. It. ‧. demonstrates the nice performance of doing Image-to-Image translation. However, CycleGAN generating pictures by unsupervised learning, that is, the results of the. y. Nat. sit. generator does NOT have a standard generated pictures answer in the learning pro-. al. er. io. cess. CycleGAN only uses the discriminator to decide the results are correct or. v i n the generator only need toC find how to pass theU h e n g c h i discriminator testing and NOT n. incorrect. Every result only needs to pass the discriminator testing, this can make. trying to find the correct generated answer or more possible answers. This problem we called Mode collapse, that causes the results with less variability, that is, the generator always generates the same picture cheating discriminator to getting a better score. In our experiments, we use the edges2handbags dataset to observe how does the picture change from sketches to bags. We found that our model can generate more diverse results. And these results can be recovered to the origin picture by another opposite generator model stably. Another experiment we use vangogh2photo dataset to observe how does the picture change from photos to VanGogh-style pictures. We show our model can make a better variety. Our goal is to upgrade this network by changing the discriminator to the de-. DOI:10.6814/NCCU201901290.

(4) scriptor. The descriptor model is adapted from the CoopNet(Cooperative Neural Network). The idea is changing the output dimension of the discriminator (descriptor) convolutional network. Using the descriptor can let our generator have labeled answer to adjust its model parameters, and change this problem to supervised learning. Also, using a descriptor can prevent from the Mode collapse. Avoid the generator always generate similar patterns. Keywords: Cooperative learning networks, Image-to-Image Translation, deep learning, neural network. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901290.

(5) Contents 1 Introduction. 1. 1.1. Motivations and contributions . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Change and Difference of models . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.3. Cooperative Learning Networks . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2 Related Work. 6. 2.1. GAN and pix2pix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.2. cycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 政治 ................. Cooperative Learning Model . . . . . . . . .大立. 8. 2.3. ‧ 國. 學. 3 Model structure. 9 9. 3.1. Overview of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 3.2. Generator part of CycleCoopNet. ‧. 3.2.1. Batch-Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 3.2.2. Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. 3.2.3. Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13. 3.2.4. Skip connection and layers concatenation . . . . . . . . . . . . . . . 13. 3.2.5. Details of generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. n. al. er. io. sit. y. Nat. 3.3. . . . . . . . . . . . . . . . . . . . . . . . 10. Ch. engchi. i n U. v. Descriptor part of CycleCoopNet . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1. Batch-Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 3.3.2. Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 3.3.3. Fully-connected layers . . . . . . . . . . . . . . . . . . . . . . . . . 17. 3.3.4. Details of descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 4 Algorithm. 19. 4.1. Update descriptor networks . . . . . . . . . . . . . . . . . . . . . . . . . . 19. 4.2. Update generator networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 4.3. CycleCoopNet algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21. DOI:10.6814/NCCU201901290.

(6) 4.3.1. Step 0: random choose two different domain picture . . . . . . . . . 23. 4.3.2. Step G1: use Generator to generate B, A domain picture from A, B 23. 4.3.3. Step D1: use Langevin revision to revise the picture generated from step 1, 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23. 4.3.4. Step R1: use another Generator to revert the B, A domain picture we generated in step 1, 4 to A, B domain . . . . . . . . . . . . . . . 23. 4.3.5. Step G2: update Generator from generator loss . . . . . . . . . . . 24. 4.3.6. Step D2: update descriptor from descriptor loss . . . . . . . . . . . 24. 4.3.7. Step R2: update Generator from cycle-consistency loss . . . . . . . 25. 政治大. 4.4. Theoretical understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 4.5. Calculate similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28. 立. ‧ 國. 學. 5 Experiments. 30. Experiment 1: Generating bag texture patterns . . . . . . . . . . . . . . . 30. 5.2. Checking the correctness of our generator model . . . . . . . . . . . . . . . 39. 5.3. Experiment 2: Generating VanGogh-style pictures . . . . . . . . . . . . . . 41. 5.4. Experiment 3: Comparison with different descriptor output dimension . . . 44. ‧. 5.1. n. al. er. io. sit. y. Nat. 6 Conclusion. Ch. engchi. i n U. v. 46. 7 Project page. 47. 8 Sample results. 47. 9 References. 52. DOI:10.6814/NCCU201901290.

(7) List of Figures 1. This figure shows how we get the idea of CycleCoopNet, and the change of models.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2. The overview flow of our model . . . . . . . . . . . . . . . . . . . . . . . . 11. 3. Generator layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. 4. Layers concatenation concept . . . . . . . . . . . . . . . . . . . . . . . . . 14. 5. Details of Generator layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. 6. Descriptor layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16. 7. Details of Descriptor layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 18. 8. 19. 政治大 Descriptor loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. Generator loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 10. The overview flow of our model . . . . . . . . . . . . . . . . . . . . . . . . 21. 11. The update steps of our model . . . . . . . . . . . . . . . . . . . . . . . . . 24. 12. The concept of the generator curve . . . . . . . . . . . . . . . . . . . . . . 26. 13. The process and how we named the results in experiment 1 . . . . . . . . . 33. 14. The scatter plot of three different model doing the experiment for generate. ‧. ‧ 國. 學. 9. n. er. io. sit. y. Nat. al. i n U. v. bags from sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 15. Ch. engchi. The scatter plot of two different model doing the experiment for recover the generated bags to origin sketches . . . . . . . . . . . . . . . . . . . . . 35. 16. The scatter plot of three different model doing the experiment for generate sketches from real bags. 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36. The scatter plot of two different model doing the experiment for recover the generated sketches to origin bags . . . . . . . . . . . . . . . . . . . . . 37. 18. Statistics plot of generating bags experiment . . . . . . . . . . . . . . . . . 38. 19. Statistics plot of generating sketches experiment . . . . . . . . . . . . . . . 38. 20. The scatter plot of two model doing the experiment for generate real style photos from Vangogh pictures . . . . . . . . . . . . . . . . . . . . . . . . . 41. DOI:10.6814/NCCU201901290.

(8) 21. The scatter plot of two model doing the experiment for generate vangogh style pictures from real photos . . . . . . . . . . . . . . . . . . . . . . . . . 42. 22. Statistics plot of generating real photos experiment . . . . . . . . . . . . . 43. 23. Statistics plot of generating Vangogh pictures experiment . . . . . . . . . . 44. 24. experiment 1: sample results of CycleCoopNet (from sketches to real bags). 25. experiment 1: sample results of CycleGAN (from sketches to real bags) . . 48. 26. experiment 1.1: sample results of pix2pix (from sketches to real bags) . . . 49. 27. experiment 2: sample results of CycleCoopNet (from real photo to Vangogh-. 47. 政治大. style picture) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 experiment 2: sample results of CycleCoopNet (from Vangogh-style picture. 立. to real photo) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 28. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901290.

(9) List of Tables 1. Confusion matrix of GAN networks . . . . . . . . . . . . . . . . . . . . . .. 2. The loss comparison table of the CycleGAN model and CycleCoopNet model 32. 3. Result of edges2handbags experiment . . . . . . . . . . . . . . . . . . . . . 39. 4. Experiment 1.5 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40. 5. Result of vangogh2photo experiment . . . . . . . . . . . . . . . . . . . . . 43. 6. Experiment 3.1 results: edges to handbags . . . . . . . . . . . . . . . . . . 45. 7. Experiment 3.2 results: handbags to edges . . . . . . . . . . . . . . . . . . 46. 立. 7. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/NCCU201901290.

(10) 1. Introduction. Our model is adapted from the work of Cooperative Training networks (CoopNet) [1]. CoopNet generates pictures from latent vectors, We change the input of CoopNet from latent vectors to pictures. Our model generates new pictures from input pictures. Our model combined from two previous works from a different domain, one is based on the Cooperative Training concept using the cooperation of Energy-Based Model and Latent Variable Model via MCMC Teaching [1]. Another one is referenced from the work of Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [2].. 政治大 we use another same model to do the transition from one style to another style, the cycle 立. In our work, we used the cooperative concept to train two models at the same time. And. ‧ 國. 學. let these two networks do the Image-to-Image Translation. We found that with merging these two works, we will have better efficiency and performance doing the Image-to-Image. y. Nat. Motivations and contributions. io. sit. 1.1. ‧. Translation.. n. al. er. In the research field of machine learning, there are few studies of the Image-to-Image. i n U. v. translation problem. GAN and its relative works are usually used as training strategies. Ch. engchi. in this problem. GAN and its relative works have already demonstrated nice results and stable generated pictures. However, GAN has a problem we called Mode collapse [3]. Mode collapse means the generator collapses which produces limited varieties of samples. That is, the generator of GAN will be restricted its learning ability by the finite observed data. There is another problem of GAN, GAN generates picture by unsupervised learning, this is, the generator model never knows the correct answer and only know the score of its generated results. A possible situation is the generator knows how to get a high score in the testing of the discriminator. Then the GAN does not learn how to generate correct pictures. In our model, we adopt the supervised learning method, that is, in every training, we will give the generator a labeled output. This labeled output will be made by our 1. DOI:10.6814/NCCU201901290.

(11) descriptor using a modified contrastive divergence algorithm. The descriptor uses the output of the generator and revises it to the correct answer that the current descriptor has learned. Then the generator can learn from the revise results by MCMC(Markov chain Monte Carlo) teaching algorithm. This is also the most different part of our work and other previous work. Next, there will be a problem that we should solve is how to define what is ”correct results”. That is, although the results we generate might be divergence, in the beginning, we will not accept all of the generate results such as irrational pictures, these diverse results. 政治大. need to be checked matching our prediction. Therefore, we add a recovered generator to our model. All of the generated images can be transformed into the original picture by our. 立. recovered generator. That is to say, the image generated by each of our generators must. ‧ 國. 學. be able to restore as close as possible to the original picture by our recovered generator. These generated pictures are the ”correct results” that we expected in our training.. ‧. Therefore, our main contribution is to propose a model that uses the supervised learn-. sit. y. Nat. ing strategy to do the Image-to-Image Translation. The descriptor learning from the finite. io. er. observed data and help us to revise the generator results, that is, the descriptor can provide infinite labeled data for our generator model training. Using a supervised learning. n. al. Ch. i n U. v. strategy is more stable than the unsupervised learning strategy shows in our experiment.. engchi. Also, this model can generate diverse results because of the infinite labeled data. This model can generate many kinds of design form a single draft and this is more similar to real-life experience. Our work also provides a new way of generating diverse pictures and more stable results than previous work.. 1.2. Change and Difference of models. Generative Adversarial Networks (GAN) GAN [4] is a model using two neural network models to do machine learning. One model is a generator, using random latent factors to generate the picture. Another model is discriminator, which can tell us one picture is real or fake. Fake pictures mean the picture is generated by the generator,. 2. DOI:10.6814/NCCU201901290.

(12) Figure 1: This figure shows how we get the idea of CycleCoopNet, and the change of models. not a picture from the real world. We compare the pictures we generated with the real. 政治大. picture. The goal of our generator is to cheat discriminator, that is, generated the picture. 立. looks real. By contrast, the goal of our discriminator is to tell the picture is real or. ‧ 國. 學. fake. So discriminator tries to get rid of the cheating of the generator. These two models will upgrade during the training, once the generator success to cheat the generator, the. ‧. discriminator will upgrade its ability to avoid being cheated. On the other hand, once the. y. Nat. generator can not cheat the discriminator, the generator will upgrade its ability to cheat. io. sit. the discriminator. After training many epochs, that is, generator cheats discriminator. n. al. er. and discriminator tell the picture is fake pictures generated by the generator many times.. i n U. v. We will have a nice generator to generate pictures, with the bonus of nice discriminator,. Ch. engchi. that can tell us the pictures are real or fake.. Cooperative Learning Networks (CoopNet) CoopNet [1] also use two neural network models training generator to generate pictures. One model is the generator, this generator is as same as the generator of GAN. Another model is a descriptor, the descriptor can help us to compare the latent factors of pictures. In the work of GAN [4], we know a picture can be generated by latent factors, and different latent factors mapping to different pictures. Descriptor transforms our generated pictures and real pictures to two latent factors, and we compare these two latent factors to calculate the loss and update the descriptor model. This is also how descriptor learning in the training. The generator learns from the descriptor learning, we use Langevin revision dynamics [5] to revise the pictures generated from the generator. Then we think the revised 3. DOI:10.6814/NCCU201901290.

(13) picture as a true answer in the learning. We compare the revised picture and generated pictures to update our generator. The likelihoods of both models involve intractable integrals, and the gradients of both log-likelihoods involve intractable expectations that can be approximated by Markov chain Monte Carlo (MCMC) [1]. The learning of the generator model is based on how the MCMC in changes the pictures generated by the generator model. We can use a metaphor like a descriptor model (teacher) distills its knowledge to the generator model (student) via MCMC, and we call it MCMC teaching. By MCMC sampling, our descriptor model has the benefit of FRAME (Filters, Ran-. 政治大. dom field, And Maximum Entropy) models [6]. A FRAME model is a random field model that defines a probability distribution on the image space. Our images can be generated. 立. from the probability distribution by the model learning from the observed data. The. ‧ 國. 學. probability distribution is the result of maximum entropy distribution, that is, this result can reproduce the statistical properties of filter responses in the observed images. And. ‧. also because of the maximum entropy, this distribution is the most random distribution. sit. y. Nat. that corresponds to the observed statistical properties of filter responses. Our generated. io. statistical properties of the observed images.. n. al. Ch. er. images sampled from this distribution can be considered typical images that share the. i n U. v. Using this descriptor to help generator learning. The descriptor learns from the finite. engchi. amount of observed data, and the generator learns from virtually infinite amounts of revised data. The generator accumulates the MCMC transitions of the descriptor via MCMC teaching and reproduces the MCMC transitions by direct ancestral sampling. In other words, the descriptor distills its MCMC algorithm into the generator. This also makes our training turning the unsupervised learning of the generator [7] into supervised learning. Cycle-Consistent Adversarial Networks (CycleGAN) CycleGAN [2] use two GANs, and let the output of one GAN can be the input of another GAN. This can make one GAN have an ability to transform the picture from A domain to B domain. On the other hand, we use another GAN to recover the picture from B domain to A domain.. 4. DOI:10.6814/NCCU201901290.

(14) These paired GANs we called cycleGAN. If we use A domain as a picture style and B domain as another picture style. We can do Image-to-Image translation. CycleCoopnet Our model we called CycleCoopnet. As we know the power of Coopnet, we try to use Coopnet to generate Image-to-Image Translation models. We use two CoopNets, and let the output of one Coopnet can be the input of another Coopnet. This idea is from CycleGAN, we know we can generate pictures from one style to another style by one generator. We use another generator to transform the style back and try to make the result of the transform cycle can be consistent with the origin picture.. 1.3. 政治大. Cooperative Learning Networks. 立. The special of our model is using two different networks to do the cooperate training. In. ‧ 國. 學. the working of Cooperative Learning networks [1], this paper introduces the main concept of Cooperative Learning networks by two different networks.. ‧. Cooperative Training also uses two networks to achieve our requirements. There are. Nat. sit. y. two networks in Cooperative Neural networks, we will call it ”CoopNet” in the following. er. io. content. One is also called a generator, and another is called a descriptor. The main. al. v i n generated picture is as similar C as h possible to the real e n g c h i Upicture. n. features of the generator generate a fake picture, and the generator is better when this The main features of the. descriptor are trying to describe the generated picture and real picture by an easy vector. By comparing the similarity of these two vectors. We can evaluate the performance of our generator. We also use an easy metaphor to introduce the role of these two networks. In our work, the generator plays the role of the student, the main task student should do is learning hard as good as possible. The generator’s task in this work is generating the picture as similar to a real picture. Another network is a descriptor, plays the role of the teacher. The main task of the descriptor is trying to teach a student how to generate a real image. By this concept, the teacher will learn the knowledge first, then try to teach a student about the knowledge. Knowledge is like a real picture. Descriptor learned the feature. 5. DOI:10.6814/NCCU201901290.

(15) from the real picture first, then taught the generator the features of the picture. Students can learn the knowledge from the teacher’s teaching. This is similar to the generator learning the features from the descriptor teaching. In the original cooperative training networks. The main goal is doing the vector-toImage task. In our work, we also use cooperative networks, but we want to use these networks to solve the Image-to-Image Translation problem.. 2. Related Work. 政治大. Our work merged the concept of Cooperative Learning Networks [1] and cycleGAN’s. 立. Image-to-Image Translation works [8], we will separately introduce some related works in. ‧ 國. GAN and pix2pix. ‧. 2.1. 學. two parts of the following contents.. sit. y. Nat. In our works, we choose the cooperative learning network(CoopNet) instead of the Gen-. io. er. erative adversarial network(GAN). We consider in GAN’s generator and discriminator combination networks. The method of how we update generator and discriminator has. n. al. Ch. i n U. v. let discriminator tell the picture generated by generator real or fake. We can see we only. engchi. have a simple result of the discriminator’s result. We can easily use a confusion matrix to explain the updated concept of GAN. Table 1 shows the concept of how we update discriminator networks. Although we have a generator loss function to calculate the L1 distance to update generator variables, discriminator loss will be too simple that only decided from the right or wrong judgment. We try to solve this problem to avoid the networks support the generator, that is, discriminator of GAN to be too simple. We use the descriptor of CoopNet to support our generator function. This is the main reason why we use Cooperative Learning Networks instead of GAN. In discriminator of CoopNet, we use the descriptor to change the generated picture and real picture to another domain, that is, a vector expression of the picture.. 6. DOI:10.6814/NCCU201901290.

(16) Judge result (by Discriminator). Fake picture (generator’s works). Real picture (Discriminator False) Fake picture (Discriminator True). Update discriminator more Update generator more. Real picture (ground truth). Learning target Update discriminator more. Table 1: Confusion matrix of GAN networks. Our descriptor convoluted the picture and wish the result vectors can be as similar as possible. We calculate all the differences and calculate the sum by all of the values.. 治政 logit values to update discriminator, we consider the 大 update will be restricted in a small 立 scope. Our work use descriptor to support the generator leaning. Descriptor loss is not. This method can solve the problem we met in GAN’s discriminator. Since we use. ‧ 國. 學. like logit values, it calculates the difference between all of the pixels. We consider this will be better than GAN’s support networks, discriminator.. ‧. The main task of this paper is using the cooperation method instead of the adversarial. sit. y. Nat. method to finish Image-to-Image Translation working. The difference from the CycleGAN. io. er. working by using adversarial networks, we used cooperative networks in this work. On. al. the other hand, we used cooperative networks to do Image-to-Image Translation, different. n. v i n [1] using the cooperative C networks to do vector-to-image Translation. hengchi U. from. Merging two neural networks is the special of this model and also the evaluation of this work. The method used in [8], we will call this model ”pix2pix” in the following content. We use a different method to do the same work, even our work is better in efficiency and performance. In pix2pix architecture, it uses the generative adversarial network, we usually called it ”GAN”, to make the work realize. GAN also used two neural networks for the work, one is a generator, another is a discriminator. We can say this concept by a metaphor: Generator plays the role of the counterfeiter, and discriminator plays the role of detective. The mission counterfeiter is trying to make the counterfeit real picture to cheat the detective, make the detective cannot tell the difference from the real picture. 7. DOI:10.6814/NCCU201901290.

(17) and counterfeit picture. On the other hand, the detective needs to tell which picture is real and which picture is fake. As you can see, the counterfeiter and detective are the competitors of each other. These two networks compete with each other and strengthen themselves when the generator can not cheat the discriminator or when the discriminator cannot tell the picture of real or fake. This is the main idea of the generative adversarial network(GAN).. 2.2. cycleGAN. 政治大 learning. By competing with discriminators and generators, GAN can generate a picture 立 that people is hard to tell the difference between the real world picture or the picture Nowadays, Generative Adversarial Network(GAN) [4] is broadly used in the field of deep. ‧ 國. 學. generated from the computer. Our work will work on making Image-to-Image Translation working by GAN’s concept. Pix2pix [8] is a very famous example of this working.. ‧. There are many previous works before trying to solve Image-to-image translation prob-. Nat. sit. y. lems. We have some works for colorize image from grayscale to colorful such as [9, 10, 11].. er. io. The image-to-image translation concept can be found in Image Analogies [12], it used. al. v i n Ch cently work use CNN and set input-output as a dataset to learn a parametric e n g examples chi U n. a non-parametric texture model [13] on one input-output training image pair. More re-. translation function [14].. In our model, we found the idea of a picture-to-picture framework from [8], it used the conditional GAN [15] to learn how we change a picture from input images to output images. This method had been used in many fields. For example, generating photographs from sketches [16]. We used all paired training examples for these learning methods. We change our network to the CoopNet [1], using the cooperation of generator and descriptor can help us to improve preference from the adversary of generator and discriminator. In our generator and descriptor architecture, we modify the layer architectures from DCGAN [17]. We add batch normalization [18] method to accelerate our training, also since we have lots of layers with several variables. Batch normalization [18] can help us to. 8. DOI:10.6814/NCCU201901290.

(18) control our weights to be not divergent. In our generator, we use “U-Net” [19] structure to avoid missing important features by doing the convolution in encoding steps. Finally, we build a convolution-BatchNorm-ReLu module by these concepts.. 2.3. Cooperative Learning Model. Our work modifies the method from Cooperative Learning Model(CoopNet) [1], this work based on the work from the contrastive divergence [20] to train a descriptor model. We can find some examples model such as deep Boltzmann machine [21] and deep belief. 政治大 the training data. In CoopNet, we initialize sampling from the results of the generator 立 and wish this result can be as close as the descriptor as possible. Then our generator can network [22] for example. The contrastive divergence method initializes sampling from. ‧ 國. 學. generate a more realistic picture.. The CoopNet concept is similar to another work in [23]. In [23], the generator up-. ‧. date networks from minimizing the Kullback-Leibler divergence between generator and. Nat. sit er. io. picture.. y. descriptor. In CoopNet, the descriptor supports the generator to learn from the target. al. v i n descriptor distills its knowledgeC tohthe generator by U e n g c h i the MCMC teaching algorithm. For n. The learning concept is related to the work of knowledge distilling [24]. this is how. updating networks, we based on algorithm D [25] for descriptor, and algorithm G [26] for generator.. 3. Model structure. In our works, we combine the concept of pix2pix networks with the cooperative learning model. By the cooperation, one neural network can teach another neural network to do the target task, and one neural network can learn from another neural network’s teaching. By using cooperative learning networks, with the combination of pix2pix’s generator, discriminator and Coopnet’s generator and descriptor. We have a better performance doing the Image-to-Image Translation task in our result. 9. DOI:10.6814/NCCU201901290.

(19) We will introduce our method in the following three parts: • overview of the model • network architecture • main steps of the algorithms. 3.1. Overview of the model. We separated two parts to introduce about our works, generator part, and descriptor part,. 政治大. we will call our model ”CycleCoopNet” in the following contents. We simply introduce. 立. our overview steps and outputs here. First, we start with the input of the label picture.. ‧ 國. 學. Second, we use a generator to generate an ”initial generated picture”. Then we generate ”revised generated picture” by using Langevin revision dynamics. Finally, we use our. ‧. descriptor to generate ”described revised picture”.. Figure 2 show the step flow of all the output picture we generated.. n. al. Ch. er. io. Generator part of CycleCoopNet. sit. y. Nat. 3.2. i n U. v. First, we introduce our generator part. We will start from input our label picture, we. engchi. will call this ”input label”, then we use our generator to generate an initial example of B from this real A image. We called this ”initial generated picture”. Figure 3 show how we design our generator, this generator layers architecture is Reference from pix2pix [8]. We have 8 layers for encoder parts and 8 layers for decoder parts. The layer using the convolution-BatchNorm-ReLU concept, means in every layer, we will do convolution first, then do Batch-Normalization for every variable. Finally, we will do ReLU to or Leaky ReLU to adjust parameters.. 3.2.1. Batch-Normalization. In our model, we use Batch-Normalization method [18] to control our weight variables value will not be too divergent. Since our model has lots of variables in every layer. By 10. DOI:10.6814/NCCU201901290.

(20) 學 ‧. ‧ 國. 立. 政治大. Figure 2: The overview flow of our model. y. Nat. io. sit. using batch normalization, we can avoid some higher weight values affect results more. n. al. er. than others. With batch normalization, we allow using higher learning rates to finish. i n U. v. our tasks using less time. Additionally, we can less careful about the initial value of our. Ch. engchi. model. In some cases, batch-normalization reduces the need for a dropout method.. 3.2.2. Activation function. We separate our generator into two parts, encoder, and decoder. Since our target is changing the label picture to the real scene picture. We consider in encoder part, we try to compress our information into a vector. We hope we can remain the feature of the pictures in a vector, then use this vector to reconstruct a real picture in the decoder part. So we consider ReLU, Leaky ReLU to be our activation function. In our generator model, we use ReLU, Leaky ReLU, as our activation function. An activation function is using to let our calculation to be a nonlinear equation. Simple convolution will let our model be a linear equation, it means the machine can not learn 11. DOI:10.6814/NCCU201901290.

(21) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 3: Generator layers. 12. DOI:10.6814/NCCU201901290.

(22) anything with doing function one by one matching. We use leaky ReLU in our generatorencoder layers, leaky ReLU has all the advantages of ReLU, leaky ReLU remains all the value less than zero with multiply these numbers with a parameter. In the encoder part, we hope we can remain more features of our original picture, the first we should consider is to keep all features as possible, so we use leaky ReLU. Let all the values less than zero can save to the next steps. Different from leaky ReLU, ReLU chooses to let all the values less than zero to be zero. We use ReLU in the generator-decoder layers, we wish it can help us to decrease the redundant values of the model, that the picture regeneration to be more clear.. 政治大. In the last layer of the generator, we use hyperbolic tangent as our activation function.. 立. This can let all variables be remained and distributed to the range between zero to one.. ‧ 國. 學. Finally, we will get our output picture. We use the above concept to build our generator. ‧ y. Dropout. Nat. 3.2.3. sit. structure.. io. er. Since we have lots of variables in our generator layers. In the decoder part, we add dropout function in every layer with a fifty-percent dropout rate. the reason why we only. n. al. Ch. i n U. v. set in the decoder part is that we wish we can remain all special features as possible in. engchi. the encoder part. In the decoder part, first, we only need to regenerate pictures by lots of weight variables. To decrease the amount of the calculation, we could randomly drop out some weight value so our calculation could be faster. Second, we will do concatenation in step. This means even if we drop out the important feature values, we could fund this in the layers concatenation part again. In other words, this also makes our important feature values could be highly weighted.. 3.2.4. Skip connection and layers concatenation. In our generator model, we add the concatenation method when we doing the generator decode part. The main task of the decoder part is to reconstruct the picture and wish it. 13. DOI:10.6814/NCCU201901290.

(23) can give us a picture like the real picture. Skip connection is used to avoid the bottleneck of information, concatenating with previous layers result. We can also find features that we might lose in the previous layers. Our concept followed by the “U-Net” in [19]. With skip connection, our generator layers can be more strictly. Figure 4 shows the concept of layers concatenation.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 4: Layers concatenation concept. 3.2.5. Details of generator. Figure 5 shows the detail design of our generator. We combine all of the concepts we mentioned above. In the encoder part, add Batch-Normalization after convolution layers, then do Leaky ReLU as the activation function for the final result of one layer. In the decoder part, add Batch-Normalization after transposed-convolution layers. Next, dropout some results by dropout rate. Then, concatenate with encoder layers result avoid information bottleneck or feature missing. Finally, do ReLU as the activation 14. DOI:10.6814/NCCU201901290.

(24) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 5: Details of Generator layers function for the final result of one layer.. 3.3. Descriptor part of CycleCoopNet. After we got the output picture from the generator, we used this picture to do ` steps Langevin revision. After these steps, we will get the revised version of B. We called this ”revised generated picture”. We will revise our ”initial generated picture” by the Langevin revision dynamics algorithm. The equation (1) is showing how we do the Langevin revision. 15. DOI:10.6814/NCCU201901290.

(25) 政治大. 學 ‧. Yτ +1 = Yτ −. n. al. er. io. • τ : the time steps of the Langevin dynamics • δ: the step size. (1). y. Nat. ∂ δ 2 Yτ [ 2 − f (Yτ ; θ)] + δUτ 2 s ∂Y. sit. dynamics.. ‧ 國. 立 Figure 6: Descriptor layers. Ch. engchi. i n U. v. • Uτ : the Gaussian white noise term. We can run ` steps of Langevin revision dynamics according to equations (1) to obtain the ”revised generated picture(Yτ +1 )” from ”initial generated picture(Yτ )”. We can find a more detail explanation about this algorithm in [1]. We also need our networks to learn to get better performance, we use the descriptor to generate ”described revised picture” from ”revised generated picture” and generate ”described real picture” from ”real picture”. We will compare ”described revised picture” and ”described the real picture” to update our descriptor networks. Our descriptor changes our picture to a vector. Figure 6 shows how we design our descriptor.. 16. DOI:10.6814/NCCU201901290.

(26) 3.3.1. Batch-Normalization. In our descriptor model, we also use Batch-Normalization method [18] to control our weight variables value not be too divergent. Batch normalization can help us to control every weight in each layer could not be determined by some high weight variables. Since there are too many variables in each layer. We need this method to avoid unbalanced weight distribution. We can use a high learning rate with less possibility of gradient exploding problem. Besides, we don’t need to carefully decide the initial value of the model. Our model can easily control all the weight values in an appropriate range.. 3.3.2. Activation function. 立. 政治大. In the descriptor model, we use leaky ReLU as our activation function. Since we need to. ‧ 國. 學. compare as every value as possible in the original picture, we use leaky ReLU to remain. ‧. all values after calculation. The high weight value will also be highly weighted after leaky ReLU calculation. We will get useful and precisely result in this activation function.. sit. y. Nat. Fully-connected layers. io. er. 3.3.3. al. n. v i n C h the convolution into a one-dimension vector. We flatten e n g c h i U result in the previous layer and In the last layer of the descriptor, we use fully-connected layers to combine all the results. fully connected to the final result. This output will represent our origin value, it brings highly calculation speed and also not having too much error with the comparison of the two pictures.. 3.3.4. Details of descriptor. Figure 7 shows the detail design of our descriptor. By the above concept, we use batch normalization in every convolution layer and do the Leaky ReLU after normalization. In the final layer, we use the fully-connected layer to flatten our picture to a vector. We use this vector to calculate our descriptor loss.. 17. DOI:10.6814/NCCU201901290.

(27) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 7: Details of Descriptor layers. 18. DOI:10.6814/NCCU201901290.

(28) Figure 8: Descriptor loss. 4. Algorithm. 4.1. 政治大 Update descriptor networks 立. ‧ 國. 學. We define θ to represent all of the descriptor parameters. In the part of ”Descriptor part of CycleCoopnet”, we use θ to do ` steps Langevin revision, then update ”initial. ‧. generated picture” to ”revised generated picture”. We update θ by comparing ”described. y. Nat. revised picture” and ”described real picture”.Figure 8 shows how we calculate the loss. io. sit. then update the descriptor networks.. n. al. er. We used Monte Carlo approximation in equations (2) to update our generator param-. i n U. v. eter α. This algorithm is referenced from [25]. The main goal we are trying to minimize. Ch. engchi. the loss between ”revised generated picture” and”initial generated picture”. We can see Figure 8 for the diagram of step flow. n. Descriptor Loss(θ) ≈. n ¯. 1X ∂ 1X ∂ f (Yi ; θ) − f (Yi0 ; θ) n i=1 ∂θ n ¯ i=1 ∂θ. (2). • {Yi , i = 1, ...,n}: real picture • {Y0i , i = 1, ...,¯ n}: revised generated picture • f (Y;θ): the output of the descriptor: ”described real picture” • f (Y0 ;θ): the output of the descriptor: ”described revised picture”. 19. DOI:10.6814/NCCU201901290.

(29) 政治大. Figure 9: Generator loss. 立. The result of equations (2) is the difference between ”described revised picture” and. ‧ 國. 學. ”described real picture”. This can also mean the loss between the ”revised generated picture” and ”real picture”. The parameter θ will keep changing in the learning process. And. ‧. we will let the loss between ”revised generated picture” and ”real picture” be minimized. sit. Update generator networks. al. er. io. 4.2. y. Nat. to match our target.. n. v i n First, we define α to represent C allh of the related parameters. engchi U. In the part of ”Generator. part of Coop pix2pix”, we use α to generate the initial generated picture. We need to update α to get a better result. Here we will compare the difference between ”initial generated picture” and ”revised generated picture”. Calculate the generator loss from these two outputs, then update our generator parameters ”α”. Figure 9 shows how we calculate the loss then update the generator networks. n. 1X 1 ∂ GeneratorLoss(α) = (Yi − g(Xi ; α)) g(Xi ; α) 2 n i=1 σ ∂α. (3). • {Xi , i = 1, ...,n}: input label • {Yi , i = 1, ...,n}: revised generated picture. 20. DOI:10.6814/NCCU201901290.

(30) • g (Xi ;α): the output of the generator: ”initial generated picture” The result of equations (3) is the difference between ”initial generated picture” and ”revised generated picture”. This can also mean how better our generator has already learned. The Generator loss can help us to update α(generator’s variables). Parameter α will keep changing in the learning process. And we will let the loss between ”initial generated picture” and ”revised generated picture” be minimized to match our target. This is also the target generator that we want. Our training target is trying to minimize the generator(α) loss between ”initial gen-. 政治大 tween ”revised generated picture” 立 and ”real picture”. Next paragraph we will introduce. erated picture” and ”revised generated picture”, and minimize the descriptor(θ) loss be-. ‧ 國. CycleCoopNet algorithm. ‧. 4.3. 學. the algorithm steps of our model.. n. al. er. io. sit. y. Nat. Algorithm 1 are the details steps for our works.. Ch. engchi. i n U. v. Figure 10: The overview flow of our model Figure 10 show how our Algorithm step 1-6, we will generate 6 pictures. 21. DOI:10.6814/NCCU201901290.

(31) Algorithm 1: Convert-Style Cooperative Learning Algorithm Input : 1. two training datasets with several examples 2. numbers of Langevin steps ` 3. number of learning iterations T Output :. 政治大. 1. A2B and B2A generator’s parameters α. 立. 2. A2B and B2A descriptor’s parameters θ. ‧ 國. 學. 3. synthetic example Y 0. 12: 13: 14: 15: 16: 17:. y. sit. er. al. n. 10: 11:. 1 by equations (1) Step R1A2B : use B2A Generator to revert the B domain picture we generated in step 1 to A domain Step G2A2B : update A2B Generator from generator lossA2B by equations (2) Step D2A2B : update A2B descriptor from descriptor lossA2B by equations (3) Step R2A2B : update A2B Generator from cycle-consistency lossA2B v by L1 distance Step G1B2A : use B2A Generator to generate B domain picture from A Step D1B2A : use B2A Langevin revision to revise the picture generated from step 4 by equations (1) Step R1B2A : use A2B Generator to revert the A domain picture we generated in step 4 to B domain Step G2B2A : update B2A Generator from generator lossB2A by equations (2) Step D2B2A : update B2A descriptor from descriptor lossB2A by equations (3) Step R2B2A : update B2A Generator from cycle-consistency lossB2A by L1 distance Let t ← t + 1 until t = T. io. 7: 8: 9:. Nat. 6:. ‧. 1: Let t ← 0, initialize θ and α. 2: repeat following Steps 3: Step 0: random choose two different domain picture 4: Step G1A2B : use A2B Generator to generate B domain picture from A 5: Step D1A2B : use A2B Langevin revision to revise the picture generated from step. Ch. engchi. 22. i n U. v. DOI:10.6814/NCCU201901290.

(32) 4.3.1. Step 0: random choose two different domain picture. In step 0, we randomly choose a labeled picture as our input of our generator. Later our generator can help us to generate a target picture form the original label picture.. 4.3.2. Step G1: use Generator to generate B, A domain picture from A, B. We had already got the original label picture in step 0, we use this picture as our input and put it into our generator. We use our generator to change this original label picture to target picture by generator neural networks. The details of the generator you can see. 政治大 these steps. Then we will get an ”initial generated picture” as output in this step. 立. in Figure 5. Figure 5 shows the convolution and transpose-convolution of how we do in. ‧ 國. 學. 4.3.3. Step D1: use Langevin revision to revise the picture generated from step 1, 4. ‧. In step 3, we will revise our ”initial generated picture” by the Langevin revision dynamics. y. Nat. sit. algorithm. The details of the Langevin revision algorithm shows in equation 1. We will. 4.3.4. al. n. output of this step.. er. io. get the revised versions of the picture, ”revised generated picture”. This will be the. Ch. engchi. i n U. v. Step R1: use another Generator to revert the B, A domain picture we generated in step 1, 4 to A, B domain. We had already got the original label picture in step 0, we use this picture as our input and put it into our generator. We use our generator to change this original label picture to target picture by generator neural networks. The details of the generator you can see in Figure 5. Figure 5 shows the convolution and transpose-convolution of how we do in these steps. Then we will get an ”initial generated picture” as output in this step. Figure 11 show how our Algorithm step 7-12, we will update our generator and descriptor.. 23. DOI:10.6814/NCCU201901290.

(33) 學. Figure 11: The update steps of our model. Step G2: update Generator from generator loss. ‧. 4.3.5. ‧ 國. 立. 政治大. sit. y. Nat. In steps 7, 8, we used another Monte Carlo approximation equations to update our gener-. io. er. ator parameters α followed by equation 3. The generator loss calculated by this equation will help us to update our generator variables. The main goal we are trying to minimize. n. al. Ch. i n U. v. the loss between ”initial generated picture” and ”revised generated picture.. engchi. After repeatedly learning, we will have nice descriptor results and generator results. These outcomes will be the main goals of our work. It can help us to do any pictures translation by paired data.. 4.3.6. Step D2: update descriptor from descriptor loss. After finish all steps before, we need to update parameter θ, θ is the parameter control how descriptor updated, so our model can learn the features of a picture, and can be improved in next training steps. We use ”revised generated picture” compared with the ”real picture”. First, We use the descriptor to generate ”described revised picture” from ”revised generated picture” and generate ”described real picture” from the ”real picture”. Figure 7 24. DOI:10.6814/NCCU201901290.

(34) shows the details of the descriptor structure and how we do it in the layer in this step. We will have two outputs ”described the revised picture” and ”described real picture” from our descriptor. Then, by using the Monte Carlo approximation algorithm to calculate the loss between ”described revised picture” and ”described real picture” following by equation 2. We use the descriptor loss generated by this equation to update our descriptor learning parameter.. 4.3.7. Step R2: update Generator from cycle-consistency loss. 治政 ator parameters α followed by equation 3. The generator 大 loss calculated by this equation 立 will help us to update our generator variables. The main goal we are trying to minimize In steps 7, 8, we used another Monte Carlo approximation equations to update our gener-. ‧ 國. 學. the loss between ”initial generated picture” and ”revised generated picture. After repeatedly learning, we will have nice descriptor results and generator results.. ‧. These outcomes will be the main goals of our work. It can help us to do any pictures of. sit. Theoretical understanding. al. er. io. 4.4. y. Nat. Image-to-Image translation.. n. v i n In CycleCoopNet, step D1, D2 C wehare modified contrastive e n g c h i U divergence.. Step G1, G2 we. are doing MCMC teaching. Step R1, R2 we are configured the cycle consistency. Modified contrastive divergence Modified contrastive divergence is the main learning target of the descriptor. In equation (2), we compare the real picture and revised generated picture by the output of the descriptor. Let Mθ be the Markov transition kernel of lp steps of Langevin dynamics that samples pθ . For any distribution p and any Markov transition kernel M, let Mp be the marginal distribution obtained by the running the Markov transition M from p. Similar to traditional contrastive divergence, At iteration t, the learning gradient of descriptor θ is the gradient of KL(Pdata )|pθ ) - KL(Mθ(t) qα(t) |pθ ) with respect to pθ . In the traditional contrastive divergence, Pdata will replace the qα(t) in the second KL-divergence.. 25. DOI:10.6814/NCCU201901290.

(35) 政治大. Figure 12: The concept of the generator curve. 立. Figure 12 shows the curve of the generator model, we can think as we are finding the. ‧ 國. 學. best parameters of the model. We try to revise current parameters from qα(t) to qα(t+1) , and qα(t+1) will be the our ideal parameters. From qα(t) to p(t+1) we are doing Markov. ‧. transition, the descriptor revise the results of our generator. Then we use revised results. sit. y. Nat. to teach the generator finding the better parameters qα(t+1) . After training, qα(t) will be. er. io. as same as the qα(t+1) , that is, we find the ideal generator parameters.. al. Descriptor learns from real data. In every iteration, the descriptor uses the result. n. v i n of the generator to revise to the that the current descriptor thinks what is Creal h epicture, i U h n c g the real picture looks like. Let current descriptor parameters p . After learning from θ. the real picture, the descriptor parameters changes to pθ(t) . We suppose this means the descriptor has already learned the correct parameters from real pictures. If generator also have already learned from the current descriptor, that is, current generator parameters qα(t) is equal to pθ(t) . The generator knows how to generate a picture like the real picture. MCMC teaching MCMC teaching is how the generator learning from the descriptor. We set the parameters of generator α. In figure 12, the gradient of generator is the gradient of KL(Mθ qα(t) |qα ) with respect to α. This figure also shows p(t+1) = Mθ(t) qα(t) , p(t+1) replace Pd ata as the learning target to train our generator model. This will make us easier to find the minimize of KL(Mθ qα(t) |qα ) than find the minimize of KL(Pdata |qα ),. 26. DOI:10.6814/NCCU201901290.

(36) because the revised picture also come from the same latent variables from generator, that is, we already know there are some relation between revised picture and its latent variables from generator. This also make our learning being supervised. We suppose the learning will converge and the final parameters of descriptor and ˆ α generator are (θ, ˆ ). Then our learning is finding the minimum of. θˆ = arg min[KL(Pdata |pθ ) − KL(Mθˆqαˆ |pθ )]. (4). θ. 政治大. α ˆ = arg min KL(Mθˆqαˆ |qα ). 立. (5). α. In equation (5), we are finding the qαˆ to minimize the KL-divergence equation. In the. ‧ 國. 學. idealized situation, minα KL(Mθˆqαˆ |qα ) = 0, that is, Mθˆqαˆ = qαˆ . In the figure 12 means we start from qαˆ , doing the Markov transition, and finally we come back to the same. ‧. point. We can say qαˆ is the stationary distribution of Mθˆ. And this also tell us we are. sit. y. Nat. nicely learning from the descriptor, that is qαˆ = pθˆ. The generator have already learned from the descriptor. This also let the second KL-divergence in equation (4) vanishes, that. io. n. al. er. is, KL(Mθˆqαˆ |pθˆ) = 0. Then will find the θˆ be the maximum likelihood estimate which. Ch. minimizes the remain KL-divergence KL(Pdata |pθ ).. engchi. i n U. v. We can use another way to explain the dynamics of MCMC teaching in the idealized scenario. We supposed the descriptor parameter pθ is fixed, that is, the descriptor already know the features of real data. Then the descriptor teaches the generator parameter qαˆ by MCMC teaching. This makes α(t+1) = arg minα KL(Mθ qα |qα ), then α(t+1) = Mθ qα(t) . We can also change this equation to qα(t) = Mθt qα(0) , it means for any generator parameters qα , after we doing MCMC transition t times, we will get our idealized generator parameters. Remember generator is learning from the descriptor features, that is, qα(t) should be equal to pθ in the idealized scenario. In the learning steps, qα(t) collects the MCMC transitions and convergences to the stationary distribution pθ . We can write this concept. 27. DOI:10.6814/NCCU201901290.

(37) like equation 6.. qα(t) = Mθt qα(0) → pθ. (6). When we adjust the parameters of the generator, we follow equation 7 to make sure we are in the right way.. KL(p(t+1) |pθ(t) ) ≤ KL(qα(t) |pθ(t) ). 4.5. Calculate similarity. 立. (7). 政治大. We will show our works generate higher similarity of picture by calculating histogram. ‧ 國. 學. similarity[27, 28, 29], average hash algorithm(aHash)[30], perceptual hash algorithm(pHash)[30, 31, 32] and different hash algorithm(dHash)[30].. ‧. Histogram similarity An image histogram is a type of histogram that acts as a. y. Nat. graphical representation of the tonal distribution in a digital image. It plots the number. io. sit. of pixels for each tonal value. By looking at the histogram for a specific image, viewers. n. al. er. will be able to judge the entire tonal distribution at a glance.. i n U. v. To calculate the histogram of the image, we use OpenCV to help our work. OpenCV is. Ch. engchi. an open-source computer vision and machine learning software library. First, we separate the source image in its three R, G and B planes by the OpenCV split function. Then calculate the histograms by using the OpenCV calcHist function. Finally, normalize the histogram so its values fall in the range indicated by the parameters we decided. We will compare the histogram of images to analyze the difference of image. We use our models and cycleGAN model to train generator separately. We will collect 300 generated pictures per epoch from each model. We compare the difference between the last picture with all the other pictures and calculate the average value and the variance. We consider that if the model gets a higher value. This means the model can generate higher diverse pictures. Average hash algorithm(aHash) Average hash is the simplest algorithm that uses 28. DOI:10.6814/NCCU201901290.

(38) only a few transformations. Scale the image, convert to greyscale, calculate the mean and binarize the greyscale based on the mean. These integers are the hash of pictures. In our case, we want to generate a 64 bit hash of pictures. The average hash algorithm first scaled-down the input image to 8×8 pixels, then converts the image to grayscale. Next, we calculated the average of all gray values of the image and then the pixels are examined one by one from left to right. If the gray value is larger than the average, a 1 is added to the hash, otherwise a 0. After we get a 64 bit hash of pictures, we called this ”the footprint of the image”.. 政治大. We use this hash value to compare how much difference between the two pictures. We compare the difference between the last picture with all the other pictures and calculate. 立. the average value and the variance. We consider that if the model gets a higher value.. ‧ 國. 學. This means the model can generate higher diverse pictures.. Perceptual hash algorithm(pHash) Perceptual hash uses a similar approach but. ‧. instead of averaging relies on discrete cosine transformation (popular transformation in. sit. y. Nat. signal processing). Perceptive hash does the same as aHash, but first, it does a Discrete. io. er. Cosine Transformation (equation 8) and works in the frequency domain. The average hash algorithm first scaled-down the input image to 8*4×8*4, that is, a. n. al. Ch. i n U. v. 32×32 image. Then converts the image to grayscale. To this image, we apply a discrete. engchi. cosine transform (equation 8), first per row and afterward per column. The pixels with high frequencies are now located in the upper left corner, which is why we crop the image to the upper left 8×8 pixels. Next, we calculate the median of the gray values in this image and generate, analogous to the median hash algorithm, a hash value from the image.. Xk =. N −1 X. Y 2n + 1 )∀k ∈ [0, N ] 2n ∗ cos( ∗k ∗ 2N n=0. (8). After we get a 64 bit hash of pictures, we called this ”the footprint of the image”. We use this hash value to compare how much difference between the two pictures. We compare the difference between the last picture with all the other pictures and calculate the average value and the variance. We consider that if the model gets a higher value.. 29. DOI:10.6814/NCCU201901290.

(39) This means the model can generate higher diverse pictures. The average hash is simple, but it is strongly affected by the average. For example, when we do gamma correction or histogram equalization on an image, the mean affects the final hash value. PHash algorithm is less affected by gamma correction or histogram equalization than aHash. It used a discrete cosine transform (DCT)[33] to obtain the low-frequency components of the picture. Difference hash algorithm(dHash)Difference hash uses the same approach as aHash, but instead of using information about average values, it uses gradients (the. 政治大. difference between adjacent pixels).. Similar to the average hash algorithm, the difference hash algorithm initially generates. 立. a grayscale image from the input image, which in our case is then scaled down to 9×8. ‧ 國. 學. pixels. From each row, the first 8 pixels are examined serially from left to right and compared to their neighbor to the right, which, analogous to the average hash algorithm,. ‧. results in a 64-bit hash.. sit. y. Nat. n. al. er. Experiments. io. 5. i n U. v. In this section, we are going to demonstrate the experiment process and the results.. Ch. engchi. The following is the implementation of the cooperative learning model. We use different datasets to show the performance of our model. We use tensorflow for our works. Our experiment environment is windows 10 with NVIDIA GTX 1080 GPU and Intel i57400 CPU. We build an Anaconda environment and install all python packages include TensorFlow-GPU 1.13.1, numpy 1.15.4, scipy 1.1.0, pillow 5.3.0 and pandas 0.23.4 in this environment. You can find all experiment codes and details in our github [34].. 5.1. Experiment 1: Generating bag texture patterns. In experiment 1, we want to let our generator model learn the bag texture patterns.. 30. DOI:10.6814/NCCU201901290.

(40) we use edges2handbags [8] as our training dataset. We want our generator model to learn how to generate the bag texture patterns. The goal of this experiment is to the change picture of the bag sketch to look like a real bag. We trained 300 epochs, 400 iterations for each epoch. We will input one sketch picture and one real bag picture. In CycleCoopnet and cycleGAN, we don’t need to pair the sketch and the real bag, but in pix2pix, because the model needs paired data. We need to give the model the bags with its sketch. We set the generator Learning rate for 0.0001, descriptor Learning rate for 0.01, cycle. 政治大. consistency Learning rate for 0.0001. We set Langevin revision steps for 30 and Langevin step size for 0.002 in this experiment. The descriptor output dimension we will set 100. 立. for this experiment.. ‧ 國. 學. Comparison of loss curve First, we compare the converging speed between cycleGAN and our works. Table 2 shows the loss comparison with CycleGAN and our work.. ‧. We use cycle consistency loss as our benchmark since cycle consistency loss can let us. sit. y. Nat. know the learning steps of the two generators. We can see our model converge in about. io. er. 5 epochs, faster than CycleGAN need to converge in about 10 epochs. This shows our model has a faster co verge ability by using a supervised learning strategy.. n. al. Ch. i n U. v. There are four output pictures in our experiment 1, we use the picture below to explain what are these outputs.. engchi. Figure 13 shows the process and how we named the results of our experiment 1, ”sk” means sketch pictures, ”R” means real pictures. The single prime notation ’ means generated picture from the generator. Double prime notation ” means the picture recovered from the generated picture. We can see Figure13 for the explanation for our experiment. We will show our results R’, which means the picture we generated. Then show our results sk’, which means the picture we recovered form generated pictures. Then we will see another side of CycleCoopNet, We will show our results sk”, which means the picture we generated. Then show our results R”, which means the picture we recovered form generated pictures.. 31. DOI:10.6814/NCCU201901290.

(41) generator loss from sketch to bag CycleGAN CycleCoopNet. 政治大. generator loss from bag to sketch CycleGAN CycleCoopNet. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. i n U. v. Table 2: The loss comparison table of the CycleGAN model and CycleCoopNet model. Ch. engchi. Comparison of R’ Here we are going to compare different models to generate the bag pictures from the bag sketch. We compared the model of CycleCoopNet, cycleGAN, and pix2pix. We use 10 datasets for every testing, and we choose 100 results of outputs and compare the difference between the real bag and the bag we generated. Figure 24 show some results generated by our CycleCoopNet, figure 25 show some results generated by cycleGAN and figure 26 show some results generated by pix2pix. You can find sample results in the last section. We compare these results with the origin picture by two benchmarks, histogram and aHash, that we have already introduced in the previous section. Figure 14 shows the scatter plot for three different model results. We can see the points of pix2pix gather in 32. DOI:10.6814/NCCU201901290.

(42) 立. 政治大. ‧ 國. 學. Figure 13: The process and how we named the results in experiment 1. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 14: The scatter plot of three different model doing the experiment for generate bags from sketches. 33. DOI:10.6814/NCCU201901290.

(43) the upper right corner. This means the generated results look like the origin pictures. The reason is pix2pix use paired data in the training, whenever we update our parameters. We use the correct real picture to adjust the parameters of the generator. After many epochs, pix2pix should generate the picture mostly closed to the original picture. On the other hand, CycleGAN and CycleCoopNet use the unpaired data in the training. This means they never see the real picture in all training processes. Next, we compare the CycleGAN and CycleCoopNet, we can see our model get a similar score in the histogram benchmark. But in the aHash benchmark, our model gets. 政治大. a higher score than the CycleGAN. This shows our model can generate pictures more correctly.. 立. Comparison of sk’ Here we are going to compare the recover results from the gen-. ‧ 國. 學. erated bags of CycleCoopNet and cycleGAN. We do not compare the pix2pix because pix2pix is training by paired data and it does not recover the generated pictures to the. ‧. original picture. We want to make sure that our results can be returned to the original. sit. y. Nat. picture. This can also prevent our generator model generating the irrational picture and. io. er. let the generated pictures can not be returned to origin pictures. We use 10 datasets for every testing, and we choose 100 results of outputs and compare the difference between. n. al. Ch. the real bag and the bag we generated.. engchi. i n U. v. We can see figure 15 for the result, the recover result shows the recoverability of our model a little worse than cycleGAN. But our recover results still get a score higher than 0.6 in the benchmark. We think this result is acceptable because our model generates a more diverse picture, this makes the recovery more difficult. Comparison of sk” Here we are going to compare different model generate the sketches form the real bag. We compared the model of CycleCoopNet, cycleGAN, and pix2pix. We use 10 datasets for every testing, and we choose 100 results of outputs and compare the difference between the real bag and the bag we generated. We can see figure 16 the result, our model and pix2pix get a similar score in two benchmarks. Pix2pix is training for by paired data it will have to be more stable cause. 34. DOI:10.6814/NCCU201901290.

(44) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 15: The scatter plot of two different model doing the experiment for recover the generated bags to origin sketches. 35. DOI:10.6814/NCCU201901290.

(45) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 16: The scatter plot of three different model doing the experiment for generate sketches from real bags. 36. DOI:10.6814/NCCU201901290.

(46) 立. 政治大. ‧. ‧ 國. 學. Figure 17: The scatter plot of two different model doing the experiment for recover the generated sketches to origin bags. sit. y. Nat. io. er. it direct learning from the observed data. Our model only descriptor can learn from the. al. observed data. And the generator learns from descriptor learning. However, we can have. n. v i n similar results with pix2pix. Shows C hour model is moreU stable by the MCMC teaching. engchi. Comparison of R” Here we are going to compare the recover results from the gen-. erated sketches of CycleCoopNet and cycleGAN. We do not compare the pix2pix because pix2pix is training by paired data and it does not recover the generated pictures to the original picture. We want to make sure that our results can be returned to the original picture. This can also prevent our generator model generating the irrational picture and let the generated pictures can not be returned to origin pictures. We use 10 datasets for every testing, and we choose 100 results of outputs and compare the difference between the real bag and the bag we generated. We can see figure 17 the result, our model gets a higher score and accumulates in a small range. This also shows our model is more stable.. 37. DOI:10.6814/NCCU201901290.

(47) 立. 政治大. ‧. ‧ 國. 學. Figure 18: Statistics plot of generating bags experiment. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 19: Statistics plot of generating sketches experiment 38. DOI:10.6814/NCCU201901290.

(48) model cyclecoopnet cyclegan model cyclecoopnet cyclegan. generated bags x recovered sketches 1100/1100 1092/1100 generated sketches x recovered bags 1100/1100 871/1100. Table 3: Result of edges2handbags experiment We compare these three models’ accuracy by training other CNN models to help us to tell the picture is for bags-class or edges-class. We use 11 cases of test data and each uses 100 pictures after the model has already trained. All test cases have not duplicated. 政治大 In the table 3, we show 立the average score of 1100 results graded by other trained. with our training data.. ‧ 國. 學. CNN models. We can see our model gets a similar score with pix2pix, and better than cycleGAN. However, pix2pix use paired data to train models. Our model can get a. ‧. similar score by using unpaired data. Figure 18 and Figure 19 sum results from all cases. Figure 18 is the result of generating bags, cycleGAN and our works get similar scores.. y. Nat. sit. Figure 18 is the result of generating sketches, we can see cycleGAN will fail in some cases.. n. al. er. io. Shows our model is more stability.. 5.2. i n U. C. v. h e n gofcour Checking the correctness h i generator model. In this experiment, we are going to check the correctness of our generator. We use some photos as our input, and put these photos into the model we have already trained in experiment 1. We are going to make sure even the photo that our generator has not seen. Our generator can also make a sketch picture for it. The experiment result is showing in the table 4. In this experiment result, we proved that although the data are different from the training data set, and even our generator has not seen this data before. It can also correctly produce sketch results.. 39. DOI:10.6814/NCCU201901290.

(49) Input picture. Output picture. 立. Input picture. Output picture. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Table 4: Experiment 1.5 results. 40. DOI:10.6814/NCCU201901290.

(50) 立. 政治大. ‧. ‧ 國. 學. Figure 20: The scatter plot of two model doing the experiment for generate real style photos from Vangogh pictures. sit. y. Nat. Experiment 2: Generating VanGogh-style pictures. io. er. 5.3. al. n. v i n C h or real photo We will set a Vangogh picture e n g c h i U as input.. We use vangogh2photo for this experiment. We trained 300 epochs, 400 batches for each epoch.. We set the generator. Learning rate for 0.0001, descriptor Learning rate for 0.01, cycle consistency Learning rate for 0.0001. We set Langevin revision steps for 30 and Langevin step size for 0.002 in this experiment. The descriptor output dimension we will set 100 for this experiment. Gernating real photo pictures Figure 20 shows a comparison of generating real photo pictures picture from Vangogh pictures CycleGAN and our model. We compare origin pictures and the pictures after translation. Our model gets more score in the aHash benchmark. Figure 28 show some sample results of our work. Gernating Vangogh style pictures Figure 21 shows comparison of generate Vangogh style picture from real photo CycleGAN and our model. We compare origin pictures and the pictures after translation. Our work gets a similar score with cycleGAN. Figure 27. 41. DOI:10.6814/NCCU201901290.

(51) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 21: The scatter plot of two model doing the experiment for generate vangogh style pictures from real photos. 42. DOI:10.6814/NCCU201901290.

(52) 政治大. 立. ‧. ‧ 國. 學. Figure 22: Statistics plot of generating real photos experiment. y. Nat. sit. show some sample results of our work.. n. al. er. io. In this experiment, we have successfully demonstrated that CycleCoopNet can produce. i n U. v. a higher diversity of our generator results. That is, the same picture can be converted. Ch. engchi. into a different variety of VanGogh-style pictures. model cyclecoopnet cyclegan model cyclecoopnet cyclegan. generated photos x recovered vangogh 986/1000 986/1000 generated vangogh x recovered photos 641/1000 595/1000. Table 5: Result of vangogh2photo experiment We compare the accuracy of these three models by training other CNN models to help us to tell the picture is for Vangogh-class or photo-class. We use 10 cases of test data and each uses 100 pictures after the model has already trained. All test cases have not duplicated with our training data.. 43. DOI:10.6814/NCCU201901290.

(53) 立. 政治大. ‧. ‧ 國. 學. Figure 23: Statistics plot of generating Vangogh pictures experiment. y. Nat. sit. In the table 5, we show the average score of 1000 results graded by our CNN models.. n. al. er. io. Figure 22 and Figure 23 sum results from all cases. Figure 22 is the result of generating. i n U. v. photos, our model gets a similar score with cycleGAN. Figure 23 is the results of generating. Ch. engchi. Vangogh pictures. Although we get a higher score than cycleGAN, in this plot we can see our model will fail in some cases. We consider the reason is we can only collect 400 Vangogh pictures as our dataset. These pictures are not enough for our CNN model to learn Vangogh pictures features precisely. So both our model and cycleGAN do not get the high score by the CNN grading.. 5.4. Experiment 3: Comparison with different descriptor output dimension. In this experiment, we are going to test how different descriptor output dimensions will change our results. The descriptor output dimension we will set 1, 100, 200 in this. 44. DOI:10.6814/NCCU201901290.