Several important findings were obtained in this study. First, information relating to
the mode of the music is a strong factor affecting our perceived emotion when listening
to music. Experiment 1 first established this point by using music in major or minor
mode and showed robust relationships between positive emotions and major mode, and
between negative emotions and minor mode. Although this mode-emotion relationship
has already been shown for western participants (Gabrielsson & Juslin, 2002;
Gabrielsson & Lindstrom, 2001) and seems to be a commonly held view in western
society, no empirical evidence so far has been provided for Taiwanese. Our finding of
the mode-emotion relationship in Taiwan not only provides evidence consistent with the
literature in western societies, but also supports the view that even with the influence of
different cultural backgrounds, the mode of music conveys emotional valence, perhaps
across different cultures (Balkwill & Thompson, 1999; Hoshino, 1996).
Furthermore, when we asked the participants to directly evaluate the emotional
valence of each unimodal stimulus in Experiment 3A, although there were some stimuli
rated as neutral in the silent video, the mode information of the music could still help
the performer to express appropriate emotional intention which could also be detected
28
by the participant. The mode of the music was implicitly detected by listeners, but it still
has power to influence perceived emotion and form a medium supplying the performer
with the ability to communicate the emotional intention of the music to the audience
through visual performance, even without acoustic emotional cues.
The second important finding is that congruency in mode between the video image
and music mediates audio-visual integration in music. Based on the results of
Experiment 1, Experiment 2 investigated whether an incongruent pairing of music and
video (vs. a congruent pair) could modify our emotional magnitude (i.e. the mode
congruence effect) in perceiving musical performance. The mode congruence effect
demonstrated in Experiment 2 indicates that emotional information processed from one
modality can be affected by information processed from the other modality, and that
information from both modalities integrates to modify our emotional responses (de
Gelder & Bertelson, 2003; Shams & Seitz, 2008).
Comparing our results with other studies, the mode congruence effect reflects that
the musical component, mode, is not only important in musical perception but is also a
medium that conveys emotional connotation visually. A musical piece which lacks mode
information is thus likely to lose that medium with which to convey emotional intention.
Accordingly, the result of audio dominance in perceived tension in Vines et al. (2006)
29
might have resulted from the fact that the musical stimuli they used did not have clear
mode information.
The third important finding is that the combination of music and video gives us
stronger perceived emotion than listening to music alone. The results in Experiment 3B
and 3C indicate that emotionally congruent visual information could enhance the
perceived positive emotion of music, whereas emotionally incongruent visual
information could attenuate both positive and negative emotions. Judging from the
results of the emotional judgments of both music and video, the possibility of visual
dominance of perceived emotion can be excluded. Had the emotional enhancement
effect resulted from the visual aspect being dominant, how could the attenuation of
emotionally incongruent stimuli in visual judgment of music be explained? The
emotional modulation effect from audio-visual emotional integration has also been
found in other cross-modal studies on music (Baumgartner, Lutz, Schmidt, & Jancke,
2006; Shevy, 2007; Spreckelmeyer et al., 2006; Thompson et al., 2008; Vines et al.,
2005). However, most studies explore the effect by examining the influence of the
combination of music with a picture (or movie clip) but not videos of musical
performance (Baumgartner et al., 2006; Shevy, 2007; Spreckelmeyer et al., 2006).
Although Vines et al. (2005) found that, compared to listening to music alone, images of
30
exaggerated visual performance could enhance the perceived emotion, the effect might
have resulted from the artificial manipulation of visual cues which were unrelated to the
musical performance itself. In our study, we asked the performer to sing appropriately
with regard to the music heard, without any exaggerated or unrelated expression. Our
finding that an emotionally congruent video image could still enhance the perceived
emotional magnitude points to a strong connection between mode and emotion.
Thompson et al. (2008) investigated whether the emotional congruence between a
vocalist’s dynamic facial expression and vocal sound in singing a major third or minor
third interval could affect the emotional judgment of music. Their results indicated that
the congruent pairs had the extreme scores, with the incongruent pairs in between. The
results seemed to suggest that emotional judgment of music could be modified by visual
information. Despite the fact that they found an emotional congruence effect, the
question remains open whether audio-visual integration enhances emotional strength
(i.e., cross-modal enhancement) or attenuates it (i.e., audio-dominance), since they did
not compare participants’ judgments of the congruent conditions with a music-alone
condition, as we have done here. Also, the audio-visual stimuli in their experiments are
more similar to binding pictures and simple vocal sounds than to real musical
performance (their audio signals contained just two notes). This makes their results
31
difficult to generalize to real–life situations such as a musical concert. We have made an
effort in this direction by using stimuli more similar to a musical performance in this
study, and the emotional congruence effects found in our Experiments 2 and 3 indicate
the robust mode-emotion relationship even across different modalities.
Music has been said to be analogous to motion, because of the dynamic sound flow
in music that is associated with the motion generated in music production (Eitan &
Granot, 2006). Some researchers have proposed that musical experience is derived from
cross-modal processing between audition and visual motion. Through the
correspondence of the musical dynamic change and the intention of expressing motion
behind the auditory code, we can understand what connotations are conveyed by music
(Livingstone & Thompson, 2009; Molnar-Szakacs & Overy, 2006; Overy &
Molnar-Szakacs, 2009). Molnar-Szakacs and Overy (2006) reviewed many studies
about the relation between music and motion, and they came to the conclusion that the
acoustic components of music, such as amplitude variation, rhythm, and contour of
melody, were systematically synchronized with performers’ motions. Musical
experience might thus be generated from the corepresentation of motor programming
between audio signal and motion production. Facial expression and body movement
express visual emotional cues that are decoded in order to understand the emotional
32
intention behind the visual image (Thompson et al., 2008). Brain imaging studies show
that the functions of detecting emotion in music and understanding what others are
thinking are highly associated with the mirror neuron system, which is considered a part
of the motor system. As well as music, facial expression and body movements also
activate the mirror neuron system (Livingstone & Thompson, 2009; Livingstone et al.,
2009). Through sensory-motor transformation by the mirror neuron system, we
understand the intention behind the sensory inputs, including emotional connotation.
Accordingly, we suggest that the audio-visual integration we found in this study
might result from cross-modal information reorganized in the mirror neuron system.
Ovary and Molnar-Szakacs (2009) propose the Shared Affective Motion Experience
(SAME) model to explain the mechanism of perceived emotion in music. According to
this model, musical signals enter the fronto-parietal mirror neuron system from the
temporal and occipital cortex to be decoded and generate motor programming. The
information flows to and is modified by anterior insula and is then transported to the
limbic system where emotional information is processed. Finally, musical information
processed by the neural network form the musical experience and emotional perception.
Recent neuroimaging studies support this hypothesis by showing that musical
processing activates the mirror neuron system and limbic regions (Blood & Zatorre,
33
2001; Green, Baerentsen, Stodkilde-Jorgensen, Wallentin, Roepstorff, & Vuust, 2008;
Hasegawa et al., 2004; Molnar-Szakacs & Overy, 2006; Peretz & Zatorre, 2005).
Considering visual images as part of musical inputs as shown in this study, the
SAME model can explain the cross-modal effect in perceiving the emotion in music
performance. However, most brain imaging studies only focus on “music” perception
and neglect the close relationship between the music itself and the image of the
performance. Future studies can use fMRI and ERP techniques to investigate if the
emotional enhancement effect of bimodal stimuli, compared to unimodal ones, reflects a
difference in brain activities in the neural network as predicted by the SAME modal.
Brain regions such as the frontal-parietal lobe, anterior insula, and limbic system may
play an important role in audio-visual integration of emotional perception, an important
part of musical experience.
34