1. Introduction
1.1 Motivation
Retrieving or composing appropriate music for a performer’s motion sequence is essential in game and animation production. At present, it is accessible for common users to compose dance motion with body motion through specific motion sensing input devices such as Kinect for Xbox 360, and insert background music in dance motion by searching on the Internet. However, it may require considerable time and labors to find best combination for the composition from a large quantity of motion and music data sources. Our goal is to help composers finding appropriate music for a given dance motion input.
Dance is one kind of performing arts that generally refers to movement of the body, and usually rhythmic, patterned and accompanied by music. A choreographer designed a composition that usually focuses on rhythmic articulation, theme and variation, etc. And most of dance usually has an accompanied music and the poses are designed by rhythm of the accompanied music.
On the other hand, music is another kind of performing arts and has more genres such as rock, pop, hip-hop, etc. Most of the popular music is sectional and the most common sections are verse, chorus or refrain, and bridge. Other common forms include thirty-two-bar form, verse-chorus form, and the twelve bar blues. Popular music songs are rarely composed liked through-composed (through-composed music is relatively continuous, non-sectional, and/or non-repetitive.).
2
In this thesis, we make use of the sectionalization properties in both dance motion and music, and propose a method of finding the consistency between dance motion and music for rhythmic structure matching.
1.2 Frameworks
In this thesis, we propose an automatic method to find the consistent between dance motion and music. We preprocess all music in database and use a dance motion as an input to query the best-fit music in database. We employ their mutual properties such as rhythm to associate the dance motion and music. To retrieve the rhythmic property, we use two statistical models, proposed by M. Levy and M. Sandler [2008], to segment and cluster music in database and the input dance motion.
Before training the statistical models, we need to consider the appropriate observation features (Section 3.1 & 3.2). The Constant-Q spectral transform (CQT) [Brown 1991] is one of the popular transform in musical signal. This transform has advantages for analysis of musical sounds compared with the conventional discrete Fourier transform because CQT can assign the range of human hearing frequencies such as ten octaves from 20Hz to around 20kHz. In this thesis, we choose the CQT as our musical observation features data because CQT well suits the musical data. We certainly allow to change another transform to our musical observation features data such as Chroma or Mel-frequency cepstral coefficients (MFCCs) are the same popular as CQT. On the other hand, the spatial space in dance motion is commonly used as motion observation features data that consist of a hierarchical skeletal structure and trajectories of degrees of freedom (DOFs) of joints. For analyzing of dance motion,
3
we calculate each joint position in the three-dimensional coordinate space. It is simple to distinguish two different poses of dance motion in the three-dimensional coordinate.
After the observation features, the features are represented by an m x n matrix (Figure 1.1) which m is the size of features and n is the total length size of the training input data. In order to reduce the high-dimension matrix from the observation feature data to one level label number, we apply a hidden Markov model (HMM) to train the observation features using an 80-state HMM (Section 3.3) and assign a number to represent each feature data (Figure 1.1). Our purpose is to find a label sequence number substituted the complicated observation features.
After finishing the reduced dimension labeling and generated label sequence number, we then employ Expectation-Maximization algorithm [LS08] to cluster the new sequence label with constrained clustering algorithm (Section 3.4). This algorithm is to enforce temporal continuity on cluster assignments. Each cluster is labeled by an alphabet letter thus all music and the input dance motion can be represented by a sequence of alphabet letters. Our ranking algorithm is based on comparing the consistency between cluster string of dance motion and cluster string of music. All music in database is pre-computed before our ranking algorithm.
Given a sequence data by the HMM training and EM-algorithm with constrained clustering algorithm, we define an objective function (our ranking algorithm) to find the best-fit pair between the input dance motion and music where the objective function is employed by their rhythmic structure (Section 4). The sequence has rhythmic structure because the constrained clustering algorithm enforces temporal
4
continuity between adjacently neighboring labels. The constrained clustering results can decide their rhythmical properties and the input dance motion is through rhythmic structure to decide the best-fit music in database. Figure 1.2 is shown the flow chart of our algorithm.
Figure 1.1. The left matrix is the pure data before the HMM training, where m is the size of feature data n and n is the number of the training data length. The right vector is an answer of one level label number after using Viterbi algorithm to decode the label of state with HMM training data.
5
Figure 1.2. The flow chart of our algorithm.
6