ACM - Computers in Entertainment

A Perceptual and Affective Evaluation of an Affectively Driven Engine for Video Game Soundtracking

By Duncan Williams, Jamie Mears, Alexis Kirke, Eduardo Miranda, Ian Daly, Asad Malik, James Weaver, Faustina Hwang, Slawomir Nasuto


We report on a player evaluation of a pilot system for dynamic video game soundtrack generation. The system being evaluated generates music using an AI-based algorithmic composition technique to create score in real-time, in response to a continuously varying emotional trajectory dictated by gameplay cues. After a section of gameplay, players rated the system on a Likert scale according to emotional congruence with the narrative, and also according to their perceived immersion with the gameplay. The generated system showed a statistically meaningful and consistent improvement in ratings for emotional congruence, yet with a decrease in perceived immersion, which might be attributed to the marked difference in instrumentation between the generated music, voiced by a solo piano timbre, and the original, fully orchestrated soundtrack. Finally, players rated selected stimuli from the generated soundtrack dataset on a two-dimensional model reflecting perceived valence and arousal. These ratings were compared to the intended emotional descriptor in the meta-data accompanying specific gameplay events. Participant responses suggested strong agreement with the affective correlates, but also a significant amount of inter-participant variability. Individual calibration of the musical feature set, or further adjustment of the musical feature set are therefore suggested as useful avenues for further work.


Categories and Subject Descriptors: H.5.5 [Information Interfaces and Presentation]: Sound and Music Computing—Methodologies and techniques; H.1.2 [Models and Principles]: User/Machine Systems—Human Information Processing; 1.5.1 [Pattern Recognition]: Models—Statistical

General Terms: Design, Algorithms, Testing

Additional Key Words and Phrases: Algorithmic composition, affect, music perception, immersion, emotional congruence



High-quality soundtracking has the potential to enhance player immersion in video games  [Grimshaw et al. 2008; Lipscomb and Zehnder 2004]. Combining emotionally congruent sound-tracking with game narrative has the potential to create significantly stronger affective responses than either stimulus alone—the power of multimodal stimuli on affective response has been shown both anecdotally and scientifically [Camurri et al. 2005]. Game audio requires at least two additional challenges over other sound-for-picture work; firstly, the need to be dynamic (responding to gameplay states) and secondly to be emotionally congruent whilst adapting to non-linear narrative changes [Collins 2007]. Thus, creating congruent soundtracking for video games is a non-trivial task, as their interactive nature necessitates dynamic and potentially non-linear sound-tracking. This requirement is essentially due to the unpredictable element of player control over the narrative, without which the game would cease to be interactive. This problem has been approached with various solutions. A commonly used solution is to loop a pre-composed passage of music until a narrative break, such as the end of a level, death of a player, victory in a battle, and so on. This type of system is illustrated in Figure 1. However, this approach can become repetitive, and potentially irritating to the player if the transition points are not carefully managed, as musical repetition has been shown to have its own impact on the emotional state of the listener [Livingstone et al. 2012].




Figure 1. A game soundtrack adapting to play dynamically—the middle section is looped until a narrative breakpoint is reached. All passages, and transitions between passages (cadences, etc…) are precomposed. This approach has been used extensively in a wide range of games (see, for example, the Final Fantasy series).



Figure 2. Divergent branching system used to sequence a musical score without looping, determined by arousal cues as meta-tags corresponding to gameplay narrative (arousal increase or decrease at beginning, middle, and end). Even for a very simple narrative with just three stages, 7 discrete pieces of music are required, with a resulting impact on composition time and storage space.


1.1  Defining Player Immersion

Player immersion is well-understood by the gaming community and as such is a desirable, measurable attribute [Qin et al. 2009]. Some researchers consider that the enduring popularity of videogames is due to the total immersion in a player mediated world [Weibel and Wissmath 2011]. Defining immersion in this context is, however, not trivial.  A Player derived definition has been suggested as a sense of the loss of time perception [Sanders and Cairns 2010], but this was refuted in later work [Nordin et al. 2013] as a solely anecdotal definition which requires further experimental analysis in order to consider correlation between time  and immersion perception. Some studies had previously attempted to address this in an ecological context [Tobin et al. 2010], i.e. While actually playing, with the suggestion that players fundamentally underestimate the length of time spent if they evaluate the duration prospectively rather than retrospectively, and that the number of hours involved in regular play was a reliable predictor of perceived gameplay duration estimates from the participating players. Time perception is therefore difficult to linearly correlate to perceived immersion. Nordin further suggested that attention might be a better direct correlate of immersion, with time perception being implicated when immersion is experienced as a by-product of attention rather than being a direct correlate itself [Nordin et al. 2013]. The cognitive processes involved in the player considering the experience immersive are likely to be narrative dependent, in other words the music needs to be emotionally congruent with gameplay events [Jørgensen 2008]. This has implications for soundtrack timing, and for emotional matching of soundtrack elements with gameplay narrative [Williams, Kirke, E. R. Miranda, et al. 2015]. If the music is appropriate to the context of the gameplay, is likely that there will be A direct relationship between greater emotional impact and an increase in perceived immersion [Grimshaw et al. 2008; Lipscomb and Zehnder 2004].

Emotional congruence between sound-tracking and gameplay is also a measurable attribute, provided the players evaluating this attribute have a shared understanding of exactly what it is they are being asked to evaluate [Bensa et al. 2005]. Lucasarts implemented a dynamic system, iMuse (see [Strank 2013] for a full treatment) to accompany their role-playing games series in the late 1980s (which included the Indiana Jones series and perhaps most famously, the Monkey Island series of games) [Warren 2003]. This system implemented two now commonplace solutions, horizontal re-sequencing and vertical re-orchestration, both of which were readily implementable due to the use of MIDI orchestration. However, the move towards real audio made many of these transformations more complex beyond the compositional aspect alone.

Music has been shown in the past to be able to influence the perceived emotion of images, for example in the exception of happy or sad facial expressions recognition [Aubé et al. 2014; Schubert 2004]. Beyond simply reflecting the existing narrative, music can be used to project particular emotional qualities onto the imagery, helping to sustain engagement, ought to incentivise particular gameplay objectives, potentially leading to engrossment.

The ability of the composer to influence player engagement (and encourage the players to continue playing) beyond that of the game designer, by reinforcing narrative through emotionally congruent  soundtracking has become an exciting and fertile area for work. Other theoretical approaches to engagement by congruent sound tracking have also been suggested. Listener expectation is a powerful cue for emotions [Huron 2006], and has been related to engagement with videogames by means of schemata confirmation and violation [Douglas and Hargadon 2000].  These mechanisms demonstrate many similarities with the world of sound for film, which may offer some explanation as to how they would be enjoyable to a first time gamer.  Examples include fast tempos with action sequences, or when the player is otherwise under pressure.  If these preconceptions are violated, some studies find that there is some suggestion that experienced gamers may find this sensation enjoyable on a neurological level [Janata 1995] [Sridharan et al. 2007], as correlates of predictability which might be used to enhance attention and help move a player higher in their level of immersion. Engrossment follows engagement, and might be considered to be the point where the player becomes emotionally invested in the game, beyond simply enjoying it.  The player may find interacting with the game no longer requires any cognitive effort (in the same way that inexperienced automobile drivers have to pay attention to the process of driving, but this cognitive load changes over time partly due to time spent behind the wheel) [Arsenault 2005]. Soundtracking at this stage becomes synonymous with the emotional content of the narrative, such that one might influence the other symbiotically [Mar et al. 2011]. Total immersion or flow, “the holistic sensation do people feel when the act with total involvement” [Csikszentmihalyi 1975] is a state where the player would not only act automatically in controlling the gameplay, but would not necessarily consider themselves playing the game any more but rather actually being in the game.  In this state there is a merging of actions and self-awareness, the place full attention will be immersed in the game.

Considering the affective potential of music in video games is a useful way of understanding, and potentially enhancing, the player experience of emotion from the gameplay narrative.


1.2  Defining Emotional Congruence

Three types of emotional responses are commonly found in music psychology literature: emotion, affect, and mood, though the differences between these are less often explained [Russell and Barrett 1999]. The general trend is for emotions to be short episodes, with moods longer-lived. The literature often makes a strong distinction between perceived and induced emotions (see for example (Västfjäll, 2001; Vuoskoski and Eerola, 2011; Gabrielsson, 2001); though the precise terminology used to differentiate the two also varies, the over-arching difference is whether the emotion is communicated to, or experienced by, the listener. The reader can find exhaustive reviews on the link between music and emotion in [Scherer 2004], which is explored further in the recent special issue of Musciae Scientiae [Lamont and Eerola 2011].

There are a number of emotion models that can be used when approaching emotional responses to musical stimuli, including categorical models, which describe affective responses with discrete labels, and dimensional models, which approach affect as coordinates, often in a two-dimensional space – Russell’s circumplex model [Russell 1980] provides a way of parameterising affective responses to musical stimuli in two dimensions: valence (a scale of positivity) and arousal (a scale of energy or activation strength), although three dimensional spaces are also common [Eerola and Vuoskoski 2010]. Emotional descriptors from Hevner’s adjective cycle can be mapped quite closely onto the two-dimensional model [Hevner 1936], in order to create a semantic space. This creates a dimensional-categorical approach [Schubert 1999, p.22], whereby intensely negative states, such as anger or fear would occur at the opposite end of a 2-dimensional space from low-intensity positive states such as calmness or relaxation. Thus, categorical and dimensional approaches are not necessarily irreconcilable, as emotional labels from categorical models can be mapped onto dimensional spaces without any particular difficulty. Both types of models have been used to carry out affective evaluations of music in a large number of studies (Juslin & Sloboda, 2010). Recently, music-specific approaches have been developed, notably in (Zentner, Grandjean, & Scherer, 2008), where the Geneva Emotion Music Scale (GEMS) describes nine dimensions covering a complete semantic space of musically evoked emotions.

The relative importance to the gamer of immersion and emotional congruence is not necessarily evenly weighted. Immersion is essential, or the player will likely cease playing the game; keeping the player hooked is an important goal of game design. Emotional congruence, on the other hand, may enhance player immersion, but is likely to hold a lower place in the player’s perceptual hierarchy. One notable exception might be in situations where the player deliberately controls the music (as can be seen in games like Guitar Hero, for example). The process then becomes a complex feedback loop wherein the player not only influences the selection of music according to their mood but the selection of music also has a subsequent impact on the player’s mood.


1.3  Defining Affectively-Driven Algorithmic Composition

Affectively-driven algorithmic composition (AAC) is an emerging field combining computer music research and perceptual/psychological approaches to music cognition [Mattek 2011; Williams et al. 2013; Williams et al. 2014]. AAC systems attempt to communicate specific emotions to the listener.

A disadvantage to looping-based approaches for gameplay soundtracking is the high amount of repetition involved. This can become distracting or worse, irritating, at transition points, which can have a knock-on negative effect on player immersion. The resolution of looping systems can be improved by adding divergent score ‘branches’ at narrative breakpoints within the soundtrack, which results in more complex, less repetitive musical sequences. However, the need to create the contributory score fragments in such a manner that they can be interchanged whilst maintaining the intended aesthetic-congruency with the narrative poses a significant challenge to the video-game composer. In simple terms, the over-arching challenge is that video game music can get repetitive and thereby ruin player immersion, but composing large quantities of music with particular moods and emotions is not practical for most games, both in terms of storage on the media (whether that be disc, cartridge, or simply bandwidth in, for example, online streaming games), and also in terms of human cost (i.e., that of the composers time when constructing large numbers of interchangeable musical sequences). Thus, the adaptability of a branching system to emotional responses for these purposes is somewhat compromised.

This paper addresses such challenges by considering an AAC system for creation of music on the fly. AAC creation of music in this manner has the potential to overcome some of these restrictions, yet many systems for algorithmic composition fail to directly target emotional responses [Williams et al. 2013], and thus the necessary narrative congruency demanded by game soundtracking might still be compromised by such systems. In this prototype we evaluate a system incorporating a range of musical features with known affective correlates; tempo, mode, pitch range, timbre, and amplitude envelope. For computational efficiency the value range of each of these musical features was discretized to 3 possible levels resulting in a generative ruleset of an AAC system  defined over discretized 3x3 partition of a 2-D affective space based on the circumplex model [Russell 1980]. The specification of the system under evaluation in this paper, which attempts to tackle these challenges, has been previously presented as a proof-of-concept [Williams et al. 2015], but has not yet been the subject of a perceptual or affective evaluation.


The AAC pilot described here uses a transformative algorithm based on a second order Markov-model [Ames 1989] with a musical feature matrix that allows for discrete control over five musical parameters in order to imply various affective descriptors in a categorical/dimensional model loosely arranged over two dimensions, after the circumplex model of affect [Russell 1980].


2.1  Musical Structure Representation, Analysis, and Generation

Markov generation of musical structures has been frequently described in the literature  (see, for example, [Ames 1989; Visell 2004; Papadopoulos and Wiggins 1999]. The second-order Markov model used here is defined over the finite space of 5 musical features mentioned above. The model consists of a transition probability matrix, with each row corresponding to a conditional probability vector defined over the array of possible next states (musical feature 5-tuples),given the last two states. The model, i.e. the entries of the state transition probability matrix, is learned from the musical training material. The Markov model is generative, i.e. once it is learned, it can be used to create new state (musical feature 5-tuples) sequences according to the likelihood of a particular state occurring after the current and proceeding states. The generated musical state sequences are subsequently further transformed according to the distance between the current features and the features which correlate to a given affective target (the transformations indicating the affective correlates are  shown in Table 1). The transformed data are then  synthesized using a piano timbre. The system can create a surprisingly large variety of material from a very limited amount of seed data. The input material used to train the system in the case of this pilot study was twelve monophonic bars of a Mozart piano concerto in the key of C major. Material can be can be generated so quickly it could in future be used to create score and performance data in pseudo real time (in normal operation, listeners would not be aware of any noticeable latency between generation and triggering of synthesized or sampled audio), but in this case, a range of pre-rendered sequences were produced as amplitude normalized PCM wave files. For each of the nine affective correlates (see section 2.2, Figure 3), 11 sequences were generated, each one-minute in duration. During gameplay, these files are cross-faded to create continuous soundtracking either within a single affective co-ordinate or across a variety depending on the gameplay cues. An overview of the generation and transformation process is shown in Figure 3.



Figure 3. Generative process. A Markov Chain is used to generate new material, which is then transformed according to the distance between the current set of musical features and the target musical features that correlate to a given point in the emotion space. Note that this flow is applied sequentially to each generated Markov state (5-tuple of musical features) at a time without the need for a the sequence of transformation—there is a generated musical features state with its implied affective counterpart  and the target affective state and the transformation thus adjust the features to meet that target affective state.


Broader musical structure, including thematic variation and repetition, is not addressed in this pilot system beyond the matching of a single emotional trajectory according to the player character. Thus, this evaluation should not be generalized beyond MMORPG games, and in future would be best expanded to an evaluation across a wide range of games for each participant. Moreover, the possible positive influence of repetition on player immersion should not be discounted [Pichlmair and Kayali 2007; Lipscomb and Zehnder 2004]. Structural composition techniques remain challenging for all automatic composition systems [Edwards 2011], and as such present a fertile area for continued investigation beyond the scope of the work presented here.


2.2  Affective Model

The algorithmic composition system references affective correlates according to the narrative of the gameplay to derive an affective target for the generated music. The system uses a combined dimensional and categorical approach for this affective target. The two-dimensional circumplex model of affect (two dimensions: valence (a scale of positivity on the horizontal axis and arousal as a scale of energy or activation strength on the vertical axis), is divided into 9 sectors that are indexed with meta-tags corresponding to Cartesian co-ordinate values, each a discrete affective descriptor, as shown in Figure 4. Thus, a range of basic affective descriptors are represented across the sectors of this model, with Euclidean distances for lower and higher arousal levels vertically across the affective space (though as with any affective descriptor, some degree of perceptual overlap is present among descriptors, see for example, pleased and happy, or pleased and content—such descriptors cannot be universally discrete). In this manner, a co-ordinate of { v3, a3 } would refer to excited.

Figure 4. Two-dimensional model divided into 9 sectors with discrete affective adjectives mapped to the circumplex model (angry, sad, tired, pleased, content, frustrated, excited, happy and calm).


Table I. Showing affective correlates, and the corresponding musical parameter mappings used by the generative model.



A quest (section of gameplay) from World of Warcraft (a massively multiplayer online role-playing game, MMORPG), was marked up with various affective targets as meta-tags (for example, fighting scenes were tagged with {v1, a3}, or angry). Two screen-shots illustrating the application of this meta-tagging are shown in Figures 5 and 6, where a battle scene is accompanied by angry music until the player is victorious, at which point, content is cued.

Stimuli corresponding to the affective meta-tag were selected randomly from the pre-generated stimulus pool (for each affective co-ordinate the pool contained 11 audiofiles each one minute in duration) during gameplay via a dedicated audio playback engine built using Max/MSP. The engine selects one of the 11 stimuli randomly, cross-fading with the next stimulus choice until the entire pool has been used (or until a new meta-tag trigger is received), using a unique random number function. Timbral changes in the generated soundtrack were created by means of piano dynamics (loud performances resulting in harder and brighter timbres, with more pronounced upper harmonics and a correspondingly higher spectral centroid). Participants were then asked to complete the quest three times; once with the original musical soundtrack, once with a soundtrack provided by the algorithmic composition system, and once with no musical accompaniment at all (sound effects were still used, for example, action sounds). Each playback was recorded so that the stimulus selections could be repeated for subsequent affective evaluation.




Figure 5. Screen capture of a sequence of gameplay from a generative-soundtrack playthrough, marked up with {v1, a3 } or angry, which accompanies a battle scene (the process of entering battle cues the change in affective target).



Figure 6. Screen capture of a sequence of gameplay from a generative-soundtrack playthrough, marked up with {v2, a2 }, or content, triggered after the player had successfully completed the battle sequence above.


Participants were asked to rate emotional congruence and immersion after each playthrough, using a 11-point Likert scale presented via an interactive web-browser form, as shown in Figure 7. Short definitions for both terms were included in the pre-experiment instructions for participants.



Figure 7. Listener interface for evaluating emotional congruence of music and immersion for each playthrough.


Having evaluated emotional congruence and immersion for each of the musical playthroughs, participants were also asked to rate the perceived emotion of each stimulus that they had been exposed to in the generative soundtrack playthrough, using a two-dimensional space labeled with the self-assessment manikin [Bradley and Lang 1994] showing valence on the horizontal scale and arousal on the vertical scale, allowing both valence and arousal to be estimated in a single rating. Valence was defined to the participants as a measure of positivity, and arousal as a measure of activation strength in a pre-experiment familiarization and training stage. This participant interface was also implemented in Max/MSP.

In total, 11 participants took part in the experiment, 6 males and 5 females. 9 participants were ages 18-21, while the remaining 2 participants were aged 22-25.

In a pre-evaluation questionnaire, the majority of the participants reported that they enjoyed listening to music, while only 45% of them had experience of composing or performing music. 72% of participants reported that they enjoyed playing or watching video games, although 45% of participants answered that they only spend 0.5 hours a week watching or playing video games.

Participants undertook the experiment using circumaural headphones in a quiet room with a dry acoustic. Gameplay was presented on a 15” laptop computer. The game was not connected to the internet (i.e., there were no other player-characters, all characters apart from the avatar were computer controlled). The exact duration of playthroughs varied on a player-by-player basis, from 4 minutes to a ‘cut off’ of 10 minutes. A variety of affective states may be entered in each playthrough, depending on the actions of the player, for example exploration, fighting, fighting and being victorious, fighting and failure, evading danger, and interacting positively with other (non-player) character avatars.



The Likert scale responses for emotional congruence and immersion are shown in Table 2. The mean emotional congruence is improved by 1, with a standard deviation of 0.7 and a p-value of less than 0.01.  This strongly suggests that listeners found the generated soundtrack more emotionally congruent with the gameplay than that of the original soundtrack, and that this marked improvement was both consistent across all participants, and statistically significant. This is a promising result for the AAC system. However, the mean immersion decreased by 1.750 in the ratings of the generated soundtrack playthrough in comparison to the original soundtrack playthrough, with a p-value still below the threshold of significance (0.05), at 0.03. Despite the high accompanying standard deviation, at 2.1 for the generated soundtrack immersion, this suggests that player immersion was consistently reduced in the generated soundtrack playthrough. Overall, these results suggest that an increase in emotional congruence of ~10% can be achieved by the current AAC prototype system, at the expense of a ~30% reduction in immersion.


Table 2. Likert-scale responses showing participant reactions to playthroughs with the original soundtrack and the generated soundtrack.


Figure 8. From left to right, mean participant ratings for original soundtrack emotional congruence, original soundtrack gameplay immersion, generated soundtrack emotional congruence, and generated soundtrack gameplay immersion. Error bars indicate 95% confidence interval for each mean.


The mean ratings for emotional congruence and immersion in playthroughs with the original soundtrack and the generated soundtrack as shown in Figure 8, suggest that although the means are close, the difference between original and generated ratings in emotional congruence and immersion is significant (a clear line can be drawn horizontally from the top of the error bar for original soundtrack emotional congruence below that of the bottom of the error bar for ratings of the generated soundtrack emotional congruence, similarly this can be drawn from the bottom of the CI of original soundtrack immersion to well above the top of the CI of the ratings for generated soundtrack immersion). This suggests that the improvement between emotional congruence in the original soundtrack and the generated soundtrack is consistent. Immersion is generally consistently lower in the generated soundtrack results, though this exhibits the largest variance and consequently the highest standard deviation. However, the sample size is small and these results should therefore be interpreted with some caution. The smallest variation occurs in the ratings for generated soundtrack emotional congruency, which suggest that even though the number of participants was small, removing additional participants would not change the outcome drastically (and thus, including additional participants, from the same small demographic, would likely not significantly influence the outcome of this pilot study). If we compute power after Rosner [Rosner 2010] allowing 5% Type I error rate, using the standard deviations and means reported in Table 2, we find that for the original soundtrack emotional congruence we can be confident of 70% power with 10 participants at the reported standard deviation of 1.188, and for the generated soundtrack emotional congruence we can be confident of 70% power with 10 participants at the reported standard deviation of 0.744 (both values would require 13 participants to achieve 80% power at the reported standard deviation values). Using the same measures for original soundtrack immersion we find 80% power with 11 participants at the reported standard deviation of 0.916, and 80% power with 11 participants at the reported standard deviation of 2.100 for generated soundtrack immersion. This suggests reasonable statistical significance despite the small sample size, especially considering the limited amount of variability between participants (who fall closely within the target demographic of this type of video game and shared an agreement in the amount of gameplay they undertook each week). The infancy of the field at the time of conducting these experiments means that there are few precedents for appropriate evaluation paradigms, and to test for normality would take hundreds of trials, which might also have a knock-on effect on the participants involved (repeatedly undertaking the same section of gameplay might have an impact on immersion for example). However, the system under evaluation here has previously been the subject of a number of other trials, including discrete musical feature evaluation [Daly et al. 2014; Williams, Kirke, E. Miranda, et al. 2015] and broader two-dimensional affective evaluation by self-report and biophysiological measurement [Daly et al. 2015; Williams, Kirke, J. Eaton, et al. 2015]. Thus, whilst we still consider the small sample size a caveat, we consider both the low p values and the comparatively high power encouraging for further work with a more fully realized system and larger numbers of participants in the future.

The spread of ratings for emotional congruence between the original soundtrack and the generated soundtrack playthroughs is illustrated in Figures 9 and 10. Participants were not asked to rate the emotional congruence of the soundtrack in the silent playthrough. However, 10 out of 11 participants rated the silent playthrough as less immersive than the two playthroughs that included a musical soundtrack.


Figure 9. The spread of ratings for emotional congruence responses to original soundtrack playthroughs.


Figure 10. The spread of ratings for emotional congruence responses to generated soundtrack playthroughs.


3.1  Listener Agreement with Intended Emotional Meta-tag

Mean listener agreement between perceived emotion and the intended emotional meta-tag was ~85%, with a standard deviation of 4.9 and a p-value of 0.08. Full results are shown in Table 3.


Table 3. Listener agreement with emotional correlate meta-tag for stimuli from each sector (shown to maximum of 3 decimal places). Note the full stimulus set was not evaluated by each listener, only stimuli which were present during the playthrough were repeated for evaluation in the two-dimensional space.


Mean listener agreement, as shown in Table 3, was markedly high in stimuli across all sectors, from 76.5% for { v1, a2 } to 90.1% for { v1, a1 }. Standard deviation and error of mean typically decreased with higher mean agreement (with the exception of { v3, a1 }. Initially, this suggests that the musical feature mapping used in the generative system, which derives affective correlates from literature and uses them as features to inform the generation of affectively-driven music, was operating as intended, and that the AAC system was thereby able to generate music correctly according to a specific affective target. However, whilst the overall standard deviation is relatively low, the p-value does not reach a significance threshold of <0.05, which suggests that the range of individual emotions from each participant is still unpredictable. This suggests a strong argument that adapting this type of system to individual responses would be a useful avenue for further work, for example by calibrating the musical feature-set from the generative algorithm to each individual player before commencing, or by using bio-feedback data (e.g., from a brain cap, heart-rate monitor or other biosensor) to calibrate the musical feature-set on a case-by-case basis to attempt to reduce inter-participant variability. The emerging field of brain-computer musical interfacing (BCMI) is steadily making inroads to this type of adaptive control for music (systems which combine other biosensors with brain-control are sometimes referred to as hybrid-BCMI) [Miranda et al. 2011].


The generated soundtrack was performed via a single synthesized piano timbre, which is in a sharp contrast to that of the original soundtrack, which consisted of a fully orchestrated symphonic piece. Nevertheless, participants seemed to consistently find that the generated music matched the emotional trajectory of the video gameplay more congruently than the original musical soundtrack. The majority of participants also reported that the playthroughs that were accompanied by a musical soundtrack were more immersive than the silent playthrough (which featured sound effects but no musical accompaniment).  However, there was also a marked decrease in reported immersion with the gameplay in the generated soundtrack playthroughs, perhaps because of the lack of familiarity and repetition in the generated soundtrack. This could be usefully addressed in further work by evaluating repetition of themes and with generated music across multiple different games with the same player in future. Another possible explanation for this increase in emotional congruence but decrease in immersion might well be the orchestration of the AAC system, (solo piano). The original soundtrack might be lacking in emotional variation, but it offers a fullness and depth of instrumentation that is not easily matched by a single instrumental voice. This is a challenge which could be evaluated in future work by creating a piano reduction of the original gameplay score and subjecting it to subsequent evaluation, or by developing the generative system further such that it can create more fully realized pieces of music using multiple instrumental timbres (this alone presents a significant challenge to algorithmic composition).

Call of Duty: Modern Warfare features 17 unique pieces of music, around 52:31 total duration, using 531mb of storage. Players might spend upwards of 100 hours playing the game, and are therefore likely to have heard each piece of music many times over. If a system like the AAC pilot evaluated here could be expanded upon to sustain or improve player immersion as well as improve emotional congruence, the benefits to the video game world would not simply be limited to a reduction in workload for composers, or to less repetitive sound-tracking for players – the amount of data storage required for soundtrack storage might also be significantly reduced (a stereo CD-quality PCM wave file takes approximately 10mb of storage per minute of recorded audio). Whilst data space is not as scarce a commodity in the modern world as it might have been in the days of game storage on cartridge or floppy disk, gaming is now increasingly moving onto mobile platforms (phones, tablets) with limited storage space, and online streaming is also a popular delivery platform for gaming. Therefore, a reduction in data by using AAC for sound-tracking could represent a significant and valuable contribution to video game delivery in the future.

Some participants anecdotally reported that the generated music seemed more responsive to the gameplay narrative, and therefore added an extra dimension to the gameplay (in other words, they felt “in control” of the music generation system). This suggests that this type of system might have possible applications as a composition engine for pedagogic, or even therapeutic applications beyond the world of entertainment.

Beyond our existing caveats regarding the timbral variation this pilot can generate, and the relatively small number of participants who have been involved in this specific evaluation, we also acknowledge that a future test paradigm might usefully explore deliberately incongruous music generation (perhaps playing happy music with scenes of low valence). This type of testing would have implications on the number of trial iterations required, and the trial time that participants were required for (at present, the maximum test time was approximately 30 minutes, after which point listener fatigue might be considered a factor). Our hypothesis would be that emotional congruence, and likely immersion, would go down in such a case, but this has not yet been tested and remains an avenue we intend to explore in further work. There also remains significant useful further work in training the generator with a larger range of input material, and in testing a real-time system so that pre-generated sequences did not need to be cross-faded as in the current paradigm (more complex timbres might create difficulties with simple cross-fades, which the current piano timbre does not seem to be particularly disrupted by).



Using an AAC system to create music according to emotional meta-tagging as part of a video game narrative has clear practical benefits in terms of composer time and hardware/software requirements (file storage and data transmission rates), which could free up processing and storage space for other game elements (e.g., visual processing). This type of system might also therefore be beneficial to mobile gaming platforms (smart phones, etc...), where space is at more of a premium than on desktop or home gaming environments.

Within the constraints of the test paradigm, this pilot study suggests that emotional congruence could be improved when participants played with a soundtrack generated by the affectively driven algorithmic composition system. However, player immersion was consistently and significantly reduced at the same time. It might be possible to seek explanation of this in the instrumentation and orchestration of the generated music, but further work would be required to establish the reason for this reported reduction in player immersion, before tackling the problem of developing an AAC system with fuller orchestration, which is in and of itself non-trivial. These algorithmic composition techniques are still in their infancy and the likelihood of replacing a human composer in the successful creation of complex, affectively-charged musical arrangements is minimal. In fact, as the system presented here (and others like it) require training with musical input, this evaluation suggests that in the future composers working explicitly with video game sound-tracking might use this type of system to generate large pools of material from specific themes, thereby freeing up time to spend on the creative part of the composition process.

Participant agreement with the affective meta-tagging used to select musical features as part of the generative system was good, though significant inter-participant variability suggested that either the musical feature set needs further calibration (which would require specific affective experiments), or that a generalized set of affective correlates as musical feature sets is not yet possible. Another solution might be to calibrate this type of generative music system to the individual, using a mapping of musical features documented here in order to attempt to target specific emotional responses in the generated soundtrack. In the future, this could be driven by bio-sensors such as the electroencephalogram (as in the emerging field of brain-computer music interfacing), or by more traditional biosensors such as heart rate sensors or galvanic skin response.


The authors gratefully acknowledge the support of EPRSC grants EP/J003077/1 and EP/J002135/1.



C. Ames. 1989. The Markov process as a compositional model: a survey and tutorial. Leonardo (1989), 175–187.

D. Arsenault. 2005. Dark waters: Spotlight on immersion. In Game-On North America 2005 Conference Proceedings. 50–52.

W. Aubé, A. Angulo-Perkins, I. Peretz, L. Concha, and J.L. Armony. 2014. Fear across the senses: brain responses to music, vocalizations and facial expressions. Soc. Cogn. Affect. Neurosci. (2014), nsu067.

J. Bensa, D. Dubois, R. Kronland-Martinet, and S. Ystad. 2005. Perceptive and cognitive evaluation of a piano synthesis model. In Computer music modeling and retrieval. Springer, 232–245.

M.M. Bradley and P.J. Lang. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 25, 1 (1994), 49–59.

A. Camurri, G. Volpe, G. De Poli, and M. Leman. 2005. Communicating expressiveness and affect in multimodal interactive systems. Multimed. IEEE 12, 1 (2005), 43–53.

K. Collins. 2007. “An Introduction to the Participatory and Non‐Linear Aspects of Video Games Audio.” Eds. Stan Hawkins and John Richardson. Essays on Sound and Vision. Helsinki: Helsinki University Press. pp. 263‐298. (2007).

M. Csikszentmihalyi. 1975. Play and intrinsic rewards. J. Humanist. Psychol. (1975).

I. Daly et al. 2014. Brain-computer music interfacing for continuous control of musical tempo. (2014).

I. Daly et al. 2015. Towards human-computer music interaction: Evaluation of an affectively-driven music generator via galvanic skin response measures. In Computer Science and Electronic Engineering Conference (CEEC), 2015 7th. IEEE, 87–92.

Y. Douglas and A. Hargadon. 2000. The pleasure principle: immersion, engagement, flow. In Proceedings of the eleventh ACM on Hypertext and hypermedia. ACM, 153–160.

M. Edwards. 2011. Algorithmic composition: computational thinking in music. Commun. ACM 54, 7 (2011), 58–67.

T. Eerola and J.K. Vuoskoski. 2010. A comparison of the discrete and dimensional models of emotion in music. Psychol. Music 39, 1 (August 2010), 18–49. DOI:

A. Gabrielsson. 2001. Emotion perceived and emotion felt: Same or different? Music. Sci. Spec Issue, 2001-2002 (2001), 123–147.

M. Grimshaw, C.A. Lindley, and L. Nacke. 2008. Sound and immersion in the first-person shooter: mixed measurement of the player’s sonic experience. In Proceedings of Audio Mostly Conference. 1–7.

K. Hevner. 1936. Experimental studies of the elements of expression in music. Am. J. Psychol. 48, 2 (1936), 246–268.

D.B. Huron. 2006. Sweet anticipation: Music and the psychology of expectation, MIT press.

P. Janata. 1995. ERP measures assay the degree of expectancy violation of harmonic contexts in music. J. Cogn. Neurosci. 7, 2 (1995), 153–164.

K. Jørgensen. 2008. Left in the dark: playing computer games with the sound turned off, Ashgate.

P.N. Juslin and J.A. Sloboda. 2010. Handbook of music and emotion : theory, research, applications, Oxford: Oxford University Press.

A. Lamont and T. Eerola. 2011. Music and emotion: Themes and development. Music. Sci. 15, 2 (July 2011), 139–145. DOI:

S.D. Lipscomb and S.M. Zehnder. 2004. Immersion in the virtual environment: The effect of a musical score on the video gaming experience. J. Physiol. Anthropol. Appl. Human Sci. 23, 6 (2004), 337–343.

S.R. Livingstone, C. Palmer, and E. Schubert. 2012. Emotional response to musical repetition. Emotion 12, 3 (2012), 552–567. DOI:

R.A. Mar, K. Oatley, M. Djikic, and J. Mullin. 2011. Emotion and narrative fiction: Interactive influences before, during, and after reading. Cogn. Emot. 25, 5 (2011), 818–833.

A. Mattek. 2011. Emotional Communication in Computer Generated Music: Experimenting with Affective Algorithms. In Proceedings of the 26th Annual Conference of the Society for Electro-Acoustic Music in the United States. Miami, Florida: University of Miami Frost School of Music.

E.R. Miranda, W.L. Magee, J.J. Wilson, J. Eaton, and R. Palaniappan. 2011. Brain-computer music interfacing (BCMI) from basic research to the real world of special needs. Music Med. 3, 3 (2011), 134–140.

A. Imran Nordin, J. Ali, A. Animashaun, J. Asch, J. Adams, and P. Cairns. 2013. Attention, time perception and immersion in games. In CHI’13 Extended Abstracts on Human Factors in Computing Systems. ACM, 1089–1094.

G. Papadopoulos and G. Wiggins. 1999. AI methods for algorithmic composition: A survey, a critical view and future prospects. In AISB Symposium on Musical Creativity. 110–117.

M. Pichlmair and F. Kayali. 2007. Levels of sound: On the principles of interactivity in music video games. In Proceedings of the Digital Games Research Association 2007 Conference" Situated play. Citeseer.

H. Qin, P.-L P. Rau, and G. Salvendy. 2009. Measuring player immersion in the computer game narrative. Intl J. Human–Computer Interact. 25, 2 (2009), 107–133.

B. Rosner. 2010. Fundamentals of biostatistics, Cengage Learning.

J.A. Russell. 1980. A circumplex model of affect. J. Pers. Soc. Psychol. 39, 6 (1980), 1161.

J.A. Russell and L.F. Barrett. 1999. Core affect, prototypical emotional episodes, and other things called< em> emotion</em>: Dissecting the elephant. J. Pers. Soc. Psychol. 76, 5 (1999), 805.

T. Sanders and P. Cairns. 2010. Time perception, immersion and music in videogames. In Proceedings of the 24th BCS interaction specialist group conference. British Computer Society, 160–167.

K.R. Scherer. 2004. Which Emotions Can be Induced by Music? What Are the Underlying Mechanisms? And How Can We Measure Them? J. New Music Res. 33, 3 (September 2004), 239–251. DOI:

E. Schubert. 2004. Emotionface: prototype facial expression display of emotion in music. In Proc. Int. Conf. On Auditory Displays (ICAD).

E. Schubert. 1999. Measuring Emotion Continuously: Validity and Reliability of the Two-Dimensional Emotion-Space. Aust. J. Psychol. 51, 3 (December 1999), 154–165. DOI:

D. Sridharan, D.J. Levitin, C.H. Chafe, J. Berger, and V. Menon. 2007. Neural dynamics of event segmentation in music: converging evidence for dissociable ventral and dorsal networks. Neuron 55, 3 (2007), 521–532.

W. Strank. 2013. The Legacy of iMuse: Interactive Video Game Music in the 1990s. Music Game (2013), 81–91.

S. Tobin, N. Bisson, and S. Grondin. 2010. An ecological approach to prospective and retrospective timing of long durations: A study involving gamers. PloS One 5, 2 (2010), e9271.

D. Västfjäll. 2001. Emotion induction through music: A review of the musical mood induction procedure. Music. Sci. Spec Issue, 2001-2002 (2001), 173–211.

Y. Visell. 2004. Spontaneous organisation, pattern models, and music. Organised Sound 9, 02 (August 2004). DOI:

J.K. Vuoskoski and T. Eerola. 2011. Measuring music-induced emotion: A comparison of emotion models, personality biases, and intensity of experiences. Music. Sci. 15, 2 (July 2011), 159–173. DOI:

C. Warren. 2003. LucasArts and the Design of Successful Adventure Games: The True Secret of Monkey Island. (2003).

D. Weibel and B. Wissmath. 2011. Immersion in computer games: The role of spatial presence and flow. Int. J. Comput. Games Technol. 2011 (2011), 6.

D. Williams, A. Kirke, E.R. Miranda, E. Roesch, I. Daly, and S. Nasuto. 2014. Investigating affect in algorithmic composition systems. Psychol. Music (August 2014). DOI:

D. Williams, A. Kirke, E.R. Miranda, et al. 2015. Dynamic game soundtrack generation in response to a continuosly varying emotional trajectory. In Proceedings of the 56th Audio Engineering Society Conference. Queen Mary, University of London: Audio Engineering Society.

D. Williams, A. Kirke, J. Eaton, et al. 2015. Dynamic Game Soundtrack Generation in Response to a Continuously Varying Emotional Trajectory. In Audio Engineering Society Conference: 56th International Conference: Audio for Games. Audio Engineering Society.

D. Williams, A. Kirke, E. Miranda, et al. 2015. Investigating Perceived Emotional Correlates of Rhythmic Density in Algorithmic Music Composition. ACM Trans Appl Percept 12, 3 (June 2015), 8:1–8:21. DOI:

D. Williams, A. Kirke, E.R. Miranda, E.B. Roesch, and S.J. Nasuto. 2013. Towards Affective Algorithmic Composition. In Proceedings of the 3rd International Conference on Music & Emotion (ICME3), Jyväskylä, Finland, 11th-15th June 2013. Geoff Luck & Olivier Brabant (Eds.). ISBN 978-951-39-5250-1. University of Jyväskylä, Department of Music.

M. Zentner, D. Grandjean, and K.R. Scherer. 2008. Emotions evoked by the sound of music: Characterization, classification, and measurement. Emotion 8, 4 (2008), 494–521. DOI:


Duncan Williams, Jamie Mears, Alexis Kirke And Eduardo Miranda, Plymouth University

Ian Daly, Asad Malik, James Weaver, Faustina Hwang, And Slawomir Nasuto, University of Reading


Permission to make digital or hardcopies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credits permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]




Copyright © 2019. All Rights Reserved