A great deal of my research relates to the fact that human face-to-face interaction is inherently multimodal; when we engage in social interaction we make use of talk, gesture, gaze, body posture and physical artefacts, and through the interplay of such different resources we display our understanding of the activity at hand. And our display of understanding serves as a framework on the basis of which our co-participants can produce meaningful and contextually shaped next-actions.
Since the 1960s, Conversation Analysis has described a range of social practices (such as taking turns-at-talk in an orderly fashion, repairing trouble of various sorts, accepting/rejecting invitations, initiating/closing phone calls etc.) that members of a given speech community rely on during the online meaning construction of social interaction. This has primarily been investigated in relation to talk. Although social interaction unquestionable is based on visual aspects (gesture, gaze etc.) as well, it is still largely undocumented exactly how different semiotic fields play together. This inevitably results in a range of methodological challenges. For instance, whereas talk is largely produced in a linear fashion, i.e. a speaker can only produce one (emergent) morphosyntactic/lexical element at a time, visual modalities are under so such constraints. While talking a speaker can produce a range of different gestures, include various tools and torque the torso away from the recipient while maintaining the gaze towards him. How do we describe the production of social practices through sequentially and serially produces resources?