Visualizing chunking and collocational networks - a graphical visualization of words’ networks

22 Jul 2013 13:00
Lancaster, UK

The notions of chunking and collocational networks are central to linguistics (e.g. Bybee 2010; McEnery 2006, 2012); chunking has been described as follows: “When two or more words are often used together, they […] develop a sequential relation, [… known] as ‘chunking’ […]. The strength of the sequential relations is determined by the frequency with which the two words appear together. […] The frequency with which sequences of units are used has an impact on their phonetic, morphosyntactic and semantic properties.” (Bybee 2010, 25). The notion of collocation has been defined in many ways (McEnery and Hardie 2012, 122-123); in general “[…]the term collocationdenotes the idea that important aspects of the meaning of a word (or another linguistic unit) are not contained within the word itself, considered in isolation, but rather subsist in the characteristic associations that the word participates in, alongside other words or structures with which it frequently co-occurs” (ibid.). When dealing with corpus data, the analysis of the aforementioned features can prove to be difficult as the results can depict a dense network of relations between words. Although visual representations of such networks have been attempted (see e.g. McEnery 2006), these methods had to deal with a series of limitations such as the media (printed paper) and quantity of data. Furthermore they tend to be more descriptive than analytical. What I propose is a method to visualize these networks that not only allows for descriptive graphical representation, but that can also be used to conduct a more detailed analysis of the data. In order to do so, I use the open source software Gephi (, a tool for visualization and exploration of any kind of networks. I demonstrate how the results drawn from corpus analysis can be “converted” and used as input for the software in order to obtain 2D or 3D graphs of the relations between a (virtually infinite) number of words. Furthermore the resulting graph is interactive, and therefore allows for the quick highlighting and isolation of specific word(s) and its/their relations, or for filtering of the data on the basis of specific statistical parameters. Any corpus data can be converted into a compatible Gephi format. As a case-study to show the basic steps of the method, I will show the results of the analysis I carried out for my PhD thesis on the study of Italian taboo language and taboo language constructions. The data was first retrieved and analysed through the web-interface SketchEngine (, and the results (i.e. the saliency scores provided by SketchEngine, based on logDice: see Kilgarriff 2012) were then converted into a compatible Gephi format through an ad-hoc script and loaded into the software. This allowed me to display networks with hundreds of “nodes” (each word is counted as a node in the network) and then to visually analyse the presence of chunking and/or (bidirectional) collocations by means of the different tools and filters available in the software (fig.1-2).

Figure 1: Visual rendition of collocational networks Figure 1: Visual rendition of collocational networks

Figure 2: Details of the collocations of “cazzo” (lit. dick) in pre-verbal position. Figure 2: Details of the collocations of “cazzo” (lit. dick) in pre-verbal position.

Among the different tools, it is possible to set a threshold so that only those words with a minimum or maximum value (in my case this is the saliency score) are shown; or a word can be isolated in order to show only the relations to or from that specific word. Furthermore, the different nodes in the network are distributed on the basis of a chosen parameters, so that the grouping of the different nodes can show specific features based on the raw data. This set of features has proven to be useful when determining the role of the word in the network, and its relations with the rest of the words. For my research I analysed the “status” of a word used in the corpus through the visual representation. I propose in my thesis that a distinction between euphemisms and dysphemisms (i.e. the contrary of euphemisms) can be drawn on the basis of the relation of the analysed word with the words with which it is used. The use of the method I propose has not only allowed me for the identification of sequential relations (see definition of chunking provided at the beginning) between words, and of collocational networks. It has also allowed me to conduct a series of more detailed statistical analysis on words which do not “stand out” among the data, but show an interesting behaviour to my research such as groups of words which are strictly related to each other so that they visually appear as a single node. By looking at the raw data for the words in these groups it was possible to establish that they visually appear together as they do share the same meaning when used with the word they all relate to. Features such as this would require a longer and more elaborate process to be spotted if the data was displayed through more “traditional” methods of representations (e.g. lists such as keywords lists). As the method I outlined is able to put together large sets of data and to triangulate the details of each word with the details of all the other words, it is possible to analyse large quantities of data just by looking at the visual rendition.

Bybee, J. 2010. Language, Usage and Cognition. CUP.
Kilgarriff, A. 2012. Statistics used in the Sketch Engine, available online at
McEnery, T. 2006. Swearing in English. Bad language, purity and power from 1586 to the present. Routledge.
McEnery, T., Hardie, A. 2012. Corpus Linguistics: Method, Theory and Practice. CUP

Matteo Di Cristofaro
Researcher, Lecturer

My research interests include language analysis, cognitive sciences, and Artificial Intelligence.