Compound tokens

require(quanteda)
require(quanteda.textstats)
options(width = 110)
toks <- tokens(data_char_ukimmig2010)

Various multi-word expressions are important in social scientific research.

kw_multiword <- kwic(toks, pattern = phrase(c("asylum seeker*", "british citizen*")))
head(kw_multiword, 10)
## Keyword-in-context with 10 matches.                                                                                                             
##       [BNP, 1724:1725]        the honour and benefit of | British citizenship | has gone to people who       
##       [BNP, 1958:1959] all illegal immigrants and bogus |   asylum seekers    | , including their dependents.
##       [BNP, 2159:2160]            region concerned. An' |    asylum seeker    | ' who has crossed dozens     
##       [BNP, 2192:2193]          country. Because every' |    asylum seeker    | ' in Britain has crossed     
##       [BNP, 2218:2219]     there are currently no legal |   asylum seekers    | in Britain today. It         
##       [BNP, 2265:2266]  of illegal immigrants and bogus |   asylum seekers    | , that there are no          
##       [BNP, 2296:2297]  benefits system for these bogus |   asylum seekers    | is removed, the flood        
##  [Conservative, 68:69]          could be carried out by |  British citizens   | , given the right training   
##        [Greens, 77:78]      immigration: over 5 million |  British Citizens   | benefit from other countries'
##      [Labour, 337:338]        economy and the values of | British citizenship | , and step up our

To preserve these expressions in a bag-of-word analysis, you have to compound them using tokens_compound().

toks_comp <- tokens_compound(toks, pattern = phrase(c("asylum seeker*", "british citizen*")))
kw_comp <- kwic(toks_comp, pattern = c("asylum_seeker*", "british_citizen*"))
head(kw_comp, 10)
## Keyword-in-context with 10 matches.                                                                                                          
##         [BNP, 1724]        the honour and benefit of | British_citizenship | has gone to people who       
##         [BNP, 1957] all illegal immigrants and bogus |   asylum_seekers    | , including their dependents.
##         [BNP, 2157]            region concerned. An' |    asylum_seeker    | ' who has crossed dozens     
##         [BNP, 2189]          country. Because every' |    asylum_seeker    | ' in Britain has crossed     
##         [BNP, 2214]     there are currently no legal |   asylum_seekers    | in Britain today. It         
##         [BNP, 2260]  of illegal immigrants and bogus |   asylum_seekers    | , that there are no          
##         [BNP, 2290]  benefits system for these bogus |   asylum_seekers    | is removed, the flood        
##  [Conservative, 68]          could be carried out by |  British_citizens   | , given the right training   
##        [Greens, 77]      immigration: over 5 million |  British_Citizens   | benefit from other countries'
##       [Labour, 337]        economy and the values of | British_citizenship | , and step up our

You can discover muti-words expressions in your tokens using textstat_collocations(). See Compunding multi-word expressions to learn how to do it.