R commands

You do not need to have advanced knowledge of the R programming language to perform text analysis with quanteda because the package has wide range of functions. However, you still have to understand a number of basic R commands.

Basic R objects and commands

R has three types of objects: vector, data frame and matrix. Since many of the quanteda objects behave similarly to these objects, it is essential for you to understand how to interact with them.

Vectors

As a language for statistical analysis, R’s most basic objects are vectors. Vectors contain a set of values. In the examples below, num_vec is a numeric vector, while char_vec is a chracter vector. We use c() to combine elements of a vector and <- to assign a vector to a variable.

num_vec <- c(1, 5, 6, 3)
print(num_vec)
## [1] 1 5 6 3
char_vec <- c('apple', 'banana', 'mandarin', 'melon')
print(char_vec)
## [1] "apple"    "banana"   "mandarin" "melon"

Once a vector is created, you can extract elements of vectors with the [] operator and index numbers of desired elements.

print(num_vec[1])
## [1] 1
print(num_vec[1:2])
## [1] 1 5
print(char_vec[c(1, 3)])
## [1] "apple"    "mandarin"

You can apply arithmetical operations such as addition, subtraction, multiplication or division on numeric vectors. If only a single value is given for multiplication, for example, each element of the vector will be multiplied by the same value.

num_vec2 <- num_vec * 2
print(num_vec2)
## [1]  2 10 12  6

You can also compare elements of a vector by relational operators such as ==, >=, >, <=, <. The result of these operations will be a logical vector that contains either TRUE or FALSE.

logi_gt5_vec <- num_vec >= 5
print(logi_gt5_vec)
## [1] FALSE  TRUE  TRUE FALSE

You cannot apply arithmetical operations on character vectors, but can apply the equality operator.

logi_apple_vec <- char_vec == 'apple'
print(logi_apple_vec)
## [1]  TRUE FALSE FALSE FALSE

You can also concatenate elements of character vectors using paste(). Since the two vectors in the example have the same length, elements at the same positions of the vectors are concatenated.

char_vec2 <- paste(c('red', 'yellow', 'orange', 'green'), char_vec)
print(char_vec2)
## [1] "red apple"       "yellow banana"   "orange mandarin" "green melon"

Finally, you can set names to elements of a numeric vector using names().

names(num_vec) <- char_vec
print(num_vec)
##    apple   banana mandarin    melon 
##        1        5        6        3

Data frames

A data frame combines multiple vectors to construct a dataset. You can combine vectors into a data frame only if they have the same lengths. However, they can be different types. nrow() and ncol() show the number of rows (observations) and variables in a data frame.

fruit_df <- data.frame(name = char_vec, count = num_vec )
print(fruit_df)
##              name count
## apple       apple     1
## banana     banana     5
## mandarin mandarin     6
## melon       melon     3
print(nrow(fruit_df))
## [1] 4
print(ncol(fruit_df))
## [1] 2

You can use subset() to select records in the data frame.

fruit_df2 <- subset(fruit_df, count >= 5)
print(fruit_df2)
##              name count
## banana     banana     5
## mandarin mandarin     6
print(nrow(fruit_df2))
## [1] 2
print(ncol(fruit_df2))
## [1] 2

We use print() to show values and structures of objects in the examples, but you do not need to use the print() command in the console, because it is triggered automatically when objects are returned to the global environment.

Matrices

Similar to a data frame, a matrix contains multi-dimensional data. In contrast to a data frame, its values must all be the same type.

mat <- matrix(c(1, 3, 6, 8, 3, 5, 2, 7), nrow = 2)
print(mat)
##      [,1] [,2] [,3] [,4]
## [1,]    1    6    3    2
## [2,]    3    8    5    7

You can use colnames() or rownames() to set/retrieve names to rows or columns of a matrix.

colnames(mat) <- char_vec
print(mat)
##      apple banana mandarin melon
## [1,]     1      6        3     2
## [2,]     3      8        5     7
rownames(mat) <- c('bag1', 'bag2') 
print(mat)
##      apple banana mandarin melon
## bag1     1      6        3     2
## bag2     3      8        5     7

You can obtain the size of a matrix by dim() that returns a two-element numeric vector.

print(dim(mat))
## [1] 2 4

If a matrix has column and row names, you can extract rows or columns by their names.

print(mat['bag1', ])
##    apple   banana mandarin    melon 
##        1        6        3        2
print(mat[, 'banana'])
## bag1 bag2 
##    6    8

Finally, you can obtain marginals of matrix by colSums() or rowSums().

print(rowSums(mat))
## bag1 bag2 
##   12   23
print(colSums(mat))
##    apple   banana mandarin    melon 
##        4       14        8        9

If you want to know the details of R commands, prepend ? to the command and execute. For example, ?subset() will show you how to use the subset function with different types of objects.