Base R provides users with the basic comparison operators (i.e., >, <, ==) for data manipulations like filter or select data points, columns, or rows based on a value; however, oftentimes you may need to filter a data set based on a partial character string that is beyond the scope of comparison operators.

Base R provides functions like “grep” and “grepl” that match character patterns in specified vector. While both of these functions find patterns, they return different output types based on those patterns. Specifically, grep returns numeric values that correspond to the indexed locations of the patterns and grepl returns a logical vector in which “TRUE” represents a pattern match.

For example, if we want to look for a pattern that starts with ‘abc’, we can specify ‘^abc’ or if we want to find a pattern that ends with ‘abc’ we would specify ‘abc$’. We can also specify a group of values such as ‘ab[0-9]’ that starts with the string ‘ab’ followed by any digit, or ‘[A-E]9’ which looks for an uppercase A or B or C or D or E followed by 9. And we can specify that a sequence repeats such as ’[xy]*’ which means that we are looking for a string with one or more ‘xy’ sequences.

grep & grepl function :

grep(pattern, x, ignore.case = FALSE, value = FALSE, fixed=FALSE)

grepl(pattern, x, ignore.case = FALSE, fixed=FALSE)

where,

pattern character string containing a regular expression to be matched in the given character vector.

x, text a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

value if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.

fixed logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.


Let’s see some examples :

strings <- c('abcd', 'dabc', 'abcabc', 'dABc','ab9', 'ab', 'abc8', 'ab1')

#Find data that starts with 'abc':
pattern <- '^abc'
print (grep(pattern, strings))
## [1] 1 3 7
print (grepl(pattern, strings))
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
#Find data that ends with 'abc' and is not case sensitive:
pattern <- 'abc$'
print (grep(pattern, strings, ignore.case = TRUE))
## [1] 2 3 4
print (grepl(pattern, strings, ignore.case = TRUE))
## [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#Find data that starts with ab followed by any digit and return the strings:
pattern <- 'ab[0-9]'
print (grep(pattern, strings, value=TRUE))
## [1] "ab9" "ab1"
print (grepl(pattern, strings))
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
#Now let's consider a data :
head(CO2)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2
grep("non", CO2$Treatment)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 43 44 45 46
## [26] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
grepl("non", CO2$Treatment)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [61]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Filtering with grep: 
#filter data set based on values in a column 
filter_for_value<-CO2[grep("non", CO2$Treatment), ]
head(filter_for_value)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2
#filter data set based on values that do not match the specified pattern
filter_for_not_a_value<-CO2[-(grep("non", CO2$Treatment)),]
head(filter_for_not_a_value)
##    Plant   Type Treatment conc uptake
## 22   Qc1 Quebec   chilled   95   14.2
## 23   Qc1 Quebec   chilled  175   24.1
## 24   Qc1 Quebec   chilled  250   30.3
## 25   Qc1 Quebec   chilled  350   34.6
## 26   Qc1 Quebec   chilled  500   32.5
## 27   Qc1 Quebec   chilled  675   35.4
#Selecting columns with grep:
select_columns<-CO2[, grep("T", colnames(CO2))]
head(select_columns)
##     Type  Treatment
## 1 Quebec nonchilled
## 2 Quebec nonchilled
## 3 Quebec nonchilled
## 4 Quebec nonchilled
## 5 Quebec nonchilled
## 6 Quebec nonchilled
dont_select_columns<-CO2[, -(grep("T", colnames(CO2)))]
head(dont_select_columns)
##   Plant conc uptake
## 1   Qn1   95   16.0
## 2   Qn1  175   30.4
## 3   Qn1  250   34.8
## 4   Qn1  350   37.2
## 5   Qn1  500   35.3
## 6   Qn1  675   39.2

The other great feature about grep and grepl is their adaptation by other packages in R. Using grep and grepl with dplyr:

library(dplyr)
CO2_dplyr<-as_tibble(CO2) #converting CO2 into a local data frame 

#dplyr filtering with grepl
filter_dplyr_for_value_non<-CO2_dplyr %>% filter(grepl("non", Treatment))
filter_dplyr_for_value_non
## # A tibble: 42 x 5
##    Plant Type   Treatment   conc uptake
##    <ord> <fct>  <fct>      <dbl>  <dbl>
##  1 Qn1   Quebec nonchilled    95   16  
##  2 Qn1   Quebec nonchilled   175   30.4
##  3 Qn1   Quebec nonchilled   250   34.8
##  4 Qn1   Quebec nonchilled   350   37.2
##  5 Qn1   Quebec nonchilled   500   35.3
##  6 Qn1   Quebec nonchilled   675   39.2
##  7 Qn1   Quebec nonchilled  1000   39.7
##  8 Qn2   Quebec nonchilled    95   13.6
##  9 Qn2   Quebec nonchilled   175   27.3
## 10 Qn2   Quebec nonchilled   250   37.1
## # ... with 32 more rows
filter_dplyr_for_not_a_value<-CO2_dplyr %>% filter(!(grepl("non", Treatment)))
filter_dplyr_for_not_a_value
## # A tibble: 42 x 5
##    Plant Type   Treatment  conc uptake
##    <ord> <fct>  <fct>     <dbl>  <dbl>
##  1 Qc1   Quebec chilled      95   14.2
##  2 Qc1   Quebec chilled     175   24.1
##  3 Qc1   Quebec chilled     250   30.3
##  4 Qc1   Quebec chilled     350   34.6
##  5 Qc1   Quebec chilled     500   32.5
##  6 Qc1   Quebec chilled     675   35.4
##  7 Qc1   Quebec chilled    1000   38.7
##  8 Qc2   Quebec chilled      95    9.3
##  9 Qc2   Quebec chilled     175   27.3
## 10 Qc2   Quebec chilled     250   35  
## # ... with 32 more rows
#dplyr selecting with grep
select_dplyr_columns<-CO2_dplyr %>% select(grep("T", colnames(CO2_dplyr)))
select_dplyr_columns
## # A tibble: 84 x 2
##    Type   Treatment 
##    <fct>  <fct>     
##  1 Quebec nonchilled
##  2 Quebec nonchilled
##  3 Quebec nonchilled
##  4 Quebec nonchilled
##  5 Quebec nonchilled
##  6 Quebec nonchilled
##  7 Quebec nonchilled
##  8 Quebec nonchilled
##  9 Quebec nonchilled
## 10 Quebec nonchilled
## # ... with 74 more rows
dont_select_dplyr_column<-CO2_dplyr %>% select(-grep("T", colnames(CO2_dplyr)))
dont_select_dplyr_column
## # A tibble: 84 x 3
##    Plant  conc uptake
##    <ord> <dbl>  <dbl>
##  1 Qn1      95   16  
##  2 Qn1     175   30.4
##  3 Qn1     250   34.8
##  4 Qn1     350   37.2
##  5 Qn1     500   35.3
##  6 Qn1     675   39.2
##  7 Qn1    1000   39.7
##  8 Qn2      95   13.6
##  9 Qn2     175   27.3
## 10 Qn2     250   37.1
## # ... with 74 more rows