Base R provides users with the basic comparison operators (i.e., >, <, ==) for data manipulations like filter or select data points, columns, or rows based on a value; however, oftentimes you may need to filter a data set based on a partial character string that is beyond the scope of comparison operators.
Base R provides functions like “grep” and “grepl” that match character patterns in specified vector. While both of these functions find patterns, they return different output types based on those patterns. Specifically, grep returns numeric values that correspond to the indexed locations of the patterns and grepl returns a logical vector in which “TRUE” represents a pattern match.
For example, if we want to look for a pattern that starts with ‘abc’, we can specify ‘^abc’ or if we want to find a pattern that ends with ‘abc’ we would specify ‘abc$’. We can also specify a group of values such as ‘ab[0-9]’ that starts with the string ‘ab’ followed by any digit, or ‘[A-E]9’ which looks for an uppercase A or B or C or D or E followed by 9. And we can specify that a sequence repeats such as ’[xy]*’ which means that we are looking for a string with one or more ‘xy’ sequences.
grep(pattern, x, ignore.case = FALSE, value = FALSE, fixed=FALSE)
grepl(pattern, x, ignore.case = FALSE, fixed=FALSE)
where,
pattern character string containing a regular expression to be matched in the given character vector.
x, text a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
value if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
fixed logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.
strings <- c('abcd', 'dabc', 'abcabc', 'dABc','ab9', 'ab', 'abc8', 'ab1')
#Find data that starts with 'abc':
pattern <- '^abc'
print (grep(pattern, strings))
## [1] 1 3 7
print (grepl(pattern, strings))
## [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
#Find data that ends with 'abc' and is not case sensitive:
pattern <- 'abc$'
print (grep(pattern, strings, ignore.case = TRUE))
## [1] 2 3 4
print (grepl(pattern, strings, ignore.case = TRUE))
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#Find data that starts with ab followed by any digit and return the strings:
pattern <- 'ab[0-9]'
print (grep(pattern, strings, value=TRUE))
## [1] "ab9" "ab1"
print (grepl(pattern, strings))
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
#Now let's consider a data :
head(CO2)
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
grep("non", CO2$Treatment)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 43 44 45 46
## [26] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
grepl("non", CO2$Treatment)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Filtering with grep:
#filter data set based on values in a column
filter_for_value<-CO2[grep("non", CO2$Treatment), ]
head(filter_for_value)
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
#filter data set based on values that do not match the specified pattern
filter_for_not_a_value<-CO2[-(grep("non", CO2$Treatment)),]
head(filter_for_not_a_value)
## Plant Type Treatment conc uptake
## 22 Qc1 Quebec chilled 95 14.2
## 23 Qc1 Quebec chilled 175 24.1
## 24 Qc1 Quebec chilled 250 30.3
## 25 Qc1 Quebec chilled 350 34.6
## 26 Qc1 Quebec chilled 500 32.5
## 27 Qc1 Quebec chilled 675 35.4
#Selecting columns with grep:
select_columns<-CO2[, grep("T", colnames(CO2))]
head(select_columns)
## Type Treatment
## 1 Quebec nonchilled
## 2 Quebec nonchilled
## 3 Quebec nonchilled
## 4 Quebec nonchilled
## 5 Quebec nonchilled
## 6 Quebec nonchilled
dont_select_columns<-CO2[, -(grep("T", colnames(CO2)))]
head(dont_select_columns)
## Plant conc uptake
## 1 Qn1 95 16.0
## 2 Qn1 175 30.4
## 3 Qn1 250 34.8
## 4 Qn1 350 37.2
## 5 Qn1 500 35.3
## 6 Qn1 675 39.2
The other great feature about grep and grepl is their adaptation by other packages in R. Using grep and grepl with dplyr:
library(dplyr)
CO2_dplyr<-as_tibble(CO2) #converting CO2 into a local data frame
#dplyr filtering with grepl
filter_dplyr_for_value_non<-CO2_dplyr %>% filter(grepl("non", Treatment))
filter_dplyr_for_value_non
## # A tibble: 42 x 5
## Plant Type Treatment conc uptake
## <ord> <fct> <fct> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## # ... with 32 more rows
filter_dplyr_for_not_a_value<-CO2_dplyr %>% filter(!(grepl("non", Treatment)))
filter_dplyr_for_not_a_value
## # A tibble: 42 x 5
## Plant Type Treatment conc uptake
## <ord> <fct> <fct> <dbl> <dbl>
## 1 Qc1 Quebec chilled 95 14.2
## 2 Qc1 Quebec chilled 175 24.1
## 3 Qc1 Quebec chilled 250 30.3
## 4 Qc1 Quebec chilled 350 34.6
## 5 Qc1 Quebec chilled 500 32.5
## 6 Qc1 Quebec chilled 675 35.4
## 7 Qc1 Quebec chilled 1000 38.7
## 8 Qc2 Quebec chilled 95 9.3
## 9 Qc2 Quebec chilled 175 27.3
## 10 Qc2 Quebec chilled 250 35
## # ... with 32 more rows
#dplyr selecting with grep
select_dplyr_columns<-CO2_dplyr %>% select(grep("T", colnames(CO2_dplyr)))
select_dplyr_columns
## # A tibble: 84 x 2
## Type Treatment
## <fct> <fct>
## 1 Quebec nonchilled
## 2 Quebec nonchilled
## 3 Quebec nonchilled
## 4 Quebec nonchilled
## 5 Quebec nonchilled
## 6 Quebec nonchilled
## 7 Quebec nonchilled
## 8 Quebec nonchilled
## 9 Quebec nonchilled
## 10 Quebec nonchilled
## # ... with 74 more rows
dont_select_dplyr_column<-CO2_dplyr %>% select(-grep("T", colnames(CO2_dplyr)))
dont_select_dplyr_column
## # A tibble: 84 x 3
## Plant conc uptake
## <ord> <dbl> <dbl>
## 1 Qn1 95 16
## 2 Qn1 175 30.4
## 3 Qn1 250 34.8
## 4 Qn1 350 37.2
## 5 Qn1 500 35.3
## 6 Qn1 675 39.2
## 7 Qn1 1000 39.7
## 8 Qn2 95 13.6
## 9 Qn2 175 27.3
## 10 Qn2 250 37.1
## # ... with 74 more rows