Splitting a string into multiple columns (with specific order) - r

I receive data that isn't in a very nice format (and I can't change them upstream). There is one column that needs to be reordered and split into 10+ other columns based on certain keywords.
Here's an example of data I receive - for each person, they have chosen one 3 different foods. Their choices for each food category (food1, food2, food3) come right after the text:
list1 <- c(' food1 pasta food2 apple food3 carrot ')
list2 <- c(' food2 banana food3 cucumber food1 brown rice ')
list3 <- c(' food3 bell pepper food2 plum food1 bread ')
foodListDF <- as.data.frame(matrix(c(1,2,3, list1, list2, list3), nrow = 3), stringsAsFactors = FALSE)
colnames(foodListDF) <- c('Person', 'Choices')
foodListDF
Person Choices
1 1 food1 pasta food2 apple food3 carrot
2 2 food2 banana food3 cucumber food1 brown rice
3 3 food3 bell pepper food2 plum food1 bread
The above is the format I receive my data in. My end goal is to split the Choices column into 3 separate columns labeled food1, food2, and food3 which requires things to be ordered properly:
Person food1 food2 food3
1 1 pasta apple carrot
2 2 brown rice banana cucumber
3 3 bread plum bell pepper
I know that I can split the choices doing something like this:
library(stringr)
as.data.frame(str_split_fixed(foodListDF$Choices, c(' food1 | food2 | food3 '), 4))[,2:4]
V2 V3 V4
1 pasta apple carrot
2 banana cucumber brown rice
3 bell pepper plum bread
But this obviously doesn't split them into their proper groups/order which is very necessary.
I'm really just struggling to think how to extract the correct food from the proper group for each person. Any ideas?

You could extract the food number and food items separately (t1 and t2), join them together, unnest the data and get it into wide format.
library(dplyr)
library(tidyr)
foodListDF %>%
mutate(food = stringr::str_extract_all(Choices, 'food\\d+')) %>%
select(-Choices) -> t1
foodListDF %>%
separate_rows(Choices, sep = 'food\\d+') %>%
filter(Choices != ' ') %>%
mutate(Choices = trimws(Choices)) %>%
group_by(Person) %>%
summarise(col = list(Choices)) -> t2
inner_join(t1, t2, by = 'Person') %>%
unnest(c(food, col)) %>%
pivot_wider(names_from = food, values_from = col)
# Person food1 food2 food3
# <chr> <chr> <chr> <chr>
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper

Base R
Here are two base R approaches, both involving regmatches and gregexpr.
The first makes use of unstack. It results in a data.frame.
splitfun1 <- function(string) {
mat <- gregexpr("food\\d+ ", string)
unstack(
list(l1 = unlist(lapply(regmatches(string, mat), trimws), use.names = FALSE),
l2 = unlist(lapply(regmatches(string, mat, invert = TRUE),
function(x) trimws(x[-1])), use.names = FALSE)),
l2 ~ l1)
}
splitfun1(foodListDF$Choices)
# food1 food2 food3
# 1 pasta apple carrot
# 2 brown rice banana cucumber
# 3 bread plum bell pepper
The second makes use of matrix indexing to fill in an empty matrix. It's probably a bit more efficient than the first alternative. It results in a matrix.
splitfun2 <- function(string) {
mat <- gregexpr("food\\d+ ", string)
l1 <- lapply(regmatches(string, mat), trimws)
l2 <- lapply(regmatches(string, mat, invert = TRUE),
function(x) trimws(x[-1]))
ul <- unlist(l1, use.names = FALSE)
cn <- sort(unique(ul))
out <- matrix(NA_character_, nrow = length(string), ncol = length(cn),
dimnames = list(seq_along(string), cn))
out[cbind(rep(seq_along(string), lengths(l1)), ul)] <- unlist(l2, use.names = FALSE)
out
}
splitfun2(foodListDF$Choices)
# food1 food2 food3
# 1 "pasta" "apple" "carrot"
# 2 "brown rice" "banana" "cucumber"
# 3 "bread" "plum" "bell pepper"
Of course, with either of these, you would then need to cbind the result with the relevant columns from the source data.frame.
cbind(foodListDF[1], splitfun2(foodListDF$Choices))
splitstackshape + data.table
Another option is to use cSplit from my "splitstackshape" package along with some pretty straightforward gsub work, followed by dcast to go into a wide form.
library(splitstackshape)
# library(data.table) # if required
# Basic helper function
fun <- function(string) {
list(gsub("(food\\d+) (.*)", "\\1", string),
gsub("(food\\d+) (.*)", "\\2", string))
}
cSplit(as.data.table(foodListDF)[, Choices := gsub(" food", ",food", trimws(Choices))],
"Choices", ",", "long")[, fun(Choices), Person][, dcast(.SD, Person ~ V1, value.var = "V2")]
# Person food1 food2 food3
# 1: 1 pasta apple carrot
# 2: 2 brown rice banana cucumber
# 3: 3 bread plum bell pepper
dplyr + tidyr
Adapting the above to "dplyr" + "tidyr", you can try:
library(dplyr)
library(tidyr)
foodListDF %>%
mutate(Choices = gsub(" food", ",food", trimws(Choices))) %>%
separate_rows(Choices, sep = ",") %>%
separate(Choices, c("var", "val"), extra = "merge") %>%
pivot_wider(names_from = var, values_from = val)
# # A tibble: 3 x 4
# Person food1 food2 food3
# <chr> <chr> <chr> <chr>
# 1 1 pasta apple carrot
# 2 2 brown rice banana cucumber
# 3 3 bread plum bell pepper

In one convoluted Base R expression:
data.frame(cbind(Person = foodListDF$Person,
do.call("rbind", Map(function(x){y <- setNames(x[[2]], x[[1]]); y[order(x[[1]])]},
lapply(strsplit(foodListDF$Choices, "\\s+"), function(x) {
res <- data.frame(t(grep("food\\d+", x, value = TRUE)), stringsAsFactors = FALSE)
res2 <- unlist(strsplit(gsub("^&&\\s*", "",
paste0(Filter(function(y){y != ""}, Vectorize(gsub)("food\\d+", "&&", x)),
collapse = " ")), "\\s*&&\\s*"))
list(res, res2)
}
)
)
)
), stringsAsFactors = FALSE)

You can use food as a delimiter in strsplit, sort the result, remove the first character with substring and return the result to your dataset.
foodListDF[paste0("food",1:3)] <- t(sapply(strsplit(foodListDF$Choices, "food"),
function (x) trimws(substring(sort(x[-1]), 2))))
foodListDF[-2]
# Person food1 food2 food3
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper
Or in case there are not all the time all levels present:
j <- sort(unique(unlist(regmatches(foodListDF$Choices, gregexpr("food\\d+",
foodListDF$Choices)))))
k <- sub("food", "", j)
foodListDF[j] <- t(sapply(strsplit(foodListDF$Choices, "food"), function(x)
trimws(sub("^\\d+", "", x[charmatch(k, x)]))))
foodListDF[-2]
# Person food1 food2 food3
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper

Related

Joining and replacing columns multiple times

I have a dataframe with a lot of columns with abbreviations. I'm trying to replace the columns with their full name.
A minimal reproducible example:
category <- data.frame(short = c("TOM", "BAN", "APP", "PEA"),
name = c("tomato", "banana", "apple", "pear"))
df <- data.frame(col1 = c("TOM", "TOM", "TOM", "APP", "TOM"),
col2 = c("APP", "TOM", "TOM", "PEA", "PEA"),
col3 = c("TOM", "PEA", "PEA", "TOM", "BAN"))
col1 col2 col3
1 TOM APP TOM
2 TOM TOM PEA
3 TOM TOM PEA
4 APP PEA TOM
5 TOM PEA BAN
Now, I would like my dataframe to just contain the full names of the products. I can get it to work with left_joins, selecting and renaming, but this code is getting out of hand pretty rapidly with a lot of columns.
df2 <- df %>%
left_join(category, by = c("col1" = "short")) %>%
select(-col1) %>%
rename(col1 = name) %>%
left_join(category, by = c("col2" = "short")) %>%
select(-col2) %>%
rename(col2 = name) %>%
left_join(category, by = c("col3" = "short")) %>%
select(-col3) %>%
rename(col3 = name)
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
I think (hope?) there's a better solution for it, but I'm unable to find it.
An option is to create a named vector
library(dplyr)
library(tibble)
v1 <- deframe(category)
and then use that to match and replace the values
df1 <- df %>%
mutate(across(everything(), ~ v1[.]))
-output
df1
# col1 col2 col3
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana
It can be also done with recode using similar way
df %>%
mutate(across(everything(), ~ recode(., !!! v1)))
Or using base R, create the named vector with setNames, loop over the columns with lapply and replace those values and assign it back
v1 <- with(category, setNames(name, short))
df1 <- df
df1[] <- lapply(df, function(x) v1[x])
Or convert to matrix (a matrix is a vector with dim attributes)
df1[1] <- v1[as.matrix(df)]
Another option is using factor
df[] <- factor(
u <- unlist(df),
labels = with(category, name[match(sort(unique(u)), short)])
)
or a shorter one via setNames
df[]<-with(category,setNames(name,short))[unlist(df)]
which gives
> df
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
You can get the data in long format such that all the values are in one column which is easy to join with category dataframe and then get data back in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, names_to = 'col', values_to = 'short') %>%
left_join(category, 'short') %>%
select(-short) %>%
pivot_wider(names_from = col, values_from = name) %>%
select(-row)
# col1 col2 col3
# <chr> <chr> <chr>
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana

Create variables based on regular expressions with a loop in r

I need help to create variables based on regular expressions.
This is my dataframe:
df <- data.frame(a=c("blue", "red", "yellow", "yellow", "yellow", "yellow", "red"), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato", NA))
Basically, what I want to do is this, but in one step:
regx_1 <- as.numeric(grep("^[a-z]{5}$", df$b))
regx_2 <- as.numeric(grep("^[a-z]{6,}$", df$b))
df$fruit_1 <- NA
df$fruit_1[regx_1 + 1] <- as.character(df$b[regx_1])
df$fruit_2 <- NA
df$fruit_2[regx_2 + 1] <- as.character(df$b[regx_2])
Here is my try:
regex1 <- "^[a-z]{5}$"
regex2 <- "^[a-z]{6,}$"
regex <- c(regex1, regex1)
make_non_matches_NA <- function(vec, pattern){
df[[newvariable]] <- NA
df[[newvariable]][as.numeric(grep(pattern, vec)) + 1] <- as.character(vec[as.numeric(grep(pattern, vec))])
return(newvariable)
}
df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
EDIT: Why is my approach wrong? (Please note that the actual problem is bigger, so I have to stick to an approach, where a repetition of a pattern should be avoided)
Any help is much appreciated!
Having numbered items in a your workspace is a good sign that they really belong
to a list, so they are formally linked and we can work with them much more easily. So let's do that first.
regex <- c("^[a-z]{5}$", "^[a-z]{6,}$")
Our core functionality is to copy a source vector, but remove elements that don't match, and leave NA in their place, so we'll make a function for that, and we'll name it explicitly so we understand intuitively what it's doing (and as will our colleagues next reader on SO ;) ) :
make_non_matches_NA <- function(vec, pattern){
# logical indices of matches
matches_lgl <- grepl(pattern, vec)
# the elements which don't match should be NA
vec[!matches_lgl] <- NA
# resulting vector should be returned
vec
}
Let's test this with first pattern
make_non_matches_NA(df$b, regex[[1]])
#> [1] apple <NA> peach lemon <NA> <NA>
#> Levels: apple lemon orange peach pineapple tomato
So far so good! now let's test it with all regex, we avoid for loops when we can generally in R because we have clearer tools like lapply(). Here I want to apply this function to all regex expressions :
lapply(regex, make_non_matches_NA, vec = df$b)
#> [[1]]
#> [1] apple <NA> peach lemon <NA> <NA>
#> Levels: apple lemon orange peach pineapple tomato
#>
#> [[2]]
#> [1] <NA> orange <NA> <NA> pineapple tomato
#> Levels: apple lemon orange peach pineapple tomato
Great, it works!
But I want this in my data.frame, not as a separate list, so I will assign this result to the relevant names in my df directly
df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
# then print my updated df
df
#> a b fruit1 fruit2
#> 1 1 apple apple <NA>
#> 2 2 orange <NA> orange
#> 3 3 peach peach <NA>
#> 4 4 lemon lemon <NA>
#> 5 5 pineapple <NA> pineapple
#> 6 6 tomato <NA> tomato
tada!
I don't if this qualifies as "at one step" but you could try mutate from the dplyr package:
df <- data.frame(a=c(1:6), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato"),
stringsAsFactors = FALSE)
Note that I set stringsAsFactors = FALSE inside data.frames.
dplyr::mutate(df, fruit_1 = if_else(grepl("^[a-z]{5}$", b), b, NA_character_),
fruit_2 = if_else(grepl("^[a-z]{6}$", b), b, NA_character_))
a b fruit_1 fruit_2
1 1 apple apple <NA>
2 2 orange <NA> orange
3 3 peach peach <NA>
4 4 lemon lemon <NA>
5 5 pineapple <NA> <NA>
6 6 tomato <NA> tomato

Inverse matching with stringr

I want to remove all characters which doesn't match a string pattern using stringr package. So far I've been able to remove those before the pattern using "\\w+(?= (grape|satsuma))" as pattern but remove those after the pattern is still imposible.
> str_remove_all("apples grape banana melon olive persimon grape apples satsuma papaya",
+ "\\w+(?= (grape|satsuma))")
[1] " grape banana melon olive grape satsuma papaya"
The desired result is:
"grape grape satsuma"
(NOTE: I am aware the easiest approach in this case is to extract only "grape" and "satsuma" but for analysis purposes I prefer this way)
Edited providing the entire problem
The entire problem is as follow, given a d data frame which contains a column with a string the function should return the same column only with matches:
> d
# A tibble: 2 x 2
string_column c2
<chr> <dbl>
1 apples grape banana satsuma 3
2 grape banana satsuma melon 4
Using the answer provided by #d.r works:
> d %>%
+ mutate_at(vars(string_column), ~ gsub("(grape|satsuma| )(*SKIP)(*FAIL)|.", "", ., perl = TRUE))
# A tibble: 2 x 2
string_column c2
<chr> <dbl>
1 " grape satsuma" 3
2 "grape satsuma " 4
All answers provided so far using stringr package fail returning the string_column
This the dput for d:
d <- structure(list(string_column = c("apples grape banana satsuma",
"grape banana satsuma melon"), c2 = c(3, 4)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
You may want to look at negative lookaheads and some related regex techniques in the linked thread.
However, since we are extracting words I'd rather use str_extract_all and I'd do it like this:
str_extract_all("apples grape banana melon olive persimon grape apples satsuma papaya",
"grape|satsuma")
"grape" "grape" "satsuma"
I also really like this line that #steveLangsford left in a comment:
paste0(unlist(str_extract_all("apples grape banana melon olive persimon grape apples satsuma papaya", "grape|satsuma")), collapse=" ")
"grape grape satsuma"
Taking it a little bit further based on our discussion/comments:
string_column <- c("apples grape banana satsuma", "grape banana satsuma melon")
c2 <- c(3, 4)
d <- tibble(string_column,c2)
myfun <- function(x) {paste0(unlist(str_extract_all(x, "grape|satsuma")), collapse=" ") }
sapply(d$string_column, myfun)
"grape satsuma" "grape satsuma"

Put the combinations matrix of many rows in a column of a dataframe, then split it

I have a dataframe that looks like this (I simplify):
df <- data.frame(rbind(c(1, "dog", "cat", "rabbit"), c(2, "apple", "peach", "cucumber")))
colnames(df) <- c("ID", "V1", "V2", "V3")
## ID V1 V2 V3
## 1 1 dog cat rabbit
## 2 2 apple peach cucumber
I would like to create a column containing all possible combinations of variables V1:V3 two by two (order doesn't matter), but keeping a link with the original ID. So something like this.
## ID bigrams
## 1 1 dog cat
## 2 1 cat rabbit
## 3 1 dog rabbit
## 4 2 apple peach
## 5 2 apple cucumber
## 6 2 peach cucumber
My idea: use combn(), mutate() and separate_row().
library(tidyr)
library(dplyr)
df %>%
mutate(bigrams=paste(unlist(t(combn(df[,2:4],2))), collapse="-")) %>%
separate_rows(bigrams, sep="-") %>%
select(ID,bigrams)
The result is not what I expected... I guess that concatenating a matrix (the result of combine()) is not as easy as that.
I have two questions about this: 1) how to debug this code? 2) Is this a good way to do this kind of thing? I'm new on R but I’ve an Open Refine background, so concatenate-split multivalued cells make a lot of sense for me. But is this also the right method with R?
Thanks in advance for any help.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), melt it to 'long' format, grouped by 'ID', get the combn of 'value' and paste it together
library(data.table)
dM <- melt(setDT(df), id.var = "ID")[, combn(value, 2, FUN = paste, collapse=' '), ID]
setnames(dM, 2, 'bigrams')[]
# ID bigrams
#1: 1 dog cat
#2: 1 dog rabbit
#3: 1 cat rabbit
#4: 2 apple peach
#5: 2 apple cucumber
#6: 2 peach cucumber
I recommend #akrun's "melt first" approach, but just for fun, here are more ways to do it:
library(tidyverse)
df %>%
mutate_all(as.character) %>%
transmute(ID = ID, bigrams = pmap(
list(V1, V2, V3),
function(a, b, c) combn(c(a, b, c), 2, paste, collapse = " ")
))
# ID bigrams
# 1 1 dog cat, dog rabbit, cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber
(mutate_all(as.character) just because you gave us factors, and factor to character conversion can be surprising).
df %>%
mutate_all(as.character) %>%
nest(-ID) %>%
mutate(bigrams = map(data, combn, 2, paste, collapse = " ")) %>%
unnest(data) %>%
as.data.frame()
# ID bigrams V1 V2 V3
# 1 1 dog cat, dog rabbit, cat rabbit dog cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber apple peach cucumber
(as.data.frame() just for a prettier printing)

Apply grepl across columns by saving columns in lists?

I have a dataframe in the format mentioned below:
String Keyword
1 Apples bananas mangoes mangoes
2 Apples bananas mangoes bananas
3 Apples bananas mangoes peach
.....
Its a dataframe (50000+ rows). I'm currently manually using the ifelse statement in batches.
data$Result<- ifelse(grepl("apples",data$String,ignore.case = TRUE)==TRUE,"apples",
ifelse(grepl("bananas",data$String,ignore.case = TRUE)==TRUE,"bananas",
ifelse(grepl("mangoes",data$String,ignore.case = TRUE)==TRUE,"mangoes","unavailable")))
String Keyword Result
Apples bananas mangoes mangoes mangoes
Apples bananas mangoes bananas bananas
Apples bananas mangoes peach unavailable
Is there a way, where I could store String and Keyword in a list and then apply grepl on the entire list?
Here's a simple and efficient solution with a combination of data.table and the stringi package:
library(data.table)
library(stringi)
setDT(df)[stri_detect_fixed(String, Keyword, case_insensitive = TRUE), result := Keyword]
# String Keyword result
# 1: Apples bananas mangoes mangoes mangoes
# 2: Apples bananas mangoes bananas bananas
# 3: Apples bananas mangoes peach NA
Alternatively, a data.table-only version:
library(data.table)
setDT(df)[, result := Keyword[grep(Keyword, String, ignore.case = TRUE)], by = .(Keyword, String)]
Benchmark
Here's a benchmark on a 5e5 data set against the mapply answer. (The for loop answer haven't finished running yet):
set.seed(123)
df1 <- data.frame(String = rep('Apples bananas mangoes', 5e5),
Keyword = sample(c("mangoes", "bananas", "peach"), 5e5, replace = TRUE))
system.time(df1$result2 <- ifelse(mapply(grepl,df1$Keyword, df1$String, ignore.case = TRUE), as.character(df1$Keyword), "Unavailable"))
# user system elapsed
# 40.78 0.02 41.12
system.time(setDT(df1)[stri_detect_fixed(String, Keyword, case_insensitive = TRUE), result3 := Keyword])
# user system elapsed
# 0.52 0.01 0.53
I'm assuming this is what you want:
df <- data.frame(string=rep("Apples bananas mangoes",3), keyword=c("mangoes", "bananas", "peach"))
df$result <- ifelse(mapply(grepl,df$keyword, df$string), as.character(df$keyword), "Unavailable")
string keyword result
1 Apples bananas mangoes mangoes mangoes
2 Apples bananas mangoes bananas bananas
3 Apples bananas mangoes peach Unavailable
Update
Based on the comment, it sounds like you have a list of words that you want to check against the keyword. If that is the case, something like this might work:
#Set up toy dataset
set.seed(123)
df <- data.frame(Keyword = sample(c("mangoes", "bananas", "apples","lemons" , "peach"), 10, replace = TRUE))
df
#Choose your searchwords globally
searchwords <- c("apples", "bananas", "mangoes")
library(data.table)
library(stringi)
setDT(df)
for (x in searchwords) df[Keyword == x, result := Keyword]
df[is.na(result), result := "Unavailable"]
df
Keyword result
1: bananas bananas
2: lemons Unavailable
3: apples apples
4: peach Unavailable
5: peach Unavailable
6: mangoes mangoes
7: apples apples
8: peach Unavailable
9: apples apples
10: apples apples
Here is a version using 'dplyr' and 'stringr':
library(dplyr)
library(stringr)
df <- mutate(df, result = ifelse(str_detect(string, keyword)==TRUE,
keyword, "Unavailable"))
Here is the line I used to create the play data:
df <- data.frame(string = rep("Apples bananas mangoes", 3), keyword = c("mangoes", "bananas", "peaches"), stringsAsFactors=FALSE)
And here is the output I get:
string keyword result
1 Apples bananas mangoes mangoes mangoes
2 Apples bananas mangoes bananas bananas
3 Apples bananas mangoes peaches Unavailable

Resources