Collapsing rows using two vectors as indicators - r

This is my first time posting; please let me know if I'm doing any beginner mistakes. In my specific case I have a vector of strings, and I want to collapse some adjacent rows. I have one vector indicating the starting position and one indicating the last element. How can I do this?
Here is some sample code and my approach that does not work:
text <- c("cat", "dog", "house", "mouse", "street")
x <- c(1,3)
y <- c(2,5)
result <- as.data.frame(paste(text[x:y],sep = " ",collapse = ""))
In case it's not clear, the result I want is a data frame consisting of two strings: "cat dog" and "house mouse street".

Not sure this is the best option, but it does the job,
sapply(mapply(seq, x, y), function(i)paste(text[i], collapse = ' '))
#[1] "cat dog" "house mouse street"

Either use base R with
mapply(function(.x,.y) paste(text[.x:.y],collapse = " "), x, y)
or use the purrr package as
map2_chr(x,y, ~ paste(text[.x:.y],collapse = " "))
Both yield
# [1] "cat dog" "house mouse street"
The output as a data frame depends on the structure you want: rows or columns

I think you want
result <- data.frame(combined = c(paste(text[x[1]:y[1]], collapse = " "),
paste(text[x[2]:y[2]], collapse = " ")))
Which gives you
result
#> combined
#> 1 cat dog
#> 2 house mouse street

Another base R solution, using parse + eval
result <- data.frame(new = sapply(paste0(x,":",y),function(v) paste0(text[eval(parse(text = v))],collapse = " ")),
row.names = NULL)
such that
> result
new
1 cat dog
2 house mouse street

Related

Match two character strings by location in R

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string
[1] " A B C"
text
[1] " I love chocolate pudding"
I am trying to match letters in "string" with text in "text" such that to the letter A corresponds the text "I love" to B corresponds "chocolate" and to C "pudding". Ideally, I would like to put A, B, C in column 1 and three different rows of a dataframe (or tibble) and the text in column 2 and the corresponding rows. Any suggestion?
It is hard to know whether the strings in which you are trying to manipulate and then collate into columns in a data.frame follow a pattern. But for the example you posted, I suggest creating a list with the strings (strings):
strings <- list(string, text)
Then use lapply() which will in turn create a list for each element in strings.
res <-lapply(strings, function(x){
grep(x=trimws(unlist(strsplit(x, "\\s\\s"))), pattern="[[:alpha:]]", value=TRUE)
})
In the code above, strsplit() splits the string whenever two spaces are found (\\s\\s). But the resulting split is a list with the strings as inner elements. Therefore you need to use unlist() so you can use it with grep(). grep() will select only those strings with an alphanumeric character --which is what you want.
You can then use do.call(cbind, list) to bind the elements in the resulting lapply() list into columns. The dimension must match for this work.
do.call(cbind, res)
Result:
> do.call(cbind, res)
[,1] [,2]
[1,] "A" "I love"
[2,] "B" "chocolate"
[3,] "C" "pudding"
You can wrap it up into a as.data.frame() for instance to get the desired result:
> as.data.frame(do.call(cbind, res), stringsAsFactors = FALSE)
V1 V2
1 A I love
2 B chocolate
3 C pudding
You can use read.fwf and get the positions using nchar.
read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1]
# V2 V3 V4
#1 I love chocolate pudding
In case the white spaces should be removed use also trimws:
trimws(read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1])
#[1] "I love" "chocolate" "pudding"
Based on your data, I came up with this workaround by using the package stringr. This only works with that kind of pattern, so in case you have erratic ones you need to adjust it.
The output is a data.frame with two columns given by your two input data and rows according to the matches.
library(stringr)
string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string_nospace <- str_replace_all( string, "\\s{1,20}", " " )
string_nospace <- str_trim( string_nospace )
string_nospace <- data.frame( string = t(str_split(string_nospace, "\\s", simplify = TRUE)))
text_nospace <- str_replace_all( text, "\\s{2,20}", "_" )
text_nospace <- str_sub(text_nospace, start = 2)
text_nospace <- data.frame(text = t(str_split(text_nospace, "_", simplify = TRUE)))
df = data.frame(string = string_nospace,
text = text_nospace )
df
#> string text
#> 1 A I love
#> 2 B chocolate
#> 3 C pudding
Created on 2020-06-08 by the reprex package (v0.3.0)

Detect part of a string in R (not exact match)

Consider the following dataset :
a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)
I m trying to detect string1 in string2 BUT my goal is to not only detect exact matching. I m looking for a way to detect the presence of string1 words in string2, whatever the order words appear. As an example, the string "my beautiful house is cool" should return TRUE when searching for "my house".
I have tried to illustrate the expected behaviour of the script in the "return" column of above the example dataset.
I have tried grepl() and str_detect() functions but it only works with exact match. Can you please help ? Thanks in advance
The trick here is to not use str_detect as is but to first split the search_words into individual words. This is done in strsplit() below. We then pass this into str_detect to check if all words are matched.
library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
patterns <- strsplit(search_words," ")
mapply(function(word,string) all(str_detect(word,string)),words,patterns)
One base R option without the involvement of split could be:
n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1
n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))),
df[, 2],
gsub(" ", "|", df[, 1], fixed = TRUE),
USE.NAMES = FALSE)
n_matches == n_words
[1] TRUE TRUE TRUE FALSE
It, however, makes the assumption that there is at least one word per row in string1

Replace second occurrence of a string in one column based on value in other column in R

Here is a sample dataframe:
a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)
I'd like to be able to remove the second occurrence of the value in col a in col b.
Here is my desired output:
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
I've tried different combinations of gsub and some stringr functions, but I haven't even gotten close to being able to remove the second (and only the second) occurrence of the string in col a in col b. I think I'm asking something similar to this one, but I'm not familiar with Perl and couldn't translate it to R.
Thanks!
It takes a little work to build the right Regex.
P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\\2)")
sub(PAT, "\\1", b, perl=TRUE)
[1] "my cat is a tabby and is a friendly cat"
[2] "walk the dog"
[3] "the mouse is scared of the other "
I've actually found another solution that, though longer, may be clearer for other regex beginners:
library(stringr)
# Replace first instance of col a in col b with "INTERIM"
df$b <- str_replace(b, a, "INTERIM")
# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")
# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)
# Trim "double" whitespace
df$b <- str_replace(gsub("\\s+", " ", str_trim(df$b)), "B", "b")
df
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
You could do this...
library(stringr)
df$b <- str_replace(df$b,
paste0("(.*?",df$a,".*?) ",df$a),
"\\1")
df
a b
1 cat my cat is a tabby and is a friendly cat
2 dog walk the dog
3 mouse the mouse is scared of the other
The regex finds the first string of characters with df$a somewhere in it, followed by a space and another df$a. The capture group is the text up to the space before the second occurrence (indicated by the (...)), and the whole text (including the second occurrence) is replaced by the capture group \\1 (which has the effect of deleting the second df$a and its preceding space). Anything after the second df$a is not affected.
Base R, split-apply-combine solution:
# Split-apply-combine:
data.frame(do.call("rbind", lapply(split(df, df$a), function(x){
b <- paste(unique(unlist(strsplit(x$b, "\\s+"))), collapse = " ")
return(data.frame(a = x$a, b = b))
}
)
),
stringsAsFactors = FALSE, row.names = NULL
)
Data:
df <- data.frame(a = c("cat", "dog", "mouse"),
b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"),
stringsAsFactors = FALSE)

Remove specific words conditionnally in R

I am trying to remove a list of words in sentences according to specific conditions.
Let's say we have this dataframe :
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
> df
questions responses
[1,] "The highest mountain in the world" "The Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "A bird"
[4,] "Widest liquid area on earth" "The Pacific ocean"
And the following list of specific words:
articles <- c("The","A")
geowords <- c("mountain","liquid area")
I would like to do 2 things:
Remove the articles in first position in the responses column when adjacent to a word starting by a lower case letter
Remove the articles in first position in the responses column when (adjacent to a word starting by an upper case letter) AND IF (a geoword is in the corresponding question)
The expected result should be:
questions responses
[1,] "The highest mountain in the world" "Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "bird"
[4,] "Widest liquid area on earth" "Pacific ocean"
I'll try gsub without success as I'm not familiar at all with regex...
I have searched in Stackoverflow without finding really similar problem. If a R and regex all star could help me, I would be very thankfull!
The same as you mentioned has been written as two logical columns and ifelse is used to validate and gsub:
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)
df
articles <- c("The ","A ")
geowords <- c("mountain","liquid area")
df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))
df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])
df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps,
{gsub(paste(articles,collapse='|'),'', df$responses ) },
df$responses)
df$new_responses
> df$new_responses
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
I taught myself some R today. I used a function to get the same result.
#!/usr/bin/env Rscript
# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
articles <- c("The","A")
geowords <- c("mountain","liquid area")
common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")
f <- function (x){
q <- x[1]
r <- x[2]
a1 <- gsub (pattern1, "\\1", r)
if ( grepl(geo_pattern, q)){
a1 <- gsub (pattern2, "\\1", a1)
}
x[1] <- q
x[2] <- a1
}
apply (df, 1, f)
running;
Rscript stacko.R
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
You may choose to use simple regex with , grepl and gsub as below:
df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords
articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses),
gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen
> print(df)
questions responses
#1 The highest mountain in the world Himalaya
#2 A cold war serie from 2013 The Americans
#3 A kiwi which is not a fruit bird
#4 Widest liquid area on earth Pacific ocean
For the fun, here's a tidyverse solution:
df2 <-
df %>%
as.tibble() %>%
mutate(responses =
#
if_else(str_detect(questions, geowords),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[A-Z])"),
replacement = ""),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[a-z])"),
replacement = ""))
)
Edit: without the "first word" regex, with inspiration from #Calvin Taylor
# Define articles
articles <- c("The", "A")
# Make it a regex alternation
art_or <- paste0(articles, collapse = "|")
# Before a lowercase / uppercase
art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")
# Work on df
df4 <-
df %>%
as.tibble() %>%
mutate(responses =
if_else(str_detect(questions, geowords),
str_replace_all(string = responses,
pattern = regex(art_upper),
replacement = ""),
str_replace_all(string = responses,
pattern = regex(art_lower),
replacement = "")
)
)

Handling string search and substitution in R

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

Resources