Concatenating R strings with a separator - except the first time - r

I have a process that does a few checks to a data frame, and at each check if the check passes I want to add a text to a column with a separator. So suppose after the first test rows 2 and 3 pass, so the msg column has "first" in it. Then the second test updated the ok column and is true for rows 1 and 2, giving the following:
> d = data.frame(ok=c(TRUE,TRUE,FALSE,FALSE), msg=c("", "first","first",""))
> d
ok msg
1 TRUE
2 TRUE first
3 FALSE first
4 FALSE
so the next step would be to add "second" to the msg column in rows 1 and 2 only, resulting in:
ok msg
1 TRUE second
2 TRUE first;second
3 FALSE first
4 FALSE
I can't work out how to do it. This first effort leaves a leading separator in the initial case:
> paste(d$msg[d$ok],"second", sep=";")
[1] ";second" "first;second"
This returns a length-3 vector which is clearly wrong:
> paste(c(d$msg[d$ok],"second"), sep=";")
[1] "" "first" "second"
and anything with collapse returns a length-1 vector which is also wrong.
Sledgehammer solution is to use the first effort above and then strip any leading separators at the end, but that's ugly. I'm hoping for something neater.
Solutions should only use base R functions, and the initial "empty string" doesn't have to be "" - but I've played with NA and got nowhere. A solution neater (in my opinion) than my sledgehammer will get accepted.

Using your dataframe d, we can use a base ifelse function to get around your problems with separators:
d$msg <- as.character(d$msg)
d$msg[d$ok] <- ifelse(d$msg[d$ok] == "", "second", paste(d$msg[d$ok], "second", sep=";"))
Output:
ok msg
1 TRUE second
2 TRUE first;second
3 FALSE first
4 FALSE

You can write a function which tests the cases of an empty string:
strAppend <- function(a, b, sep=";") {
paste0(a, c("", sep)[1+nchar(a)>0], b)
# paste0(a, c(sep,"")[1+(a=="")], b) #Alternative
}
strAppend(d$msg[d$ok], "second")
#[1] "second" "first;second"

Related

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

Find best match for multiple substrings across multiple candidates

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"
We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

How can I check if multiple strings exist in another string?

I have this string:
myStr <- "I am very beautiful btw"
str <- c("very","beauti","bt")
Now I want to check whether myStr includes all strings in str, how can I do this in R? For example above it should be TRUE.
Many Thanks
Yes, you can use grepl (not grep, actually), but you must run it once for each substring:
> sapply(str, grepl, myStr)
very beauti bt
TRUE TRUE TRUE
To get only one result if all of them are true, use all:
> all(sapply(str, grepl, myStr))
[1] TRUE
Edit:
In case you have more than one string to check, say:
myStrings <- c("I am very beautiful btw", "I am not beautiful btw")
You then run the sapply code, which will return a matrix with one row for each string in myStrings. Apply all on each row:
> apply(sapply(str, grepl, myStrings), 1, all)
[1] TRUE FALSE
Using stringr you could do:
str_detect(myStr, str)
Which returns a result for each substring:
#[1] TRUE TRUE TRUE
Or as per #thelatemail suggestion, if you want to know if all of them are true:
all(str_detect(myStr,str))
Which gives:
#[1] TRUE
You could also find the location (start, end) of every character in myStr that matches str
str_locate(myStr, str)
Which gives:
# start end
#[1,] 6 9
#[2,] 11 16
#[3,] 21 22

Grep in R using OR and NOT

I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!

R: manipulating data.frames containing strings and booleans

I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.
It also has column names, which are strings. What I would like to do is the following:
For every row in p I see a TRUE I would like to replace with the name of the corresponding column
I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.
I can do step 1 using the following for loop:
for (i in seq(length(colnames(p)))) {
p[p[,i]==TRUE,i]=colnames(p)[i]
}
but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.
I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!
Here's some sample data:
df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))
You can use apply to do something like this:
names(df)[apply(df, 1, which)]
Or without apply by using which directly:
idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]
Use apply to sweep your index through, and use that index to access the column names:
> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+ c=c(FALSE,TRUE,FALSE))
> df
a b c
1 TRUE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
>

Resources