R: manipulating data.frames containing strings and booleans - r

I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.
It also has column names, which are strings. What I would like to do is the following:
For every row in p I see a TRUE I would like to replace with the name of the corresponding column
I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.
I can do step 1 using the following for loop:
for (i in seq(length(colnames(p)))) {
p[p[,i]==TRUE,i]=colnames(p)[i]
}
but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.
I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!

Here's some sample data:
df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))
You can use apply to do something like this:
names(df)[apply(df, 1, which)]
Or without apply by using which directly:
idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]

Use apply to sweep your index through, and use that index to access the column names:
> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+ c=c(FALSE,TRUE,FALSE))
> df
a b c
1 TRUE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
>

Related

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

How to find a subset of names in another column?

I have a list of file names that look like this:
files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")
I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names (names) but these names have different puncutation. The column looks like this:
df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")
I want to subset the list in df$repec_id so that I have only the strings that correspond to file names in files$name but they have different punctuation. In other words, I want an output that looks like this:
ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")
Initially, I thought that removing all the special characters from both lists and then comparing them would work. So I did this:
files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]
However, I need a way of preserving the original structure of the IDs in df$repec_id because I need to provide a list of IDs from df$repec_id that are/ are not in the subset. Does anyone have any suggestions? Thanks in advance for your help!
We can use
!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
You can remove all punctuations from repec_id and name and use %in% to find out the strings that match.
gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] TRUE TRUE TRUE TRUE FALSE FALSE
If you add negation(!) sign to this you would get strings that do not match.
!gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
This maintains the length same as df$repec_id so you can use this to subset rows from df.

How to grepl with two pattern objects in R

I have a vector called
vec <- c("16S_s95_S112_R2_101.fastq.gz",
"16S_s95_S112_R1_001.fastq.gz",
"16S_s94_S103_R2_021.fastq.gz",
"16S_s94_S103_R1_001.fastq.gz")
I want to grepl items with sample <- "_s95_" and R1 <- "R1".
I want to use sample and R1 objects while doing grepl and find something matching _s95_ and R1 strings both.
Result I want is 16S_s95_S112_R1_001.fastq.gz.
I tried grepl(pattern = sample&R1, x= vec) which did not work for me.
I can do this with multiple grepl's, but I am trying to find something neat to do this.
For your specific use case where you know the order of the patterns, it's almost certainly going to be faster to follow Jilber Urbina's suggestion to programmatically compose a single regex.
For a more general solution that works regardless of order and on any number of patterns, we can use sapply to loop across each pattern, and then use rowSums to count the number of pattern matches and find the rows where all of them match:
patterns = c("_s95_", 'R1')
sapply(patterns, function(x) grepl(x, vec))
_s95_ R1
[1,] TRUE FALSE
[2,] TRUE TRUE
[3,] FALSE FALSE
[4,] FALSE TRUE
vec[which(rowSums(sapply(patterns, function(x) grepl(x, vec))) == length(patterns))]
[1] "16S_s95_S112_R1_001.fastq.gz"
You need to work a bit more in your pattern in order to get the match, try:
> grep(paste0(".*", sample, ".*", R1), vec, value=TRUE)
[1] "16S_s95_S112_R1_001.fastq.gz"

R: Checking if mutliple elements of a vector appear in vector of strings

I'm trying to create a function that checks if all elements of a vector appear in a vector of strings. The test code is presented below:
test_values = c("Alice", "Bob")
test_list = c("Alice,Chris,Mark", "Alice,Bob,Chris", "Alice,Mark,Zach", "Alice,Bob,Mark", "Mark,Bob,Zach", "Alice,Chris,Bob", "Mark,Chris,Zach")
I would like the output for this to be FALSE TRUE FALSE TRUE FALSE TRUE FALSE.
I first thought I'd be able to switch the | to & in the command grepl(paste(test_values, collapse='|'), test_list) to get when Alice and Bob are both in the string instead of when either of them appear, but I was unable to get the correct answer.
I also would rather not use the command: grepl(test_values[1], test_list) & grepl(test_values[2], test_list) because the test_values vector will change dynamically (varying from length 0 to 3), so I'm looking for something to take that into account.
We can use Reduce with grepl
Reduce(`&`, lapply(test_values, grepl, test_list))
#[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE

Resources