Searching strings to ignore multiple matches - r

I have a dataframe with columns names that look like this:
d=c("Q.40a-some Text", "Q.40b-some Text", "Q.44a-some Text", "Q.44b-some Text" "Q.44c-some Text" "Q.44d-some Text" ,"Q.4a-some Text", "Q.4b-some Text")
I would like to identify the columns which begin with Q.4 and ignore the Q.40, Q.44.
To identify Q.44 or Q.40, for example, is easy. What I do is to use this "^Q.44" or "^Q.40" as input to my function. But, this does not work if I do the same for identifying Q.4 - simply because all names begin with Q.4. So, can someone help me on this ?
UPDATE
The result I want to pass it, to my function that takes inputs as follows:
multichoice<-function(data, question.prefix){
index<-grep(question.prefix, names(data)) # identifies the index for the available options in Q.12
cases<-length(index) # The number of possible options / columns
# Identify the range of possible answers for each question
# Step 1. Search for the min in each col and across each col choose the min
# step 2. Search for the max in each col and across each col choose the max
mn<-min(data[,index[1:cases]], na.rm=T)
mx<-max(data[,index[1:cases]], na.rm=T)
d = colSums(data[, index] != 0, na.rm = TRUE) # The number of elements across column vector, that are different from zero.
vec<-matrix(,nrow=length(mn:mx),ncol=cases)
for(j in 1:cases){
for(i in mn:mx){
vec[i,j]=sum(data[, index[j]] == i, na.rm = TRUE)/d[j] # This stores the relative responses for option j for the answer that is i
}
}
vec1<-as.data.frame(vec)
names(vec1)<-names(data[index])
vec1<-t(vec1)
return(vec1)
}
And the way I use my funtion is this
q4 <-multichoice(df2,"^Q.4")
Where by "^Q.4" I intend to identify the columns for Q.4, and df2 is my dataframe.

Here is a method using grep:
To return the indices
grep("^Q\\.4[^0-9]", d)
Of the column names:
grep("^Q\\.4[^0-9]", d, value=T)
This works because [^0-9] says any character that is not a number, so we match Q.4 literally, then match strings with any non number.
I believe what you want in the mn statement in your function is
mn <- min(sapply(data[,index], min, na.rm=T), na.rm=T)
sapply moves through the columns selected by index selected grep and finds the minimum with min. Then, min is applied to all of the columns.

We can use stringr,
library(stringr)
str_extract(d, 'Q.[0-9]+') == 'Q.4'
#[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#or
d[str_extract(d, 'Q.[0-9]+') == 'Q.4']
#[1] "Q.4a-some Text" "Q.4b-some Text"
If the format is always the same (i.e. Q.[0-9]...) then we can use gsub
gsub('\\D', '', d) == 4
#[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE

Related

ifelse function on a vector

I am using the ifelse function in order to obtain either a vector with NA if all the "value" of this vector are NA or a vector with all the values not equal to "NA_NA". In my example, I would like to obtain this results
[1] "14_mter" "78_ONHY"
but I am obtaining this
[1] "14_mter"
my example:
vect=c("NA_NA", "14_mter", "78_ONHY")
out=ifelse(all(is.na(vec)), vec, vec[which(vec!="NA_NA")])
What is wrong in this function ?
ifelse is vectorized and its result is as long as the test argument. all(is.na(vect)) is always just length one, hence the result. a regular if/else clause is fine here.
vect <- c("NA_NA", "14_mter", "78_ONHY")
if (all(is.na(vect))) {
out <- vect
} else {
out <- vect[vect != "NA_NA"]
}
out
#> [1] "14_mter" "78_ONHY"
additional note: no need for the which() here
The ifelse help file, referring to its three arguments test, yes and no, says:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
so if the test has a length of 1, which is the case for the code in the question, then the result will also have length 1. Instead try one of these.
1) Use if instead of ifelse. if returns the value of the chosen leg so just assign that to out.
out <- if (all(is.na(vect))) vect else vect[which(vect != "NA_NA")]
2) The collapse package has an allNA function so a variation on (1) is:
library(collapse)
out <- if (allNA(vect)) vect else vect[which(vect != "NA_NA")]
3) Although not recommended if you really wanted to use ifelse it could be done by wrapping each leg in list(...) so that the condition and two legs all have the same length, i.e. 1.
out <- ifelse(all(is.na(vect)), list(vect), list(vect[which(vect != "NA_NA")])) |>
unlist()
If the NAvalue is always the string NA_NA, this works:
grep("NA_NA", vect, value = TRUE, invert = TRUE)
[1] "14_mter" "78_ONHY"
While the pattern matches the NA_NA value, the invert = TRUE argument negates the match(es) and produces the unmatched values
Data:
vect=c("NA_NA", "14_mter", "78_ONHY")

How to define a certain number of nested for-loops (based on input length in R Shiny app)?

Here is the context : I work on a R shiny web application. The user uploads a dataframe. Then he selects a certain n number of columns with a selectInput. This number of columns selected can vary from one to six.
Based on this number of columns, I would like to generate the appropriate number nested for-loops automatically. At that time, I use if() conditions by testing each possible number of columns selected.
I want to pass through each unique value of each column selected. That makes my code very long :
my_columns = input$colnames #The user selects column names
if(length(mycolumns) == 1){
for(var1 in unique(mydataframe[,my_columns[1]])){
...
}
}
if(length(mycolumns) == 2){
for(var1 in unique(mydataframe[,my_columns[1]])){
for(var2 in unique(mydataframe[,my_columns[2]])){
...
}
}
}
if(length(mycolumns) == 3){
for(var1 in unique(mydataframe[,my_columns[1]])){
for(var2 in unique(mydataframe[,my_columns[2]])){
for(var3 in unique(mydataframe[,my_columns[3]])){
...
}
}
}
}
and so on ...
Is there a solution to avoid this ?
Thank you
Correct me if I am mistaken, but you seem to compute something that needs to cover all possible value combinations of the selected columns.
R does not need nested for-loops for this case
my_columns <- data.frame(
"A" = c(1,2,3),
"B" = c(11,12,13),
"C" = c(21,22,23))
# find all unique values per column
list_uniques <- lapply(seq_along(my_columns),
function(x){unique(my_columns[[x]])}
)
# find out all possible combinations of the given values
# the output is a dataframe
all_combinations <- expand.grid(list_uniques)
# Now you can iterate over the frame and do something with them
# example rowsums
rowSums(all_combinations) # vectorized functions like this are faster
# example custom function
apply(all_combinations,
MARGIN = 1, # iterate rowwise
# you can now use your own function
# the input i is a row as a named vector
FUN = function(i){paste(i,collapse = " and ")})
# This function will output:
# "1 and 11 and 21" "2 and 11 and 21" ....

How can I extract matched part of multiple strings?

I have multiple strings, and I want to extract the part that matches.
In practice my strings are directories, and I need to choose where to write a file, which is the location that matches in all strings. For example, if you have a vector with three strings:
data.dir <- c("C:\\data\\files\\subset1\\", "C:\\data\\files\\subset3\\", "C:\\data\\files\\subset3\\")
...the part that matches in all strings is "C:\data\files\". How can I extract this?
strsplit and intersect the overlapping parts recursively using Reduce. You can then piece it back together by paste-ing.
paste(Reduce(intersect, strsplit(data.dir, "\\\\")), collapse="\\")
#[1] "C:\\data\\files"
As #g-grothendieck notes, this will fail in certain circumstances like:
data.dir <- c("C:\\a\\b\\c\\", "C:\\a\\X\\c\\")
An ugly hack might be something like:
tail(
Reduce(
intersect,
lapply(strsplit(data.dir, "\\\\"),
function(x) sapply(1:length(x), function(y) paste(x[1:y], collapse="\\") )
)
),
1)
...which will deal with either case.
Alternatively, use dirname if you only ever have one extra directory level:
unique(dirname(data.dir))
#[1] "C:/data/files"
g contains the character positions to successive backslashes in data.dir[1]. From this create a logical vector ok whose ith element is TRUE if the first g[i] characters of all elements in data.dir are the same, i.e. all elements of substr(data.dir, 1, g[i]) are the same. If ok[1] is TRUE then there is a non-zero length common prefix whose length is given by the first g[k] characters of data.dir[1] where k (which equals rle(ok)$lengths[1]) is the leading number of TRUE values in ok; otherwise, there is no common prefix so return "".
g <- gregexpr("\\", data.dir[1], fixed = TRUE)[[1]]
ok <- sapply(g, function(i) all(substr(data.dir[1], 1, i) == substr(data.dir, 1, i)))
if (ok[1]) substr(data.dir[1], 1, g[rle(ok)$lengths[1]]) else ""
For data.dir defined in the question the last line gives:
[1] "C:\\data\\files\\"

R - How to determine if every value in column of dataframe is zero?

I have a dataframe and want to determine for a given column if every value in the column is equal to zero.
This is the code I have:
z <- read.zoo(sub, sep = ",", header = TRUE, index = 1:2, tz = "", format = "%Y-%m-%d %H:%M:%S")
if(all.equal(z$C_duration, 0))
C_dur_acf = NA
But I am getting an error:
Error in if (all.equal(z$C_duration, 0)) { :
argument is not interpretable as logical
The code should return a boolean value (TRUE/FALSE) if the entire column is all zeros.
Use all builtin: all(z$C_duration == 0)
Here is an example by using the iris dataset built in R and apply function in addiction with all that allows you to test if all elements of the objects you pass in it do respect one or more logical conditions.
Do note that in this case the "objects" is a column of the data frame. The code with lapply do the same for every column.
lapply(iris[-5], function(x) all(x == 0))
$Sepal.Length
[1] FALSE
$Sepal.Width
[1] FALSE
$Petal.Length
[1] FALSE
$Petal.Width
[1] FALSE
To use all.equal:
if(all.equal(z$C_duration, rep(0, length(z$C_duration)){
C_dur_acf = NA
}
In essence all.equal does a pair-wise test. The if statement is failing because all.equal(z$C_duration,0) returns: "Numeric: lengths (##, 1) differ"
HTH!

How to delete row n and n+1 in a dataframe?

I have several dataframes in which I want to delete each row that matches a certain string. I used the following code to do it:
df[!(regexpr("abc", df$V4) ==1),]
How can I delete the row that is following, e.g. if I delete row n as specified by the code above, how can I additionally delete row n+1?
My first try was to simply find out the indices of the desired rows, but that won't work, as I need to delete rows in different dataframes which are of different lengths. So the indices vary.
Thanks!
I would suggest taking out and manipulating the logical vector directly. Suppose we have the vector:
x = c(5,0,1, 4, 3)
and we want to do:
x[x > 3]
First, note that:
R> (s_n = x>3)
[1] TRUE FALSE FALSE TRUE FALSE
So
R> (s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
[1] TRUE TRUE FALSE TRUE TRUE
Hence,
x[s_n1]
gives you what you want.
In your particular example, something like:
s_n = !(regexpr("abc", df$V4) == 1)
s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
df[s_n1, ]
should work.
Use which() on your logical expression and then you can just add 1 to the result.
sel <- which(grep("abc", df$V4))
sel <- c(sel, sel+1)
df[-sel,]
df[which(!(regexpr("abc", df$V4) ==1))+c(0,1),]

Resources