Data table subsetting in r by concatenating string variables - r

I have a data table that I am trying to subset by creating a list of variable names by pasting together some string vectors in the j argument of the data table, but I'm running into difficulty.
I have a character vector called foos (for this example foos <- c('FOO0','FOO1','FOO2')) and a vector I created with c() . I wanted to subset my data table by doing dt[,paste0(foos, c('VAR0','VAR1','VAR2'))] but that didn’t work as expected. I output what paste0(foos, c('VAR0','VAR1','VAR2')) returns and it becomes
[1] "FOO0VAR0" "FOO1VAR1" "FOO2VAR2"
so it seems this approach does a vector index by vector index concatenation instead of a concatenation of the vectors themselves (and that’s a bit surprising to me, I’d expect to have to lapply to get a paste happening on elements of a vector). Changing the permutation of the c() and paste0 didn’t work. I also tried to do
dt[,c(foos,c('VAR0','VAR1','VAR2'))] but that also doesn't work.
Is there a way to subset by a created concatenation of two string vectors in the jth column of a data table in R?

Related

How can I create a vector of subsets in Pari/GP?

I want to produce a vector containing all k-element subsets of a second vector. I know that I can do this by applying vecextract with each k-element subset of the natural numbers 1...n to my original vector.
How can I create that vector of subsets of natural numbers, though? I can see that the command forsubset does nearly what I want, but it's an imperative command, not one which creates a vector. So I could use the following function to print a list of vectors:
f(n,k)=forsubset([n,k],s,print(s)) but I can only capture them by adding each subset to a List() and then converting the List() to a Vec(). This seems clumsy. Is there a better way to do this, perhaps a totally different one?

improving specific code efficiency - *base R* alternative to for() loop solution

Looking for a vectorized base R solution for my own edification. I'm assigning a value to a column in a data frame based on a value in another column in the data frame.
My solution creates a named vector of possible codes, looks up the code in the original column, subsets the named list by the value found, and assigns the resulting name to the new column. I'm sure there's a way to do exactly this using the named vector I created that doesn't need a for loop; is it some version of apply?
dplyr is great and useful and I'm not looking for a solution that uses it.
# reference vector for assigning more readable text to this table
tempAssessmentCodes <- setNames(c(600,301,302,601,303,304,602,305,306,603,307,308,604,309,310,605,311,312,606,699),
c("base","3m","6m","6m","9m","12m","12m","15m","18m","18m","21m","24m","24m","27m","30m","30m",
"33m","36m","36m","disch"))
for(i in 1:nrow(rawDisp)){
rawDisp$assessText[i] <- names(tempAssessmentCodes)[tempAssessmentCodes==rawDisp$assessment[i]]
}
The standard way is to use match():
rawDisp$assessText <- names(tempAssessmentCodes)[match(rawDisp$assessment, tempAssessmentCodes)]
For each y element match(x, y) will find a corresponding element index in x. Then we use the names of y for replacing values with names.
Personally, I do it the opposite way - make tempAssesmentCodes have names that correspond to old codes, and values correspond to new codes:
codes <- setNames(names(tempAssessmentCodes), tempAssessmentCodes)
Then simply select elements from the new codes using the names (old codes):
rawDisp$assessText <- codes[as.character(rawDisp$assessment)]

What does tibble in R exactly do?

I came accross this question: Can you use multiple conditions in match() function - R, and I was wondering, what exacly is the tribble function used for? (It's part of the answer provided)
According to Rdocumentation (https://www.rdocumentation.org/packages/tibble/versions/3.0.4/topics/tribble) it is used to constuct dataframes, but what is the difference between it and e.g. data.frame() ?
Tibble Vs data.frame
The below information will give you a better understanding of how tibble differs from data.frame
Tibble
data.frame
Row names
doesn't add row names to the data frame
Can add row names
partial matching of variable name
Doesn't allow partial matching of variable name
permitted
Subset
subsetting/extracting part of a tibble always gives a tibble
subsetting data.frame can return vector
non-valid R variable
Allows non-valid R variable names
A non-valid R variable name has to be surrounded by backticks
variables
variables used to create tibble must have same length
can have different length
recycling values of vector
Tibbles don't recycle values of vector when creating dataframe
data.frame recycles values of vector when creating a dataframe
Printing capabilities
has better printing capabilities
no special printing capabilities
Printing rows
can specify no of rows to print for a tibble
prints all rows
Printing columns
Only prints number of columns that can fit horizontally in console
prints all columns

R code to extract two columns (one of which has multiple text strings which need to be parsed) into a named list of character vectors

I currently have a situation where I have a dataframe where I need to convert two of the columns a specified format. Example of the data in each column:
Column 1: Some_text_String
Column 2:
GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`GO:0005576^cellular_component^extracellular region`GO:0099503^cellular_component^secretory vesicle`GO:0004252^molecular_function^serine-type endopeptidase activity`GO:0080001^biological_process^mucilage extrusion from seed coat`GO:0048359^biological_process^mucilage metabolic process involved in seed coat development`GO:0010214^biological_process^seed coat development
So I have two problems. I need to parse the second column so that only the GO:XXXXXXXX text is included. A partial solution that gets the first term is stringr::str_extract(mydataframe[1,2], ".{0,8}GO.{0,8}") but this only captures the first term.
Secondly the final output needs to be a named list of character vectors, with the list names being the first column and each element of the list being a character vector. This is direct from the vignette of the R package I'm trying to use (topGO).
The object returned by readMappings is a named list of character
vectors. The list names give the genes identifiers. Each element of
the list is a character vector and contains the GO identifiers
annotated to the specific gene
I know this is simple but I'm just getting stuck trying to use apply or some other solution and my brain is on strike.
Repex:
myvector1 <- c("Some_text_String")
myvector2 <- c("GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`")
mydataframe <- data.frame(myvector1,myvector2)
# parse myvector2 to remove everything except GO terms.
# This code only gets the first term, but I need all of them as a vector
stringr::str_extract(mydataframe [1,2], ".{0,8}GO.{0,8}")
# At this point the desired result is named list of character vectors, with the list names being the first column and each element of the list being a character vector.
You can use str_extract_all to extract all the values that satisfy the pattern and use setNames to get a named list.
library(stringr)
setNames(str_extract_all(mydataframe [1,2], "GO.{0,8}"), mydataframe$myvector1)
#$Some_text_String
#[1] "GO:0048046" "GO:0005618"

How to code this if else clause in R?

I have a function that outputs a list containing strings. Now, I want to check if this list contain strings which are all 0's or if there is at least one string which doesn't contain all 0's (can be more).
I have a large dataset. I am going to execute my function on each of the rows of the dataset. Now,
Basically,
for each row of the dataset
mylst <- func(row[i])
if (mylst(contains strings containing all 0's)
process the next row of the dataset
else
execute some other code
Now, I can code the if-else clause but I am not able to code the part where I have to check the list for all 0's. How can I do this in R?
Thanks!
You can use this for loop:
for (i in seq(nrow(dat))) {
if( !any(grepl("^0+$", dat[i, ])) )
execute some other code
}
where dat is the name of your data frame.
Here, the regex "^0+$" matches a string that consists of 0s only.
I'd like to suggest solution that avoids use of explicit for-loop.
For a given data set df, one can find a logical vector that indicates the rows with all zeroes:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',s))) # grepl() was taken from the Sven's solution
With this logical vector, it is easy to subset df to remove all-zero rows:
df[!all.zeros,]
and use it for any subsequent transformations.
'Toy' dataset
df <- data.frame(V1=c('00','01','00'),V2=c('000','010','020'))
UPDATE
If you'd like to apply the function to each row first and then analyze the resulting strings, you should slightly modify the all.zeros expression:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',func(s))))

Resources