Split dataset into multiple by column names - r

I am trying to split a dataset into multiple ones based on the column names:
for(i in 1:nrow(column_vals)){
dataset_filtered <- dataset_metadata %>%
filter(characteristics..strain == column_vals[i,1],
characteristics..age == column_vals[i,2])
samples <- dataset_filtered[,1]
samples <- substr(samples, 1, 22)
exprs_filtered <- as.data.frame(exprs) %>% filter(colnames(exprs) %in%
samples)
saveRDS(exprs_filtered, paste0(path, i, sep=""))
}
samples is a character array that contains different column names that need to be selected at each iteration. With above code I am getting an error:
exprs has dimensions 21266x24185. I tried using grepl function:
is.in <- grepl(paste(colnames(exprs), collapse="|"), samples)
exprs_filtered <- exprs[, is.in]
But it is giving me another error:
What am I doing wrong here? How to solve the problem? Any suggestions would be greatly appreciated.
Update
I tried transposing the exprs dataset: as.data.frame(t(exprs)) %>% ... and the error was gone, but the filtering was still not working: I am getting zero filtered results for each iteration. The exprs dataset looks the following way:
One of the samples character array:

If your data is 21266x24185, the error suggests that you might need to transpose exprs or samples using t() to get the same orientation.
edit:
R has appended an X to your exprs headers, so that they no longer match those in sample. When reading the exprs file (e.g. read.csv()) add the argument check.names = F, which will prevent this - though use with caution as syntactically invalid headers might affect other functions. See ?make.name for more info
If this still doesn't fix the problem, confirm that some of the headers in expr do indeed match samples so that we expect an output.
If you provide examples that contain matching data in a format that we can copy into R (text, not images), we may be able to help further if this doesn't solve the problem.

Related

Problems with renaming columns via variables in R

I'm having issues with a specific problem I have a dataset of a ton of matrices that all have V1 as their column names, essentially NULL. I'm trying to write a loop to replace all of these with column names from a list but I'm running into some issues.
To break this down to the most simple form, this code isn't functioning as I'd expect it to.
nameofmatrix <- paste('column_', i, sep = "")
colnames(eval(as.name(nameofmatrix))) <- c("test")
I would expect this to take the value of column_1 for example, and replace (in the 2nd line) with "test" as the column name.
I tried to break this down smaller, for example, if I run print(eval(as.name(nameofmatrix)) I get the object's column/rows printed as expected and if I run print(colnames(eval(as.name(nameofmatrix))) I'm getting NULL as expected for the column header (since it was set as V1).
I've even tried to manually type in the column name, such as colnames(column_1) <- c("test) and this successfully works to rename the column. But once this variable is put in the text's place as shown above, it does not work the same. I'm having difficulties finding a solution on how to rename several matrix columns after they have been created with this method. Does anyone have any advice or suggestions?
Note, the error I'm receiving on trying to run this is
Error in eval([as.name](nameofmatrix)) <- \`vtmp\` : could not find function "eval<-"
We could return the values of the objects in a list with get (if there are multiple objects use mget, then rename the objects in the list and update those objects in the global env with list2env
list2env(lapply(mget(nameofmatrix), function(x) {colnames(x) <- newnames
x}), .GlobalEnv)
It can also be done with assign
data(mtcars)
nameofobject <- 'mtcars'
assign(nameofobject, `colnames<-`(get(nameofobject),
c('mpg1', names(mtcars)[-1])))
Now, check the names of 'mtcars'
names(mtcars)[1]
#[1] "mpg1"

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Renaming columns using for loops gives error, R

I have obtained a data set with slightly inconsistent and messy variable names. I would like to rename them in a an efficient and automated way.
I have a set of data frames and I need to rename some columns in several of them. The order of the columns, and length of the data frames differ, so I would like to use any function such as grep() or a subset term (df$x[== "term"]).
I have found an older question regarding this problem (Rename columns in multiple dataframes, R), but I haven't been able to get any of the suggested solutions to work since I get an error message. I do not have reputation enough to comment and ask further questions on those replies. However, my problem seems to be a bit different as I get an error message from my for loop that is not mentioned in the earlier question:
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
Setup: multiple data frames, let's call them myDF1, myDF2 ...
In those data frames there are columns with names (bad_name1, bad_name2) that should be changed to something else (good_name1, good_name2).
Replicable dataset:
myDF1 <- data.frame(bad_name1="A", bad_name2="B")
myDF2 <- data.frame(bad_name1="C", bad_name2="D")
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name0", replacement = "good_name1")
}
There are several ways of doing this. One that appealed to me is the subset method:
colnames(myDF1)[names(myDF1) == "bad_name1"] <- "good_name1")
This works fine as a single line, but not as a for loop.
for (x in c(myDF1,myDF2)) {
colnames(x)[colnames(x) == "bad_name1"] <- "good_name1"
}
Which renders the error message.
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
The same error message applies with a 'gsub'-based method:
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name1", replacement = "good_name1")
}
I realise that I miss out on something fundamental here. I suppose that the for loop is not receiving the results of the 'colnames(x)' in an appropriate format. But I cannot understand how I'm supposed to make it work. The methods suggested in Rename columns in multiple dataframes, R does not really cover this error message.
Additional clarification, as asked for by vaettchen in a comment:
There is 3 column names that have to be changed (in all data frames). The reason is that they have names like varX.1, varX.2, varX.3 while I would prefer varXcount, varXmean, varXmax. So I have realised that there are names that I am not happy with, and decided on new ones based on my own taste.
You just need a few minor changes. Look at c(myDF1, myDF2) to see why that is not working - it splits the data frames into a list of 4 factors. Combine the data frames into a list and process the list:
all <- list(myDF1=myDF1, myDF2=myDF2)
for (x in seq_along(all)) {
colnames(all[[x]]) <- gsub(x = colnames(all[[x]]), pattern = "bad_name1",
replacement = "good_name1")
}
list2env(all, envir=.GlobalEnv)

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Is there a more efficient/clean approach to an eval(parse(paste0( set up?

Sometimes I have code which references a specific dataset based on some variable ID. I have then been creating lines of code using paste0, and then eval(parse(...)) that line to execute the code. This seems to be getting sloppy as the length of the code increases. Are there any cleaner ways to have dynamic data reference?
Example:
dataset <- "dataRef"
execute <- paste0("data.frame(", dataset, "$column1, ", dataset, "$column2)")
eval(parse(execute))
But now imagine a scenario where dataRef would be called for 1000 lines of code, and sometimes needs to be changed to dataRef2 or dataRefX.
Combining the comments of Jack Maney and G.Grothendieck:
It is better to store your data frames that you want to access by a variable in a list. The list can be created from a vector of names using get:
mynames <- c('dataRef','dataRef2','dataRefX')
# or mynames <- paste0( 'dataRef', 1:10 )
mydfs <- lapply( mynames, get )
Then your example becomes:
dataset <- 'dataRef'
mydfs[[dataset]][,c('column1','column2')]
Or you can process them all at once using lapply, sapply, or a loop:
mydfs2 <- lapply( mydfs, function(x) x[,c('column1','column2')] )
#G.Grothendieck has shown you how to use get and [ to elevate a character value and return the value of a named object and then reference named elements within that object. I don't know what your code was intended to accomplish since the result of executing htat code would be to deliver values to the console, but they would not have been assigned to a name and would have been garbage collected. If you wanted to use three character values: objname, colname1 and colname2 and those columns equal to an object named after a fourth character value.
newname <- "newdf"
assign( newname, get(dataset)[ c(colname1, colname2) ]
The lesson to learn is assign and get are capable of taking character character values and and accessing or creating named objects which can be either data objects or functions. Carl_Witthoft mentions do.call which can construct function calls from character values.
do.call("data.frame", setNames(list( dfrm$x, dfrm$y), c('x2','y2') )
do.call("mean", dfrm[1])
# second argument must be a list of arguments to `mean`

Resources