Renaming columns using for loops gives error, R - r

I have obtained a data set with slightly inconsistent and messy variable names. I would like to rename them in a an efficient and automated way.
I have a set of data frames and I need to rename some columns in several of them. The order of the columns, and length of the data frames differ, so I would like to use any function such as grep() or a subset term (df$x[== "term"]).
I have found an older question regarding this problem (Rename columns in multiple dataframes, R), but I haven't been able to get any of the suggested solutions to work since I get an error message. I do not have reputation enough to comment and ask further questions on those replies. However, my problem seems to be a bit different as I get an error message from my for loop that is not mentioned in the earlier question:
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
Setup: multiple data frames, let's call them myDF1, myDF2 ...
In those data frames there are columns with names (bad_name1, bad_name2) that should be changed to something else (good_name1, good_name2).
Replicable dataset:
myDF1 <- data.frame(bad_name1="A", bad_name2="B")
myDF2 <- data.frame(bad_name1="C", bad_name2="D")
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name0", replacement = "good_name1")
}
There are several ways of doing this. One that appealed to me is the subset method:
colnames(myDF1)[names(myDF1) == "bad_name1"] <- "good_name1")
This works fine as a single line, but not as a for loop.
for (x in c(myDF1,myDF2)) {
colnames(x)[colnames(x) == "bad_name1"] <- "good_name1"
}
Which renders the error message.
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
The same error message applies with a 'gsub'-based method:
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name1", replacement = "good_name1")
}
I realise that I miss out on something fundamental here. I suppose that the for loop is not receiving the results of the 'colnames(x)' in an appropriate format. But I cannot understand how I'm supposed to make it work. The methods suggested in Rename columns in multiple dataframes, R does not really cover this error message.
Additional clarification, as asked for by vaettchen in a comment:
There is 3 column names that have to be changed (in all data frames). The reason is that they have names like varX.1, varX.2, varX.3 while I would prefer varXcount, varXmean, varXmax. So I have realised that there are names that I am not happy with, and decided on new ones based on my own taste.

You just need a few minor changes. Look at c(myDF1, myDF2) to see why that is not working - it splits the data frames into a list of 4 factors. Combine the data frames into a list and process the list:
all <- list(myDF1=myDF1, myDF2=myDF2)
for (x in seq_along(all)) {
colnames(all[[x]]) <- gsub(x = colnames(all[[x]]), pattern = "bad_name1",
replacement = "good_name1")
}
list2env(all, envir=.GlobalEnv)

Related

Using paste0 to fill in an argument

Let me start by saying I'm sure this has been answered before but I am unsure of what terms to search.
I have a few data frames that are named like df_A , df_B , and df_C and wish to send them all to ggplot. I tried to loop through them all but have been unsuccessful. Here is what I have now:
for (Param in c("A","B","C"){
chosen_df <- paste0("df_",Param)
ggplot(data=chosen_df...)
}
I receive back an error saying "data must be a data frame". This error makes sense to me since chosen_df is character vector rather than the actual data frame.
I have tried using noquote but to no avail.
My questions are:
1) What kind of search terms can I look up to solve this problem
2) How close am I to solving this problem?
We can use get to return the value of the object names as a string
for (Param in c("A","B","C"){
chosen_df <- get(paste0("df_",Param))
ggplot(data=chosen_df, ...)
}
Or with mget, return the values in a list
lst1 <- mget(ls(pattern = '^df_[A-Z]$'))

Split dataset into multiple by column names

I am trying to split a dataset into multiple ones based on the column names:
for(i in 1:nrow(column_vals)){
dataset_filtered <- dataset_metadata %>%
filter(characteristics..strain == column_vals[i,1],
characteristics..age == column_vals[i,2])
samples <- dataset_filtered[,1]
samples <- substr(samples, 1, 22)
exprs_filtered <- as.data.frame(exprs) %>% filter(colnames(exprs) %in%
samples)
saveRDS(exprs_filtered, paste0(path, i, sep=""))
}
samples is a character array that contains different column names that need to be selected at each iteration. With above code I am getting an error:
exprs has dimensions 21266x24185. I tried using grepl function:
is.in <- grepl(paste(colnames(exprs), collapse="|"), samples)
exprs_filtered <- exprs[, is.in]
But it is giving me another error:
What am I doing wrong here? How to solve the problem? Any suggestions would be greatly appreciated.
Update
I tried transposing the exprs dataset: as.data.frame(t(exprs)) %>% ... and the error was gone, but the filtering was still not working: I am getting zero filtered results for each iteration. The exprs dataset looks the following way:
One of the samples character array:
If your data is 21266x24185, the error suggests that you might need to transpose exprs or samples using t() to get the same orientation.
edit:
R has appended an X to your exprs headers, so that they no longer match those in sample. When reading the exprs file (e.g. read.csv()) add the argument check.names = F, which will prevent this - though use with caution as syntactically invalid headers might affect other functions. See ?make.name for more info
If this still doesn't fix the problem, confirm that some of the headers in expr do indeed match samples so that we expect an output.
If you provide examples that contain matching data in a format that we can copy into R (text, not images), we may be able to help further if this doesn't solve the problem.

Dynamically call dataframe column & conditional replacement in R

First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Recoding over multiple data frames in R

(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]

Resources