Way to do this using apply? - r
I want to take an average for each row across different data frames. Does anyone know of a more clever way to do this using apply statements? Sorry for the wall of code.
Youl would need a vector of 1000:1006 for each hiXXXX file and then a vector 2:13 for the columns. I have used mapply for something weird like this before so maybe that could do it somehow?
for (i in 1:nrow(subavg)) {
subavg[i,c(2)] <- mean(c(hi1000[i,c(2)],hi1001[i,c(2)],hi1002[i,c(2)],hi1003[i,c(2)],hi1004[i,c(2)],hi1005[i,c(2)],hi1006[i,c(2)]))
subavg[i,c(3)] <- mean(c(hi1000[i,c(3)],hi1001[i,c(3)],hi1002[i,c(3)],hi1003[i,c(3)],hi1004[i,c(3)],hi1005[i,c(3)],hi1006[i,c(3)]))
subavg[i,c(4)] <- mean(c(hi1000[i,c(4)],hi1001[i,c(4)],hi1002[i,c(4)],hi1003[i,c(4)],hi1004[i,c(4)],hi1005[i,c(4)],hi1006[i,c(4)]))
subavg[i,c(5)] <- mean(c(hi1000[i,c(5)],hi1001[i,c(5)],hi1002[i,c(5)],hi1003[i,c(5)],hi1004[i,c(5)],hi1005[i,c(5)],hi1006[i,c(5)]))
subavg[i,c(6)] <- mean(c(hi1000[i,c(6)],hi1001[i,c(6)],hi1002[i,c(6)],hi1003[i,c(6)],hi1004[i,c(6)],hi1005[i,c(6)],hi1006[i,c(6)]))
subavg[i,c(7)] <- mean(c(hi1000[i,c(7)],hi1001[i,c(7)],hi1002[i,c(7)],hi1003[i,c(7)],hi1004[i,c(7)],hi1005[i,c(7)],hi1006[i,c(7)]))
subavg[i,c(8)] <- mean(c(hi1000[i,c(8)],hi1001[i,c(8)],hi1002[i,c(8)],hi1003[i,c(8)],hi1004[i,c(8)],hi1005[i,c(8)],hi1006[i,c(8)]))
subavg[i,c(9)] <- mean(c(hi1000[i,c(9)],hi1001[i,c(9)],hi1002[i,c(9)],hi1003[i,c(9)],hi1004[i,c(9)],hi1005[i,c(9)],hi1006[i,c(9)]))
subavg[i,c(10)] <- mean(c(hi1000[i,c(10)],hi1001[i,c(10)],hi1002[i,c(10)],hi1003[i,c(10)],hi1004[i,c(10)],hi1005[i,c(10)],hi1006[i,c(10)]))
subavg[i,c(11)] <- mean(c(hi1000[i,c(11)],hi1001[i,c(11)],hi1002[i,c(11)],hi1003[i,c(11)],hi1004[i,c(11)],hi1005[i,c(11)],hi1006[i,c(11)]))
subavg[i,c(12)] <- mean(c(hi1000[i,c(12)],hi1001[i,c(12)],hi1002[i,c(12)],hi1003[i,c(12)],hi1004[i,c(12)],hi1005[i,c(12)],hi1006[i,c(12)]))
subavg[i,c(13)] <- mean(c(hi1000[i,c(13)],hi1001[i,c(13)],hi1002[i,c(13)],hi1003[i,c(13)],hi1004[i,c(13)],hi1005[i,c(13)],hi1006[i,c(13)]))
}
As there are only 7 datasets, we can use that as arguments for Map, then cbind it, and get the rowMeans
Map(function(...) rowMeans(cbind(...)), hi1000, hi1001, hi1002, hi1003,
hi1004, hi1005, hi1006)
Or use + with Reduce after getting the datasets in a list and then divide by the total number of datasets, i.e. 7
Reduce(`+`, mget(paste0("hi", 1000:1006)))/7
The second solution is more compact, but if we have NAs in the dataset, it is better to use the first one as the rowMeans have na.rm argument. By default it is FALSE, but we can set it to TRUE.
Related
Recall different data names inside loop
here is how I created number of data sets with names data_1,data_2,data_3 .....and so on for initial dim(data)<- 500(rows) 17(column) matrix for ( i in 1:length(unique( data$cluster ))) { assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,])) } upto this point everything is fine now I am trying to use these inside the other loop one by one like for (i in 1:5) { data<- paste(data, i, sep = "_") } however this is not giving me the data with required format any help will be really appreciated. Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe. Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages. Using Base R With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop: mylist <- split( data, f = data$cluster) for(mydata in mylist){ head(mydata) ... } Or mylist <- split( data, f = data$cluster) result <- lapply(mylist, function(mydata){ doSomething(mydata) }) Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop. If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy Using plyr and dplyr These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-) For example: library(plyr) result <- ddply(data, .(cluster), function(mydata){ doSomething(mydata) }) Use dlply if the result should be a list.
R: How to do this without a for-loop?
The following code in R uses a for-loop. What is a way I could solve the same problem without a for-loop (maybe by vectorizing it)? I am looking at an unfamiliar dataset with many columns (243), and am trying to figure out which columns hold unstructured text. As a first check, I was going to flag columns that are 1) of class 'character' and 2) have at least ten unique values. openEnded <- rep(x = NA, times = ncol(scaryData)) for(i in 1:ncol(scaryData)) { openEnded[i] <- is.character(scaryData[[i]]) & length(unique(scaryData[[i]])) >= 10 }
This would probably do the job: openEnded <- apply(scaryData, 2, function(x) is.character(x) & length(unique(x))>=10) From the loop, you simply iterate over columns (that's the apply(scaryData, 2) part) an anonymous function that combines your two conditions (function(x) cond1 & cond2). I guess your data is a data.frame so sapply(scaryData, 2, function(x) ...) would also work. A nice post about the *apply family can be found there.
Applying multiple function via sapply
I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below: require(datasets) crs_mat <- cor(mtcars) # Triangle function get_upper_tri <- function(cormat){ cormat[lower.tri(cormat)] <- NA return(cormat) } require(reshape2) crs_mat <- melt(get_upper_tri(crs_mat)) I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve: crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) { # Replace first phrase gsub("mpg","MPG",x), # Replace second phrase gsub("gear", "GeArr",x) # Ideally, perform other changes }) Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following: Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub. Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax I don't want a nested loop I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers. Edit Following very useful comments, what I'm trying to achieve can be summarised in the solution below: fun.clean.columns <- function(x, str_width = 15) { # Make character x <- as.character(x) # Replace various phrases x <- gsub("perc85","something else", x) x <- gsub("again", x) x <- gsub("more","even more", x) x <- gsub("abc","ohmg", x) # Clean spaces x <- trimws(x) # Wrap strings x <- str_wrap(x, width = str_width) # Return object return(x) } mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns) I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact library(qdap) crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub, pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm. crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) { # Replace first phrase step1 <- gsub("mpg","MPG",x) # Replace second phrase. Note that this operates on a modified dataframe. step2 <- gsub("gear", "GeArr",step1) # Ideally, perform other changes return(step2) #or one nested line, not practical if more needs to be done #return(gsub("gear", "GeArr",gsub("mpg","MPG",x))) })
Double "for loops" in a dataframe in R
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column: height = ifelse(abs(height-mean(height,na.rm=TRUE)) < 3*sd(height,na.rm=TRUE),height,NA) And I also want to create other variables based on different columns. For example: data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) , paste(data$age, data$mark,sep=""),NA) An example of my dataset would be: name = factor(c("A","B","C","D","E","F","G","H","H")) height = c(120,NA,150,170,NA,146,132,210,NA) age = c(10,20,0,30,40,50,60,NA,130) mark = c(100,0.5,100,50,90,100,NA,50,210) data = data.frame(name=name,mark=mark,age=age,height=height) data I have tried this (for one condition): d1=names(data) list = c("age","height","mark") ntraits=length(list) nrows=dim(data)[1] for(i in 1:ntraits){ a=list[i] b=which(d1==a) d2=data[,b] for (j in 1:nrows){ d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA) } } Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one. First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about. NA_outside_3s <- function(x) { mean_x <- mean(x) sd_x <- sd(x,na.rm=TRUE) x_outside_3s <- abs(x - mean(x)) < 3 * sd_x x[x_outside_3s] <- NA # no need for ifelse here x } of course, you can choose any function name you want. More descriptive is better. Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length. cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns. for (j in cols_to_loop_over) { my_data[, j] <- NA_if_3_sd(my_data[, j]) } I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward. In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply: my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s) Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability. Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
Quantiles of a data.frame
There is a data.frame() for which's columns I'd like to calculate quantiles: tert <- c(0:3)/3 data <- dbGetQuery(dbCon, "SELECT * FROM tablename") quans <- mapply(quantile, data, probs=tert, name=FALSE) But the result only contains the last element of quantiles return list and not the whole result. I also get a warning longer argument not a multiple of length of shorter. How can I modify my code to make it work? PS: The function alone works like a charme, so I could use a for loop: quans <- quantile(a$fileName, probs=tert, name=FALSE) PPS: What also works is not specifying probs quans <- mapply(quantile, data, name=FALSE)
The problem is that mapply is trying to apply the given function to each of the elements of all of the specified arguments in sequence. Since you only want to do this for one argument, you should use lapply, not mapply: lapply(data, quantile, probs=tert, name=FALSE) Alternatively, you can still use mapply but specify the arguments that are not to be looped over in the MoreArgs argument. mapply(quantile, data, MoreArgs=list(probs=tert, name=FALSE))
I finally found a workaround which I don't like but kinda works. Perhaps someone can tell the right way to do it: q <- function(x) { quantile(x, probs=c(0:3)/3, names=FALSE) } mapply(q, data) works, no Idea where the difference is.