I am having difficulty subsetting my data by factors in a for loop. Here is a illustrative example:
x<-rnorm(n=40, m=0, sd=1)
y<-rep(1:5, 8)
df<-as.data.frame(cbind(x,y))
df_split<-split(df, df$y)
mean_vect<-rep(-99, 5)
for (i in c(1:5)) {
current_df<-df_split$i
mean_vect[i]<-mean(current_df)
}
`
This approach is not working because I think R is looking for a split called "i" when I really want it to pull out the ith split! I have also tried the subset function with little joy. I always run into these problems when I am trying to split on a non-numeric factor so any help would be appreciated
FYI, the functionality to accomplish this is typically done using tapply
tapply( df$x, df$y, mean )
The first argument specifies the value you want to "mean-group". The second is just the INDEX, i.e. the variable that splits your groups and the last is obviously the function you want to run on these groups, in this case mean.
To get split number i run
df_split[[i]]
BTW, as your final aim is mean_vect you better to use
mean_vect <- lapply(df_split, mean)
or:
mean_vect <- tapply(df$x, df$y, mean)
mean_vect
1 2 3 4 5
0.2566810 -0.1528079 -0.2097333 -0.1540343 0.3609312
Related
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
I have a general problem about the R loops in general:
Could someone explain me the error of this?
for (i in seq(length=2:ncol(df))) { z <- cor.test(df$SEASON, df[,i], method="spearman");z}
easily I would like to use the cor.test(x,y) function between the col called SEASON with all the col of my data frame "df".
Moreover I want that after this calculation, R prints me the results "z".
First, you don't really need a for() loop. You can use the apply() function to get the correlation of SEASON with all of the other columns of the data frame df.
# some fake data
n <- 20
df <- data.frame(SEASON=runif(n), A=runif(n), B=runif(n), C=runif(n))
# print the correlation
apply(df[, -1], 2, cor.test, df$SEASON, method="spearman")
Second, you are not using the seq() function properly. The length.out argument of seq() is the "desired length of the sequence". You keep supplying the length.out argument with a vector, instead of the scalar (a vector of length one) it is expecting. That is why you get a warning message when you submit something like, seq(length.out=2:ncol(df)). The function just uses the first element, so the result is the same (without a warning message) as for seq(length.out=2). If you wanted to use seq() to give you the desired result, you would use seq(from=2, to=ncol(df)). This is fine, but I think it simpler and cleaner to simply use 2:ncol(df) as previous posters suggest.
If you really wanted to use a for loop, this should do the trick:
for(i in 2:ncol(df)) cor.test(df$SEASON, df[, i], method="spearman")
I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))
So, I'm trying to generate random numbers from multivariate normal distributions with different means. I'm also trying to use the apply functions and not for loops, which is where the problem occurs. Here is my code:
library(MASS)
set.seed(123)
# X and Y means
Means<-cbind(c(.2,.2,.8),c(.2,.6,.8))
Means
Sigma<-matrix(c(.01,0,0,.01),nrow=2)
Sigma
data<-apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
data
Instead of getting two vector with X and Y points for the three means, I get three vectors with X and Y points stacked. What is the best way to get the two vectors? I know I could unstack them manually, but I feel R should have some slick way of getting this done.
It's not sure if it's what I would call 'slick' but if you really want to use apply (instead of lapply as previously mentioned), you can force apply to return your results as a list of matrices. Then it's just a matter of sticking the results together. I expect that this would be less error-prone than trying to rebuild a two column matrix.
data <- apply(Means, 1, function(x) {
list(mvrnorm(n=10, mu=x, Sigma=Sigma))
})
data <- do.call('rbind', unlist(data, recursive=FALSE))
Try:
set.seed(42)
res1 <- lapply(seq_len(nrow(Means)), function(i) mvrnorm(Means[i,], n=10, Sigma))
Checking with the results of apply
set.seed(42)
res2 <- apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
dim(res2) <- c(10,2, 3)
res3 <-lapply(1:dim(res2)[3], function(i) res2[,,i])
all.equal(res3, res1, check.attributes=FALSE)
#[1] TRUE