I have data set with 130 rows and two columns.
I want to calculate the mean, minimum and maximum of every 5 rows of the seconds column using R. By using colMeans and the following command
rep(colMeans(matrix(data$Pb, nrow=5), na.rm=TRUE), each=5) I was able to compute mean for every 5 rows. However i am not able to compute max and min since there is no built in function for the same. I tried as suggested here for 5 rows instead of 2. However I get an error that dim(X) must have a positive length. Can someone please help me understand as to what should I do to fix and compute the above quantities ? My end goal is to plot min,mean, max for every 5 rows.
Thanks in advance.
If we are looking for function to find the max and min of each column of matrix, colMaxs and colMins from matrixStats can be used.
library(matrixStats)
colMaxs(mat)
#[1] 7 8 20
colMins(mat)
#[1] 3 1 7
But, if this is to find for every 5 rows of dataset columns, use gl to create the grouping index for each 5 rows, and then with the help of by we get the colMaxs or colMins or colMeans
by(data, list(gr=as.numeric(gl(nrow(data), 5, nrow(data)))),
FUN = function(x) colMaxs(as.matrix(x)))
The same way, we can find the colMins or colMeans
by(data, list(gr=as.numeric(gl(nrow(data), 5, nrow(data)))),
FUN = function(x) colMins(as.matrix(x)))
by(data, list(gr=as.numeric(gl(nrow(data), 5, nrow(data)))),
FUN = function(x) colMeans(as.matrix(x)))
The above can be done in a compact way with dplyr
library(dplyr)
data %>%
group_by(gr = as.numeric(gl(nrow(.), 5, nrow(.)))) %>%
summarise_each(funs(min, max, mean))
To do the plotting, may be we can extend this with ggplot
library(ggplot2)
library(tidyr)
data %>%
group_by(gr = as.numeric(gl(nrow(.), 5, nrow(.)))) %>%
summarise_each(funs(min, max, mean)) %>%
gather(Var, Val, -gr) %>%
separate(Var, into = c("Var1", "Var2")) %>%
ggplot(., aes(x=factor(gr), y=Val, fill=Var2)) +
geom_bar(stat="identity")+
facet_wrap(~Var1)
data
mat <- matrix(c(3,1,20,5,4,12,6,2,9,7,8,7), byrow=T, ncol=3)
set.seed(24)
data <- data.frame(Pb = sample(1:9, 42, replace=TRUE), Ps = rnorm(42))
A nice function for this would be the base by function combined with apply. Below is an example where you first make a index of the groups for your function:
m <- matrix(runif(130*2),130,2)
group <- rep(seq(nrow(m)), each=5, length.out=nrow(m))
res <- by(m, INDICES = group, FUN = function(x){apply(x, MARGIN=2, FUN=max)})
class(res) # "by" class
do.call(rbind, res) # matrix
Related
I have a data frame of 200*1000 rows and 6 columns. I want to calculate the correlation between 2 columns cor(df$y1, df$y2)) for every 200 rows, so that I get 1000 different correlation values as a result.
When I wanted to calculate the sums of every 200 rows I could simply use
rowsum(df,rep(1:1000,each=200))
but there is no such command in r as rowcor that I could use equivalently for correlations.
We may use a group by approach
by(df[c('y1', 'y2')], as.integer(gl(nrow(df), 200, nrow(df))),
FUN = function(x) cor(x$y1, x$y2))
Or using tidyverse
library(dplyr)
out <- df %>%
group_by(grp = as.integer(gl(n(), 200, n()))) %>%
summarise(Cor = cor(y1, y2))
> dim(out)
[1] 1000 2
data
set.seed(24)
df <- as.data.frame(matrix(rnorm(200 *1000 * 6), ncol = 6))
names(df)[1:2] <- c('y1', 'y2')
I want to write a for loop in R to replace NA values from one column of my dataframe and replace them for the mean of the values of the same column when 2 conditions are true.
When conditions are met, I want to assign the mean to NAs using observations from the same year and from the same group.
I wrote the following code, but I am struggling to write the conditions.
missing <- which(is.na(df$price))
for (i in 1:36){
x <- df[missing,]group
y <- df[missing,]year
selection <- df[conditions??,]$price
df[missing,]$price <- mean(selection, na.rm = TRUE)
}
You don't need a for loop, you can directly replace all the NAs with the mean(, na.rm=T) directly to calculate the mean of said column without NAs. This is for the general case:
df[is.na(df$price),]$price <- mean(df$price, na.rm = TRUE)
Using tidyverse you can achieve what you want:
library(tidyverse)
df %>% group_by(group, year) %>% mutate(price=ifelse(is.na(price), mean(price, na.rm=T), price))
Using data.table
dt <- data.table(df)
dt[,price:=fifelse(is.na(price), mean(price, na.rm=T), price), by=.(group,year)][]
A base R solution using by, which splits a data frame by the groups in the list in the second argument, and applies a function defined in the third:
result <- by(df,
list(df[["group"]], df[["year"]]),
function(x) {
x[is.na(x$price), "price"] <- mean(x[["price"]], na.rm = TRUE)
x
},
simplify = TRUE)
do.call(rbind, result)
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
I am running a very big simulation with 10,000 loops and to achieve better performance I have transformed the codes to generate matrix outputs, rather than data frame. It runs faster. However for each loop I need to summarize the outputs. For example, for below matrix,
mtx <- matrix(data = c(rep(c(1, 2), each = 6),
rep(c(3, 5, 7), each = 4),
rnorm(0, 1, n = 12)),
ncol = 3)
colnames(mtx) <- c("A", "B", "Value")
I want to summarize the number of observations in each A and B group, and calculate the mean values of Value, like the way that you could do with group_by() and summarize() in dplyr, if it's a data frame:
mtx %>% group_by(A, B) %>% summarize(N = n(), MEAN = mean(Value))
Are there any functions/packages to do this directly on matrix, without transforming it into data frame? Because the simulation is too big, collecting all the raw outputs and summarizing after the for loop is not an option.
You can use aggregate directly on a matrix.
aggregate(Value ~ A + B, data=mtx, mean)
A B Value
1 1 3 -0.2282783
2 1 5 0.5021404
3 2 5 -0.1693665
4 2 7 0.5118390
An option with tapply
tapply(mtx[, 'Value'], list(mtx[, 'A'], mtx[, 'B']), FUN = mean)
Consider the following -
set.seed(1)
x <- runif(100)
y <- sample(c('M', 'F', 'D'), 100, TRUE)
aveResult <- ave(x = x, y, FUN = sum)
tapplyResult <- tapply(x, y, sum)
aveResult <- setNames(aveResult, y)
tapplyResult
aveResult[!duplicated(names(aveResult))]
The results of both the functions are identical except for the length of their outputs. Furthermore, this also creates confusion (exacerbated due to recycling) as in this case.
Is there an example where one of the functions can do something the other can't?
ave is a very useful base R function which is fast and efficient for creating new columns based on applying function by group (below is a simple example that creates a mean by group column using ave, dplyr and data.table methods).
set.seed(24)
df1 <- data.frame(grp = sample(LETTERS, 1e6, replace = TRUE), val = rnorm(1e6))
system.time(with(df1, ave(val, grp)))
# user system elapsed
# 0.070 0.004 0.073
library(dplyr)
system.time(df1 %>%
group_by(grp) %>%
mutate(new = mean(val)))
# user system elapsed
# 0.159 0.000 0.160
library(data.table)
system.time(setDT(df1)[, new := mean(val), by = grp])
# user system elapsed
# 0.056 0.000 0.057
while tapply gives a summarised output. One of the main advantages of ave is that we don't have to worry about the order of the output as it always gives output in the same order of the rows. This can change even in some tidyverse functions. The question of whether the sorted unique values of ave is always equal to tapply - it depends. For some functions, we can get a summarised list output in tapply
tapply(1:10, rep(LETTERS[1:3], c(3, 3, 4)), FUN = range)
whereas ave fails here because it won't match the length of each group
ave(1:10, rep(LETTERS[1:3], c(3, 3, 4)), FUN = range)
and gives a warning
Just to add one more option in this particular case: There is also by(x, y, FUN = sum).
As a supplement to #akrun's excellent post, here is short break-down of the output differences between ave, tapply and by given OPs example data:
ave(x, y, FUN = sum) replaces x entries with group-summed values, where every group consists of those x values with the same y component. The return object is a vector of length length(x).
tapply(x, y, sum) sums x values for every group; the return object is an array that has the same number of dimensions as y has unique groups.
by(x, y, sum) also sums x values for every group; the return object is a list that has the same number of entries as y has unique groups.
Perhaps another way to think about the difference between ave vs. tapply/by is in the context of dplyr's syntax:
ave corresponds to a group_by+mutate statement:
data.frame(x, y) %>% group_by(y) %>% mutate(x = sum(x)) %>% pull(x)
tapply/by corresponds to a group_by+summarise statement:
data.frame(x, y) %>% group_by(y) %>% summarise(x = sum(x)) %>% pull(x)
As quite rightly emphasised by #Onyambu, by and tapply are quite different; tapply works on vectors, while by can take any object (typically a data.frame, matrix, etc.).