Computing Colwise Means on a Given Interval - r

I have a data frame in R that can be approximated as:
df <- data.frame(x = rep(1:5, each = 4), y = rep(2:6, each = 4), z = rep(3:7, each = 4))
> df
x y z
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 2 3 4
6 2 3 4
7 2 3 4
8 2 3 4
9 3 4 5
10 3 4 5
11 3 4 5
12 3 4 5
13 4 5 6
14 4 5 6
15 4 5 6
16 4 5 6
17 5 6 7
18 5 6 7
19 5 6 7
20 5 6 7
I'd like to compute colwise means at intervals of 5, and then collapse these means into a new data frame. For example, I'd like to compute the colwise means of df[1:5,], df[6:10,], df[11:15,], and df[16:20,], and return a df that looks as follows:
[,1] [,2] [,3]
[1,] 1.2 2.2 3.2
[2,] 2.4 3.4 4.4
[3,] 3.6 4.6 5.6
[4,] 4.8 5.8 6.8
I'm currently using a for-loop as such (where temp.coeff would correspond to the "5" specified above):
my.means <- NULL
for (j in 1:baseFreq) {
temp.mean <- colMeans(temp.df[(temp.coeff*(j-1)+1):(temp.coeff*j),])
my.means <- rbind(my.means, temp.mean)
}
my.means <- t(my.means)
collapsed.df <- t(data.frame(colMeans(my.means)))
}
..but I feel like there's an apply statement that could do the job a lot more efficiently. In addition, while the above data frame only has 20 rows, the one's on which I'll be working will have several thousand. Thoughts?
Many thanks in advance SO.

aggregate can do this if you aggregate against an appropriate running index. You do end up with another column in the result (which can be removed).
aggregate(. ~ rep(seq(nrow(df)/5), each=5), data=df, FUN=mean)
## rep(seq(nrow(df)/5), each = 5) x y z
## 1 1 1.2 2.2 3.2
## 2 2 2.4 3.4 4.4
## 3 3 3.6 4.6 5.6
## 4 4 4.8 5.8 6.8

I really think data.table works great for situations like this. It is fast and easy.
require("data.table")
dt <- data.table(df)
dt[,row.num:=.I]
dt[,lapply(.SD,mean),by=list(interval=cut(row.num,seq(0,nrow(dt),by=5)))]
# interval x y z
# 1: (0,5] 1.2 2.2 3.2
# 2: (5,10] 2.4 3.4 4.4
# 3: (10,15] 3.6 4.6 5.6
# 4: (15,20] 4.8 5.8 6.8

This is a possible solution with a combination of apply and sapply:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)])))
# x y z
#[1,] 1.2 2.2 3.2
#[2,] 2.4 3.4 4.4
#[3,] 3.6 4.6 5.6
#[4,] 4.8 5.8 6.8
Edit after comment by #jbaums: depending on the desired behavior, you might want to add na.rm=TRUE to the mean calculation:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)], na.rm = TRUE)))

Related

Create a new row from the average of specific rows from all columns [duplicate]

This question already has answers here:
How can I get each numeric column's mean in one data?
(2 answers)
Doing operation on multiple numbered tables in R
(1 answer)
Closed 2 years ago.
Let's say I have a dataset.
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3,5)
length(y)=4
z=data.frame(w,x,y)
This will return
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
I would like to have a 5th row that averages only row 2 and 3.
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
5 6.5 2.5 2.5
How would I approach this?
There are a lot of examples with rowMeans, but I'm looking to average all columns, and from only specific rows.
You can use colMeans as :
rows <- c(2, 3)
rbind(z, colMeans(z[rows,]))
# w x y
#1 5.0 1.0 1.0
#2 6.0 2.0 2.0
#3 7.0 3.0 3.0
#4 8.0 4.0 5.0
#5 6.5 2.5 2.5
Does this work:
library(dplyr)
z %>% bind_rows(sapply(z[2:3,], mean))
w x y
1 5.0 1.0 1.0
2 6.0 2.0 2.0
3 7.0 3.0 3.0
4 8.0 4.0 5.0
5 6.5 2.5 2.5

r rbind dataframes in each list using lapply function

I want to add some data points. odtl is the original data andadtl is the data points to add. adtl is set to NA but will be interpolated by zoo :: na.spline after rbind.
During this process, two lists(odtl and adtl) contain three data frames each. I want to combine the data frames in the order in which they are loaded into each list.
I succeed this using the for function as follows. But my lapply function doesn't work. Could you make this loop as a lapply or apply family functions?
Thanks.
> odtl # original dataset
[[1]]
x index
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
[[2]]
x index
1 1 1
2 2 2
3 3 3
4 4 4
[[3]]
x index
1 1 1
2 2 2
3 3 3
> adtl # dataset for add
[[1]]
x index
1 NA 1.5
[[2]]
x index
1 NA 1.5
2 NA 2.5
3 NA 3.5
[[3]]
x index
1 NA 1.5
2 NA 2.5
> wdtl <- list() # This is the goal.
> for(i in 1:length(odtl)){
+ wdtl[[i]] <- rbind(odtl[[i]], adtl[[i]])
+ }
> wdtl # This is the goal but I want complete it by lapply or something
[[1]]
x index
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 NA 1.5
[[2]]
x index
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 NA 1.5
6 NA 2.5
7 NA 3.5
[[3]]
x index
1 1 1.0
2 2 2.0
3 3 3.0
4 NA 1.5
5 NA 2.5
You may use Map() which element-wise applies a function to the first elements of each of its arguments.
Map(rbind, odtl, adtl)
# [[1]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 4 4.0
# 5 5 5.0
# 6 NA 1.5
# 7 NA 2.5
# 8 NA 3.5
# 9 NA 4.5
# 10 NA 5.5
#
# [[2]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 4 4.0
# 5 NA 1.5
# 6 NA 2.5
# 7 NA 3.5
# 8 NA 4.5
#
# [[3]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 NA 1.5
# 5 NA 2.5
# 6 NA 3.5
Data
odtl <- list(data.frame(x=1:5, index=1:5),
data.frame(x=1:4, index=1:4),
data.frame(x=1:3, index=1:3))
adtl <- list(data.frame(x=NA, index=seq(1.5, 5.5, 1)),
data.frame(x=NA, index=seq(1.5, 4.5, 1)),
data.frame(x=NA, index=seq(1.5, 3.5, 1)))
I think the solution in the comments by #thelatemail should be the most elegant one. If you want to use lapply, then the below would be the something you want
wdtl <- sapply(seq(odtl), function(k) rbind(odtl[[k]],adtl[[k]]))
Specifically from the lapply, apply etc. family of functions, you could use mapply
> odtl <- list(data.frame(x=1:5, index=1:5),
data.frame(x=1:4, index=1:4),
data.frame(x=1:3, index=1:3))
> adtl <- list(data.frame(x=NA, index=seq(1.5, 5.5, 1)),
data.frame(x=NA, index=seq(1.5, 4.5, 1)),
data.frame(x=NA, index=seq(1.5, 3.5, 1)))v
> mapply(rbind, odtl, adtl, SIMPLIFY = FALSE)
# [[1]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 4 4.0
# 5 5 5.0
# 6 NA 1.5
# 7 NA 2.5
# 8 NA 3.5
# 9 NA 4.5
# 10 NA 5.5
#
# [[2]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 4 4.0
# 5 NA 1.5
# 6 NA 2.5
# 7 NA 3.5
# 8 NA 4.5
#
# [[3]]
# x index
# 1 1 1.0
# 2 2 2.0
# 3 3 3.0
# 4 NA 1.5
# 5 NA 2.5
# 6 NA 3.5
Note that Map is a wrapper around mapply(FUN = f, ..., SIMPLIFY = FALSE).

In data.table in R, how can I create a new column in a data table with values matching and corresponding to a column element and another table? [duplicate]

This question already has answers here:
Left join using data.table
(3 answers)
Closed 4 years ago.
I have two tables currently:
A B
3.3 10
2.5 11
6.7 11
6.0 12
5.4 12
3.5 12
6.5 13
8.0 13
and
B Val
10 0
11 1
12 2
13 3
What would like to do is to create a new column C in the first table such that it contains the value Val corresponding to each element of B in the first table that matches the B in the second. I would like to obtain:
A B C
3.3 10 0
2.5 11 1
6.7 11 1
6.0 12 2
5.4 12 2
3.5 12 2
6.5 13 3
8.0 13 3
The example code is:
DT.1 <- data.table(A=c(3.3,2.5,6.7,6.0,5.4,3.5,6.5,8.0), B=c(10,11,11,12,12,12,13,13))
DT.2 <- data.table(B=c(10,11,12,13),Val=c(0,1,2,3))
Thanks for any hints or inputs.
The joining-part is most certainly a duplicate.. I included this answer, because some renaming/reordering is also being done..
dt1 <- fread("A B
3.3 10
2.5 11
6.7 11
6.0 12
5.4 12
3.5 12
6.5 13
8.0 13", header = TRUE)
dt2 <- fread("B Val
10 0
11 1
12 2
13 3", header = TRUE)
result <- dt2[dt1, on = .(B)]
setcolorder(result, c("A", "B", "Val") )
setnames(result, old = "Val", new = "C")
# A B C
# 1: 3.3 10 0
# 2: 2.5 11 1
# 3: 6.7 11 1
# 4: 6.0 12 2
# 5: 5.4 12 2
# 6: 3.5 12 2
# 7: 6.5 13 3
# 8: 8.0 13 3

R - Count duplicated rows keeping index of their first occurrences

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences.
For example, if I have a data frame:
df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)
gives me
x y V1
1 0.6 4.2 2
2 1.3 8.1 2
3 5.1 7.1 1
4 8.5 3.2 1
5 9.3 2.4 1
6 10.8 5.9 1
But I want to keep the original indices (along with the row names) of the duplicated rows. like:
x y V1
1 9.3 2.4 1
2 5.1 7.1 1
3 0.6 4.2 2
5 8.5 3.2 1
6 1.3 8.1 2
8 10.8 5.9 1
"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).
If you want to keep the index:
library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
# x y I N
#1: 9.3 2.4 1 1
#2: 5.1 7.1 2 1
#3: 0.6 4.2 3 2
#4: 8.5 3.2 5 1
#5: 1.3 8.1 6 2
#6: 10.8 5.9 8 1
Or using data.tables unique method
unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))
We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).
library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
# x y V1
#1: 9.3 2.4 1
#2: 5.1 7.1 1
#3: 0.6 4.2 2
#4: 8.5 3.2 1
#5: 1.3 8.1 2
#6: 10.8 5.9 1
If we need the row ids,
setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
# x y V1 rn
#1: 9.3 2.4 1 1
#2: 5.1 7.1 1 2
#3: 0.6 4.2 2 3
#4: 8.5 3.2 1 5
#5: 1.3 8.1 2 6
#6: 10.8 5.9 1 8
Or
setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]

How to merge tables in R?

I think this will have a simple answer, but I can't work it out! Here is an example using the iris dataset:
a <- table(iris[,2])
b <- table(iris[,3])
How do I add these two tables together? For example, the variable 3 would have a value of 27 (26+1) and variable 3.3 a value of 8 (6+2) in the new output table.
Any help much appreciated.
This will work if you want to use the variables which are present in both a and b:
n <- intersect(names(a), names(b))
a[n] + b[n]
# 3 3.3 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.4
# 27 8 8 5 4 7 5 6 4 5 5
If you want to use all variables:
n <- intersect(names(a), names(b))
res <- c(a[!(names(a) %in% n)], b[!(names(b) %in% n)], a[n] + b[n])
res[order(names(res))] # sort the results
temp<-merge(a,b,by='Var1')
temp$sum<-temp$Freq.x + temp$Freq.y
Var1 Freq.x Freq.y sum
1 3 26 1 27
2 3.3 6 2 8
3 3.5 6 2 8
4 3.6 4 1 5
5 3.7 3 1 4
6 3.8 6 1 7
7 3.9 2 3 5
8 4 1 5 6
9 4.1 1 3 4
10 4.2 1 4 5
11 4.4 1 4 5
Here is another one:
transform(merge(a,b, by="Var1"), sum=Freq.x + Freq.y)
Var1 Freq.x Freq.y sum
1 3 26 1 27
2 3.3 6 2 8
3 3.5 6 2 8
4 3.6 4 1 5
5 3.7 3 1 4
6 3.8 6 1 7
7 3.9 2 3 5
8 4 1 5 6
9 4.1 1 3 4
10 4.2 1 4 5
11 4.4 1 4 5
Here's a slightly tortured one-liner version of the merge() solution:
do.call(function(Var1, Freq.x, Freq.y) data.frame(Var1=Var1, Freq=rowSums(cbind(Freq.x, Freq.y))), merge(a, b, by="Var1"))
Here's the one if you want to use all variables:
do.call(function(Var1, Freq.x, Freq.y) data.frame(Var1=Var1, Freq=rowSums(cbind(Freq.x, Freq.y), na.rm=TRUE)), merge(a, b, by="Var1", all=TRUE))
Unlike the transform() one-liner, it doesn't accumulate .x and .y so it can be used iteratively.
The merge function of the data.table package may be what you want: https://rpubs.com/ronasta/join_data_tables

Resources