i'm working on how to find sum of products of two dataframes.
data<-w1 w2 w3 w4
4 6 8 5
where w1 w2 w3 w4 are column names
and I have one more dataframe
data2<-p1 p2 p3 p4
3 4 5 6
5 6 8 4
4 6 6 8
3 5 8 9
my result should be like this:
result <- w1*P1+w2*p2+w3*p3*w4*p4
result1 <- 4*3+6*4+8*5+5*6 # result on row 1
result2 <- 4*5+6*6+8*8+5*4 # result on row 2
and so on for each row in data2
how to do this in general
Thanks
Fastest way is to come back to R linear algebra (even more is you have big data.frame's):
> as.matrix(data2) %*% unlist(data)
# [,1]
#[1,] 106
#[2,] 140
#[3,] 140
#[4,] 151
Or sweep:
> rowSums(sweep(as.matrix(data2), 2, unlist(data), `*`))
#[1] 106 140 140 151
Data
data=data.frame(a=4,b=6,c=8,d=5)
data2=data.frame(a=c(3,5,4,3),b=c(4,6,6,5),c=c(5,8,6,8),d=c(6,4,8,9))
You could use mapply:
df1 <- data.frame(w1 = 4, w2 = 6, w3 = 8, w4 = 5)
df2 <- data.frame(p1 = c(3, 5, 4, 3), p2 = c(4, 6, 6, 5),
p3 = c(5, 8, 6, 8), p4 = c(6, 4, 8, 9))
This multiplies each element of df2 with each element of df1 (by element I mean column - the data frame is treated as a list in this context):
> (tmp <- mapply(`*`, df2, df1))
p1 p2 p3 p4
[1,] 12 24 40 30
[2,] 20 36 64 20
[3,] 16 36 48 40
[4,] 12 30 64 45
>sum(tmp)
[1] 537
Edit If you want to get the sum of each row from the above matrix you can use either apply(tmp, 1, sum) or rowSums:
> rowSums(tmp)
[1] 106 140 140 151
Related
I have a dataframe like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
I want to calculate the means of the value-column and specific rows.
The pattern of the rows is pretty complicated:
Rows of MeanA1: 1, 5, 9
Rows of MeanA2: 2, 6, 10
Rows of MeanA3: 3, 7, 11
Rows of MeanA4: 4, 8, 12
Rows of MeanB1: 13, 17, 21
Rows of MeanB2: 14, 18, 22
Rows of MeanB3: 15, 19, 23
Rows of MeanB4: 16, 20, 24
Rows of MeanC1: 25, 29, 33
Rows of MeanC2: 26, 30, 34
Rows of MeanC3: 27, 31, 35
Rows of MeanC4: 28, 32, 36
Rows of MeanD1: 37, 41, 45
Rows of MeanD2: 38, 42, 46
Rows of MeanD3: 39, 43, 47
Rows of MeanD4: 40, 44, 48
As you see its starting at 4 different points (1, 13, 25, 37) then always +4 and for the following 4 means its just stepping 1 more row down.
I would like to have an output of all these means in one list.
Any ideas? NOTE: In this example the mean is of course always the middle number, but my real df is different.
Not quite sure about the output format you require, but the following codes can calculate what you want anyhow.
calc_mean1 <- function(x) mean(test$value[seq(x, by = 4, length.out = 3)])
calc_mean2 <- function(x){sapply(x:(x+3), calc_mean1)}
output <- lapply(seq(1, 37, 12), calc_mean2)
names(output) <- paste0('Mean', LETTERS[seq_along(output)]) # remove this line if more than 26 groups.
output
## $MeanA
## [1] 5 6 7 8
## $MeanB
## [1] 17 18 19 20
## $MeanC
## [1] 29 30 31 32
## $MeanD
## [1] 41 42 43 44
An idea via base R is to create a grouping variable for every 4 rows, split the data every 12 rows (nrow(test) / 4) and aggregate to find the mean, i.e.
test$new = rep(1:4, nrow(test)%/%4)
lapply(split(test, rep(1:4, each = nrow(test) %/% 4)), function(i)
aggregate(value ~ new, i, mean))
# $`1`
# new value
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
# $`2`
# new value
# 1 1 17
# 2 2 18
# 3 3 19
# 4 4 20
# $`3`
# new value
# 1 1 29
# 2 2 30
# 3 3 31
# 4 4 32
# $`4`
# new value
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
And yet another way.
fun <- function(DF, col, step = 4){
run <- nrow(DF)/step^2
res <- lapply(seq_len(step), function(inc){
inx <- seq_len(run*step) + (inc - 1)*run*step
dftmp <- DF[inx, ]
tapply(dftmp[[col]], rep(seq_len(step), run), mean, na.rm = TRUE)
})
names(res) <- sprintf("Mean%s", LETTERS[seq_len(step)])
res
}
fun(test, 2, 4)
#$MeanA
#1 2 3 4
#5 6 7 8
#
#$MeanB
# 1 2 3 4
#17 18 19 20
#
#$MeanC
# 1 2 3 4
#29 30 31 32
#
#$MeanD
# 1 2 3 4
#41 42 43 44
Since you said you wanted a long list of the means, I assumed it could also be a vector where you just have all these values. You would get that like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
meanVector <- NULL
for (i in 1:(nrow(test)-8)) {
x <- c(test$value[i], test$value[i+4], test$value[i+8])
m <- mean(x)
meanVector <- c(meanVector, m)
}
Consider the following dataset:
d1 <- c(2, 3, 8)
d2 <- data.frame(d1)
d1 <- c(1, 7, 9, 10)
d3 <- data.frame(d1)
Now I want to randomly draw 3 observations (without replacement) from d3 3 times, and each time I want to merge it with d2. So I should have three merged data frames with 6 observations.
I have tried with:
for (r in 1:3)
{
sam <- sample(1:4, 3, replace = FALSE)
merge <- rbind(d2, d3[sam])
}
But this does not work. Can anyone help me?
You can try
library(tidyverse)
d1 <- data.frame(d1=c(2, 3, 8))
d2 <- data.frame(d1=c(1, 7, 9, 10))
1:3 %>%
map(~sample_n(d2, 3) %>%
bind_cols(d1))
[[1]]
d1 d11
1 1 2
2 7 3
3 9 8
[[2]]
d1 d11
1 10 2
2 1 3
3 7 8
[[3]]
d1 d11
1 9 2
2 7 3
3 10 8
I have vectors of different lengths. For instance:
df1
[1] 1 95 5 2 135 4 3 135 4 4 135 4 5 135 4 6 135 4
df2
[1] 1 70 3 2 110 4 3 112 4
I'm trying to write a script in R in order to have any vector enter the function or for loop and it returns a dataframe of three columns. So a separate dataframe for each input vector. Each vector is a multiple of three (hence, the three columns). I'm fairly new to R in terms of writing functions and can't seem to figure this out. Here was my attempt:
newdf = c()
ld <- length(df1)
ld_mult <- length(df1)/3
ld_seq <- seq(from=1,to=ld,by=3)
ld_seq2 < ld_seq +2
for (i in 1:ld_mult) {
newdf[i,] <- df1[ld_seq[i]:ld_seq2[i]]
}
the output I want for df1 would be:
1 95 5
2 135 4
3 135 4
4 135 4
5 135 4
6 135 4
Here's an example of how you could use matrix for that purpose:
x <- c(1, 95, 5,2, 135, 4, 3, 135, 4)
as.data.frame(matrix(x, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 95 5
#2 2 135 4
#3 3 135 4
And
y <- c(1, 70, 3, 2, 110, 4, 3, 112, 4)
as.data.frame(matrix(y, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
Or if you want to make it a custom function:
newdf <- function(vec) {
as.data.frame(matrix(vec, ncol = 3, byrow = TRUE))
}
newdf(y)
#V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
You could also let the user specify the number of columns he wants to create with the function if you add another argument to newdf:
newdf <- function(vec, cols = 3) {
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
Now, the default number of columns is 3, if the user doesnt specify a number. If he wants to, he could use it like this:
newdf(z, 5) # to create 5 columns
Another nice little addon for the function would be a check if the input vector length is a multiple of the number of columns specified in the function call:
newdf <- function(vec, cols = 3) {
if(length(vec) %% cols != 0) {
stop("Number of columns is not a multiple of input vector length. Please double check.")
}
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
newdf(x, 4)
#Error in newdf(x, 4) :
# Number of columns is not a multiple of input vector length. Please double check.
If you had multiple vectors sitting in a list, here's how you could convert each of them to be a data.frame:
> l <- list(x,y)
> l
#[[1]]
#[1] 1 95 5 2 135 4 3 135 4
#
#[[2]]
#[1] 1 70 3 2 110 4 3 112 4
> lapply(l, newdf)
#[[1]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
#
#[[2]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
I have a list of prime numbers with I multiply using outer() and upper.tri() to get a unique set of numbers.
primes <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
m <- outer(primes, primes, "*")
unq <- m[which(upper.tri(m))]
> unq
6 10 15 14 21 35 22 33 55 77 26 39 65 91 143 34 51 85 119 187 221 38 57 95 133 209 247 323 46 69 115 161 253 299 391 437 58 87 145 203 319 377 493 551 667
Each of the original prime numbers represents a set of two numbers:
a2 <- c(1,1)
a3 <- c(1,2)
a5 <- c(2,2)
a7 <- c(1,3)
a11 <- c(1,4)
a13 <- c(2,3)
a17 <- c(2,4)
a19 <- c(3,3)
a23 <- c(3,4)
a29 <- c(4,4)
The combination of the two sets of two numbers produces 4 numbers
expand.grid(a2,a3)
1 1
1 1
1 2
1 2
So what I would like to do is have a kind of a list of lists, with each prime number having all 4 possible combinations.
I tried something like this, but I am missing some fundamentals here:
outer(a ,a , "expand.grid")
So the result would look something like this for the first prime:
6 c(11, 11, 12, 12)
I'm not sure I understand correctly, but I hope this helps:
#function to `outer`
fun <- function(x, y)
{
a1 <- get(paste0("a", x))
a2 <- get(paste0("a", y))
res <- apply(expand.grid(a1, a2), 1, paste, collapse = "")
res2 <- paste(res, collapse = ";")
return(res2)
}
#`outer` a vectorized `fun`
m2 <- outer(primes, primes, Vectorize(fun))
#select `upper.tri`
unq2 <- m2[upper.tri(m2)]
#combine to a list
myls <- lapply(as.list(unq2), function(x) as.numeric(unlist(strsplit(x, ";"))))
names(myls) <- unq
myls
#$`6`
#[1] 11 11 12 12
#$`10`
#[1] 12 12 12 12
#$`15`
#[1] 12 22 12 22
#$`14`
#[1] 11 11 13 13
#...
Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?
I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
x[sample(nrow(mydf), 1)] <- NA
x
})
mydf
# A B X1 X2 X3
# 1 1 A 27 69 27
# 2 1 A 38 NA 39
# 3 1 A 58 77 2
# 4 2 A 91 50 39
# 5 2 A 21 72 87
# 6 3 B 90 100 35
# 7 3 B 95 39 49
# 8 3 B 67 78 60
# 9 3 B 63 94 NA
# 10 4 B NA 22 19
# 11 4 B 21 66 83
# 12 4 B 18 13 67
First, the formula interface. The following three commands will all yield the same output.
aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
# A B X1 X2 X3
# 1 1 A 85 146 29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B 39 79 150
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).
aggregate(cbind(mydf$X1, mydf$X2, mydf$X3),
by = list(mydf$A, mydf$B), sum)
Group.1 Group.2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112 122 126
3 3 B 315 311 NA
4 4 B NA 101 169
Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:
aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3),
by = list(NewA = mydf$A, NewB = mydf$B), sum)
The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.
aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.
From the help page,
aggregate.formula is a standard formula interface to aggregate.data.frame
So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.