multiplication with dataframes in r - r

I have two dataframes. One is called data, like
data <- data.frame(ID = c(1, 1, 2, 2 ),
Number = c(1,2, 1, 2),
Answer = c(1, 2, 3, 2 )
)
The other is called weights, like
weights <- data.frame ( Number=c(1,2),
weight1=c(0.5,1),
weight2=c(1, 1)
)
What I want is to use Data$Answers to multiply Weights$weight based on Number (in both dataframes). The final results should be look like
ID Number Answer Answer*Weights1 Answer*Weights2
1 1 1 1 1*0.5 1*1
2 1 2 2 2*1 2*1
3 2 1 3 3*0.5 3*1
4 2 2 2 2*1 2*1
How can I achieve it? Your inputs will be deeply appreciated. Thanks.

data <- merge(data, weights, by = "Number")
data <- transform(data,
A1 = Answer * weight1,
A2 = Answer * weight2)
# Number ID Answer weight1 weight2 A1 A2
#1 1 1 1 0.5 1 0.5 1
#2 1 2 3 0.5 1 1.5 3
#3 2 1 2 1.0 1 2.0 2
#4 2 2 2 1.0 1 2.0 2

You could also do
library(dplyr)
left_join(data, weights, by="Number") %>%
select(ID:Answer, Answer_weight1=weight1, Answer_weight2=weight2) %>%
mutate_each(funs(Answer*.), contains("weight"))
# ID Number Answer Answer_weight1 Answer_weight2
# 1 1 1 1 0.5 1
# 2 1 2 2 2.0 2
# 3 2 1 3 1.5 3
# 4 2 2 2 2.0 2

Here's how you could do this using data.table:
require(data.table) ## 1.9.2
setDT(data) ## convert data.frame to data.table by reference
setDT(weights)
setkey(data, Number) ## set the key columns to join by
data[weights, c("Answer1", "Answer2") :=
list(Answer * weight1, Answer * weight2)]
We perform a join, but directly create the required columns without the intermediate variables (weight1, weight2), and is therefore quite memory efficient. It modifies data in place.

Just in case you want those entries in the Answers*Weights1 and Answers*Weights2 columns to be strings and not actually multiplication, like you have in your original post:
data <- cbind(data,
paste(data[, 3], weights[, 2], sep = "*"),
paste(data[, 3], weights[, 3], sep = "*"))
names(data)[4:5] <- c("Answer*Weights1", "Answer*Weights2")
# ID Number Answer Answer*Weights1 Answer*Weights2
# 1 1 1 1 1*0.5 1*1
# 2 1 2 2 2*1 2*1
# 3 2 1 3 3*0.5 3*1
# 4 2 2 2 2*1 2*1
Or if you want numbers instead of strings
data[, 4] <- data[, 3] * weights[, 2]
data[, 5] <- data[, 3] * weights[, 3]
names(data)[4:5] <- c("Answer*Weights1", "Answer*Weights2")
# ID Number Answer Answer*Weights1 Answer*Weights2
# 1 1 1 1 0.5 1
# 2 1 2 2 2.0 2
# 3 2 1 3 1.5 3
# 4 2 2 2 2.0 2

Related

Extract data based on another list

I am trying to extract rows of a dataset based on a list of time points nested within individuals. I have repeated time points (therefore exactly the same variable values) but I still want to keep the duplicated rows. How to achieve that in base R?
Here is the original dataset:
xx <- data.frame(id=rep(1:3, each=3), time=1:3, y=rep(1:3, each=3))
Here is the list of matrices where the third one is a vector
lst <- list(`1` = c(1, 1, 2), `2` = c(1, 3, 3), `3` = c(2, 2, 3))
Desirable outcome:
id time y
1 1 1
1 1 1 #this is the duplicated row
1 2 1
2 1 2
2 3 2
2 3 2 #this is the duplicated row
3 2 3
3 2 3 #this is the duplicated row
3 3 3
The code do.call(rbind, Map(function(p, q) subset(xx, id == q & time %in% p), lst, names(lst))) did not work for me because subset removes duplicated rows
The issue is that %in% doesn't iterate over the non-unique values repeatedly. To do so, we need to also iterate (lapply) over p internally. I'll wrap your inner subset in another do.call(rbind, lapply(p, ...)) to get what you expect:
do.call(rbind, Map(function(p, q) {
do.call(rbind, lapply(p, function(p0) subset(xx, id == q & time %in% p0)))
}, lst, names(lst)))
# id time y
# 1.1 1 1 1
# 1.2 1 1 1
# 1.21 1 2 1
# 2.4 2 1 2
# 2.6 2 3 2
# 2.61 2 3 2
# 3.8 3 2 3
# 3.81 3 2 3
# 3.9 3 3 3
(Row names are a distraction here ...)
An alternative would be to convert your lst into a frame of id and time, and then left-join on it:
frm <- do.call(rbind, Map(function(x, nm) data.frame(id = nm, time = x), lst, names(lst)))
frm
# id time
# 1.1 1 1
# 1.2 1 1
# 1.3 1 2
# 2.1 2 1
# 2.2 2 3
# 2.3 2 3
# 3.1 3 2
# 3.2 3 2
# 3.3 3 3
merge(frm, xx, by = c("id", "time"), all.x = TRUE)
# id time y
# 1 1 1 1
# 2 1 1 1
# 3 1 2 1
# 4 2 1 2
# 5 2 3 2
# 6 2 3 2
# 7 3 2 3
# 8 3 2 3
# 9 3 3 3
Two good resources for learning about merges/joins:
How to join (merge) data frames (inner, outer, left, right)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

Aggregate data frame/table by all rows, add counts, and do it fast [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Finding unique rows in data.frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Add a column for counting unique tuples in the data frame [duplicate]

This question already has answers here:
How to get frequencies then add it as a variable in an array?
(3 answers)
Closed 8 years ago.
Suppose I have the following data frame:
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
df
# userID A B
# 1 1 2 2
# 2 1 3 3
# 3 3 2 1
# 4 5 1 0
# 5 3 2 1
# 6 5 1 0
I would like to create a data frame with the same columns but with an added final column that counts up the number of unique tuples / combinations of the other columns. The output should look like the following:
userID A B count
1 2 2 1
1 3 3 1
3 2 1 2
5 1 0 2
The meaning is the the tuple / combination of (1, 2, 2) occurs with count=1, while the tuple of (3, 2, 1) occurs twice so has count=2. I would prefer not to use any external packages.
1) aggregate
ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ] # sort the rows
giving:
userID A B count
3 1 2 2 1
4 1 3 3 1
2 3 2 1 2
1 5 1 0 2
The last line of code which sorts the rows could be omitted if the order of the rows is unimportant.
The remaining solutions use the indicated packages:
2) sqldf
library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")
giving:
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
The order by clause could be omitted if the order is unimportant.
3) dplyr
library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())
giving:
Source: local data frame [4 x 4]
Groups: userID, A
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
4) data.table
library(data.table)
data.table(df)[, list(count = .N), by = names(df)]
giving:
userID A B count
1: 1 2 2 1
2: 1 3 3 1
3: 3 2 1 2
4: 5 1 0 2
ADDED additional solutions. Also some small improvements.
Here's a fairly straightforward way (ave to the rescue!):
unique(cbind(df,
count = ave(rep(1, nrow(df)),
do.call(paste, df),
FUN = length)))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
Here's a variation of the above:
unique(within(df, {
counter <- rep(1, nrow(df))
count <- ave(counter, df, FUN = length)
rm(counter)
}))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
Make a quick factor of the tuples:
df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))
No external packages just taking advantage of summary() and storing it as a DF then merging the counts on the original data:
df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL
Almost final output, just has dupes:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 3 2 1 2
5 5 1 0 2
6 5 1 0 2
Lastly, clean up dupes:
df <- df[!duplicated(df), ]
Here you go:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
5 5 1 0 2
Been a while not doing that with sql or plyr. if you can use dplyr or a package later on do it. Bioconductor has a lot of great sequencing packages if it starts to get more complex.
Hope this helps.
This should do the trick, even if it is a little bit ugly:
vec <- table(apply(df,1,paste,collapse=""))
df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))
names(df2) <- names(df)
df2$count <- vec
# userID A B count
#1 1 2 2 1
#2 1 3 3 1
#3 3 2 1 2
#4 5 1 0 2

Find how many times duplicated rows repeat in R data frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Resources