I have the following example data.
data_1 <- data.frame("ID" = c('a','b','c','d','e'),
"value" = c(2,4,9,5,3))
data_2 <- data.frame("ID" = c('a','c','d','b','e','a','e','d','c'),
'var' =c(2,6,2,4,6,8,6,4,5))
I want to calculate new column in data_2 such that for the same ID in the two dataset, the value and var is multiplied.
Something like for data_1$ID==data_2$ID then data_1$value*data_2$var. So newVar would be (4,54,10,16,18,16,18,20,45).
Join the two dataframes and multiply value and var.
transform(merge(data_1, data_2, by = 'ID'), result = value * var)
You can also use match :
transform(data_2, result = var * data_1$value[match(ID, data_1$ID)])
# ID var result
#1 a 2 4
#2 c 6 54
#3 d 2 10
#4 b 4 16
#5 e 6 18
#6 a 8 16
#7 e 6 18
#8 d 4 20
#9 c 5 45
Using dplyr :
library(dplyr)
inner_join(data_1, data_2, by = 'ID') %>% mutate(result = value * var)
Using data.table
library(data.table)
setDT(data_1)[data_2, result := value * var, on = .(ID)]
Related
I have these two dataframes (imagine them very big) :
df = data.frame(subjects = 1:10,
var1 = c('a',NA,'b',NA,'c',NA,'d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
and I want to put the variable score from the g dataframe into the df dataframe, with the condition that if var1 = NA, then the score in df will be equal to NA. How can we make that with a simple function ? thanks.
Second scenario :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'))
g = data.frame(subjects = c(1,3,5,7,8,9,10),
score = c(1,2,1,3,2,4,1) )
now I want that the score for each subject that was not calculated to be NAs to become as follows :
df = data.frame(subjects = 1:10,
var1 = c('a','e','b','c','c','b','d','e','f','g'),
score = c(1,NA,2,NA,1,NA,3,2,4,1))
We could do a join by 'subjects' which return 'score' with NA where there are no corresponding 'subject's in 'g'. If we need the 'score' to be NA also when 'var1' is NA, do a replace on the next step with NA check on 'var1'
library(dplyr)
df <- left_join(df, g, by= "subjects") %>%
mutate(score = replace(score, is.na(var1), NA))
-output
df
subjects var1 score
1 1 a 1
2 2 e NA
3 3 b 2
4 4 c NA
5 5 c 1
6 6 b NA
7 7 d 3
8 8 e 2
9 9 f 4
10 10 g 1
I am having an issue multiplying 3 columns by 3 different constants (i.e, 2,3,4, respectively) and then summing each row after applying the conversion.
I am using dplyr
variable <- df %>% transmute(df, sum(col1, col2*2, col3*3, col4*4))
We could do
library(dplyr)
df %>%
mutate(a = a * 2,
b = b * 3,
c = c * 4,
total = a + b + c)
# a b c total
#1 2 18 44 64
#2 4 21 48 73
#3 6 24 52 82
#4 8 27 56 91
#5 10 30 60 100
Using rowSums
df %>%
mutate(a = a * 2,
b = b * 3,
c = c * 4) %>%
mutate(total = rowSums(.))
Important to note that if we are using rowSums, we need to include it in the new mutate call and not the same one otherwise it would sum the original df and not the changed one.
Or in base R
df1 <- transform(df, a = a*2, b = b * 3, c = c *4)
df1$total <- rowSums(df1)
data
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)
In base R, we can do this more compactly with %*%
df$total <- c(as.matrix(df) %*% 2:4)
df
# a b c total
#1 1 6 11 64
#2 2 7 12 73
#3 3 8 13 82
#4 4 9 14 91
#5 5 10 15 100
Or with crossprod
df$total <- c(crossprod(t(df), 2:4))
--
Or with tidyverse
library(tidyverse)
map2(df, 2:4, ~ .x * .y) %>%
reduce(`+`) %>%
bind_cols(df, total = .)
data
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)
variable <- df %>%
rowwise() %>%
mutate(new_var = sum(col1, col2*2, col3*3, col4*4))
Try that instead.
add rowwise() to have data analyzed at each row
use mutate() to get the new calculation
I want to combine different column value rows into a new column row.
Example df like this:
df <- data.frame(area = c("a","b","c","a"),
d = c(1,3,6,3),
f = c(3,2,8,2),
e = c(4,7,1,8),
g = c(6,9,2,9))
Where a,b,c are area column value, I want to combine/sum two rows (a,c) into one to get:
area d f e g
a+c+a 10 13 13 17
b 3 2 7 9
AND I have tried like this:
df <- aggregate(df, list(area=replace(area == c("a","c"), "a+c+a")), sum)
But it won't work.
Thank you.
Another solution using dplyr
library(dplyr)
aggr <- df[df$area %in% c("a", "c"),-1] %>%
summarize_all(sum)
rbind(df[!(df$area %in% c("a", "c")),],
bind_cols(area = "a+c+a", aggr))
# area d f e g
# 2 b 3 2 7 9
# 1 a+c+a 10 13 13 17
I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)
I have a data frame that looks like this:
id = c("A","B","C","A","C","C")
val = c(5,4,6,7,10,99)
df = data.frame(id, val)
df
id val
A 5
B 4
C 6
A 7
C 10
C 99
Now I would like to re-arrange the id column (A, B, C...), keep their corresponding val, and then add a new column of newid starting with letter E, followed by three digits counting the number of id in the first column. The code is here:
id2 = c("A","A","B","C","C","C")
val2 = c(5,7,4,6,10,99)
newid = c("E001","E002","E001","E001","E002","E003")
df2 = data.frame(id2, val2, newid)
df2
and the final result is this:
id2 val2 newid
A 5 E001
A 7 E002
B 4 E001
C 6 E001
C 10 E002
C 99 E003
Is there an efficient way to do this?
library(data.table)
dt = data.table(df)
dt[, newid := paste0('E', gsub(' ', '0', format(1:.N, width = 3))), keyby = id]
dt
# id val newid
#1: A 5 E001
#2: A 7 E002
#3: B 4 E001
#4: C 6 E001
#5: C 10 E002
#6: C 99 E003
keyby here does the sorting, so no need to do it explicitly
Here is one way to do that, using the order() function to arrange the data, and the sprintf(), sapply() and table() functions to define newid.
df2 <- df[order(df$id, df$val), ]
df2$newid <- paste0("E", sprintf("%04d", unlist(sapply(table(df$id), function(x) 1:x))))