I want to sum the "value" column by group1 and by group2.
group2 can range from 1 to 5.
If there is no entry for group2, the sum should be 0.
Data:
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
I am using
aggregate(data$value, by=(list(data$group1, data$group2)), FUN = sum)
which gives
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
However, the result should look like
group1 group2 value
a 1 100
a 2 200
a 3 300
a 4 0
a 5 0
b 1 10
b 2 20
b 3 0
b 4 0
b 5 0
How can i address this using the aggregate function in R?
Thank you!
We can use complete from tidyr to complete missing combinations.
library(dplyr)
library(tidyr)
df %>%
group_by(group1, group2) %>%
summarise(value = sum(value)) %>%
complete(group2 = 1:5, fill = list(value = 0))
# group1 group2 value
# <fct> <int> <dbl>
# 1 a 1 100
# 2 a 2 200
# 3 a 3 300
# 4 a 4 0
# 5 a 5 0
# 6 b 1 10
# 7 b 2 20
# 8 b 3 0
# 9 b 4 0
#10 b 5 0
data
df <- structure(list(group1 = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), group2 = c(1L, 2L, 3L, 1L, 2L), value = c(100L,
200L, 300L, 10L, 20L)), class = "data.frame", row.names = c(NA, -5L))
You need of course to tell R that "group 2 can range from 1 to 5". Best you merge it with an expand.grid accordingly and use with.
with(merge(expand.grid(group1=c("a", "b"), group2=1:5, value=0), data, all=TRUE),
aggregate(value, by=(list(group1, group2)), FUN=sum))
# Group.1 Group.2 x
# 1 a 1 100
# 2 b 1 10
# 3 a 2 200
# 4 b 2 20
# 5 a 3 300
# 6 b 3 0
# 7 a 4 0
# 8 b 4 0
# 9 a 5 0
# 10 b 5 0
Data:
data <- structure(list(group1 = c("a", "a", "a", "b", "b"), group2 = c(1L,
2L, 3L, 1L, 2L), value = c(100L, 200L, 300L, 10L, 20L)), row.names = c(NA,
-5L), class = "data.frame")
Related
I have dataframe something like:
myData <- User X Y Similar
A 1 4 100
A 1 2 100
A 1 1 100
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 100
To something like this:
myData <- User X Y Similar
A 1 4 0
A 1 2 0
A 1 1 0
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 0
Question
I want to change value in similar column to 0 with condition. The condition is if variable x = 1 and variable similar = 100. How to do that in r?
Thanks
We create a logical vector based on the 'X' and 'Similar' and do the assignment of 'Similar with that index to replace those values to 0
i1 <- with(myData, X ==1 & Similar == 100)
myData$Similar[i1] <- 0
-output
myData
# User X Y Similar
#1 A 1 4 0
#2 A 1 2 0
#3 A 1 1 0
#4 A 3 2 80
#5 A 2 1 20
#6 A 2 4 100
#7 B 3 1 50
#8 B 4 2 90
#9 B 1 3 0
data
myData <- structure(list(User = c("A", "A", "A", "A", "A", "A", "B", "B",
"B"), X = c(1L, 1L, 1L, 3L, 2L, 2L, 3L, 4L, 1L), Y = c(4L, 2L,
1L, 2L, 1L, 4L, 1L, 2L, 3L), Similar = c(100L, 100L, 100L, 80L,
20L, 100L, 50L, 90L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
I have a dataset which is similar to the following:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3
6-9 a b a 2 3 4
6-9 b b c 1 2 3
6-9 c a 4 1
9-10 c c b 1 3 1
9-10 c a b 1 2 1
Using R, I want to get the following data set which contains new set of columns a, b and c by adding the corresponding values:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3 a b c
6-9 a b a 2 3 4 6 3 0
6-9 b b c 1 2 3 0 3 3
6-9 c a 4 1 1 0 4
9-10 c c b 1 3 1 0 1 4
9-10 c a b 1 2 1 2 1 1
Note: My data also contains missing values. The variables Monday:Wednesday are factors and the variables Value1:Value3 are numeric. For more clearity: 1st row of column "a" contains the addition of all values through Value1 to Value3 related to a (say for example 2+4 =6).
One way using base R:
data$id <- 1:nrow(data) # Create a unique id
vlist <- list(grep("day$", names(data)), grep("^Value", names(data)))
d1 <- reshape(data, direction="long", varying=vlist, v.names=c("Day","Value"))
d2 <- aggregate(Value~id+Day, FUN=sum, na.rm=TRUE, data=d1)
d3 <- reshape(d2, direction="wide", v.names="Value", timevar="Day")
d3[is.na(d3)] <- 0
merge(data, d3, by="id", all.x=TRUE)
# id Age Monday Tuesday Wednesday Value1 Value2 Value3 Value.a Value.b Value.c
#1 1 6-9 a b a 2 3 4 6 3 0
#2 2 6-9 b b c 1 2 3 0 3 3
#3 3 6-9 <NA> c a NA 4 1 1 0 4
#4 4 9-10 c c b 1 3 1 0 1 4
#5 5 9-10 c a b 1 2 1 2 1 1
Data:
data <- structure(list(Age = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("6-9",
"9-10"), class = "factor"), Monday = structure(c(1L, 2L, NA,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), Tuesday = structure(c(2L,
2L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
Wednesday = structure(c(1L, 3L, 1L, 2L, 2L), .Label = c("a",
"b", "c"), class = "factor"), Value1 = c(2L, 1L, NA, 1L,
1L), Value2 = c(3L, 2L, 4L, 3L, 2L), Value3 = c(4L, 3L, 1L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))
You can use below code:
data[] <- lapply(data, as.character)
data$rownumber<-rownames(data)
x<-gather(data[,c(1:4,8)], Day, Letter, Monday:Wednesday) %>% mutate(row2 = rownames(x))
y<-gather(data[,c(1,5:7,8)], Day, Value, Value1:Value3)%>% mutate(row2 = rownames(y))
z<-left_join(x, y, by =c("Age","rownumber", "row2")) %>% group_by(Age, rownumber, Letter) %>% dplyr::summarise(suma = sum(as.numeric(Value), na.rm = T)) %>% mutate(suma = replace_na(suma, 0))
z<-dcast(z, rownumber ~ Letter , value.var="suma") %>% left_join(data, z, by = "rownumber")
z$Var.2<-NULL
z[is.na(z)]<-0
Output:
rownumber a b c Age Monday Tuesday Wednesday Value1 Value2 Value3
1 1 6 3 0 6-9 a b a 2 3 4
2 2 0 3 3 6-9 b b c 1 2 3
3 3 1 0 4 6-9 c a 0 4 1
4 4 0 1 4 9-10 c c b 1 3 1
5 5 2 1 1 9-10 c a b 1 2 1
I have a dataset like this below
W X Y Z
A 2 3 4
A 2 3 6
B 1 2 3
C 3 2 1
B 1 3 4
B 1 2 2
I am want to combine/collapse the values in column Z only if the values in column W, X, Y are similar.
The final dataset will be like this.
W X Y Z
A 2 3 4,6
B 1 2 3,2
C 3 2 1
B 1 3 4
Not sure how to do this, any suggestions is much appreciated.
We can group by 'W', 'X', 'Y' and paste the values of 'Z' (toString is paste(..., collapse=", "))
library(dplyr)
df1 %>%
group_by(W, X, Y) %>%
summarise(Z = toString(unique(Z)))
# A tibble: 4 x 4
# Groups: W, X [3]
# W X Y Z
# <chr> <int> <int> <chr>
#1 A 2 3 4, 6
#2 B 1 2 3, 2
#3 B 1 3 4
#4 C 3 2 1
Or with aggregate from base R
aggregate(Z ~ ., unique(df1), toString)
# W X Y Z
#1 B 1 2 3, 2
#2 C 3 2 1
#3 B 1 3 4
#4 A 2 3 4, 6
data
df1 <- structure(list(W = c("A", "A", "B", "C", "B", "B"), X = c(2L,
2L, 1L, 3L, 1L, 1L), Y = c(3L, 3L, 2L, 2L, 3L, 2L), Z = c(4L,
6L, 3L, 1L, 4L, 2L)), class = "data.frame", row.names = c(NA,
-6L))
I have the following dataset
clust T2 n
1 a 1
1 b 3
1 c 3
2 d 5
3 a 4
3 b 3
4 b 5
4 c 8
4 t 6
4 e 7
etc..
using the following function:
library(dplyr)
table <- data %>% group_by(clust) %>% summarise(max = max(n), name1 = T2[which.max(n)])
I get this output
clust max name1
1 3 b
2 5 d
3 4 a
4 8 c
etc
however there are cases where there are two or more T2 values corresponding to max(n). how can I record those value too?
i.e.
clust max name1
1 3 b,c
2 5 d
3 4 a
4 8 c
etc
or
clust max name1
1 3 b
1 3 c
2 5 d
3 4 a
4 8 c
etc
We can do a == instead of which.max (that returns only the first index of max value) and paste together with toString
library(dplyr)
library(tidyr)
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)]))
# A tibble: 4 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b, c
#2 2 5 d
#3 3 4 a
#4 4 8 c
and this can be expanded with separate_rows in the next step
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)])) %>%
separate_rows(name1, sep=",\\s+")
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
Or have a list column and then unnest
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = list(T2[n == max(n)])) %>%
unnest(c(name1))
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
data
data <- structure(list(clust = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L, 4L,
4L), T2 = c("a", "b", "c", "d", "a", "b", "b", "c", "t", "e"),
n = c(1L, 3L, 3L, 5L, 4L, 3L, 5L, 8L, 6L, 7L)),
class = "data.frame", row.names = c(NA,
-10L))
Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0