I have a dataset like this below
W X Y Z
A 2 3 4
A 2 3 6
B 1 2 3
C 3 2 1
B 1 3 4
B 1 2 2
I am want to combine/collapse the values in column Z only if the values in column W, X, Y are similar.
The final dataset will be like this.
W X Y Z
A 2 3 4,6
B 1 2 3,2
C 3 2 1
B 1 3 4
Not sure how to do this, any suggestions is much appreciated.
We can group by 'W', 'X', 'Y' and paste the values of 'Z' (toString is paste(..., collapse=", "))
library(dplyr)
df1 %>%
group_by(W, X, Y) %>%
summarise(Z = toString(unique(Z)))
# A tibble: 4 x 4
# Groups: W, X [3]
# W X Y Z
# <chr> <int> <int> <chr>
#1 A 2 3 4, 6
#2 B 1 2 3, 2
#3 B 1 3 4
#4 C 3 2 1
Or with aggregate from base R
aggregate(Z ~ ., unique(df1), toString)
# W X Y Z
#1 B 1 2 3, 2
#2 C 3 2 1
#3 B 1 3 4
#4 A 2 3 4, 6
data
df1 <- structure(list(W = c("A", "A", "B", "C", "B", "B"), X = c(2L,
2L, 1L, 3L, 1L, 1L), Y = c(3L, 3L, 2L, 2L, 3L, 2L), Z = c(4L,
6L, 3L, 1L, 4L, 2L)), class = "data.frame", row.names = c(NA,
-6L))
Related
I have this list of dataframes created as follows :
df = data.frame(x = c(1,0,0,0,1,1,1,NA), y = c(2,2,2,2,3,3,2,NA),
z = c(1:7,NA), m = c(1,2,3,1,2,3,1,NA) )
df$x = factor(df$x)
df$y = factor(df$y)
df$m = factor(df$m)
l1 = list(df$x,df$y,df$m)
l2 = lapply(l1,table)
l3 = lapply(l2,as.data.frame)
l3
The output is as follows :
[[1]]
Var1 Freq
1 0 3
2 1 4
[[2]]
Var1 Freq
1 2 5
2 3 2
[[3]]
Var1 Freq
1 1 3
2 2 2
3 3 2
I wish that the names of the variables from the dataframe are assigned autmatically to the l3 list elements. For example : Var1 from list 1 becomes x. Var1 from list 2 becomes y. Var1 from list 3 becomes m. Thanks
Using Map.
l3 <- lapply(df, table) |> lapply(as.data.frame)
(l3 <- Map(\(x, y) {names(x)[1] <- y; x}, l3, names(l3)))
# $x
# x Freq
# 1 0 3
# 2 1 4
#
# $y
# y Freq
# 1 2 5
# 2 3 2
#
# $z
# z Freq
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 1
# 7 7 1
#
# $m
# m Freq
# 1 1 3
# 2 2 2
# 3 3 2
Data:
df <- structure(list(x = structure(c(2L, 1L, 1L, 1L, 2L, 2L, 2L, NA
), levels = c("0", "1"), class = "factor"), y = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 1L, NA), levels = c("2", "3"), class = "factor"),
z = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, NA), m = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, NA), levels = c("1", "2", "3"), class = "factor")), row.names = c(NA,
-8L), class = "data.frame")
One possible solution:
Map(\(x,y) setNames(x, c(y,"Freq")), l3, c("x", "y", "z"))
[[1]]
x Freq
1 0 3
2 1 4
[[2]]
y Freq
1 2 5
2 3 2
[[3]]
z Freq
1 1 3
2 2 2
3 3 2
I have a dataframe such as:
COL1 VALUE1 VALUE2
1 A,A 1 5
2 A,A,B 1 3
3 C 1 1
4 D 1 2
5 D 1 2
6 A,A 1 10
7 A,B,A 1 2
and I can succeed to remove duplicate within the COL1 and count the number of different duplicated in COL1 by using:
as.data.frame(table(tab$COL1)) %>%
group_by(Var1 = sapply(strsplit(as.character(Var1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(Freq))
And then I get:
# A tibble: 4 × 2
Var1 Freq
<chr> <int>
1 A 2
2 A, B 2
3 C 1
4 D 2
But I wondered if someone had an idea in order to add a new column called Mean which would be for each COL1 groups, the mean of the VALUE2 values and then get:
Var1 Freq Mean
1 A 2 7.5 < because (5+10)/2 =7.5
2 A, B 2 2.5 < because (3+2)/2 =2.5
3 C 1 1 < because 1/1 = 1
4 D 2 2 < because (2+2)/2 = 2
Here is the dataframe if it can helps:
structure(list(COL1 = structure(c(1L, 2L, 4L, 5L, 5L, 1L, 3L), .Label = c("A,A",
"A,A,B", "A,B,A", "C", "D"), class = "factor"), VALUE1 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), VALUE2 = c(5L, 3L, 1L, 2L, 2L, 10L,
2L)), class = "data.frame", row.names = c(NA, -7L))
You can calculate the frequency table directly in the dplyr chain, and then just add a Mean = mean(VALUE2) in the summarise() call.
I.e.
tab %>%
group_by(Var1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(VALUE1), Mean = mean(VALUE2))
# # A tibble: 4 x 3
# Var1 Freq Mean
# <chr> <int> <dbl>
# 1 A 2 7.5
# 2 A, B 2 2.5
# 3 C 1 1
# 4 D 2 2
Is this what you want:
library(dplyr)
tab %>%
mutate(COL1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
group_by(COL1) %>%
summarise(Freq = sum(VALUE1),
Mean = mean(VALUE2))
# A tibble: 4 x 3
COL1 Freq Mean
* <chr> <int> <dbl>
1 A 2 7.5
2 A, B 2 2.5
3 C 1 1
4 D 2 2
I have a dataset which is similar to the following:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3
6-9 a b a 2 3 4
6-9 b b c 1 2 3
6-9 c a 4 1
9-10 c c b 1 3 1
9-10 c a b 1 2 1
Using R, I want to get the following data set which contains new set of columns a, b and c by adding the corresponding values:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3 a b c
6-9 a b a 2 3 4 6 3 0
6-9 b b c 1 2 3 0 3 3
6-9 c a 4 1 1 0 4
9-10 c c b 1 3 1 0 1 4
9-10 c a b 1 2 1 2 1 1
Note: My data also contains missing values. The variables Monday:Wednesday are factors and the variables Value1:Value3 are numeric. For more clearity: 1st row of column "a" contains the addition of all values through Value1 to Value3 related to a (say for example 2+4 =6).
One way using base R:
data$id <- 1:nrow(data) # Create a unique id
vlist <- list(grep("day$", names(data)), grep("^Value", names(data)))
d1 <- reshape(data, direction="long", varying=vlist, v.names=c("Day","Value"))
d2 <- aggregate(Value~id+Day, FUN=sum, na.rm=TRUE, data=d1)
d3 <- reshape(d2, direction="wide", v.names="Value", timevar="Day")
d3[is.na(d3)] <- 0
merge(data, d3, by="id", all.x=TRUE)
# id Age Monday Tuesday Wednesday Value1 Value2 Value3 Value.a Value.b Value.c
#1 1 6-9 a b a 2 3 4 6 3 0
#2 2 6-9 b b c 1 2 3 0 3 3
#3 3 6-9 <NA> c a NA 4 1 1 0 4
#4 4 9-10 c c b 1 3 1 0 1 4
#5 5 9-10 c a b 1 2 1 2 1 1
Data:
data <- structure(list(Age = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("6-9",
"9-10"), class = "factor"), Monday = structure(c(1L, 2L, NA,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), Tuesday = structure(c(2L,
2L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
Wednesday = structure(c(1L, 3L, 1L, 2L, 2L), .Label = c("a",
"b", "c"), class = "factor"), Value1 = c(2L, 1L, NA, 1L,
1L), Value2 = c(3L, 2L, 4L, 3L, 2L), Value3 = c(4L, 3L, 1L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))
You can use below code:
data[] <- lapply(data, as.character)
data$rownumber<-rownames(data)
x<-gather(data[,c(1:4,8)], Day, Letter, Monday:Wednesday) %>% mutate(row2 = rownames(x))
y<-gather(data[,c(1,5:7,8)], Day, Value, Value1:Value3)%>% mutate(row2 = rownames(y))
z<-left_join(x, y, by =c("Age","rownumber", "row2")) %>% group_by(Age, rownumber, Letter) %>% dplyr::summarise(suma = sum(as.numeric(Value), na.rm = T)) %>% mutate(suma = replace_na(suma, 0))
z<-dcast(z, rownumber ~ Letter , value.var="suma") %>% left_join(data, z, by = "rownumber")
z$Var.2<-NULL
z[is.na(z)]<-0
Output:
rownumber a b c Age Monday Tuesday Wednesday Value1 Value2 Value3
1 1 6 3 0 6-9 a b a 2 3 4
2 2 0 3 3 6-9 b b c 1 2 3
3 3 1 0 4 6-9 c a 0 4 1
4 4 0 1 4 9-10 c c b 1 3 1
5 5 2 1 1 9-10 c a b 1 2 1
I have a dataframe as follows
group x y
a 1 2
a 3 1
b 1 3
c 1 1
c 2 3
I want to be able to generate all combinations of the x and y columns within a group, like so
group xy
a 1-2
a 1-1
a 3-2
a 3-1
b 1-3
c 1-1
c 1-3
c 2-1
c 2-3
I've tried using the following code, but it seems as though the group_by function is not working as expected
library(dplyr)
library(tidyr)
combn <- df %>%
group_by(group) %>%
expand(x, y)
My current results are instead giving me every combination of all three columns
head(combn)
group x y
a 1 1
a 1 2
a 1 3
a 2 1
a 2 2
a 2 3
Dput:
structure(list(group = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), x = structure(c(1L, 3L, 1L, 1L,
2L), .Label = c("1", "2", "3"), class = "factor"), y = structure(c(2L,
1L, 3L, 1L, 3L), .Label = c("1", "2", "3"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
You could use crossing from purrr to create combinations within a group and then unnest to create them as separate rows.
library(dplyr)
df1 <- df %>%
group_by(group) %>%
summarise(xy = list(crossing(x, y))) %>%
tidyr::unnest(xy)
df1
# group a b
# <fct> <int> <int>
#1 a 1 2
#2 a 3 2
#3 a 1 1
#4 a 3 1
#5 b 1 3
#6 c 1 1
#7 c 2 1
#8 c 1 3
#9 c 2 3
If you want to combine the two columns, you could use unite :
tidyr::unite(df1, xy, a, b, sep = "-")
# group xy
# <fct> <chr>
#1 a 1-2
#2 a 3-2
#3 a 1-1
#4 a 3-1
#5 b 1-3
#6 c 1-1
#7 c 2-1
#8 c 1-3
#9 c 2-3
I have the following dataset
clust T2 n
1 a 1
1 b 3
1 c 3
2 d 5
3 a 4
3 b 3
4 b 5
4 c 8
4 t 6
4 e 7
etc..
using the following function:
library(dplyr)
table <- data %>% group_by(clust) %>% summarise(max = max(n), name1 = T2[which.max(n)])
I get this output
clust max name1
1 3 b
2 5 d
3 4 a
4 8 c
etc
however there are cases where there are two or more T2 values corresponding to max(n). how can I record those value too?
i.e.
clust max name1
1 3 b,c
2 5 d
3 4 a
4 8 c
etc
or
clust max name1
1 3 b
1 3 c
2 5 d
3 4 a
4 8 c
etc
We can do a == instead of which.max (that returns only the first index of max value) and paste together with toString
library(dplyr)
library(tidyr)
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)]))
# A tibble: 4 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b, c
#2 2 5 d
#3 3 4 a
#4 4 8 c
and this can be expanded with separate_rows in the next step
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)])) %>%
separate_rows(name1, sep=",\\s+")
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
Or have a list column and then unnest
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = list(T2[n == max(n)])) %>%
unnest(c(name1))
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
data
data <- structure(list(clust = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L, 4L,
4L), T2 = c("a", "b", "c", "d", "a", "b", "b", "c", "t", "e"),
n = c(1L, 3L, 3L, 5L, 4L, 3L, 5L, 8L, 6L, 7L)),
class = "data.frame", row.names = c(NA,
-10L))