I have a data frame such as this:
df <- data.frame(
ID = c('123','124','125','126'),
Group = c('A', 'A', 'B', 'B'),
V1 = c(1,2,1,0),
V2 = c(0,0,1,0),
V3 = c(1,1,0,3))
which returns:
ID Group V1 V2 V3
1 123 A 1 0 1
2 124 A 2 0 1
3 125 B 1 1 0
4 126 B 0 0 3
and I would like to return a table that indicates if a variable is represented in the group or not:
Group V1 V2 V3
A 1 0 1
B 1 1 1
In order to count the number of distinct variables in each group.
Using:
df %>%
group_by(Group) %>%
summarise_at(vars(V1:V3), funs(as.integer(any(. > 0))))
gives:
# A tibble: 2 × 4
Group V1 V2 V3
<fctr> <dbl> <dbl> <dbl>
1 A 1 0 1
2 B 1 1 1
Can be done in data.table:
require(data.table)
setDT(df)
table <- df[, .(sum(V1) > 0, sum(V2) > 0, sum(V3) > 0), Group]
table
Group V1 V2 V3
1: A TRUE FALSE TRUE
2: B TRUE TRUE TRUE
table[, lapply(.SD, as.integer), Group, .SD=2:4]
Group V1 V2 V3
1: A 1 0 1
2: B 1 1 1
Related
I have two data frames dat1 and dat2. I will like to compare the values of the similar columns in the two data frames and returned the highest value.
How can I compare value of two table and return highest value in r?
library(reshape2)
dat1 <- data.frame(sn=c("v1","v2"), A=c(1,0), B=c(0,1), C=c(0,0), D=c(1,0))
dat2 <- data.frame(sn=c("v1","v2", "v3"), A=c(1,0,1), C=c(1,0,1), B=c(0,0,1))
dat1:
sn A B C D
v1 1 0 0 1
v2 0 1 0 0
dat2:
sn A C B
v1 1 1 0
v2 0 0 0
v3 1 1 1
dt1 <- melt(dat1,"sn")
dt2 <- melt(dat2,"sn")
dt3 <- merge(dt1,dt2,by=c("sn","variable"))
dt3$value <- max(dt3$value.x, dt3$value.y)
#I got the following which is not correct.
dt3:
sn variable value.x value.y value
v1 A 1 1 1
v1 B 0 0 1
v1 C 0 1 1
v2 A 0 0 1
v2 B 1 0 1
v2 C 0 0 1
#I will like dt3 to return the following
dt3:
dt3
sn variable value.x value.y value
v1 A 1 1 1
v1 B 0 0 0
v1 C 0 1 1
v2 A 0 0 0
v2 B 1 0 1
v2 C 0 0 0
Using data.table
library(data.table)
setDT(dat1)
setDT(dat2)
dt1 = melt(dat1, id.vars = "sn")
dt2 = melt(dat2, id.vars = "sn")
# Inner join
dt3 = merge(dt1, dt2, by = c("sn", "variable"))
dt3[, value := pmax(value.x, value.y)]
dt3
# Key: <sn, variable>
# sn variable value.x value.y value
# <char> <fctr> <num> <num> <num>
# 1: v1 A 1 1 1
# 2: v1 B 0 0 0
# 3: v1 C 0 1 1
# 4: v2 A 0 0 0
# 5: v2 B 1 0 1
# 6: v2 C 0 0 0
Heres' a tidyverse solution. Note that the max values in group v3 for variable d is -infinity because all values are NA.
bind_rows(dat1, dat2) %>%
pivot_longer(
-sn,
names_to = "variable"
) %>%
group_by(sn, variable) %>%
summarise(max_value = max(value, na.rm = TRUE))
# A tibble: 12 × 3
# Groups: sn [3]
sn variable max_value
<chr> <chr> <dbl>
1 v1 A 1
2 v1 B 0
3 v1 C 1
4 v1 D 1
5 v2 A 0
6 v2 B 1
7 v2 C 0
8 v2 D 0
9 v3 A 1
10 v3 B 1
11 v3 C 1
12 v3 D -Inf
Warning message:
In max(value, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Im trying to create a new character variable (in my example V4) based on values of other variables. I need to use the column names to fill this new variable.
I have this:
V1 V2 V3
1 1 0 1
2 0 1 1
3 0 0 0
4 1 1 1
And i hope the new variable contains all the column names where the value are equal to 1
Like this:
V1 V2 V3 V4
1 1 0 1 "V1,V3"
2 0 1 1 "V2,V3"
3 0 0 0 " "
4 1 1 1 "V1,V2,V3"
example data:
data.frame(
V1 =c(1,0,0,1),
V2 = c(0,1,0,1),
V3 = c(1,1,0,1)
)
You can use the following code:
library(dplyr)
df %>%
rowwise() %>%
mutate(V4 = paste0(names(.)[c_across() == 1], collapse = ','))
Output:
# A tibble: 4 × 4
# Rowwise:
V1 V2 V3 V4
<dbl> <dbl> <dbl> <chr>
1 1 0 1 "V1,V3"
2 0 1 1 "V2,V3"
3 0 0 0 ""
4 1 1 1 "V1,V2,V3"
Data
df <- data.frame(
V1 = c(1,0,0,1),
V2 = c(0,1,0,1),
V3 = c(1,1,0,1)
)
Using base R with apply
df1$V4 <- apply(df1, 1, \(x) toString(names(x)[x ==1]))
I have a data frame. You can see that some rows just differs in the order "A"-"B" and "B"-"A" and these two rows have the same Value
df <- tibble(
V1 = c("A", "C", "B","D"),
V2 = c("B", "D", "A","C"),
Value = c(1,2,1,2)
)
V1 V2 Value
<chr> <chr> <dbl>
1 A B 1
2 C D 2
3 B A 1
4 D C 2
I want to remove one duplicated rows 0 or 2, to make it like below
V1 V2 Value
0 A B 1
1 C D 2
How can I remove those repetitive rows?
df[!duplicated(t(apply(df,1,sort))),]
V1 V2 Value
0 A B 1
1 C D 2
or even:
df[!duplicated(cbind(pmax(df$V1, df$V2), pmin(df$V1, df$V2))),]
V1 V2 Value
0 A B 1
1 C D 2
An option with tidyverse
library(dplyr)
library(stringr)
library(purrr)
df %>%
filter(!duplicated(pmap_chr(across(V1:V2), ~ str_c(sort(c(...)),
collapse = ""))))
# A tibble: 2 × 3
V1 V2 Value
<chr> <chr> <dbl>
1 A B 1
2 C D 2
This may have solutions/answers available here, but I am unable to find.
Let us assume a simple data like this
x <- data.frame(id = rep(1:3, each = 2),
v1 = c('A', 'B', 'A', 'B', 'A', 'C'))
> x
id v1
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 3 C
Now I want an output of relation of V1 column with itself, but across group on id something like this
v1 A B C
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
So, I proceeded like this..
library(tidyverse)
#merged the V1 column by itself with all = TRUE
x <- merge(x, x, by = "id", all = T)
# removed same group rows
x <- x[x$v1.x != x$v1.y, ]
# final code
x %>% select(-id) %>%
group_by(v1.x, v1.y) %>%
summarise(val = n()) %>%
pivot_wider(names_from = v1.y, values_from = val, values_fill = 0L, names.sort = T)
# A tibble: 3 x 4
# Groups: v1.x [3]
v1.x A B C
<chr> <int> <int> <int>
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
My question is that any better/direct method to obtain the cross-table?
How about creating a contingency table with xtabs (which can work with large data sets as well). Then, you can use crossprod on the table and set the diagonal to zero for the final result.
ct <- xtabs(~ id + v1, data = x)
cp <- crossprod(ct, ct)
diag(cp) <- 0
cp
Instead of xtabs you can create a cross-table with simply table as well. As noted by #A5C1D2H2I1M1N2O1R2T1, you can simplify to a nice one-liner equivalent:
"diag<-"(crossprod(table(x)), 0)
Output
v1
v1 A B C
A 0 2 1
B 2 0 0
C 1 0 0
I have two dataframe df1 and df2, i want to create a new dataframe by looking at the union of the two dataset. If a particular column has value 1 in both dataset, the new dataset should have value 1 for that particular column.
df1 = data.frame( V1 = letters[1:5], V2 = c("0","1","1","0","1"), V3 = c("0","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("0","0","0","0","1"),V6 =c("1","1","1","0","0"))
df2 = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","0"), V3 = c("1","0","0","0","1"), V4 =c("0","0","1","0","1"), V5 = c("1","0","0","0","1"))
result = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","1"), V3 = c("1","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("1","0","0","0","1"),V6 =c("1","1","1","0","0"))
Here is my first attempt; although I'm sure this can be improved:
library(tidyverse)
set.seed(345)
df1 <- tibble(
V1 = letters[1:5],
V2 = sample(c(0,1), 5, replace = TRUE),
V3 = sample(c(0,1), 5, replace = TRUE)
)
df2 <- tibble(
V1 = letters[1:5],
V2 = sample(c(0,1), 5, replace = TRUE),
V3 = sample(c(0,1), 5, replace = TRUE)
)
df1
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 0 0
3 c 0 1
4 d 1 0
5 e 0 0
df2
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 0
2 b 1 1
3 c 0 0
4 d 1 1
5 e 1 1
The draft solution:
result <- df1 %>%
left_join(df2, by = "V1") %>%
rowwise() %>%
mutate(
V2 = max(V2.x, V2.y),
V3 = max(V3.x, V3.y)
) %>%
select(V1, V2, V3)
result
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
Obviously, if you have a large number of variables, this becomes a less ideal answer.
UPDATE:
Here is how to make the solution even more general for an arbitrary number of columns:
df1 %>%
select(V1) %>%
bind_cols(
map2_df(
.x = df1[-1],
.y = df2[-1],
.f = ~ map2_dbl(.x, .y, max)
)
)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
This is how it works:
We can supply map2_dbl with one pair of vectors and find the max of the two vectors at each position like so:
map2_dbl(
.x = c(0, 0, 0, 1, 0),
.y = c(0, 1, 0, 1, 1),
.f = max
)
[1] 0 1 0 1 1
That will become the inner-most portion of the solution. Now, we just need to figure out how to pass in all pairs of variables from both data frames iteratively to the map2_dbl above. This silly example shows how it works:
map2(
.x = df1[-1],
.y = df2[-1],
.f = function(x = .x, y = .y) {
cat("x = ", x, "y = ", y, "\n")
}
)
x = 0 0 0 1 0 y = 0 1 0 1 1
x = 1 0 1 0 0 y = 0 1 0 1 1
$V2
NULL
$V3
NULL
Notice that in the first pass x = df1$V2 and y = df2$V2. In the second iteration x = df1$V3 and y = df2$V3. That is exactly what we want.
We could use three steps to get our final solution:
x1 <- df1 %>%
select(V1)
x2 <- map2_df(
.x = df1[-1],
.y = df2[-1],
.f = function(x = .x, y = .y) {
map2_dbl(x, y, max)
}
)
bind_cols(x1, x2)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
OR, we can combine these steps into one pipeline:
df1 %>%
select(V1) %>%
bind_cols(
map2_df(
.x = df1[-1],
.y = df2[-1],
.f = ~ map2_dbl(.x, .y, max)
)
)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1