How to collapse Table while ignore a specific value in R? - r

I have a data frame like the following:
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
and I want to collapse(sum) the row with the same name (column 1) but ignore the value -1 while collapsing (summing). For example, the example above would become:
> example # the goal
name X1.8 X1.8.1 X1.8.2
1 a 1 8 18
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
> dput(example)
structure(list(name = structure(c(1L, 2L, 3L, 1L, 4L, 5L, 1L,
3L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
X1.8 = c(1, 33, 3, -1, 5, 7, -1, 5), X1.8.1 = c(1, 0, 10,
-1, 8, 6, 7, 20), X1.8.2 = c(7, 2, -1, 4, 5, 12, 7, 9)), row.names = c(NA,
8L), class = "data.frame")
Edit for question:
will this work if there are some rows with -1? For example,
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
9 f -1 -1 -1

You can remove -1 and sum rest of the values.
Using base R :
aggregate(.~name, example, function(x) sum(x[x!=-1]))
# name X1.8 X1.8.1 X1.8.2
#1 a 1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
In dplyr :
library(dplyr)
example %>%
group_by(name) %>%
summarise(across(everything(), ~sum(.[. != -1])))
and data.table :
library(data.table)
setDT(example)[, lapply(.SD, function(x) sum(x[x!=-1])), name]

As you are calculating sums you can set the -1 you want to ignore to 0 and use rowsum to get the sum per group.
x[x==-1] <- 0
rowsum(x[-1], x[,1])
# X1.8 X1.8.1 X1.8.2
#a 1 8 18
#b 33 0 2
#c 8 30 9
#d 5 8 5
#e 7 6 12
another option is to set -1 to NA
x[x==-1] <- NA
rowsum(x[-1], x[,1], na.rm = TRUE)

Related

Subset groups in a data.table using conditions on two columns

I have a data.table with a high number of groups. I would like to subset whole groups (not just rows) based on the conditions on multiple columns. Consider the following data.table:
DT <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L),
group = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
y = c(14, 19, 16, 10, 6, 8, 14, 19, 10, 9, 6, 8),
x = c(3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3)),
row.names = c(NA, -12L),
class = c("data.table", "data.frame"))
>DT
id group y x
1: 1 A 14 3
2: 2 A 19 3
3: 3 A 16 2
4: 4 A 10 3
5: 5 B 6 3
6: 6 B 8 3
7: 7 B 14 3
8: 8 B 19 2
9: 9 C 10 2
10: 10 C 9 3
11: 11 C 6 3
12: 12 C 8 3
I would like to keep groups that have y=6 and x=3 in the same row. So that I would have only class B and C (preferably using data.table package in R):
id group y x
1: 5 B 6 3
2: 6 B 8 3
3: 7 B 14 3
4: 8 B 19 2
5: 9 C 10 2
6: 10 C 9 3
7: 11 C 6 3
8: 12 C 8 3
All my attempts gave me only those rows containing y=6 and x=3, which I do not want:
id group y x
1: 5 B 6 3
2: 11 C 6 3
With data.table:
DT[,.SD[any(x == 3 & y == 6)], by=group]
group id y x
<char> <int> <num> <num>
1: B 5 6 3
2: B 6 8 3
3: B 7 14 3
4: B 8 19 2
5: C 9 10 2
6: C 10 9 3
7: C 11 6 3
8: C 12 8 3
Another possibly faster option:
DT[, if (any(x == 3 & y == 6)) .SD, by=group]
Try dplyr package
#select groups containing y and x
groups = DT %>% filter(y == 6, x == 3) %>% select(group) %>% unique() %>% unlist() %>% as.vector()
# filter for selected groups
DT %>% filter(group %in% groups)
A data.table option
> DT[group %in% DT[.(3, 6), group, on = .(x, y)]]
id group y x
1: 5 B 6 3
2: 6 B 8 3
3: 7 B 14 3
4: 8 B 19 2
5: 9 C 10 2
6: 10 C 9 3
7: 11 C 6 3
8: 12 C 8 3

dplyr: add numbers based on matching rows

Let's say I have
> fig
hands imp_spe n
1 A 0 39
2 A 1 32
3 B 0 3
4 B 1 2
5 C 0 115
6 C 1 24
7 D 0 11
8 D 1 3
I want to add a new column fig$new, that adds numbers in fig$n, but only when rows in fig$hands are matching.
I need to keep the dataframe as it is.
Expected output
> fig
hands imp_spe n new
1 A 0 39 71
2 A 1 32 71
3 B 0 3 5
4 B 1 2 5
5 C 0 115 139
6 C 1 24 139
7 D 0 11 14
8 D 1 3 14
I am looking for a solution in dplyr
fig <- structure(list(hands = c("A", "A", "B", "B", "C", "C", "D", "D"
), imp_spe = c(0, 1, 0, 1, 0, 1, 0, 1), n = c(39L, 32L, 3L, 2L,
115L, 24L, 11L, 3L)), row.names = c(NA, -8L), class = "data.frame")
here you go
library(dplyr)
fig %>%
group_by(hands) %>%
mutate(new = sum(n)) %>%
ungroup
dplyr solution:
dplyr::add_count(fig, hands, wt = n, name = 'new')
# hands imp_spe n new
# 1 A 0 39 71
# 2 A 1 32 71
# 3 B 0 3 5
# 4 B 1 2 5
# 5 C 0 115 139
# 6 C 1 24 139
# 7 D 0 11 14
# 8 D 1 3 14
base solution:
transform(
fig,
new = ave(x = n, hands, FUN = sum)
)
# hands imp_spe n new
# 1 A 0 39 71
# 2 A 1 32 71
# 3 B 0 3 5
# 4 B 1 2 5
# 5 C 0 115 139
# 6 C 1 24 139
# 7 D 0 11 14
# 8 D 1 3 14

How to sum a df while ignoring value in R?

This is an extension of my previous question. I reviewed the linked duplicate, but I am still having trouble.
I have a data frame like the following:
> example
name X1.8 X1.8.1 X1.8.2
1 a -1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
9 f -1 -1 -1
and I want to collapse(sum) the row with the same name (column 1) but ignore the value -1 while collapsing (summing). *-1 is similar to NA. For example, the example above would become:
> example # the goal
name X1.8 X1.8.1 X1.8.2
1 a -1 8 18 # the first col stays as -1 b/c all are -1
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
6 f -1 -1 -1
> dput(example)
structure(list(name = structure(c(1L, 2L, 3L, 1L, 4L, 5L, 1L,
3L, 6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
X1.8 = c(-1, 33, 3, -1, 5, 7, -1, 5, -1), X1.8.1 = c(1, 0,
10, -1, 8, 6, 7, 20, -1), X1.8.2 = c(7, 2, -1, 4, 5, 12,
7, 9, -1)), row.names = c(NA, 9L), class = "data.frame")
We can use an if/else after doing the group_by i.e. after grouping by 'name', summarise across all the other columns (dplyr 1.0.0), if all values are -1, then return it or else get the sum of values excluding -1
library(dplyr) # 1.0.0
example %>%
group_by(name) %>%
summarise(across(everything(), ~ if(all(.==-1)) -1 else
sum(.[. != -1], na.rm = TRUE)))
# A tibble: 6 x 4
# name X1.8 X1.8.1 X1.8.2
# <fct> <dbl> <dbl> <dbl>
#1 a -1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
#6 f -1 -1 -1
An option is also to use na_if to replace the -1 to NA and then make use of na.rm= TRUE in sum. But, we have avoided that route in case there are actual NAs in the dataset for a particular group. This would help in identifying the -1 as such
or with summarise_at
example %>%
group_by(name) %>%
summarise_at(vars(-group_cols()), ~ if(all(.==-1)) -1 else
sum(.[. != -1], na.rm = TRUE))
# A tibble: 6 x 4
# name X1.8 X1.8.1 X1.8.2
# <fct> <dbl> <dbl> <dbl>
#1 a -1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
#6 f -1 -1 -1
This solution could help you:
library(dplyr)
#Format
example[example==-1]<-NA
#Aggregate
example %>% group_by(name) %>% summarise_all(sum,na.rm=T)
# A tibble: 6 x 4
name X1.8 X1.8.1 X1.8.2
<fct> <dbl> <dbl> <dbl>
1 a 0 8 18
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
6 f 0 0 0
base R
aggregate(x = example[,2:4],
by = list(name = example$name),
FUN = function(x)ifelse(all(x==-1), -1, sum(x[x!=-1])))

Is there any R code to repeat a same value for multiple rows?

My df looks like this now.
A B C
1 3 .
1 6 .
1 9 .
2 1 .
2 2 .
2 5 .
3 9 .
3 3 .
3 2 .
Below is the ideal dataframe I am try to create:
Variable A refers to individuals (user-id). Each individual has three rows.
Each individual has different values for variable B...
whereas they need to have a same value of variable C.
I need to repeat a same value of variable C for each individual. I was wondering how I can give each participant the same value of variable C so that variable C is repeated three times for each participant.
A B C
1 3 1
1 6 1
1 9 1
2 1 3
2 2 3
2 5 3
3 9 8
3 3 8
3 2 8
We can just use rep in base R as the number of repeats are already known as 3
df$C <- rep(c(1, 3, 8), each = 3)
df
# A B C
#1 1 3 1
#2 1 6 1
#3 1 9 1
#4 2 1 3
#5 2 2 3
#6 2 5 3
#7 3 9 8
#8 3 3 8
#9 3 2 8
Or another option is to use 'A' as integer index which would also work when there are unequal lengths
df$C <- c(1, 3, 8)[df$A]
If the values in 'A' are not in sequence or it is not numeric, use a named vector to replace
df$C <- setNames(c(1, 3, 8), unique(df$A))[as.character(df$A)]
data
df <- data.frame(A = rep(1:3, each = 3), B = c(3, 6, 9, 1, 2, 5, 9, 3, 2))
You could use an assignment matrix and match it with your A column.
am <- matrix(c(1, 1,
2, 3,
3, 8), byrow=TRUE, ncol=2)
dat$C <- am[match(dat$A, am[,1]), 2]
dat
# A B C
# 1 1 3 1
# 2 1 6 1
# 3 1 9 1
# 4 2 1 3
# 5 2 2 3
# 6 2 5 3
# 7 3 9 8
# 8 3 3 8
# 9 3 2 8
Data:
dat <- structure(list(A = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), B = c(3L,
6L, 9L, 1L, 2L, 5L, 9L, 3L, 2L)), row.names = c(NA, -9L), class = "data.frame")
The solution by #akrun is the most efficient so far. Here is another base R solution, which applies to the cases grouped by df$A for unequal size of groups...
v <- c(1,3,8)
df <- do.call(rbind,lapply(seq_along(v), function(k) cbind(split(df,df$A)[[k]],C=v[k])))
such that
> df
A B C
1 1 3 1
2 1 6 1
3 1 9 1
4 2 1 3
5 2 2 3
6 2 5 3
7 3 9 8
8 3 3 8
9 3 2 8

Conditional statement within group

I have a dataframe in which I want to make a new column with values based on condition within groups. So for the dataframe below, I want to make a new column n_actions which gives
Cond1. for the whole group GROUP the number 2 if a 6 appears in column STEP
Cond 2. for the whole group GROUP the number 3 if a 9 appears in column STEP
Cond 3. if not a 6 or 9 appears within column STEP for the GROUP, then 1
#dataframe start
dataframe <- data.frame(group = c("A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D"),
step = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9))
# dataframe desired
dataframe$n_actions <- c(rep(1, 3), rep(2, 6,), rep(1, 3), rep(3, 9))
Try out:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(n_actions = ifelse(9 %in% step, 3,
ifelse(6 %in% step, 2, 1)))
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fctr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
# ... with 11 more rows
Another way with dplyr's case_when:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(
n_actions1 = case_when(
9 %in% step ~ 3,
6 %in% step ~ 2,
TRUE ~ 1
)
)
Output:
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fct> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
11 C 2 1
12 C 3 1
13 D 1 3
14 D 2 3
15 D 3 3
16 D 4 3
17 D 5 3
18 D 6 3
19 D 7 3
20 D 8 3
21 D 9 3
You could divide the maximum value per group by %/% 3, it seems.
dataframe <- transform(dataframe,
n_actions2 = ave(step, group, FUN = function(x) max(x) %/% 3))
dataframe
# group step n_actions n_actions2
#1 A 1 1 1
#2 A 2 1 1
#3 A 3 1 1
#4 B 1 2 2
#5 B 2 2 2
#6 B 3 2 2
#7 B 4 2 2
#8 B 5 2 2
#9 B 6 2 2
#10 C 1 1 1
#11 C 2 1 1
#12 C 3 1 1
#13 D 1 3 3
#14 D 2 3 3
#15 D 3 3 3
#16 D 4 3 3
#17 D 5 3 3
#18 D 6 3 3
#19 D 7 3 3
#20 D 8 3 3
#21 D 9 3 3

Resources