Add missing subtotals to each group using dplyr - r

I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.

This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1

Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12

Related

Spread one column in multiple columns

I have one column "m" that contains multiple values associated with one subject (ID). I need to spread the values in this column in 5 different columns to obtain the second table that I provided below. I also need to associate names to those columns.
f <- read.table(header = TRUE, text = "
Scale ID m
1 1 1 0.4089795
2 1 1 0.001041055
3 1 1 0.1843616
4 1 1 0.03398921
5 1 1 FALSE
6 3 1 0.1179424
7 3 1 0.3569155
8 3 1 0.2006204
9 3 1 0.04024855
10 3 1 FALSE
")
Here's what the output should look like
ID Scale x y z a b
1 1 1 0.4089795 0.001041055 0.1843616 0.03398921 FALSE
2 1 3 0.1179424 0.356915500 0.2006204 0.04024855 FALSE
Thanks for any help!
df <- read.table(header = TRUE, text = "
Scale ID m
1 1 1 0.4089795
2 1 1 0.001041055
3 1 1 0.1843616
4 1 1 0.03398921
5 1 1 FALSE
6 3 1 0.1179424
7 3 1 0.3569155
8 3 1 0.2006204
9 3 1 0.04024855
10 3 1 FALSE
")
library(tidyverse)
df %>%
group_by(Scale, ID) %>% # for each combination of Scale and ID
mutate(names = c("x","y","z","a","b")) %>% # add column names
ungroup() %>% # forget the grouping
spread(-Scale, -ID) %>% # reshape data
select(Scale, ID, x, y, z, a, b) # order columns
# # A tibble: 2 x 7
# Scale ID x y z a b
# <int> <int> <fct> <fct> <fct> <fct> <fct>
# 1 1 1 0.4089795 0.001041055 0.1843616 0.03398921 FALSE
# 2 3 1 0.1179424 0.3569155 0.2006204 0.04024855 FALSE

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

How to merge subsequent values of specific column in table in r

I want to merge repetitive row with a specific content only.
let say I have following datafram
df:
user action
1 A
1 A
1 B
1 B
2 A
2 C
2 C
2 A
2 A
I want to merge only subsequent action A only.
so the result would be:
user action
1 A
1 B
1 B
2 A
2 C
2 C
2 A
how can I do it in R?
thx
As long as there are no other conditions to match, this will work with:
library(magrittr)
library(dplyr)
Start by creating a dummy column that tells us whether it's an immediate duplicate of the prior "A":
> df %>% group_by(user) %>%
mutate(condition=paste0(action,lag(action)==action))
# A tibble: 9 x 3
# Groups: user [2]
user action condition
<fct> <fct> <chr>
1 1 A ANA
2 1 A ATRUE
3 1 B BFALSE
4 1 B BTRUE
5 2 A ANA
6 2 C CFALSE
7 2 C CTRUE
8 2 A AFALSE
9 2 A ATRUE
Then you can filter out the rows within each user where A follows another A:
> df %>% group_by(user) %>%
mutate(condition=paste0(action,lag(action)==action)) %>%
filter(condition!="ATRUE")
# A tibble: 7 x 3
# Groups: user [2]
user action condition
<fct> <fct> <chr>
1 1 A ANA
2 1 B BFALSE
3 1 B BTRUE
4 2 A ANA
5 2 C CFALSE
6 2 C CTRUE
7 2 A AFALSE
You don't even have to reveal the dummy column because you can just filter out the rows that match "ATRUE" and then select the two variables you care about:
> df %>% group_by(user) %>%
mutate(condition=paste0(action,lag(action)==action)) %>%
filter(condition!="ATRUE") %>% select(user,action)
# A tibble: 7 x 2
# Groups: user [2]
user action
<fct> <fct>
1 1 A
2 1 B
3 1 B
4 2 A
5 2 C
6 2 C
7 2 A

Eliminate factors contributing less

There are hundreds of levels in a column and not all of them really add value - as in, about 60% of levels account for <80% (they don't occur many a times in the dataframe) and also expected to not influence the outcome. Objective is to eliminate those levels that do not contribute more than 80%.
Could someone help? Thanks in advance
Here is a simple process that spots values that account for less than 80% of the dataset (rows) and groups them together using a new value. This process uses a character column and not a factor column.
library(dplyr)
# example dataset
dt = data.frame(type = c("A","A","A","B","B","B","c","D"),
value = 1:8, stringsAsFactors = F)
dt
# type value
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 c 7
# 8 D 8
# count number of rows for each type
dt %>% count(type)
# # A tibble: 4 x 2
# type n
# <chr> <int>
# 1 A 3
# 2 B 3
# 3 c 1
# 4 D 1
# add cumulative percentages
dt %>%
count(type) %>%
mutate(Prc = n/sum(n),
CumPrc = cumsum(Prc))
# # A tibble: 4 x 4
# type n Prc CumPrc
# <chr> <int> <dbl> <dbl>
# 1 A 3 0.375 0.375
# 2 B 3 0.375 0.750
# 3 c 1 0.125 0.875
# 4 D 1 0.125 1.000
# pick the types you want to group together
dt %>%
count(type) %>%
mutate(Prc = n/sum(n),
CumPrc = cumsum(Prc)) %>%
filter(CumPrc > 0.80) %>%
pull(type) -> types_to_group
# group them
dt %>% mutate(type_upd = ifelse(type %in% types_to_group, "Rest", type))
# type value type_upd
# 1 A 1 A
# 2 A 2 A
# 3 A 3 A
# 4 B 4 B
# 5 B 5 B
# 6 B 6 B
# 7 c 7 Rest
# 8 D 8 Rest

Retain rows up to first occurrence of a value in a column, by group. Groups without value allowed

I have a data frame like this one:
> df
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 1 b
6 2 a
7 2 a
8 2 b
9 3 a
10 3 a
I want to keep all rows for each group (id) up to the first occurrence of value 'b' in the type column. For groups without type 'b', I want to keep all their rows.
The resulting data frame should look like this:
> dfnew
id type
1 1 a
2 1 a
3 1 b
4 2 a
5 2 a
6 2 b
7 3 a
8 3 a
I tried the following code, but it retains additional rows that have the value 'a' beyond the first occurrence of 'b', and only excludes additional occurrences of 'b', which is not what I want. Look at row 4 in the following. I want to rid of it.
> df %>% group_by(id) %>% filter(cumsum(type == 'b') <= 1)
Source: local data frame [7 x 2]
Groups: id
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 2 a
6 2 a
7 2 b
8 3 a
9 3 a
You could combine match or which with slice or (as mentioned by #Richard) which.max
library(dplyr)
df %>%
group_by(id) %>%
slice(if(any(type == "b")) 1:which.max(type == "b") else row_number())
# Source: local data table [8 x 2]
# Groups: id
#
# id type
# 1 1 a
# 2 1 a
# 3 1 b
# 4 2 a
# 5 2 a
# 6 2 b
# 7 3 a
# 8 3 a
Or you could try it with data.table
library(data.table)
setDT(df)[, if(any(type == "b")) .SD[1:which.max(type == "b")] else .SD, by = id]
# id type
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 2 a
# 5: 2 a
# 6: 2 b
# 7: 3 a
# 8: 3 a

Resources