Divide (and name) one group of columns by another group in dplyr - r

After a (very scaring) dplyr pipeline I've ended up with a dataset like this:
year A B C [....] Z count.A count.B count.C [....] count.Z
1999 10 20 10 ... 6 3 5 67 ... 6
2000 3 5 5 ... 7 5 2 5 ... 5
Some example data to reproduce:
df <- data.frame(year = c(1999, 2000),
A = c(10, 20),
B = c(3, 6),
C = c(1, 2),
count.A = c(1, 2),
count.B = c(8, 9),
count.C = c(5, 7))
What I really need is to combine each column with its "count" counterpart i.e.
weight.A = A / count.A,
weight.B = B / count.B
I've to do that programmatically as I have hundreds of columns. Is there a way to do that in a dplyr pipeline?

Don't store variables in column names. If you reshape your data to make it tidy, the calculation is really simple:
library(tidyverse)
df %>% gather(var, val, -year) %>% # reshape to long
separate(var, c('var', 'letter'), fill = 'left') %>% # extract var from former col names
mutate(var = coalesce(var, 'value')) %>% # add name for unnamed var
spread(var, val) %>% # reshape back to wide
mutate(weight = value / count) # now this is very simple
#> year letter count value weight
#> 1 1999 A 1 10 10.0000000
#> 2 1999 B 8 3 0.3750000
#> 3 1999 C 5 1 0.2000000
#> 4 2000 A 2 20 10.0000000
#> 5 2000 B 9 6 0.6666667
#> 6 2000 C 7 2 0.2857143

If your columns are consistently named (and easy enough to retrieve) you could easily do this using an lapply:
cols <- c("A","B","C")
df[,paste0("weighted.",cols)] <- lapply(cols, function(x) df[,x] / df[, paste0("count.",x)])
# year A B C count.A count.B count.C weighted.A weighted.B weighted.C
#1 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2 2000 20 6 2 2 9 7 10 0.6666667 0.2857143

Assuming that the columns are in order, we can use data.table. Specify the columns of interest in .SDcols and divide by subset of columns of Subset of Data.table with the other half and assign (:=) it to new columns
library(data.table)
setDT(df)[, paste0("weighted.",names(df)[1:3]) := .SD[,1:3]/.SD[,4:6], .SDcols = A:count.C]
df
# year A B C count.A count.B count.C weighted.year weighted.A weighted.B
#1: 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2: 2000 20 6 2 2 9 7 10 0.6666667 0.2857143

Assuming you can programmatically create a vector of all column names, here is how I'd do for your example above
for (c.name in c("A", "B", "C")) {
c.weight <- sprintf("weight.%s", c.name)
c.count <- sprintf("count.%s", c.name)
df[,c.weight] <- df[,c.name] / df[,c.count]
}

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

Create frequency data frame and transfer columns from old data frame

I am using the map function to create frequency tables from a list of data frames. I would like to import the name column from the original data frame. For example, when I enter df_freq$C I want to see three columns, value, n, and name. For the name column I want all values equal to "C".
# load packages and define variables
rm(list = ls())
library(purrr)
library(dplyr)
## load data
df_raw <- data.frame(name = c("C", "A", "B", "A", "B", "C"),
start = c(2, 1, 3, 4, 5, 2),
end = c(7, 6, 7, 8, 10, 9))
df <- df_raw %>%
split(.$name) %>% # split data by name
imap(function(x, x_name) {
data.frame(value = Map(seq.int, x$start, x$end) %>% unlist,
name = x_name) })
## create frequency plot with name column
df_freq <- df %>%
map(., ~count(.x, value))```
It can be done more directly in tidyverse. Create a rowwise attribute, then transmute to return the name and list of sequence from 'start' to 'end' for each row, unnest the list column and do the count
library(dplyr)
library(tidyr)
df_raw %>%
rowwise %>%
transmute(name, value = list(start:end)) %>%
unnest(c(value)) %>%
count(name, value)
-output
# A tibble: 24 x 3
# name value n
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 1
# 4 A 4 2
# 5 A 5 2
# 6 A 6 2
# 7 A 7 1
# 8 A 8 1
# 9 B 3 1
#10 B 4 1
# … with 14 more rows
Or instead of rowwise, may use map2
library(purrr)
df_raw %>%
transmute(name, value = map2(start, end, `:`)) %>%
unnest(c(value)) %>%
count(name, value)
In the OP's code, the count needs the name column as well
df %>%
map(., ~count(.x, name, value))
Here is a data.table option
setDT(df)[, .(value = unlist(Map(seq, start, end)), n = 1), .(name)][, .(n = sum(n)), by = .(name, value)]
which gives
name value n
1: C 2 2
2: C 3 2
3: C 4 2
4: C 5 2
5: C 6 2
6: C 7 2
7: C 8 1
8: C 9 1
9: A 1 1
10: A 2 1
11: A 3 1
12: A 4 2
13: A 5 2
14: A 6 2
15: A 7 1
16: A 8 1
17: B 3 1
18: B 4 1
19: B 5 2
20: B 6 2
21: B 7 2
22: B 8 1
23: B 9 1
24: B 10 1
name value n

How to lag a specific column of a data frame in R

Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))

use value from a col to choose value from another col, put into new df in R

I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))
First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6
Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4
It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4

How to summarize value not matching the group using dplyr

I want to sum values of rows which belongs to group other than the row's group. For example using this sample data
> df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
> df
id group val
1 1 A 9
2 2 A 7
3 3 B 5
4 4 B 3
5 5 A 1
Summarizing with dplyr by group
> df %>% group_by(group) %>% summarize(sumval = sum(val))
Source: local data frame [2 x 2]
group sumval
(fctr) (dbl)
1 A 17
2 B 8
What I want is the value for rows belonging to group A to use sumval of not group A. i.e. the final result is
id group val notval
1 1 A 9 8
2 2 A 7 8
3 3 B 5 17
4 4 B 3 17
5 5 A 1 8
Is there a way to do this in dplyr? Preferrably in a single chain?
We can do this with base R
s1 <- sapply(unique(df$group), function(x) sum(df$val[df$group !=x]))
s1[with(df, match(group, unique(group)))]
#[1] 8 8 17 17 8
Or using data.table
library(data.table)
setDT(df)[,notval := sum(df$val[df$group!=group]) ,group]
#akrun answers are best. But if you want to do in dplyr, this is a round about way.
df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
df %>% mutate(TotalSum = sum(val)) %>% group_by(group) %>%
mutate(valsumval = TotalSum - sum(val))
Source: local data frame [5 x 5]
Groups: group [2]
id group val TotalSum valsumval
(int) (fctr) (dbl) (dbl) (dbl)
1 1 A 9 25 8
2 2 A 7 25 8
3 3 B 5 25 17
4 4 B 3 25 17
5 5 A 1 25 8
This also works even if there are more than two groups.
Also Just this works
df %>% group_by(group) %>% mutate(notval = sum(df$val)- sum(val))

Resources