adding a row to a data frame in long format

adding a row to a data frame in long format - r

Given a dataframe df like below
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE)
I want to add a row with parameter b_diff for each car, qtr combination with val as difference of parameter b for two consecutive qtr. The qtr ascending order is FY18Q1, FY18Q2. For the first qtr which is FY18Q1, the val for b_diff shall be NA as there is no previous qtr.
The expected output is as below.
parameter car qtr val
a a3 FY18Q1 23
b a3 FY18Q1 10000
b_diff a3 FY18Q1 NA
a a3 FY18Q2 14
b a3 FY18Q2 12000
b_diff a3 FY18Q2 2000
a cla FY18Q1 15
b cla FY18Q1 12000
c cla FY18Q1 5.5
b_diff cla FY18Q1 NA
a cla FY18Q2 26
b cla FY18Q2 10000
c cla FY18Q2 6.2
b_diff cla FY18Q2 -2000
How do I go about doing this with dplyr ?

A solution using dplyr and purrr. We can create a group ID using group_indices and based on that to split the data frame, summarize the data and then combine them. df5 is the final output.
library(dplyr)
library(purrr)
df2 <- df %>% mutate(GroupID = group_indices(., car, qtr))
df3 <- df2 %>%
filter(parameter %in% "b") %>%
group_by(car) %>%
mutate(val = val - lag(val), parameter = "b_diff") %>%
ungroup() %>%
split(f = .$GroupID)
df4 <- df2 %>% split(f = .$GroupID)
df5 <- map2_dfr(df4, df3, bind_rows) %>% select(-GroupID)
df5
# parameter car qtr val
# 1 a a3 FY18Q1 23.0
# 2 b a3 FY18Q1 10000.0
# 3 b_diff a3 FY18Q1 NA
# 4 a a3 FY18Q2 14.0
# 5 b a3 FY18Q2 12000.0
# 6 b_diff a3 FY18Q2 2000.0
# 7 a cla FY18Q1 15.0
# 8 b cla FY18Q1 12000.0
# 9 c cla FY18Q1 5.5
# 10 b_diff cla FY18Q1 NA
# 11 a cla FY18Q2 26.0
# 12 b cla FY18Q2 10000.0
# 13 c cla FY18Q2 6.2
# 14 b_diff cla FY18Q2 -2000.0
DATA
Notice that it is better to have stringsAsFactors = FALSE.
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE, stringsAsFactors = FALSE)

Here is one algorithm:
Reshape the data to "wide" format, so that qtr and car form a unique row index, with the parameter column "spread" into columns
Within each car value, take the 1-period diff of the new parameter_b column
Reshape the data back to "long" format
Equivalent code, using reshape2 and dplyr:
# optional. you could just use `c(NA, diff(x))` below, but this is more general
padded_diff <- function(x, lag = 1L) {
c(rep.int(NA, lag), diff(x, lag = lag))
}
df %>%
dcast(car + qtr ~ parameter, value.var = "val") %>%
mutate(b_diff = padded_diff(b)) %>%
melt(id.vars = c("car", "qtr"), variable.name = "parameter") %>%
arrange(car, qtr, parameter)
Here is another algorithm:
Group the data frame by car
Within each group, temporarily filter so that only rows with paramter == "b" are present
Take the 1-period diff of the val column
Remove the filter and ungroup
Equivalent code, using only dplyr, using a temporary table to simulate a "removable" filter:
make_b_diff_within_group <- function(df) {
tmp <- df %>%
filter(parameter == "b") %>%
transmute(
qtr = qtr,
val = padded_diff(val),
parameter = "b_diff")
bind_rows(df, tmp)
}
df %>%
group_by(car) %>%
do(make_b_diff_within_group(.)) %>%
ungroup() %>%
arrange(car, qtr, parameter)
This second algorithm could be implemented using several other "split-apply-combine" paradigms, including the tapply or by functions in base R, the ddply function in the plyr package (an ancestor of dplyr by the same author), and the split method from dplyr, as shown in this answer.

Related

Create new dataframe by dividing all possibles columns combination from another table

I'm struggling to find an easy a fast solution to create a new data frame by multiplying all "group" of columns between them.
Data for example
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
For example in this in my initial datatable
Original <- data.frame(
date = seq(today()-9, today(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2)
and this datatable is what I would like to achieve (e.i., columns with all the possible combination between the columns that end with a 1 and columns with all the possible combination between the columns that end with a 2)
Objective <- data.frame(
date = seq(today()-9, today(), by = 1),
b1a1 = b1*a1,
c1a1 = c1*a1,
c1b1 = c1*b1,
b2c2 = b2*c2,
b2a2 = b2*a2,
c2a2 = c2*a2)
I tried with loops but it's not a very elegant and efficient solution; or at least mine wasn't. A solution using the tidyverse would be very welcome
Thanks in advance
I.T

Here is base R option -
cbind(Original[1], do.call(cbind,
unname(lapply(split.default(Original[-1],
gsub('\\D', '', names(Original[-1]))), function(x) {
do.call(cbind, combn(names(x), 2, function(y) {
setNames(data.frame(do.call(`*`, Original[y])),
paste0(y, collapse = ''))
}, simplify = FALSE))
}))))
# date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
#1 2021-05-28 -0.06708 1.393018 -0.1213 0.1795 -1.0878 -0.0947
#2 2021-05-29 0.33234 0.045563 0.0201 0.0607 0.0247 0.9219
#3 2021-05-30 0.05043 0.160582 0.0341 0.1748 -0.3893 -0.1184
#4 2021-05-31 0.93642 0.980333 0.8156 0.0746 -1.1128 -0.1571
#5 2021-06-01 -1.21365 -0.256619 0.3268 -1.0106 -0.3542 2.1991
#6 2021-06-02 -0.09550 1.311417 -0.0754 -0.8243 -0.5532 1.1986
#7 2021-06-03 0.32514 0.373324 2.3262 -1.1904 -3.0764 0.7171
#8 2021-06-04 -0.41219 1.034527 -0.8338 -1.8588 -1.0202 2.6916
#9 2021-06-05 0.12488 -0.155639 -0.2294 0.2380 0.4288 0.3711
#10 2021-06-06 -0.00665 0.000139 -0.0105 -2.0117 -0.6363 1.0802
Explanation of the answer -
split.default is used to split the data in groups.
split.default(Original[-1], gsub('\\D', '', names(Original[-1])))
#$`1`
# a1 b1 c1
#1 -0.87773 0.0764 -1.5871
#2 0.86812 0.3828 0.0525
#3 0.48761 0.1034 0.3293
#4 -1.06095 -0.8826 -0.9240
#5 0.97625 -1.2432 -0.2629
#6 -1.28910 0.0741 -1.0173
#7 -0.22843 -1.4234 -1.6343
#8 -0.71512 0.5764 -1.4467
#9 0.29108 0.4290 -0.5347
#10 -0.00937 0.7098 -0.0149
#$`2`
# a2 b2 c2
#1 -1.4360 -0.125 0.758
#2 -0.0403 -1.507 -0.612
#3 -0.7580 -0.231 0.514
#4 0.7270 0.103 -1.531
#5 -0.4035 2.505 0.878
#6 0.6168 -1.336 -0.897
#7 2.2599 -0.527 -1.361
#8 -0.8394 2.215 1.215
#9 -0.5244 -0.454 -0.818
#10 1.0886 -1.848 -0.585
where gsub is used to remove all non-numeric character from the column names which is used to create groups.
gsub('\\D', '', names(Original[-1]))
#[1] "1" "1" "1" "2" "2" "2"
For every group using lapply we create every combination of column names (combn(names(x), 2.....) taking 2 columns at a time.
Multiply each combination (do.call(*, Original[y])) create a one-column dataframe and give the name of the column using setNames that is name of the combination (paste0(y, collapse = ''))
All the combinations from step 3 are combined into one dataframe. (do.call(cbind, combn.....).
All such groups are again combined into one dataframe (do.call(cbind, lapply...).
First column with dates is kept as it is in the final output (cbind(Original[1], ....).

Very good question. A tidyverse approach. This approach will have combination of uneven number of columns per group. Explanation -
Data is divided into a list with each sub-group as a separate item in the list. For this division
Firstly, the data is pivoted long using pivot_longer
then a dummy group (sub-group identification) column in created using gsub. You may use str_replace too.
list created using dplyr::group_split
data in all items reshaped back to its original form using tidyr::pivot_wider inside purrr::map now
thereafter all individual items of list -
first combined using combn and Reduce. You may also use purrr::reduce here
secondly names of new columns generated using same combn and Reduce
these names bound above matrix into a named dataframe.
lastly, using purrr::reduce in conjunction with dplyr::left_join list is converted back to intended shape
set.seed(123)
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
Original <- data.frame(
date = seq(Sys.Date()-9, Sys.Date(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2)
library(tidyverse)
Original %>% pivot_longer(!date) %>%
mutate(grp = gsub('^\\D*(\\d)+$', '\\1', name)) %>%
group_split(grp, .keep = F) %>%
map(~ .x %>% pivot_wider(names_from = name, values_from = value)) %>%
map(~ combn(.x[-1], 2, FUN = Reduce, f = `*`) %>% as.data.frame() %>%
setNames(combn(names(.x[-1]), 2, FUN = Reduce, f = paste0)) %>% cbind(.x[1], .)) %>%
reduce(~left_join(.x, .y, by = 'date'))
date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
1 2021-05-28 -0.68606804 0.59848918 -1.30710356 -0.29626767 0.108031283 -0.175982140
2 2021-05-29 -0.08282104 0.05017292 -0.07843039 0.06135046 0.008423333 0.005935364
3 2021-05-30 0.62468579 -1.59924166 -0.41119329 -1.13268875 -0.038374446 0.054248120
4 2021-05-31 0.00780406 -0.05139295 -0.08067566 1.90463287 1.201815497 2.968438088
5 2021-06-01 -0.07186344 -0.08080991 0.34742254 0.99243873 -0.185489171 -0.272722771
6 2021-06-02 3.06467216 -2.89278864 -3.01397443 -0.77341778 1.044302702 -1.703161152
7 2021-06-03 0.22946735 0.38614963 0.41709268 -0.22316502 -0.857881519 0.623969018
8 2021-06-04 2.48789113 -0.19402639 -0.30162620 0.02889143 -0.036194437 -0.272813136
9 2021-06-05 -0.48172830 0.78173260 -0.79823906 -0.23864021 -0.037894774 0.096601990
10 2021-06-06 0.21070515 -0.55877763 -0.59279292 0.03171951 -0.082159505 -0.018002847
Check it for this extended dataset
set.seed(123)
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
d2 <- rnorm(n = 10)
Original <- data.frame(
date = seq(Sys.Date()-9, Sys.Date(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2,
d2 = d2)
library(tidyverse)
Original %>% pivot_longer(!date) %>%
mutate(grp = gsub('^\\D*(\\d)+$', '\\1', name)) %>%
group_split(grp, .keep = F) %>%
map(~ .x %>% pivot_wider(names_from = name, values_from = value)) %>%
map(~ combn(.x[-1], 2, FUN = Reduce, f = `*`) %>% as.data.frame() %>%
setNames(combn(names(.x[-1]), 2, FUN = Reduce, f = paste0)) %>% cbind(.x[1], .)) %>%
reduce(~left_join(.x, .y, by = 'date'))
date a1b1 a1c1 b1c1 a2b2 a2c2 a2d2 b2c2 b2d2 c2d2
1 2021-05-28 -0.68606804 0.59848918 -1.30710356 -0.29626767 0.108031283 0.161902656 -0.175982140 -0.26373820 0.09616971
2 2021-05-29 -0.08282104 0.05017292 -0.07843039 0.06135046 0.008423333 0.148221326 0.005935364 0.10444173 0.01433970
3 2021-05-30 0.62468579 -1.59924166 -0.41119329 -1.13268875 -0.038374446 -0.298262480 0.054248120 0.42163941 0.01428475
4 2021-05-31 0.00780406 -0.05139295 -0.08067566 1.90463287 1.201815497 -0.894445153 2.968438088 -2.20924515 -1.39402460
5 2021-06-01 -0.07186344 -0.08080991 0.34742254 0.99243873 -0.185489171 -0.880563395 -0.272722771 -1.29468307 0.24197936
6 2021-06-02 3.06467216 -2.89278864 -3.01397443 -0.77341778 1.044302702 0.209022041 -1.703161152 -0.34089562 0.46029226
7 2021-06-03 0.22946735 0.38614963 0.41709268 -0.22316502 -0.857881519 0.248271309 0.623969018 -0.18057692 -0.69416615
8 2021-06-04 2.48789113 -0.19402639 -0.30162620 0.02889143 -0.036194437 -0.003281582 -0.272813136 -0.02473471 0.03098700
9 2021-06-05 -0.48172830 0.78173260 -0.79823906 -0.23864021 -0.037894774 -0.282179411 0.096601990 0.71933645 0.11422674
10 2021-06-06 0.21070515 -0.55877763 -0.59279292 0.03171951 -0.082159505 -0.779997773 -0.018002847 -0.17091365 0.44269850
Created on 2021-06-06 by the reprex package (v2.0.0)

You can also use the following solution, not as concise as other answers but here is a different approach that might have some points worthy of consideration. Much of the first chunk of codes I tried to emulate combn function with tidyverse equivalences. So first chuck which leads to df2 data set creates all the combinations whose products you would like to calculate and the second chunk just evaluates them in the context of Original data set. Anyway thank you for this fantastic question that pushed me to the limits.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(rlang)
cols <- c("(\\w1)", "(\\w2)")
cols %>%
map_dfc(~ names(Original)[str_detect(names(Original), .x)] %>%
as_tibble() %>%
mutate(value2 = rev(value)) %>%
expand(value, value2) %>%
filter(value != value2) %>%
rowwise() %>%
mutate(comb = paste0(sort(c(value, value2)), collapse = "*")) %>%
select(comb) %>%
distinct(comb)) %>%
rename_with(~ str_remove(., "\\.\\.\\."), everything()) %>%
pivot_longer(everything(), names_to = c(".value", "id"),
names_pattern = "(\\w+)(\\d)") -> df2
df2 %>%
select(comb) %>%
rowwise() %>%
mutate(data = map(comb, ~ eval_tidy(parse_expr(.x), data = Original))) %>%
unnest(cols = c(data)) %>%
group_by(comb) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = comb, values_from = data) %>%
relocate(ends_with("1")) %>%
bind_cols(Original$date) %>%
rename_with(~ str_remove(., "\\*"), everything()) %>%
rename(Date = ...8) %>%
relocate(Date) %>%
select(-id)
# A tibble: 10 x 7
Date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-05-28 -0.129 0.0912 -0.0838 -1.55 -1.52 2.11
2 2021-05-29 -0.477 -1.58 0.352 -3.55 -0.144 0.101
3 2021-05-30 0.195 0.708 0.105 0.910 -0.356 -0.177
4 2021-05-31 -0.194 0.0219 -0.0111 -1.35 0.261 -0.200
5 2021-06-01 0.0140 0.107 0.000601 -0.0279 -0.126 0.104
6 2021-06-02 0.242 0.141 0.174 -0.0174 0.695 -0.0570
7 2021-06-03 -0.439 -0.360 0.589 0.804 -2.76 -1.79
8 2021-06-04 -1.02 -0.0349 0.0137 2.07 0.357 0.495
9 2021-06-05 -0.00670 0.550 -0.00161 -0.000907 0.00503 -0.925
10 2021-06-06 -0.287 -0.505 0.718 -0.0290 -0.00351 0.0256

lapply over lapply (or other way to clean timeseries data)

I am trying to extract some data from a number of excel spreadsheets that do not have a tidy format. I think I need to run lapply within lapply, but can't seem to make it work. Here is an example:
Here are two dataframes with formats equivalent to what i find in the excel sheets:
library('dplyr')
library('tidyr')
library('readxl')
df1 <- data.frame(instance = c('...', 'A', 'B'),
`1990.1` = c('est', 1, 2),
`1990.2` = c('val', 2, 3),
`1991.1` = c('est', 3, 4),
`1991.2` = c('val', 4, 5))
df2 <- data.frame(instance = c('...', 'A', 'B'),
`1990.1` = c('est', 5, 6),
`1990.2` = c('val', 6, 7),
`1991.1` = c('est', 7, 8),
`1991.2` = c('val', 8, 9))
> df1
instance X1990.1 X1990.2 X1991.1 X1991.2
1 ... est val est val
2 A 1 2 3 4
3 B 2 3 4 5
I create a function to clean the data based off:
df1 %>%
select(1, which(.[1,] == 'est')) %>%
.[-1,] %>%
gather(key = year, value = score, -instance) %>%
mutate(var = 'est')
instance year score var
1 A X1990.1 1 est
2 B X1990.1 2 est
3 A X1991.1 3 est
4 B X1991.1 4 est
Gives:
data_clean <- function(x) {
df1 %>%
select(1, which(.[1,] == x)) %>%
.[-1,] %>%
gather(key = year, value = score, -instance) %>%
mutate(var = x)
}
I can now generate a clean version of each df as follows:
do.call(rbind, lapply(c('est', 'val'), data_clean)) %>%
mutate(origin = 'df1')
instance year score var origin
1 A X1990.1 1 est df1
2 B X1990.1 2 est df1
3 A X1991.1 3 est df1
4 B X1991.1 4 est df1
5 A X1990.2 2 val df1
6 B X1990.2 3 val df1
7 A X1991.2 4 val df1
8 B X1991.2 5 val df1
What I now need to do is apply this to the list of dataframes:
list_data <- list(df1, df2)
In my case i would generate this from a function:
data_pull <- function(x) {
read_excel('path/to/file', sheet = x)
}
list_data <- lapply(2:20, data_pull)
But I can't think how to do this. I need to apply data_clean to each element of the list generated by data_pull. I obviously need to remove the first call to df in the data_clean function, but then what object am i passing to data_clean?
What I eventually want is a single data frame with all the data in one place, in a tidy format.
Sorry if i am missing something simple here. I feel there is lots of data that is structured like this and the solution for cleaning it should be fairly simple.I can't seem to think of it.

An option is to keep it in a list and loop over the list with map. We can rename the columns by pasteing the 1st row for all those columns except the 'instance', slice out the first row, use pivot_longer to reshape from 'wide' to 'long', separate the 'name' column into two, and convert the type if needed.
library(dplyr)
library(tidyr)
library(purrr)
library(readr)
library(stringr)
f1 <- function(dat) {
names(dat)[-1] <- str_c(names(dat)[-1], unlist(dat[1,-1]), sep="_")
dat %>%
slice(-1) %>%
pivot_longer(cols = -instance, values_to = "seq" ) %>%
mutate_all(as.character) %>%
separate(name, into = c('year', 'var'), sep="_", convert = TRUE) %>%
type_convert()
}
map_dfr(set_names(list_data, c('df1', 'df2')), f1, .id = 'origin')
# A tibble: 16 x 5
# origin instance year var seq
# <chr> <chr> <chr> <chr> <dbl>
# 1 df1 A X1990.1 est 1
# 2 df1 A X1990.2 val 2
# 3 df1 A X1991.1 est 3
# 4 df1 A X1991.2 val 4
# 5 df1 B X1990.1 est 2
# 6 df1 B X1990.2 val 3
# 7 df1 B X1991.1 est 4
# 8 df1 B X1991.2 val 5
# 9 df2 A X1990.1 est 5
#10 df2 A X1990.2 val 6
#11 df2 A X1991.1 est 7
#12 df2 A X1991.2 val 8
#13 df2 B X1990.1 est 6
#14 df2 B X1990.2 val 7
#15 df2 B X1991.1 est 8
#16 df2 B X1991.2 val 9
If we are using the function data_pull
map_dfr(2:20, ~ data_pull(.x) %>%
f1, .id = 'origin')

Add a boolean column to a data.frame indicating wether specific columns are all NAs

I have a data.frame, which has NA's in several columns:
df <- data.frame(a0 = 1:3, a1 = c("A","B",NA), a2 = c("a",NA,NA),
a3 = rep(NA,3), stringsAsFactors = FALSE)
I would like to add a new column, all.na, indicating whether columns: c("a1","a2","a3") are all(is.na), per each row.
It can be done using sapply:
df$all.na <- sapply(1:nrow(df), function(x) all(is.na(df[x,c("a1","a2","a3")])))
But I'm looking for something faster.
I thought using dplyr::mutate might be a good solution but:
> df %>% dplyr::mutate(all(is.na(c(a1,a2,a3))))
a0 a1 a2 a3 all(is.na(c(a1, a2, a3)))
1 1 A a NA FALSE
2 2 B <NA> NA FALSE
3 3 <NA> <NA> NA FALSE
Doesn't give me the desired outcome.
Any idea how to get dplyr::mutate to give:
df$all.na <- c(FALSE, FALSE, TRUE)
On this?

We could use rowwise with do
library(dplyr)
cols <- c("a1","a2","a3")
df %>%
rowwise() %>%
do( (.) %>% as.data.frame %>%
mutate(all.na = all(is.na(.[cols]))))
# a0 a1 a2 a3 all.na
# <int> <chr> <chr> <lgl> <lgl>
#1 1 A a NA FALSE
#2 2 B NA NA FALSE
#3 3 NA NA NA TRUE
Or a more general approach using tidyverse gather and spread
library(tidyverse)
df %>%
gather(key, value, -a0) %>%
group_by(a0) %>%
mutate(all.na = all(is.na(value))) %>%
spread(key, value)
However, in base R there is a better approach using is.na and rowSums
df$all.na <- rowSums(is.na(df[cols])) == length(cols)
df
# a0 a1 a2 a3 all.na
#1 1 A a NA FALSE
#2 2 B <NA> NA FALSE
#3 3 <NA> <NA> NA TRUE
This can also be achieved using apply row-wise (MARGIN = 1) but this will not help with any speed improvements.
df$all.na <- apply(df[cols], 1, function(x) all(is.na(x)))

Here is one option with tidyverse making use of pmap
library(tidyverse)
df %>%
mutate(all.na = pmap_lgl(.[cols], ~ all(is.na(c(...)))))
# a0 a1 a2 a3 all.na
#1 1 A a NA FALSE
#2 2 B <NA> NA FALSE
#3 3 <NA> <NA> NA TRUE
Or another option is to convert to logical vector with map and reduce it back to a single logical vector
df %>%
mutate(all.na = map(.[cols], is.na) %>%
reduce(`&`))
With base R, this can be achieved using Reduce and lapply
df$all.na <- Reduce(`&`, lapply(df[cols], is.na))
data
cols <- c("a1","a2","a3")

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?

The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3

You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.

I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b

Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

Grouping Over All Possible Combinations of Several Variables With dplyr

Given a situation such as the following
library(dplyr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
I would like to group `myData' to eventually find summary data grouping by all possible combinations of var2, var3, and var4.
I can create a list with all possible combinations of variables as character values with
groupNames <- names(myData)[2:4]
myGroups <- Map(combn,
list(groupNames),
seq_along(groupNames),
simplify = FALSE) %>%
unlist(recursive = FALSE)
My plan was to make separate data sets for each variable combination with a for() loop, something like
### This Does Not Work
for (i in 1:length(myGroups)){
assign( myGroups[i]%>%
unlist() %>%
paste0(collapse = "")%>%
paste0("Data"),
myData %>%
group_by_(lapply(myGroups[[i]], as.symbol)) %>%
summarise( n = length(var1),
avgVar2 = var2 %>%
mean()))
}
Admittedly I am not very good with lists, and looking up this issue was a bit challenging since dpyr updates have altered how grouping works a bit.
If there is a better way to do this than separate data sets I would love to know.
I've gotten a loop similar to above working when I am only grouping by a single variable.
Any and all help is greatly appreciated! Thank you!

This seems convulated, and there's probably a way to simplify or fancy it up with a do, but it works. Using your myData and myGroups,
results = lapply(myGroups, FUN = function(x) {
do.call(what = group_by_, args = c(list(myData), x)) %>%
summarise( n = length(var1),
avgVar1 = mean(var1))
}
)
> results[[1]]
Source: local data frame [3 x 3]
var2 n avgVar1
1 a 31 0.38929738
2 b 31 -0.07451717
3 c 38 -0.22522129
> results[[4]]
Source: local data frame [9 x 4]
Groups: var2
var2 var3 n avgVar1
1 a A 11 -0.1159160
2 a B 11 0.5663312
3 a C 9 0.7904056
4 b A 7 0.0856384
5 b B 13 0.1309756
6 b C 11 -0.4192895
7 c A 15 -0.2783099
8 c B 10 -0.1110877
9 c C 13 -0.2517602
> results[[7]]
# I won't paste them here, but it has all 27 rows, grouped by var2, var3 and var4.
I changed your summarise call to average var1 since var2 isn't numeric.

I have created a function based on the answer of #Gregor and the comments that followed:
library(magrittr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
Function combSummarise
combSummarise <- function(data, variables=..., summarise=...){
# Get all different combinations of selected variables (credit to #Michael)
myGroups <- lapply(seq_along(variables), function(x) {
combn(c(variables), x, simplify = FALSE)}) %>%
unlist(recursive = FALSE)
# Group by selected variables (credit to #konvas)
df <- eval(parse(text=paste("lapply(myGroups, function(x){
dplyr::group_by_(data, .dots=x) %>%
dplyr::summarize_( \"", paste(summarise, collapse="\",\""),"\")})"))) %>%
do.call(plyr::rbind.fill,.)
groupNames <- c(myGroups[[length(myGroups)]])
newNames <- names(df)[!(names(df) %in% groupNames)]
df <- cbind(df[, groupNames], df[, newNames])
names(df) <- c(groupNames, newNames)
df
}
Call of combSummarise
combSummarise (myData, var=c("var2", "var3", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)"))
etc

Inspired by the answers by Gregor and dimitris_ps, I wrote a dplyr style function that runs summarise for all combinations of group variables.
summarise_combo <- function(data, ...) {
groupVars <- group_vars(data) %>% map(as.name)
groupCombos <- map( 0:length(groupVars), ~combn(groupVars, ., simplify=FALSE) ) %>%
unlist(recursive = FALSE)
results <- groupCombos %>%
map(function(x) {data %>% group_by(!!! x) %>% summarise(...)} ) %>%
bind_rows()
results %>% select(!!! groupVars, everything())
}
Example
library(tidyverse)
mtcars %>% group_by(cyl, vs) %>% summarise_combo(cyl_n = n(), mean(mpg))

Using unite to create a new column is the simplest way
library(tidyverse)
df = tibble(
a = c(1,1,2,2,1,1,2,2),
b = c(3,4,3,4,3,4,3,4),
val = c(1,2,3,4,5,6,7,8)
)
print(df)#output1
df_2 = unite(df, 'combined_header', a, b, sep='_', remove=FALSE) #remove=F doesn't remove existing columns
print(df_2)#output2
df_2 %>% group_by(combined_header) %>%
summarize(avg_val=mean(val)) %>% print()#output3
#avg 1_3 = mean(1,5)=3 avg 1_4 = mean(2, 6) = 4
RESULTS
Output:
output1
a b val
<dbl> <dbl> <dbl>
1 1 3 1
2 1 4 2
3 2 3 3
4 2 4 4
5 1 3 5
6 1 4 6
7 2 3 7
8 2 4 8
output2
combined_header a b val
<chr> <dbl> <dbl> <dbl>
1 1_3 1 3 1
2 1_4 1 4 2
3 2_3 2 3 3
4 2_4 2 4 4
5 1_3 1 3 5
6 1_4 1 4 6
7 2_3 2 3 7
8 2_4 2 4 8
output3
combined_header avg_val
<chr> <dbl>
1 1_3 3
2 1_4 4
3 2_3 5
4 2_4 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

adding a row to a data frame in long format - r

Related

Create new dataframe by dividing all possibles columns combination from another table

lapply over lapply (or other way to clean timeseries data)

Add a boolean column to a data.frame indicating wether specific columns are all NAs

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Grouping Over All Possible Combinations of Several Variables With dplyr

Categories

Resources