I'm working on creating a table for publication and an having trouble creating the structure I need.
The "data":
a <- data.frame(Year = c(2018, 2019, 2020), a = 1:3,
b = c("a", "b", "c"),
c = c("d", "e", "f"),
fac = rep("this", 3))
The product would look like this ideally.
fac 2018_a 2018_b 2018_c 2019_a 2019_b 2019_c 2020_a 2020_b 2020_c
this 1 a d 2 b e 3 c f
I know that his should be possible with the pivot functions, but I'm not sure if I need to pivot longer before I go wider, and all the experiments I've done I can not get the names or data order correct. I'd very much appreciate any help!
We can also use the following solution:
library(tidyr)
a %>%
pivot_wider(names_from = Year, values_from = c(a, b, c),
names_glue = "{Year}_{.value}") %>%
select(fac, sort(names(.)[-1]))
# A tibble: 1 x 10
fac `2018_a` `2018_b` `2018_c` `2019_a` `2019_b` `2019_c` `2020_a` `2020_b` `2020_c`
<chr> <int> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr>
1 this 1 a d 2 b e 3 c f
You could use recast from reshape2 package:
reshape2::recast(a, fac~Year+variable, id.var = c('Year', 'fac'))
fac 2018_a 2018_b 2018_c 2019_a 2019_b 2019_c 2020_a 2020_b 2020_c
1 this 1 a d 2 b e 3 c f
Related
Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).
I have in R the following data frame:
ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
"E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
"CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY
DATA = data.frame(ID,VAR,CATEGORY);DATA
That looks like this table below :
ID
VAR
CATEGORY
1
A
ANE
1
A
ANE
1
A
ANA
1
A
ANB
1
B
ANE
2
C
BOO
2
C
BOA
2
D
BOO
3
E
CAT
3
E
CAT
4
F
DOG
4
A
ANE
4
B
ANE
4
F
DOG
4
C
FUT
4
F
DOG
ideal output given the above data frame in R I want to be like that:
ID
TEXTS
category
1
A
ANE
2
C
BOO
3
E
CAT
4
F
DOG
More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.
How can I do it in R ?
Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.
Another dplyr strategy using count and slice:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
dplyr
library(dplyr)
DATA %>%
group_by(ID) %>%
filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
group_by(ID, VAR) %>%
summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
ungroup()
# # A tibble: 4 x 3
# ID VAR CATEGORY
# <dbl> <chr> <chr>
# 1 1 A ANE
# 2 2 C BOA
# 3 3 E CAT
# 4 4 F DOG
Data
DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
We could modify the Mode to return the index and use that in slice after grouping by 'ID'
Modeind <- function(x) {
ux <- unique(x)
which.max(tabulate(match(x, ux)))
}
library(dplyr)
DATA %>%
group_by(ID) %>%
slice(Modeind(VAR)) %>%
ungroup
-output
# A tibble: 4 x 3
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOO
3 3 E CAT
4 4 F DOG
A base R option with nested subset + ave
subset(
subset(
DATA,
!!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
),
!!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)
gives
ID VAR CATEGORY
1 1 A ANE
6 2 C BOO
9 3 E CAT
11 4 F DOG
Explanation
The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)
I would like to reorder some columns to come after a particular other column using dplyr::relocate. Here is a MWE:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
d <- letters[10:12]
mytib <- tibble::tibble(a,b,c,d)
# A tibble: 3 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 a d g j
# 2 b e h k
# 3 c f i l
mytib %>%
relocate(c, .after = a)
This example works but is there a way that I could, with one relocate command, move c after a and, say, d after b?
I tried the following without success:
mytib %>%
relocate(c(c, d), .after(c(a, b)))
Edit 1: I explicitly ask about relocate because functions like select do not work for large datasets where all I know is after which column (name) I want to insert a column.
Edit 2: This is my expected output:
# A tibble: 3 x 4
# a c b d
# <chr> <chr> <chr> <chr>
# 1 a g d j
# 2 b h e k
# 3 c i f l
As dplyr::relocate itself apparently doesn't allow relocating in pairs, you can "hack" this behavior by preparing a list of column pairs like the ones you describe ("c after a" & "d after b") and reduce over that list, passing your df in as an .init value and in each reduce-step relocating one pair.
Like this:
library(dplyr)
library(purrr)
df_relocated <- reduce(
.x = list(c('c','a'), c('d','b')),
.f = ~ relocate(.x, .y[1], .after = .y[2]),
.init = mytib
)
This produces a tibble just as you expect it:
> df_relocated
# A tibble: 3 x 4
a c b d
<chr> <chr> <chr> <chr>
1 a g d j
2 b h e k
3 c i f l
In case you want to work with two lists, where element 1 of list 2 should relocated after element 1 of list 1 and so forth, this would be a solution:
reduce2(
.x = c("a", "b"),
.y = c("c", "d"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = mytib
)
I've built a web scraping function that takes a variety of arguments. Let's use the sample arguments for demonstration purposes.
Arguments: year, type, gender and col_types.
My function takes the referenced arguments and scrapes the data to return a df.
I am looking to join the alternate col_types to standard based on matches at the year, type, gender, name.
Then I want to bind all of the rows to one df.
Sample Data:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Sample DF
a <- tibble(year = 2019, type = "full_year", col_types = "standard", gender = "M", name = c("a","b","c"), variable_1 = 1:3)
b <- tibble(year = 2019, type = "full_year", col_types = "alternate", gender = "M", name = c("a","b","c"), variable_2 = 1:3, variable_3 = 8:10)
c <- tibble(year = 2019, type = "full_year", col_types = "standard", gender = "F", name = c("ab","ba","ca"), variable_1 = 4:6)
d <- tibble(year = 2019, type = "full_year", col_types = "alternate", gender = "F", name = c("ab","ba","ca"), variable_2 = 1:3, variable_3 = 16:18)
e <- tibble(year = 2019, type = "last_month", col_types = "standard", gender = "M", name = c("a","b","c"), variable_1 = 1:3)
f <- tibble(year = 2019, type = "last_month", col_types = "alternate", gender = "M", name = c("a","b","c"), variable_2 = 1:3, variable_3 = 8:10)
g <- tibble(year = 2019, type = "last_month", col_types = "standard", gender = "F", name = c("ab","ba","ca"), variable_1 = 4:6)
h <- tibble(year = 2019, type = "last_month", col_types = "alternate", gender = "F", name = c("ab","ba","ca"), variable_2 = 1:3, variable_3 = 16:18)
# I know this is not going to work as it presents me with NA where I want there to be joins
df <- bind_rows(a, b, c, d, e, f, g, h)
# Adding desired output
df <- bind_rows(a, b, c, d, e, f, g, h)
m_fy_join <-
a %>%
left_join(b %>% select(-matches("col_types")))
f_fy_join <-
c %>%
left_join(d %>% select(-matches("col_types")))
m_lm_join <-
e %>%
left_join(f %>% select(-matches("col_types")))
f_lm_join <-
g %>%
left_join(h %>% select(-matches("col_types")))
# Desired Output
desired_output <- bind_rows(m_fy_join, f_fy_join, m_lm_join, f_lm_join)
What purrr function can I use to do a left_join, and then bind rows?
I don't think you necessarily need to do a join. You can bind all the tibbles together and use coalesce to get rid of the NAs (which arise due to the fact that the "standard"s don't have variable 2/3 and the "alternate"s don't have variable 1).
I think this may be the easiest given the way your data is currently arranged. But, you might consider re-engineering the process (if possible) so that all the "alternate" tibbles are added to one list when created, and all the "standard" tibbles are added to another, so you could just rbind_list each of them and join the two together, rather than devising a way to manage a bunch of tibbles which are all mixed together.
library(tidyverse)
bind_rows(a, b, c, d, e, f, g, h) %>%
group_by(year, type, gender, name) %>%
summarise_at(vars(contains('variable')), reduce, coalesce)
# # A tibble: 12 x 7
# # Groups: year, type, gender [4]
# year type gender name variable_1 variable_2 variable_3
# <dbl> <chr> <chr> <chr> <int> <int> <int>
# 1 2019 full_year F ab 4 1 16
# 2 2019 full_year F ba 5 2 17
# 3 2019 full_year F ca 6 3 18
# 4 2019 full_year M a 1 1 8
# 5 2019 full_year M b 2 2 9
# 6 2019 full_year M c 3 3 10
# 7 2019 last_month F ab 4 1 16
# 8 2019 last_month F ba 5 2 17
# 9 2019 last_month F ca 6 3 18
# 10 2019 last_month M a 1 1 8
# 11 2019 last_month M b 2 2 9
# 12 2019 last_month M c 3 3 10
Edit: Thanks for showing desired output. I've checked and this output is equivalent, except for ordering and the fact that it doesn't have a col_types column,
library(dplyr)
library(purrr)
my_join_function <- function(df1, df2) {
x <- get(df1)
y <- get(df2)
left_join(x, select(y, -matches("col_types")))
}
desired_output2 <- map2_df(
.x = c("a", "c", "e", "g"),
.y = c("b", "d", "f", "h"),
.f = my_join_function
)
testthat::expect_error(testthat::expect_identical(desired_output, desired_output2))
Error: testthat::expect_identical(desired_output, desired_output2) did not throw an error.
I would like to concatenate an arbitrary number of columns in a dataframe based on a variable cols_to_concat
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
To achieve the desired result with this specific value of cols_to_concat I could do this:
df %>%
dplyr::mutate(concat = paste0(a, b, c))
But I need to generalise this, using syntax a bit like this
# (DOES NOT WORK)
df %>%
dplyr::mutate(concat = paste0(cols))
I'd like to use the new NSE approach of dplyr 0.7.0, if this is appropriate, but can't figure out the correct syntax.
You can perform this operation using only the tidyverse if you'd like to stick to those packages and principles. You can do it by using either mutate() or unite_(), which comes from the tidyr package.
Using mutate()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>% mutate(new_col = do.call(paste0, .[cols_to_concat]))
# A tibble: 3 × 4
a b c new_col
<chr> <chr> <chr> <chr>
1 a d g adg
2 b e h beh
3 c f i cfi
Using unite_()
library(tidyr)
df %>% unite_(col='new_col', cols_to_concat, sep="", remove=FALSE)
# A tibble: 3 × 4
new_col a b c
* <chr> <chr> <chr> <chr>
1 adg a d g
2 beh b e h
3 cfi c f i
EDITED July 2020
As of dplyr 1.0.0, it appears that across() and c_across() are replacing the underscore verbs (e.g. unite_) and scoped variants like mutate_if(), mutate_at() and mutate_all(). Below is an example using that convention. Not the most concise, but still an option that promises to be more extensible.
Using c_across()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>%
rowwise() %>%
mutate(new_col = paste0(c_across(all_of(cols_to_concat)), collapse=""))
#> # A tibble: 3 x 4
#> # Rowwise:
#> a b c new_col
#> <chr> <chr> <chr> <chr>
#> 1 a d g adg
#> 2 b e h beh
#> 3 c f i cfi
Created on 2020-07-08 by the reprex package (v0.3.0)
You can try syms from rlang:
library(dplyr)
packageVersion('dplyr')
#[1] ‘0.7.0’
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
library(rlang)
cols_quo <- syms(cols_to_concat)
df %>% mutate(concat = paste0(!!!cols_quo))
# or
df %>% mutate(concat = paste0(!!!syms(cols_to_concat)))
# # A tibble: 3 x 4
# a b c concat
# <chr> <chr> <chr> <chr>
# 1 a d g adg
# 2 b e h beh
# 3 c f i cfi
You can do the following:
library(dplyr)
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = lapply(list("a", "b", "c"), as.name)
q <- quos(paste0(!!! cols_to_concat))
df %>%
dplyr::mutate(concat = !!! q)