I have 30+ excel workbooks that all have the same structure - Sheet 1(with the data), sheet 2 called "Metadata". I've already consolidated all of my data from all "Sheets 1" and now I'm trying to consolidate the information from Metadata and this has been proving very challenging for some reason. The metadata sheet is identical in all workbooks. It contains two columns and about 20 rows. The first column is identical as it describes the information that's then provided in column 2 - e.g. in column 1 I have name, source, time etc. and it's labeled "Dataset View Id". Column 2 has the corresponding values. I want to perform essentially a vlookup and create a summary table where each column lists information from each Metadata worksheet (i.e. values from column 2), and column 1 will be identical (i.e. name, source, time). I tried various alternatives of merge and reduce - base R and tidyverse, but it either takes forever to calculate and eventually returns an error that R reached its capacity or R crashes after a few minutes. Specifically, I tried the following:
dataframe_list <- list(a, b, c, etc.)
Reduce(function(x, y) merge(x, y, all=TRUE), dataframe_list) #base R
or
dataframe_list %>% reduce(full_join, by='Dataset View Id') #tidyverse
Any advice much appreciated.
An approach is to bind all rows with an identifier for every xls sheet then transform to wider data
Suppose you have data like that
dataframe_list <- list()
for (i in 1:5)
dataframe_list[[i]] = data.frame(`Dataset View Id`=c("value1", "value2"), value = runif(2), check.names = F)
dataframe_list
#> [[1]]
#> Dataset View Id value
#> 1 value1 0.8541774
#> 2 value2 0.9110413
#>
#> [[2]]
#> Dataset View Id value
#> 1 value1 0.6893812
#> 2 value2 0.2531087
#>
#> [[3]]
#> Dataset View Id value
#> 1 value1 0.07905447
#> 2 value2 0.68175406
#>
#> [[4]]
#> Dataset View Id value
#> 1 value1 0.27886536
#> 2 value2 0.02120348
#>
#> [[5]]
#> Dataset View Id value
#> 1 value1 0.3485775
#> 2 value2 0.5457035
you can add to every dataframe an id column. In the example Excel_id
library(tidyverse)
dataframe_list <- purrr::imap(dataframe_list, ~ mutate(.x, Excel_id=.y))
Then, paste all together with bind_rows and finally do pivot_wider to get the desired result
result <- bind_rows(dataframe_list) %>% pivot_wider(id_cols = `Dataset View Id`, names_from = Excel_id)
result
#> # A tibble: 2 × 6
#> `Dataset View Id` `1` `2` `3` `4` `5`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value1 0.175 0.364 0.936 0.149 0.733
#> 2 value2 0.258 0.406 0.747 0.510 0.587
Created on 2022-10-18 by the reprex package (v2.0.1)
Related
I have the following data
county<-c(a,a,a,b,b,c)
id<-c(1,2,3,4,5,6)
data<-data.frame(county,id)
I need to convert from long to wide and get the following output
county<-c(a,b,c)
id__0<-c(1,4,6)
id__1<-c(2,5,##NA##)
id__2<-3,##NA##,##NA##)
data2<-data.frame(county,id__0,id__1,id__2)
My main problem is not in converting from long to wide, but how to make the columns start with id__0.
You could add an intermediate variable by grouping according to county and using mutate to build a sequence from 0 upwards for each county, then pivot_wider on that:
library(tidyr)
library(dplyr)
data %>%
group_by(county) %>%
mutate(id_count = seq(n()) - 1) %>%
pivot_wider(county, names_from =id_count, values_from = id, names_prefix = "id_")
#> # A tibble: 3 x 4
#> # Groups: county [3]
#> county id_0 id_1 id_2
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 2 3
#> 2 b 4 5 NA
#> 3 c 6 NA NA
Created on 2022-02-10 by the reprex package (v2.0.1)
I have a tibble with a number of variables collected over time. A very simplified version of the tibble looks like this.
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
I want to systematically create a new set of variables varC so that varC.t# = varA.t# / varB.t# where # is 1, 2, 3, etc. (similarly to the way column names are setup in the tibble above).
How do I use something along the lines of mutate or across to do this?
You can do something like this with mutate(across..., however, for renaming columns there must be a shortcut.
df %>%
mutate(across(.cols = c(varA.t1, varA.t2),
.fns = ~ .x / get(glue::glue(str_replace(cur_column(), "varA", "varB"))),
.names = "V_{.col}")) %>%
rename_with(~str_replace(., "V_varA", "varC"), starts_with("V_"))
# A tibble: 2 x 7
id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 row_1 5 10 2 4 2.5 2.5
2 row_2 20 50 4 6 5 8.33
If there is a long time series you can also create a vector for .cols beforehand.
I have a package on GitHub called {dplyover} which aims to solve this kind of problem in way similar to dplyr::across.
The function is called across2. It lets you define two sets of columns to which you can apply one or several functions. The .names argument supports two glue specifictions: {pre} and {suf}. They extract the shared pre- and suffix of the variable names. This makes it easy to put nice names on our output variables.
The function has one caveat. It is not performant when applied to highly grouped data (there is a vignette with benchmarks).
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
df %>%
mutate(across2(starts_with("varA"),
starts_with("varB"),
~ .x / .y,
.names = "{pre}C.{suf}"))
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v0.3.0)
For such cases I find using base R easy and efficient.
varAcols <- sort(grep('varA', names(df), value = TRUE))
varBcols <- sort(grep('varB', names(df), value = TRUE))
df[sub('A', 'C', varAcols)] <- df[varAcols]/df[varBcols]
# id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 row_1 5 10 2 4 2.5 2.5
#2 row_2 20 50 4 6 5 8.33
Another way to do this with some customization is
Initial setup
library(dplyr)
library(purrr)
library(stringr)
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
# A function take in a formula then parse it and correct the column name
operation_function <- function(df, formula) {
# Extract the column name from the formula
new_column_name <- str_extract(formula, "^.+=")
new_column_name <- trimws(gsub("=", "", new_column_name))
# Process the df
df %>%
# parse the formula - this reuslt in new column name as value formula
mutate(!!rlang::parse_expr(formula)) %>%
# rename the new created column with the correct column name
rename(!!new_column_name := last_col())
}
Note: I think there should be more efficient way to implement the formula that have proper name. Though I couldn't figure it out right now. Welcome ideas from others
Prepare the formula to be process by the data. In this case it simple
For more complicated formula you may want to do it a little bit differently
# Prepare the formula
base_formula <- c("varC.t# = varA.t# / varB.t#")
replacement_list <- c(1, 2)
list_formula <- map(replacement_list, .f = gsub,
pattern = "#", x = base_formula)
list_formula
#> [[1]]
#> [1] "varC.t1 = varA.t1 / varB.t1"
#>
#> [[2]]
#> [1] "varC.t2 = varA.t2 / varB.t2"
Finally process the data with the list of formulas
# process with the function and then reduce them with left_join
reduce(map(.x = list_formula, .f = operation_function, df = df),
left_join)
#> Joining, by = c("id", "varA.t1", "varA.t2", "varB.t1", "varB.t2")
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v1.0.0)
packageVersion("dplyr")
#[1] ‘0.8.99.9002’
Please note that this question uses dplyr's new across() function. To install the latest dev version of dplyr issue the remotes::install_github("tidyverse/dplyr") command. To restore to the released version of dplyr issue the install.packages("dplyr") command. If you are reading this some point in the future and are already on dplyr 1.X+ you won't need to worry about this note.
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3),
rep(as.Date("2020-02-01"), 2)),
Type = c("A", "A", "B", "C", "C"),
col1 = 1:5,
col2 = c(0, 8, 0, 3, 0),
col3 = c(25:29),
colX = rep(99, 5))
#> # A tibble: 5 x 6
#> Date Type col1 col2 col3 colX
#> <date> <chr> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 A 1 0 25 99
#> 2 2020-01-01 A 2 8 26 99
#> 3 2020-01-01 B 3 0 27 99
#> 4 2020-02-01 C 4 3 28 99
#> 5 2020-02-01 C 5 0 29 99
I'd like to sum columns 1 through X above row-wise, grouped by "Date" and "Type". I will always start at the third column (ie col1), but will never know the numerical value of X in colX. That's OK because I can use the length of the data frame to determine how far I need to go 'out' to capture all columns until the end of the data frame. Here's my approach:
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.)), sum())
#> Error: Problem with `summarise()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Locations 5 and 6 don't exist.
#> i There are only 4 columns.
#> i Input `..1` is `across(3:length(.))`.
#> i The error occured in group 1: Date = 2020-01-01, Type = "A".
#> Run `rlang::last_error()` to see where the error occurred.
But it seems my usage of the base R length(.) function is improper. Am I using dplyr's new across() function in the right manner? How can I get the length of the data frame in the portion of the pipe where I need it? I'll never know how many columns there are to the end, nor are the actual names nearly as clean as my example data frame.
packageVersion("dplyr")
#[1] ‘0.8.99.9002’
First, you just have a little problem with your syntax, the select statement and the function both go inside the across call.
df %>% summarize(across(3:length(.),sum))
## A tibble: 1 x 4
# col1 col2 col3 colX
# <int> <dbl> <int> <dbl>
#1 15 11 135 495
The following code does not work because you cannot select columns that are currently being group_by-ed on.
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.), sum))
#Error: Problem with `summarise()` input `..1`.
#x Can't subset columns that don't exist.
#x Locations 5 and 6 don't exist.
#ℹ There are only 4 columns.
This is obvious when you try the following:
df %>%
group_by(Date, Type) %>%
summarize(across(everything(), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
Other options include the starts_with tidy-select verb.
df %>%
group_by(Date, Type) %>%
summarize(across(starts_with("col"), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
The row-wise and column-wise vignettes are pretty good. The row-wise one actually discusses how group_by columns are subset.
I have a rather large tibble (called df.tbl with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
If I use:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
with i being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
to create a list containing the row numbers for each twin for each object (row).
I cannot, however I try, use map to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
All I get is the following error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
How would I go about to convert the for loop into an equivalent using map?
My original data look like this:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)
A bit late to the party, but you can do it much more neatly with nest().
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
The twins will be in the list of tibbles created by nest.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc
do you really need to solve it with map?
I would solve through combining duplicated and semi_join from the package dplyr like this
defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")
dplyr::semi_join(
df.tbl,
df.tbl[duplicated(df.tbl[defining_columns]),],
by = defining_columns
) %>%
group_by_at(defining_columns) %>%
arrange(.by_group = TRUE) %>%
summarise(twins = paste0(rowNum,collapse = ",")) %>%
pull(twins) %>%
strsplit(",")
the duplicated gives us which rows are duplicated and the semi_join only keeps rows in x that are present in y
Hope this helps!!
Edits I'm editing this post a little bit to provide a bit more context in case the whole approach was wrong from the start. See "Context" below for trying to explain the problem more abstractly.
I have seen the thread where the matching of NAs in tibbles is discussed, and the options are to match them to other NAs, or not to match them to anything: dplyr left_join matching NA
However, I am really looking for the opposite behaviour. Is there a way of having NAs (or whichever missing value for that case) matched to any other value during a join operation? An example below:
library(tidyverse)
# Removed output for brevity
tbl1 <- tibble(subj = 1, run = 1, session=1)
tbl2 <- tibble(subj = c(1, NA, 2), run = c(NA, 1, 2), session=c(NA, NA, 1), outcomedata = c(NA, NA, NA) )
tbl2$outcomedata[2][[1]] <- list(temperature=30)
tbl2$outcomedata[1][[1]] <- list(height=155, weight=80)
tbl2$outcomedata[3][[1]] <- list(temperature=20)
tbl1
#> # A tibble: 1 x 3
#> subj run session
#> <dbl> <dbl> <dbl>
#> 1 1.00 1.00 1.00
tbl2
#> # A tibble: 3 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 NA NA <list [2]>
#> 2 NA 1.00 NA <list [1]>
#> 3 2.00 2.00 1.00 <list [1]>
left_join(tbl1, tbl2)
#> Joining, by = c("subj", "run", "session")
#> # A tibble: 1 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <NULL>
My desired end result is that I can match the first and the second row of tbl2 to the single row of tbl1, since these rows match on all non-NA attributes. The third row should not match to anything, since it differs on non-NA values. Thus, I am trying to get the final output to be as follows:
#> # A tibble: 2 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <list [2]>
#> 2 1.00 1.00 1.00 <list [1]>
Context
Let me provide context in case I am way out here and barking up the wrong tree with the joins and there's an easier alternative. I have a bunch of nested json files (which I instantiate in R as lists), which contain various information that I want to attribute to specific instances in the data. One json might contain information which pertains to all instances in the data for subject 1 (i.e. the first row of tbl2), while another pertains to all instances in the data for run 1 (i.e. the second row of tbl2).
I would like to be able to merge all relevant information for each constellation of parameters in the data (one of which is in tbl1, but the plan is to have them all) in separate lists. My plan has been to try to get everything to match to everything related, and then to use a group_by operation over all parameters (i.e. group_by(subj, run, session)) and merge the lists (my plan was to use rlist::list.merge).
Any help would be massively appreciated!
Here's a tidyverse solution :
tbl2 %>%
split(seq(nrow(.))) %>% # split into one row data frames
map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) # inner join to table1
# # A tibble: 2 x 4
# subj run session outcomedata
# <dbl> <dbl> <dbl> <list>
# 1 1 1 1 <list [2]>
# 2 1 1 1 <list [1]>
I use inner_join(tbl1,.) instead of inner_join(tbl1) to preserve column order.
And a base R translation :
df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
# subj run session outcomedata
# 1 1 1 1 155, 80
# 2 1 1 1 30
Bonus
2 100% tidyverse approaches using group_by instead of split. one with do, one with nest and map. do is being soft deprecated FYI but here it offers more compact and readable syntax:
tbl2 %>%
group_by(n=seq(n())) %>%
do(modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) %>%
ungroup %>%
select(-n)
tbl2 %>%
rowid_to_column("n") %>%
group_by(n) %>%
nest(.key="dfi") %>%
mutate_at("dfi",~map(.,
~ modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.))) %>%
unnest %>%
select(-n)