Dplyr join: NA match to any

Dplyr join: NA match to any - r

Edits I'm editing this post a little bit to provide a bit more context in case the whole approach was wrong from the start. See "Context" below for trying to explain the problem more abstractly.
I have seen the thread where the matching of NAs in tibbles is discussed, and the options are to match them to other NAs, or not to match them to anything: dplyr left_join matching NA
However, I am really looking for the opposite behaviour. Is there a way of having NAs (or whichever missing value for that case) matched to any other value during a join operation? An example below:
library(tidyverse)
# Removed output for brevity
tbl1 <- tibble(subj = 1, run = 1, session=1)
tbl2 <- tibble(subj = c(1, NA, 2), run = c(NA, 1, 2), session=c(NA, NA, 1), outcomedata = c(NA, NA, NA) )
tbl2$outcomedata[2][[1]] <- list(temperature=30)
tbl2$outcomedata[1][[1]] <- list(height=155, weight=80)
tbl2$outcomedata[3][[1]] <- list(temperature=20)
tbl1
#> # A tibble: 1 x 3
#> subj run session
#> <dbl> <dbl> <dbl>
#> 1 1.00 1.00 1.00
tbl2
#> # A tibble: 3 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 NA NA <list [2]>
#> 2 NA 1.00 NA <list [1]>
#> 3 2.00 2.00 1.00 <list [1]>
left_join(tbl1, tbl2)
#> Joining, by = c("subj", "run", "session")
#> # A tibble: 1 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <NULL>
My desired end result is that I can match the first and the second row of tbl2 to the single row of tbl1, since these rows match on all non-NA attributes. The third row should not match to anything, since it differs on non-NA values. Thus, I am trying to get the final output to be as follows:
#> # A tibble: 2 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <list [2]>
#> 2 1.00 1.00 1.00 <list [1]>
Context
Let me provide context in case I am way out here and barking up the wrong tree with the joins and there's an easier alternative. I have a bunch of nested json files (which I instantiate in R as lists), which contain various information that I want to attribute to specific instances in the data. One json might contain information which pertains to all instances in the data for subject 1 (i.e. the first row of tbl2), while another pertains to all instances in the data for run 1 (i.e. the second row of tbl2).
I would like to be able to merge all relevant information for each constellation of parameters in the data (one of which is in tbl1, but the plan is to have them all) in separate lists. My plan has been to try to get everything to match to everything related, and then to use a group_by operation over all parameters (i.e. group_by(subj, run, session)) and merge the lists (my plan was to use rlist::list.merge).
Any help would be massively appreciated!

Here's a tidyverse solution :
tbl2 %>%
split(seq(nrow(.))) %>% # split into one row data frames
map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) # inner join to table1
# # A tibble: 2 x 4
# subj run session outcomedata
# <dbl> <dbl> <dbl> <list>
# 1 1 1 1 <list [2]>
# 2 1 1 1 <list [1]>
I use inner_join(tbl1,.) instead of inner_join(tbl1) to preserve column order.
And a base R translation :
df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
# subj run session outcomedata
# 1 1 1 1 155, 80
# 2 1 1 1 30
Bonus
2 100% tidyverse approaches using group_by instead of split. one with do, one with nest and map. do is being soft deprecated FYI but here it offers more compact and readable syntax:
tbl2 %>%
group_by(n=seq(n())) %>%
do(modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) %>%
ungroup %>%
select(-n)
tbl2 %>%
rowid_to_column("n") %>%
group_by(n) %>%
nest(.key="dfi") %>%
mutate_at("dfi",~map(.,
~ modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.))) %>%
unnest %>%
select(-n)

Related

merge/vlookup in R with many data frames/excel workbooks

I have 30+ excel workbooks that all have the same structure - Sheet 1(with the data), sheet 2 called "Metadata". I've already consolidated all of my data from all "Sheets 1" and now I'm trying to consolidate the information from Metadata and this has been proving very challenging for some reason. The metadata sheet is identical in all workbooks. It contains two columns and about 20 rows. The first column is identical as it describes the information that's then provided in column 2 - e.g. in column 1 I have name, source, time etc. and it's labeled "Dataset View Id". Column 2 has the corresponding values. I want to perform essentially a vlookup and create a summary table where each column lists information from each Metadata worksheet (i.e. values from column 2), and column 1 will be identical (i.e. name, source, time). I tried various alternatives of merge and reduce - base R and tidyverse, but it either takes forever to calculate and eventually returns an error that R reached its capacity or R crashes after a few minutes. Specifically, I tried the following:
dataframe_list <- list(a, b, c, etc.)
Reduce(function(x, y) merge(x, y, all=TRUE), dataframe_list) #base R
or
dataframe_list %>% reduce(full_join, by='Dataset View Id') #tidyverse
Any advice much appreciated.

An approach is to bind all rows with an identifier for every xls sheet then transform to wider data
Suppose you have data like that
dataframe_list <- list()
for (i in 1:5)
dataframe_list[[i]] = data.frame(`Dataset View Id`=c("value1", "value2"), value = runif(2), check.names = F)
dataframe_list
#> [[1]]
#> Dataset View Id value
#> 1 value1 0.8541774
#> 2 value2 0.9110413
#>
#> [[2]]
#> Dataset View Id value
#> 1 value1 0.6893812
#> 2 value2 0.2531087
#>
#> [[3]]
#> Dataset View Id value
#> 1 value1 0.07905447
#> 2 value2 0.68175406
#>
#> [[4]]
#> Dataset View Id value
#> 1 value1 0.27886536
#> 2 value2 0.02120348
#>
#> [[5]]
#> Dataset View Id value
#> 1 value1 0.3485775
#> 2 value2 0.5457035
you can add to every dataframe an id column. In the example Excel_id
library(tidyverse)
dataframe_list <- purrr::imap(dataframe_list, ~ mutate(.x, Excel_id=.y))
Then, paste all together with bind_rows and finally do pivot_wider to get the desired result
result <- bind_rows(dataframe_list) %>% pivot_wider(id_cols = `Dataset View Id`, names_from = Excel_id)
result
#> # A tibble: 2 × 6
#> `Dataset View Id` `1` `2` `3` `4` `5`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value1 0.175 0.364 0.936 0.149 0.733
#> 2 value2 0.258 0.406 0.747 0.510 0.587
Created on 2022-10-18 by the reprex package (v2.0.1)

How to convert two vectors into dataframe (wide format)

I want to convert two vectors into a wide format dataframe. The fist vector represent the column names and the second vector the values.
Here is my reproduceable example:
vector1<-c("Reply","Reshare","Like","Share","Search")
vector2<-c(2,1,0,4,3)
Now I want to convert these two vector into a wide format dataframe:
# A tibble: 1 x 5
Reply Reshare Like Share Search
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 0 4 3
I have found some examples for the long format, but none simple solution for the wide format. Can anyone help me?

You can make a named list (e.g. using setNames), followed by as.data.frame:
df <- as.data.frame(setNames(as.list(vector2), vector1))
Note that it needs to be a list: when converting a named vector into a data.frame, R puts values into separate rows instead of columns.

vector1<-c("Reply","Reshare","Like","Share","Search")
vector2<-c(2,1,0,4,3)
df <- data.frame(vector1, vector2)
df |> tidyr::pivot_wider(names_from = vector1, values_from = vector2)
#> # A tibble: 1 × 5
#> Reply Reshare Like Share Search
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 0 4 3
Created on 2022-02-08 by the reprex package (v2.0.1)

Yet another solution, based on dplyr::bind_rows:
library(dplyr)
vector1<-c("Reply","Reshare","Like","Share","Search")
vector2<-c(2,1,0,4,3)
names(vector2) <- vector1
bind_rows(vector2)
#> # A tibble: 1 × 5
#> Reply Reshare Like Share Search
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 0 4 3

We can use map_dfc and set_names
library(purrr)
set_names(map_dfc(vector2, ~.x), vector1)
# A tibble: 1 × 5
Reply Reshare Like Share Search
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 0 4 3

Another possible solution:
library(dplyr)
data.frame(rbind(vector1, vector2)) %>%
`colnames<-`(.[1, ]) %>%
.[-1, ] %>%
`rownames<-`(NULL)
Reply Reshare Like Share Search
1 2 1 0 4 3

Is there a way to create a mock dataset of an actual survey, using R?

Is there a way to create a mock dataset of an actual survey?
I will be analyzing and reporting results of a survey that is currently being conducted.
I don't want to wait until the survey is closed to start analyzing the data. While the data is being collected, I would like to start working on the script for analysis and visualizations, and for that I need data that are similar to the data being collected.
Is it possible to create a mock dataset with the actual structure and variables of the survey? The survey has different types of questions, including arrays, single choice, multiple choice, and open ended questions. I know I could wait until I have 10 or 20 responses and start working on that? Could I create a random dataset based on those responses?
I'd be grateful if someone could give some ideas about this. Cheers.

It depends on the format (e.g. column names etc.) of your expected data. Here, you can simulate binary vectors:
library(tidyverse)
set.seed(1337)
select_m_out_of_n <- function(n, m) {
if (m > n) stop("Can not choose more items than the number of options")
res <- rep(FALSE, n)
selected <- sample(n, m)
res[selected] <- TRUE
res
}
# q1 one out of 5
simulated_survey <- tibble(person_id = seq(3)) %>%
mutate(
# single choice
q1 = person_id %>% map(~ select_m_out_of_n(5, 1)),
# binary choice
q1 = person_id %>% map(~ select_m_out_of_n(1, 1)),
# multiple choice
q2 = person_id %>% map(~ select_m_out_of_n(5, 2)),
q3 = person_id %>% map(~ select_m_out_of_n(4, 2))
)
simulated_survey
#> # A tibble: 3 x 4
#> person_id q1 q2 q3
#> <int> <list> <list> <list>
#> 1 1 <lgl [1]> <lgl [5]> <lgl [4]>
#> 2 2 <lgl [1]> <lgl [5]> <lgl [4]>
#> 3 3 <lgl [1]> <lgl [5]> <lgl [4]>
# answers of first person
simulated_survey %>%
slice(1) %>%
as.list()
#> $person_id
#> [1] 1
#>
#> $q1
#> $q1[[1]]
#> [1] TRUE
#>
#>
#> $q2
#> $q2[[1]]
#> [1] FALSE FALSE TRUE TRUE FALSE
#>
#>
#> $q3
#> $q3[[1]]
#> [1] FALSE TRUE TRUE FALSE
Created on 2021-11-11 by the reprex package (v2.0.1)

Filter a tibble in mutate based on another tibble?

I have two tibbles, ranges and sites. The first contains a set of coordinates (region, start, end, plus other character variables) and the other contains a sites (region, site). I need to get all sites in the second tibble that fall within a given range (row) in the first tibble. Complicating matters, the ranges in the first tibble overlap.
# Range tibble
region start end var_1 ... var_n
1 A 1 5
2 A 3 10
3 B 20 100
# Site tibble
region site
1 A 4
2 A 8
3 B 25
The ~200,000 ranges can be 100,000s long over about a billion sites, so I don't love my idea of a scheme of making a list of all values in the range, unnesting, semi_join'ing, grouping, and summarise(a_list = list(site))'ing.
I was hoping for something along the lines of:
range_tibble %>%
rowwise %>%
mutate(site_list = site_tibble %>%
filter(region.site == region.range, site > start, site < end) %>%
.$site %>% as.list))
to produce a tibble like:
# Final tibble
region start end site_list var_1 ... var_n
<chr> <dbl> <dbl> <list> <chr> <chr>
1 A 1 5 <dbl [1]>
2 A 3 10 <dbl [2]>
3 B 20 100 <dbl [1]>
I've seen answers using "gets" of an external variable (i.e. filter(b == get("b")), but how would I get the variable from the current line in the range tibble? Any clever pipes or syntax I'm not thinking of? A totally different approach is great, too, as long as it plays well with big data and can be turned back into a tibble.

Use left_join() to merge two data frames and summarise() to concatenate the sites contained in the specified range.
library(dplyr)
range %>%
left_join(site) %>%
filter(site >= start & site <= end) %>%
group_by(region, start, end) %>%
summarise(site = list(site))
# region start end site
# <fct> <dbl> <dbl> <list>
# 1 A 1 5 <dbl [1]>
# 2 A 3 10 <dbl [2]>
# 3 B 20 100 <dbl [1]>
Data
range <- data.frame(region = c("A", "A", "B"), start = c(1, 3, 20), end = c(5, 10, 100))
site <- data.frame(region = c("A", "A", "B"), site = c(4, 8, 25))

When I don't know column names in data.frame, when I use dplyr mutate function

I like to know how I can use dplyr mutate function when I don't know column names. Here is my example code;
library(dplyr)
w<-c(2,3,4)
x<-c(1,2,7)
y<-c(1,5,4)
z<-c(3,2,6)
df <- data.frame(w,x,y,z)
df %>% rowwise() %>% mutate(minimum = min(x,y,z))
Source: local data frame [3 x 5]
Groups: <by row>
# A tibble: 3 x 5
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 2
3 4 7 4 6 4
This code is finding minimum value in row-wise. Yes, "df %>% rowwise() %>% mutate(minimum = min(x,y,z))" works because I typed column names, x, y, z. But, let's assume that I have a really big data.frame with several hundred columns, and I don't know all of the column names. Or, I have multiple data sets of data.frame, and they have all different column names; I just want to find a minimum value from 10th column to 20th column in each row and in each data.frame.
In this example data.frame I provided above, let's assume that I don't know column names, but I just want to get minimum value from 2nd column to 4th column in each row. Of course, this doesn't work, because 'mutate' doesn't work with vector;
df %>% rowwise() %>% mutate(minimum=min(df[,2],df[,3], df[,4]))
Source: local data frame [3 x 5]
Groups: <by row>
# A tibble: 3 x 5
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 1
3 4 7 4 6 1
These two codes below also don't work.
df %>% rowwise() %>% mutate(average=min(colnames(df)[2], colnames(df)[3], colnames(df)[4]))
df %>% rowwise() %>% mutate(average=min(noquote(colnames(df)[2]), noquote(colnames(df)[3]), noquote(colnames(df)[4])))
I know that I can get minimum value by using apply or different method when I don't know column names. But, I like to know whether dplyr mutate function can be able to do that without known column names.
Thank you,

With apply:
library(dplyr)
library(purrr)
df %>%
mutate(minimum = apply(df[,2:4], 1, min))
or with pmap:
df %>%
mutate(minimum = pmap(.[2:4], min))
Also with by_row from purrrlyr:
df %>%
purrrlyr::by_row(~min(.[2:4]), .collate = "rows", .to = "minimum")
Output:
# tibble [3 x 5]
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 2
3 4 7 4 6 4

A vectorized option would be pmin. Convert the column names to symbols with syms and evaluate (!!!) to return the values of the columns on which pmin is applied
library(dplyr)
df %>%
mutate(minimum = pmin(!!! rlang::syms(names(.)[2:4])))
# w x y z minimum
#1 2 1 1 3 1
#2 3 2 5 2 2
#3 4 7 4 6 4

Here is a tidyeval approach along the lines of the suggestion from aosmith. If you don't know the column names, you can make a function that accepts the desired positions as inputs and finds the columns names itself. Here, rlang::syms() takes the column names as strings and turns them into symbols, !!! unquotes and splices the symbols into the function.
library(dplyr)
w<-c(2,3,4)
x<-c(1,2,7)
y<-c(1,5,4)
z<-c(3,2,6)
df <- data.frame(w,x,y,z)
rowwise_min <- function(df, min_cols){
cols <- df[, min_cols] %>% colnames %>% rlang::syms()
df %>%
rowwise %>%
mutate(minimum = min(!!!cols))
}
rowwise_min(df, 2:4)
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#>
#> # A tibble: 3 x 5
#> w x y z minimum
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 3 1
#> 2 3 2 5 2 2
#> 3 4 7 4 6 4
rowwise_min(df, c(1, 3))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#>
#> # A tibble: 3 x 5
#> w x y z minimum
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 3 1
#> 2 3 2 5 2 3
#> 3 4 7 4 6 4
Created on 2018-09-04 by the reprex package (v0.2.0).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dplyr join: NA match to any - r

Related

merge/vlookup in R with many data frames/excel workbooks

How to convert two vectors into dataframe (wide format)

Is there a way to create a mock dataset of an actual survey, using R?

Filter a tibble in mutate based on another tibble?

When I don't know column names in data.frame, when I use dplyr mutate function

Categories

Resources