using merge to create blank rows

using merge to create blank rows - r

I am running multiple simulations with the same input parameters. Some simulations complete earlier than others and I need to extend the results of the shorter simulations so that I can analyse the data with all runs included. This means filling up 'short' runs with repeats of the final values until they are the same length as the 'long' runs with the same input parameters.
I would like a dplyr solution because the real datasets are massive and dplyr has fast joins.
Here is my attempt.
library(dplyr)
sims <- data.frame("run" = c(1, 1, 1, 2, 2, 3, 3),
"type" = c("A", "A", "A", "A", "A", "B", "B"),
"step" = c(0, 1, 2, 0, 1, 0, 1),
"value" = seq(1:7))
allSteps <- data.frame("type" = c("A", "A", "A", "B", "B"),
"step" = c(0, 1, 2, 0, 1))
merged <- full_join(sims, allSteps,
by = c("type", "step"))
This gets the output:
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
3 B 0 6
3 B 1 7
But I actually want the following because run 2 is of type A and should therefore be expanded to the same length as run 1 (also type A):
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
2 A 2 NA # extra line here
3 B 0 6
3 B 1 7
I will then use fill to get to my desired result of:
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
2 A 2 5 # filled replacement of NA
3 B 0 6
3 B 1 7
I am sure this is a duplicate of some question but the various search terms I used didn't manage to surface it.

We don't really need the data.frame allSteps if at least one of the runs contains the full sequence for each type. Instead we can use tidyr::expand() in combination with a self-join:
library(tidyr)
sims %>% group_by(type) %>%
expand(run, step) %>%
full_join(sims, by = c("type", "step", "run")) %>%
select(2,1,3,4)
# run type step value
# <dbl> <fctr> <dbl> <int>
#1 1 A 0 1
#2 1 A 1 2
#3 1 A 2 3
#4 2 A 0 4
#5 2 A 1 5
#6 2 A 2 NA
#7 3 B 0 6
#8 3 B 1 7

Using tidyr::complete to get missing combinations, then use fill to fill NAs with last non-NA value:
library(tidyr)
sims %>%
group_by(type) %>%
complete(run, step) %>%
select(run, type, step, value) %>%
ungroup() %>%
fill(value)
# # A tibble: 8 x 4
# run type step value
# <dbl> <fct> <dbl> <int>
# 1 1.00 A 0 1
# 2 1.00 A 1.00 2
# 3 1.00 A 2.00 3
# 4 2.00 A 0 4
# 5 2.00 A 1.00 5
# 6 2.00 A 2.00 5
# 7 3.00 B 0 6
# 8 3.00 B 1.00 7

We can split the data frame by run and do a right_join on allSteps for each of them to have all the combinations you desire. Then we join back and fill.
It's a bit more general than current solutions in that you could have steps in allSteps that may not be in sims or in the sims subset you're working on.
library(tidyverse)
sims %>%
split(.$run) %>%
map_dfr(right_join,allSteps,.id = "id") %>%
group_by(type,id) %>%
fill(run,value,.direction="down") %>%
ungroup %>%
filter(!is.na(run)) %>%
select(-id)
# # A tibble: 8 x 4
# run type step value
# <dbl> <fctr> <dbl> <int>
# 1 1 A 0 1
# 2 1 A 1 2
# 3 1 A 2 3
# 4 2 A 0 4
# 5 2 A 1 5
# 6 2 A 2 5
# 7 3 B 0 6
# 8 3 B 1 7

Related

How to create new column of repeating sequence based on other column

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated

Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

The Most Efficient Way of Forming Groups using R

I have a tibble dt given as follows:
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt
As one can observe the rule for grouping is:
starts 0 and ends with 1 (e.g., groups A, B, D) or
it solely contains 1 (e.g., group C)
Problem: Given a tibble with column integer vector x of zeros and 1 that starts with 0 and ends in 1, what is the most efficient way to obtain a grouping using R? (You can use any grouping symbols/factors.)

We can get the cumulative sum of 'x' (assuming it is binary), take the lag add 1 and use that index to replace it with LETTERS (Note that LETTERS was used only as part of matching with the expected output - it can take go up to certain limit)
library(dplyr)
dt %>%
mutate(grp2 = LETTERS[lag(cumsum(x), default = 0)+ 1])
-output
# A tibble: 10 x 3
x grp grp2
<int> <fct> <chr>
1 0 A A
2 0 A A
3 1 A A
4 0 B B
5 0 B B
6 0 B B
7 1 B B
8 1 C C
9 0 D D
10 1 D D

Though the strategy proposed by Akrun is fantastic, yet to show that it can be managed through accumulate also
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt %>%
mutate(GRP = accumulate(lag(x, default = 0),.init =1, ~ if(.y != 1) .x else .x+1)[-1])
#> # A tibble: 10 x 3
#> x grp GRP
#> <int> <fct> <dbl>
#> 1 0 A 1
#> 2 0 A 1
#> 3 1 A 1
#> 4 0 B 2
#> 5 0 B 2
#> 6 0 B 2
#> 7 1 B 2
#> 8 1 C 3
#> 9 0 D 4
#> 10 1 D 4
Created on 2021-06-13 by the reprex package (v2.0.0)

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!

You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

How can I find the column index of the first non-zero value in a row with R dplyr?

I'm working in R. I have a dataset of COVID case totals that looks like this:
Facility
Day_1
Day_2
Day_3
A
0
0
1
B
1
2
5
C
0
2
6
D
0
0
0
I would like to use mutate() to create a new column, first_case, that has the column index of the first non-zero element in each row -- or "NA" if there is no non-zero element. I thought about using where(), but couldn't quite figure out how to get a column index instead of a row index.
Any help is much appreciated!

We can use max.col to get the first instance when the value is non-zero in each zero.
library(dplyr)
df %>%
mutate(first_case = {
tmp <- select(., starts_with('Day'))
ifelse(rowSums(tmp) == 0, NA, max.col(tmp != 0, ties.method = 'first'))
})
# Facility Day_1 Day_2 Day_3 first_case
#1 A 0 0 1 3
#2 B 1 2 5 1
#3 C 0 2 6 2
#4 D 0 0 0 NA
first_case has column number of the 'Day' columns, if you need column number in the data you can add + 1 to above output.

This is probably unnecessarily complex, because the data is not in a long ('tidy') format that dplyr etc expect.
datlong <- dat %>%
pivot_longer(cols=starts_with("Day"), names_to = c("day"), names_pattern="_(\\d+)")
## A tibble: 12 x 3
# Facility day value
# <chr> <chr> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 B 1 1
# 5 B 2 2
# 6 B 3 5
# 7 C 1 0
# 8 C 2 2
# 9 C 3 6
#10 D 1 0
#11 D 2 0
#12 D 3 0
It's then simple to get the first/second/third/[n]th day above whatever value, as well as to calculate minimums, maximums, means, weekly averages, rolling averages, whatever, because you are now dealing with a plain old vector of values rather than a list of values across multiple columns.
datlong %>%
group_by(Facility) %>%
filter(value > 0, .preserve=TRUE) %>%
summarise(first_day = first(day))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 4 x 2
# Facility first_day
# <chr> <chr>
#1 A 3
#2 B 1
#3 C 2
#4 D <NA>
Alternative using indexes and stuff, which is less dplyr-like:
datlong %>%
group_by(Facility) %>%
summarise(first_day = day[value > 0][1])

How can I filter by subjects who have all levels of a factor?

I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?

We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1

Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.

I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using merge to create blank rows - r

Related

How to create new column of repeating sequence based on other column

The Most Efficient Way of Forming Groups using R

In R, take sum of multiple variables if combination of values in two other columns are unique

How can I find the column index of the first non-zero value in a row with R dplyr?

How can I filter by subjects who have all levels of a factor?

Categories

Resources