Writing a function with variable times in the repeat function in R - r

I'm hoping someone can help me write a more eloquent function to do the following:
Let's say I have a data frame looking approximately like the following:
library(tidyverse)
d =
tibble(
ID = as.factor(c("1", "2")),
dialect_TCU = as.numeric(c(8, 12)),
standard_TCU = as.numeric(c(12, 9)),
mixture_TCU = as.numeric(c(14, 5))
)
I cannot, for the life of me, figure out how to write a function that does the following:
Repeats each header the amount of times listed for each participant and
repeats the participant ID the amount of times the headers are repeated.
The ending data frame should look like this:
d2 =
tibble(
ID = c(rep("1", 34),
rep("2", 26)),
successfulRow = c(rep("dialect_TCU", 8),
rep("standard_TCU", 12),
rep("mixture_TCU", 14),
rep("dialect_TCU", 12),
rep("standard_TCU", 9),
rep("mixture_TCU", 5))
)
If anyone could help me out in writing a function that does this (it's probably really easy and I'm just overthinking the whole thing...), that would be extremely helpful!
Thanks!

We can reshape to 'long' with pivot_longer and then use uncount for replicating the rows
library(dplyr)
library(tidyr)
d %>%
pivot_longer(cols = -ID, names_to = "successfulRow") %>%
uncount(value)
-output
# A tibble: 60 × 2
ID successfulRow
<fct> <chr>
1 1 dialect_TCU
2 1 dialect_TCU
3 1 dialect_TCU
4 1 dialect_TCU
5 1 dialect_TCU
6 1 dialect_TCU
7 1 dialect_TCU
8 1 dialect_TCU
9 1 standard_TCU
10 1 standard_TCU
# … with 50 more rows

Related

Add a column with single value per group

i have a grouped tibble with several columns. i now want to add a new column that has the same value for every row within a group but a different value for each group, basically giving the groups names. these per group values are supplied from a vector.
ideally i want to do this in generic way, so it works in a function based on the number of groups the input has.
any help would be much appreciated, here is a very basic and reduced example of the tibble and vector. (the original tibble has character, int, and dbl columns)
df <- tibble(a = c(1,2,3,1,3,2)) %>% group_by(a)
names <- c("owl", "newt", "zag")
desired_output <– tibble(a = c(1, 2, 3, 1, 3, 2),
name = c("owl", "newt", "zag", "owl", "zag", "newt"))
as the output i would like to have the same tibble just with another column for all in group 1 = owl, 2 = newt, and 3 = zag
Just take a as indices:
library(dplyr)
df %>%
mutate(name = names[a])
# # A tibble: 6 × 2
# a name
# <dbl> <chr>
# 1 1 owl
# 2 2 newt
# 3 3 zag
# 4 1 owl
# 5 3 zag
# 6 2 newt
You can also use recode() if a cannot be used as indices.
df %>%
mutate(name = recode(a, !!!setNames(names, 1:3)))
Data
df <- tibble(a = c(1,2,3,1,3,2))
names <- c("owl", "newt", "zag")
Something like this?
library(dplyr)
names = c("owl", "newt", "zag")
df %>%
group_by(a) %>%
mutate(new_col = case_when(a == 1 ~ names[1],
a == 2 ~ names[2],
a == 3 ~ names[3]))
a new_col
<dbl> <chr>
1 1 owl
2 2 newt
3 3 zag
4 1 owl
5 2 newt
6 3 zag
7 2 newt
8 3 zag
9 1 owl
10 2 newt
11 1 owl
12 3 zag
13 2 newt
14 3 zag
data:
df <- structure(list(a = c(1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L))
You could use factor()with mutate()
names = c("owl", "newt", "zag")
dat = data.frame(a = c(1,2,3,1,2,3,2,3,1,2,1,3,2,3))
dat %>% mutate(label = factor(a, levels = c(1,2,3), labels = names))
Just make sure the order in levels corresponds to the order in labels (i.e 1 = "owl")

Updating Values in New Data with Old Data

I have the following data frame:
library(dplyr)
old_data = data.frame(id = c(1,2,3), var1 = c(11,12,13))
> old_data
id var1
1 1 11
2 2 12
3 3 13
I want to replace the values in the 2nd row of "old_data" with data in "new_data" (i.e. rows in "old_data" where the id variables matches ):
new_data = data.frame(id = c(4,2,5), var1 = c(11,15,13))
> new_data
id var1
1 4 11
2 2 15
3 5 13
Using the answer found here (Update rows of data frame in R), I tried to do this with the "dplyr" library:
update = old_data %>%
rows_update(new_data, by = "id")
But this gave me the following error:
Error: Attempting to update missing rows.
Run `rlang::last_error()` to see where the error occurred.
This is what I am trying to get:
id var1
1 1 11
2 2 15
3 3 13
Can someone please tell me what I am doing wrong?
Thanks!
A little bit messy but this works (on this sample data at least)
old_data %>%
left_join(new_data,by="id") %>%
mutate(var1 = if_else(!is.na(var1.y),var1.y,var1.x)) %>%
select(id,var1)
# id var1
#1 1 11
#2 2 15
#3 3 13
A base R approach using match -
inds <- match(old_data$id, new_data$id)
old_data$var1[!is.na(inds)] <- na.omit(new_data$var1[inds])
old_data
# id var1
#1 1 11
#2 2 15
#3 3 13
A data.table approach (with turning the data table back into a dataframe):
library(data.table)
as.data.frame(setDT(old_data)[new_data, var1 := .(i.var1), on = "id"])
Output
id var1
1 1 11
2 2 15
3 3 13
An alternative tidyverse option using rows_update. You can filter new_data to only have ids that appear in old_data. Then, you can update those values, like you had previously tried. Essentially, new_data must only have id values that appear in old_data.
library(tidyverse)
old_data %>%
rows_update(., new_data %>% filter(id %in% old_data$id), by = "id")
Data
old_data <-
structure(list(id = c(1, 2, 3), var1 = c(11, 12, 13)),
class = "data.frame",
row.names = c(NA,-3L))
new_data <-
structure(list(id = c(4, 2, 5), var1 = c(11, 15, 13)),
class = "data.frame",
row.names = c(NA,-3L))
We can use dplyr::rows_update if we first use a semi_join on new_data to filter only those ids that are included in old_data.
library(dplyr)
old_data %>%
rows_update(new_data %>%
semi_join(old_data, by = "id"),
by = "id")
#> id var1
#> 1 1 11
#> 2 2 15
#> 3 3 13
Created on 2021-12-29 by the reprex package (v0.3.0)

Create new rows and put a flag to differentiate between existing row

I have a dataset like this:
df_have <- data.frame(id = rep("a",3), time = c(1,3,5), flag = c(0,1,1))
The data has one row per time per id but I need to have the second row duplicated and put into the data.frame like this:
df_want <- data.frame(id = rep("a",4), time = c(1,3,3,5), flag = c(0,0,1,1))
The flag variables should become 0 with the new row added and all other information the same. Any help would be appreciated.
Edit:
The comments below are helpful but I would also need to do this in groups by id and some ids have more rows than other ids. After reading this and seeing the comments below I see the logic isn't clear. My original data does not have a count variable (what I call flag) but it needs it in the final output. What I would need is that every row besides for the first and last timepoint (within each id) to be duplicated and every time there is a duplicate make a counter to demonstrate when a row was created until the next new row is created.
df_have2 <- data.frame(id = c(rep("a",3),rep("b",4)) ,
time = c(1,3,5,1,3,5,7))
df_want2 <- data.frame(id = c(rep("a",4),rep("b",6)),
time = c(1,3,3,5,1,3,3,5,5,7),
flag = c(1,1,2,2,1,1,2,2,3,3))
We could expand the data with slice and then create the 'flag' by matching the 'time' with unique values of 'time' and take the lag of it
library(dplyr)
df_have2 %>%
group_by(id) %>%
slice(rep(row_number(), c(1, rep(2, n() - 2), 1))) %>%
mutate(flag = lag(match(time, unique(time)), default = 1)) %>%
ungroup
# A tibble: 10 x 3
# id time flag
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 3 1
# 3 a 3 2
# 4 a 5 2
# 5 b 1 1
# 6 b 3 1
# 7 b 3 2
# 8 b 5 2
# 9 b 5 3
#10 b 7 3

Combining multiple across() in a single mutate() sentence, while controlling the variables names in R

I have the following dataframe:
df = data.frame(a = 10, b = 20, a_sd = 2, b_sd = 3)
a b a_sd b_sd
1 10 20 2 3
I want to compute a/a_sd, b/b_sd, and to add the results to the dataframe, and name them ratio_a, ratio_b. In my dataframe I have a lot of variables so I need a 'wide' solution. I tried:
df %>%
mutate( across(one_of( c('a','b')))/across(ends_with('_sd')))
This gave:
a b a_sd b_sd
1 5 6.666667 2 3
So this worked but the new values took the place of the old ones. How can I add the results to the data frame and to control the new names?
You can use the .names argument inside across
df %>%
mutate(across(one_of(c('a','b')), .names = 'ratio_{col}')/across(ends_with('_sd')))
# a b a_sd b_sd ratio_a ratio_b
# 1 10 20 2 3 5 6.666667

Extracting corresponding dataframe values from multiple records using a function

I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)

Resources