How to propotionally split data using initial_split r - r

I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?

There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:
dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2)))
#> # A tibble: 100 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
#> 5 5 3
#> 6 6 3
#> 7 7 4
#> 8 8 4
#> 9 9 5
#> 10 10 5
#> # … with 90 more rows
split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
#> <Analysis/Assess/Total>
#> <50/50/100>
training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 4 2
#> 3 5 3
#> 4 8 4
#> 5 10 5
#> 6 12 6
#> 7 13 7
#> 8 16 8
#> 9 17 9
#> 10 20 10
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 2 1
#> 2 3 2
#> 3 6 3
#> 4 7 4
#> 5 9 5
#> 6 11 6
#> 7 14 7
#> 8 15 8
#> 9 18 9
#> 10 19 10
#> # … with 40 more rows
We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.


In R, how to copy object

In R, how to quote object in code ? for example there is object 'df1' in RAM --dataframe
df1 <- data.frame(dt=c(1:100))
df1_copy <- sym(paste0("df","1"))
df1_copy not the same as df1 ---- df1_copy is "symbol" and value is "mt1"。How to fix it, Thanks!
If you want to programmatically make copies of objects in your environment, you can go along the lines of the second answer to this post.
df1 <- data.frame(v1 = 1:10)
df2 <- data.frame(V1 = 11:20)
original.objects <- ls(pattern="df[0-9]+")
for(i in 1:length(original.objects)){
assign(paste0("copy_", original.objects[i]), eval([i])))
#> [1] "copy_df1" "copy_df2" "df1" "df2"
#> [5] "i" "original.objects"
print(list(df1, df2, copy_df1, copy_df2))
#> [[1]]
#> v1
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> [[2]]
#> V1
#> 1 11
#> 2 12
#> 3 13
#> 4 14
#> 5 15
#> 6 16
#> 7 17
#> 8 18
#> 9 19
#> 10 20
#> [[3]]
#> v1
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> [[4]]
#> V1
#> 1 11
#> 2 12
#> 3 13
#> 4 14
#> 5 15
#> 6 16
#> 7 17
#> 8 18
#> 9 19
#> 10 20
How to find average time points difference in longitudinal data

I have longitudinal data of body weights of over 100K participants. Time points of weight measurements between participants are not the same. What I want to know is the average time difference between 1st and 2nd measurement as well as 2nd and 3rd measurement etc. Another one is how many people or % of people who have 3 body weight measurements, as well as for 4,5, 6, 7, and 8 etc. How can I do to find these answers on R.
Perhaps something like this:
library(dplyr, warn.conflicts = F)
# generate some sample data
dates <- seq(as.Date("2000-01-01"), by = "day", length.out = 500)
sample_data <- tibble(
participant_id = sample(1:1000, size = 5000, replace = T),
meas_date = sample(dates, size = 5000, replace = T)) %>%
arrange(participant_id, meas_date)
#> # A tibble: 5,000 × 2
#> participant_id meas_date
#> <int> <date>
#> 1 1 2000-01-18
#> 2 1 2000-02-28
#> 3 1 2000-05-15
#> 4 1 2001-02-01
#> 5 2 2000-05-11
#> 6 3 2000-01-22
#> 7 3 2000-03-27
#> 8 3 2000-04-17
#> 9 3 2000-09-23
#> 10 3 2000-12-13
#> # … with 4,990 more rows
# periods between each measurement for each participant
meas_periods <- sample_data %>%
group_by(participant_id) %>%
mutate(meas_n = row_number(),
date_diff = meas_date - lag(meas_date)) %>%
#> # A tibble: 5,000 × 4
#> participant_id meas_date meas_n date_diff
#> <int> <date> <int> <drtn>
#> 1 1 2000-01-18 1 NA days
#> 2 1 2000-02-28 2 41 days
#> 3 1 2000-05-15 3 77 days
#> 4 1 2001-02-01 4 262 days
#> 5 2 2000-05-11 1 NA days
#> 6 3 2000-01-22 1 NA days
#> 7 3 2000-03-27 2 65 days
#> 8 3 2000-04-17 3 21 days
#> 9 3 2000-09-23 4 159 days
#> 10 3 2000-12-13 5 81 days
#> # … with 4,990 more rows
# average period between meas_n-1 and meas_n
meas_periods %>%
group_by(meas_n) %>%
summarise(mean_duration = mean(date_diff))
#> # A tibble: 13 × 2
#> meas_n mean_duration
#> <int> <drtn>
#> 1 1 NA days
#> 2 2 88.54102 days
#> 3 3 86.16762 days
#> 4 4 76.21154 days
#> 5 5 69.11392 days
#> 6 6 67.16798 days
#> 7 7 50.67089 days
#> 8 8 50.91111 days
#> 9 9 49.89873 days
#> 10 10 48.70588 days
#> 11 11 51.00000 days
#> 12 12 26.25000 days
#> 13 13 66.00000 days
# number and percentage of participants gone through meas_n measurements
meas_periods %>%
count(meas_n, name = "participant_n") %>%
mutate(percent = participant_n/max(participant_n))
#> # A tibble: 13 × 3
#> meas_n participant_n percent
#> <int> <int> <dbl>
#> 1 1 996 1
#> 2 2 963 0.967
#> 3 3 877 0.881
#> 4 4 728 0.731
#> 5 5 553 0.555
#> 6 6 381 0.383
#> 7 7 237 0.238
#> 8 8 135 0.136
#> 9 9 79 0.0793
#> 10 10 34 0.0341
#> 11 11 12 0.0120
#> 12 12 4 0.00402
#> 13 13 1 0.00100
In R , is there any avaiable funcation like IFERROR formula in EXCEL

In R , is there any available function like IFERROR formula in EXCEL ?
I want to calculate moving average using 4 nearest figures, but if the figures less than 4 in the group then using normal average.
Detail refer to below code, the IF_ERROR is just i wished function and can't work
test_data <- data.frame(category=c('a','a','a','b','b','b','b','b','b'),
test_data %>% group_by(category) %>% mutate(avg_amount=IF_ERROR(TTR::runMedian(amount,4),
In general, input should only generate errors in exceptional circumstances. It can be computationally expensive to catch and handle errors where a simple if statement will suffice. The key here is realising that runMedian throws an error if the group size is less than 4. Remember we can check the group size inside mutate by using n(), so all you need do is:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 NA
#> 5 b 5 NA
#> 6 b 6 NA
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
Additionally, if you want to replace the NA values from the beginning of the running median, you could use ifelse:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount),
avg_amount = ifelse(, median(amount), avg_amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 6.5
#> 5 b 5 6.5
#> 6 b 6 6.5
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5

R+dplyr: conditionally swap the elements of two columns

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
This looks like sorting for me:
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

create a "mean rank" for a rank-frequency data.frame by R [duplicate]

we use tidytext to generate a rank column for a data.frame.
as shown:
what we want to get is another "mean rank" column for the data.frame.
as shown:
are there any easy way to generate this column?
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number())
sure, just group by frequency
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number()) %>%
group_by(frequency) %>%
mutate(mean_rank = mean(rank)) %>%
#> # A tibble: 13 × 3
#> frequency rank mean_rank
#> <dbl> <int> <dbl>
#> 1 64 1 1
#> 2 58 2 2
#> 3 54 3 3
#> 4 32 4 4
#> 5 29 5 5.5
#> 6 29 6 5.5
#> 7 25 7 7
#> 8 17 8 8.5
#> 9 17 9 8.5
#> 10 15 10 10
#> 11 12 11 11.5
#> 12 12 12 11.5
#> 13 10 13 13
Use the built-in functions for this.
dt <- within(dt, {
# frequency mean_rank rank
# 1 64 1.0 1
# 2 58 2.0 2
# 3 54 3.0 3
# 4 32 4.0 4
# 5 29 5.5 5
# 6 29 5.5 6
# 7 25 7.0 7
# 8 17 8.5 8
# 9 17 8.5 9
# 10 15 10.0 10
# 11 12 11.5 11
# 12 12 11.5 12
# 13 10 13.0 13
