How to propotionally split data using initial_split r - r

I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?

There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:
library(rsample)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2)))
dat
#> # A tibble: 100 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
#> 5 5 3
#> 6 6 3
#> 7 7 4
#> 8 8 4
#> 9 9 5
#> 10 10 5
#> # … with 90 more rows
split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>
training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 4 2
#> 3 5 3
#> 4 8 4
#> 5 10 5
#> 6 12 6
#> 7 13 7
#> 8 16 8
#> 9 17 9
#> 10 20 10
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 2 1
#> 2 3 2
#> 3 6 3
#> 4 7 4
#> 5 9 5
#> 6 11 6
#> 7 14 7
#> 8 15 8
#> 9 18 9
#> 10 19 10
#> # … with 40 more rows
Created on 2022-02-22 by the reprex package (v2.0.1)
We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.

Related

In R, how to copy object

In R, how to quote object in code ? for example there is object 'df1' in RAM --dataframe
library(tidyverse)
df1 <- data.frame(dt=c(1:100))
df1_copy <- sym(paste0("df","1"))
df1_copy not the same as df1 ---- df1_copy is "symbol" and value is "mt1"。How to fix it, Thanks!
If you want to programmatically make copies of objects in your environment, you can go along the lines of the second answer to this post.
df1 <- data.frame(v1 = 1:10)
df2 <- data.frame(V1 = 11:20)
original.objects <- ls(pattern="df[0-9]+")
for(i in 1:length(original.objects)){
assign(paste0("copy_", original.objects[i]), eval(as.name(original.objects[i])))
}
ls()
#> [1] "copy_df1" "copy_df2" "df1" "df2"
#> [5] "i" "original.objects"
print(list(df1, df2, copy_df1, copy_df2))
#> [[1]]
#> v1
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#>
#> [[2]]
#> V1
#> 1 11
#> 2 12
#> 3 13
#> 4 14
#> 5 15
#> 6 16
#> 7 17
#> 8 18
#> 9 19
#> 10 20
#>
#> [[3]]
#> v1
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#>
#> [[4]]
#> V1
#> 1 11
#> 2 12
#> 3 13
#> 4 14
#> 5 15
#> 6 16
#> 7 17
#> 8 18
#> 9 19
#> 10 20
Created on 2023-01-12 with reprex v2.0.2

How to find average time points difference in longitudinal data

0
I have longitudinal data of body weights of over 100K participants. Time points of weight measurements between participants are not the same. What I want to know is the average time difference between 1st and 2nd measurement as well as 2nd and 3rd measurement etc. Another one is how many people or % of people who have 3 body weight measurements, as well as for 4,5, 6, 7, and 8 etc. How can I do to find these answers on R.
Perhaps something like this:
library(dplyr, warn.conflicts = F)
set.seed(1)
# generate some sample data
dates <- seq(as.Date("2000-01-01"), by = "day", length.out = 500)
sample_data <- tibble(
participant_id = sample(1:1000, size = 5000, replace = T),
meas_date = sample(dates, size = 5000, replace = T)) %>%
arrange(participant_id, meas_date)
sample_data
#> # A tibble: 5,000 × 2
#> participant_id meas_date
#> <int> <date>
#> 1 1 2000-01-18
#> 2 1 2000-02-28
#> 3 1 2000-05-15
#> 4 1 2001-02-01
#> 5 2 2000-05-11
#> 6 3 2000-01-22
#> 7 3 2000-03-27
#> 8 3 2000-04-17
#> 9 3 2000-09-23
#> 10 3 2000-12-13
#> # … with 4,990 more rows
# periods between each measurement for each participant
meas_periods <- sample_data %>%
group_by(participant_id) %>%
mutate(meas_n = row_number(),
date_diff = meas_date - lag(meas_date)) %>%
ungroup()
meas_periods
#> # A tibble: 5,000 × 4
#> participant_id meas_date meas_n date_diff
#> <int> <date> <int> <drtn>
#> 1 1 2000-01-18 1 NA days
#> 2 1 2000-02-28 2 41 days
#> 3 1 2000-05-15 3 77 days
#> 4 1 2001-02-01 4 262 days
#> 5 2 2000-05-11 1 NA days
#> 6 3 2000-01-22 1 NA days
#> 7 3 2000-03-27 2 65 days
#> 8 3 2000-04-17 3 21 days
#> 9 3 2000-09-23 4 159 days
#> 10 3 2000-12-13 5 81 days
#> # … with 4,990 more rows
# average period between meas_n-1 and meas_n
meas_periods %>%
group_by(meas_n) %>%
summarise(mean_duration = mean(date_diff))
#> # A tibble: 13 × 2
#> meas_n mean_duration
#> <int> <drtn>
#> 1 1 NA days
#> 2 2 88.54102 days
#> 3 3 86.16762 days
#> 4 4 76.21154 days
#> 5 5 69.11392 days
#> 6 6 67.16798 days
#> 7 7 50.67089 days
#> 8 8 50.91111 days
#> 9 9 49.89873 days
#> 10 10 48.70588 days
#> 11 11 51.00000 days
#> 12 12 26.25000 days
#> 13 13 66.00000 days
# number and percentage of participants gone through meas_n measurements
meas_periods %>%
count(meas_n, name = "participant_n") %>%
mutate(percent = participant_n/max(participant_n))
#> # A tibble: 13 × 3
#> meas_n participant_n percent
#> <int> <int> <dbl>
#> 1 1 996 1
#> 2 2 963 0.967
#> 3 3 877 0.881
#> 4 4 728 0.731
#> 5 5 553 0.555
#> 6 6 381 0.383
#> 7 7 237 0.238
#> 8 8 135 0.136
#> 9 9 79 0.0793
#> 10 10 34 0.0341
#> 11 11 12 0.0120
#> 12 12 4 0.00402
#> 13 13 1 0.00100
Created on 2022-11-02 with reprex v2.0.2

In R , is there any avaiable funcation like IFERROR formula in EXCEL

In R , is there any available function like IFERROR formula in EXCEL ?
I want to calculate moving average using 4 nearest figures, but if the figures less than 4 in the group then using normal average.
Detail refer to below code, the IF_ERROR is just i wished function and can't work
library(tidyverse)
library(TTR)
test_data <- data.frame(category=c('a','a','a','b','b','b','b','b','b'),
amount=c(1,2,3,4,5,6,7,8,9))
test_data %>% group_by(category) %>% mutate(avg_amount=IF_ERROR(TTR::runMedian(amount,4),
median(amount),
TTR::runMedian(amount,4))
In general, input should only generate errors in exceptional circumstances. It can be computationally expensive to catch and handle errors where a simple if statement will suffice. The key here is realising that runMedian throws an error if the group size is less than 4. Remember we can check the group size inside mutate by using n(), so all you need do is:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 NA
#> 5 b 5 NA
#> 6 b 6 NA
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
Additionally, if you want to replace the NA values from the beginning of the running median, you could use ifelse:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount),
avg_amount = ifelse(is.na(avg_amount), median(amount), avg_amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 6.5
#> 5 b 5 6.5
#> 6 b 6 6.5
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5

R+dplyr: conditionally swap the elements of two columns

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)
This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

create a "mean rank" for a rank-frequency data.frame by R [duplicate]

This question already has an answer here:
Rank a vector based on order and replace ties with their average
(1 answer)
Closed 1 year ago.
we use tidytext to generate a rank column for a data.frame.
as shown:
what we want to get is another "mean rank" column for the data.frame.
as shown:
are there any easy way to generate this column?
library(tidyverse)
library(tidytext)
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number())
sure, just group by frequency
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number()) %>%
group_by(frequency) %>%
mutate(mean_rank = mean(rank)) %>%
ungroup()
#> # A tibble: 13 × 3
#> frequency rank mean_rank
#> <dbl> <int> <dbl>
#> 1 64 1 1
#> 2 58 2 2
#> 3 54 3 3
#> 4 32 4 4
#> 5 29 5 5.5
#> 6 29 6 5.5
#> 7 25 7 7
#> 8 17 8 8.5
#> 9 17 9 8.5
#> 10 15 10 10
#> 11 12 11 11.5
#> 12 12 12 11.5
#> 13 10 13 13
Use the built-in functions for this.
dt <- within(dt, {
rank=order(-frequency)
mean_rank=rank(-frequency)
})
dt
# frequency mean_rank rank
# 1 64 1.0 1
# 2 58 2.0 2
# 3 54 3.0 3
# 4 32 4.0 4
# 5 29 5.5 5
# 6 29 5.5 6
# 7 25 7.0 7
# 8 17 8.5 8
# 9 17 8.5 9
# 10 15 10.0 10
# 11 12 11.5 11
# 12 12 11.5 12
# 13 10 13.0 13

Resources