I'm trying to randomize the order for the receipt of 6 drinks (each in a different day) for 40 participants. I want to ensure that every participant get each drink once, and that every drink has roughly the same number of occurrences across participants in each day.
I create the data, with participants in columns and days in rows.
library(ggplot2)
set.seed(123)
random_order <- as.data.frame(replicate(40, sample(1:6, 6,
replace = F)))
random_order$trial <- c(1:6)
random_order
Then I check the number of occurrences of each drink within each row / trial, which shows that the frequency of different drinks within trials is not uniform:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n())
# # A tibble: 36 × 3
# # Groups: trial [6]
# trial drink_order count
# <int> <int> <int>
# 1 1 1 9
# 2 1 2 8
# 3 1 3 8
# 4 1 4 4
# 5 1 5 5
# 6 1 6 6
# 7 2 1 7
# 8 2 2 4
# 9 2 3 10
# 10 2 4 7
# # … with 26 more rows
and look at it with a density plot:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n()) |>
ggplot(aes(count)) +
geom_density()
Basically, I want to have a very thin normal curve. How can I make it so that the count column above has a small range during creating the data?
Thanks!
You’re looking for a variation on a Latin square, which is a set of ordered elements such that each element occurs exactly once per column and once per row. You can generate random Latin squares using agricolae::design.lsd(). In your case, instead of once per row, you want each element to occur the same number of times per row, which you can do by binding together multiple Latin squares.
library(agricolae)
set.seed(123)
# to get 40 columns, first get 7 Latin squares
# (7 squares x 6 columns per square = 42 columns)
orders <- replicate(
7,
design.lsd(1:6)$sketch,
simplify = FALSE
)
# then column-bind and subset to 40 columns
random_order <- data.frame(do.call(cbind, orders))[, 1:40]
random_order$trial <- c(1:6)
Using the code from your question, we can see that all trials include 6 or 7 of each drink:
# A tibble: 36 × 3
# Groups: trial [6]
trial drink_order count
<int> <chr> <int>
1 1 1 7
2 1 2 7
3 1 3 7
4 1 4 6
5 1 5 6
6 1 6 7
7 2 1 7
8 2 2 6
9 2 3 6
10 2 4 7
# … with 26 more rows
Related
I have a longitudal dataset, where the same subjects are measured at different occasions in time.
For instance:
dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))
I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,...
That's my attempt for counting, for instance, how many subjects have been measured only twice:
library(dplyr)
s.two=dd %>% group_by(subject_id) %>% filter(n() == 2) %>% ungroup()
length(s.two$subject_id)/2
But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?
The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time...
With this info, I shall construct a table to add in my report
You should use summarize. After this you can still filter with filter(n == 2).
library(dplyr)
dd <- data.frame(
subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
income = c(rnorm(16, 50000, 250))
)
dd |>
group_by(subject_id) |>
summarise(n = n())
#> # A tibble: 9 × 2
#> subject_id n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 2
#> 9 9 2
If you use mutate instead of summarize and filter then, you will get
dd |>
group_by(subject_id) |>
mutate(n = n()) |>
filter(n ==2)
subject_id income n
<dbl> <dbl> <int>
1 3 49675. 2
2 3 50306. 2
3 8 49879. 2
4 8 50202. 2
5 9 49783. 2
6 9 49834. 2
NEW EDIT
Maybe you mean this:
dd |>
group_by(subject_id) |>
summarise(n = n()) |>
mutate(info = glue::glue(
'There are {n} times {subject_id} out of {max(subject_id)} groups')) |>
select(info)
# A tibble: 9 × 1
info
<glue>
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups
Next which would be #Ritchie Sacramento 's solution
dd |>
group_by(subject_id) |>
summarise(no_of_occurences = n()) |>
count(no_of_occurences)
# A tibble: 3 × 2
no_of_occurences n
<int> <int>
1 1 4
2 2 3
3 3 2
I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
I have two dfs : df1 and df2 where the column names are dates. When I join the two df's I get columns like
date1.x, date1.y, date2.x, date2.y, date3.x, date3.y, date4.x, date4.y...........
I want to create new columns which have values which are multiplication of date1.x and date1.y and similarly for other date pairs as well.
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
grep("^date.*\\.x$", colnames(df), value = TRUE)
# [1] "date1.x" "date2.x"
datenms <- grep("^date.*\\.x$", colnames(df), value = TRUE)
### make sure all of our 'date#.x' columns have matching 'date#.y' columns
datenms <- datenms[ gsub("x$", "y", datenms) %in% colnames(df) ]
datenms
# [1] "date1.x" "date2.x"
subset(df, select = datenms)
# date1.x date2.x
# 1 1 4
# 2 2 5
# 3 3 6
subset(df, select = gsub("x$", "y", datenms))
# date1.y date2.y
# 1 7 10
# 2 8 11
# 3 9 12
subset(df, select = datenms) * subset(df, select = gsub("x$", "y", datenms))
# date1.x date2.x
# 1 7 40
# 2 16 55
# 3 27 72
There are a number of ways to do this, but I suggest that it is a good practice to get used to transforming your data into a format that is easy to work with. The first answer showed you one way to do what you want without transforming your data. My answer will show you how to transform the data so that calculation (this one and others) are easy, and then how to perform the calculation once the data is tidy.
Making your data tidy helps to perform easier aggregations, to graph results, to perform feature engineering for models, etc.
library(dplyr)
library(tidyr)
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
# Convert the data to a tidy format that is easier for computers to calculate
tidy_df <- df %>%
pivot_longer(
cols = starts_with("date"), # We are tidying any column starting with date
names_to = c("date_num","date_source"), # creating two columns for names
values_to = c("date_value"), # creating one column for values
names_prefix = "date", # removing the "date" prefix
names_sep = "\\." # splitting the names on the period `.`
)
tidy_df
# id date_num date_source date_value
# <int> <chr> <chr> <int>
# 1 11 1 x 1
# 2 11 2 x 4
# 3 11 1 y 7
# 4 11 2 y 10
# 5 12 1 x 2
# 6 12 2 x 5
# 7 12 1 y 8
# 8 12 2 y 11
# 9 13 1 x 3
# 10 13 2 x 6
# 11 13 1 y 9
# 12 13 2 y 12
# Now that the data is tidy we can do easier dataframe grouping and aggregation
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup()
# id date_num date_value_mult
# <int> <chr> <dbl>
# 1 11 1 7
# 2 11 2 40
# 3 12 1 16
# 4 12 2 55
# 5 13 1 27
# 6 13 2 72
# If/When you eventually want the data in a more human readable format you can
# pivot the data back into a human readable format. This is likely after all
# computer calculations are done and you want to present the data. For storing
# the data (such as in a database) you would not need/want this step.
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup() %>%
pivot_wider(
names_from = date_num,
values_from = date_value_mult,
names_prefix = "date"
)
# id date1 date2
# <int> <dbl> <dbl>
# 1 11 7 40
# 2 12 16 55
# 3 13 27 72
I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})
I have a dataframe,
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11),score=c(1,3,5,7,3,4,7,1,2,6,3),cluster=c(1,1,2,2,2,2,3,3,3,3,3))
I also have a set of cluster IDs and the number of samples I'd like from each cluster,
sample_sizes<-data.frame(cluster=c(1,2,3),samples=c(1,3,2))
I would like to have a samples dataframe consisting of samples selected according to the number of samples specified in the sample_sizes dataframe.
For instance, the following table would be a potential result:
id score cluster
2 3 1
3 4 2
5 3 2
6 4 2
9 2 3
11 3 3
I have looked at using the following using dplyr:
df2<-merge(df,sample_sizes)
df3<-df2 %>%
group_by(cluster) %>%
sample_n(samples)
but receive an error.
Is there a best method for doing this? A solution that could scale with larger numbers of clusters and samples would be ideal.
Thank you in advance!
We may use map2_df along with split:
map2_df(split(df, df$cluster), sample_sizes$samples, sample_n)
# id score cluster
# 1 1 1 1
# 2 4 7 2
# 3 5 3 2
# 4 3 5 2
# 5 7 7 3
# 6 9 2 3
split(df, df$cluster) gives a list of data frames, one for each cluster, then map2_df applies sample_n to each cluster, just like you intended, and binds the resulting data frames into one.
Here is a way using tidyr::nest() and purrr::map2
library(tidyverse)
df %>% group_by(cluster) %>% nest() %>%
left_join(sample_sizes) %>% mutate(samp=map2(data,samples,sample_n)) %>%
select(cluster,samples,samp) %>% unnest()
Joining, by = "cluster"
# A tibble: 6 x 4
cluster samples id score
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 3 5 3
3 2 3 6 4
4 2 3 4 7
5 3 2 8 1
6 3 2 10 6