I have a question on how to get a random sample but maintain multiple items that belong to the same group. What I'm really trying to do is do sampling, but each sample has to include every item.
Here is a method of sampling from mtcars. Using this, I get two random rows,
(sampled_df <- mtcars[sample(nrow(mtcars), 2), ])
I can take mtcars and then number it as though there are groups. mtcars has 32 observations. Here I'm saying that there are eight groups with four items each.
library(dplyr)
mtcars %>%
mutate(number = rep(1:8,each=4)) %>%
group_by(number) %>%
sample_n(2)
The last two lines of code isn't doing what I'm hoping it would. I'm trying to have eight lines as output: all four of the observations from two of the groups.
I'm really working with invoice data and I want to be able to make the data frame smaller while making sure that I'm keeping the basket sizes the same.
What you might want is:
mtcars %>%
mutate(number = rep(1:8,each=4)) %>%
filter(number %in% sample(1:8, 2))
Related
I have a dataset with 51 columns and I want to add summary rows for most of these variables. Currently columns 5:48 are various metrics with each row being 1 area from 1 quarter, I am summing the metric for all quarters for each area and ultimately creating a rate. The below code works fine for doing this to one individual metric but I need to run this for 44 different columns.
example <- test %>%
group_by(Area) %>%
summarise(`Metric 1`= (sum(`Metric 1`))/(mean(Population))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
I have tried creating a for loop and using the column index values, however, that hasn't worked and just returns various errors. I've been unable to get the above script working with index values as well, the below gives an error ('Error: unexpected '=' in: " group_by_at(Local_Authority) %>% summarise(u17_12mo[5]=")
example <- test %>%
group_by_at(Area) %>%
summarise(test[5]= (sum(test[5]))/(mean(test[4]))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
Any help on setting up a for loop for this, or another way entirely would be great
Without data, its tough to help, but maybe this would work for you:
library(tidyverse)
example <- test %>%
group_by(Area) %>%
summarise(across(5:48, ~(sum(.))/(mean(Population))*10000))
I am having trouble doing cross-validation for a hierarchical dataset. There is a level 2 factor ("ID") that needs to be equally represented in each subset. For this dataset, there are 157 rows and 28 IDs. I want to divide my data up into five subsets, each containing 31 rows, where each of the 28 IDs is represented (a stand can be repeated within a subset).
I have gotten as far as:
library(dplyr)
df %>%
group_by(ID) %>%
and have no clue where to take it from there. Any help is appreciated!
Here's what I'd do: assign one row from each ID randomly to each of the 5 subsets, and then distribute the leftovers fully randomly. Without sample data this is untested, but it should at least get you on the right track.
df %>%
group_by(ID) %>%
mutate(
random_rank = rank(runif(n())),
strata = ifelse(random_rank <= 5, random_rank, sample(1:5, size = n(), replace = TRUE))
) %>%
select(-random_rank) %>%
ungroup()
This should create a strata column as described above. If you want to split the data into a list of data frames for each strata, ... %>% group_by(strata) %>% group_split().
Edited
I have a large table, which starts like this
Essentially, it's a table with multiple samples ("samp_id") showing the number ("least") of "taxon" present in each.
I want to transpose/pivot the table to look like this;
i.e. with "taxon" as the top row, with each of the 90 samples in "data" following as a row based on the "least" column, re-named with its "samp_id". So you see what each sample is, as well the value in "least" for each sample in the different "taxon" (which may not be identical across the 90 samples).
Previously, I have separated the data into multiple tibbles based on "samp_id", selected "taxon" and "least", re-named "least" with the "samp_id" then combined the individual tibbles based on "taxon" with full_join using something like the code below, then transposing the combined table
ACLOD_11 = data %>%
filter(samp_id == "ACLOD_11") %>%
select(taxon, least) %>%
rename("ACLOD_11" = least)
ACLOD_12 = data ... #as above, but different samp_id
data_final = list(ACLOD_11, ACLOD_12, ...) %>%
reduce(full_join, by = "taxon")
As I have more data tables to follow after this one with 90 samples, so I want to be able to do this without having to individually separate the data into 100s of tibbles and manually inputting the "samp_id" before joining.
I have currently split the data into 90 separate tibbles based on "samp_id" (there are 90 samples in "data")
data_split = data %>%
group_split(samp_id)
but am unsure if this is the best way to do this, or what I should to next?
We can use
library(dplyr)
library(purrr)
data %>%
split(.$samp_id) %>%
imap(~ .x %>%
select(taxon, least) %>%
rename(!!.y := least)) %>%
reduce(full_join, by = 'taxon')
I have a dataset that looks like this:
group=rep(1:4,each=100)
values=round(runif(400,25,350),0)
data<-data.frame(values,group)
Each group is comprised by 100 observations (values).
For each group, I would take 20 random samples without replacement and varying sampling size starting from 10 and increasing by 5 up to 95.
Thus for each group I want 20 samples with size=10, 20 samples with size=15....20 samples with size=95.
Any idea on how to do it using some tidyverse solution?
At the moment I did this:
data %>%
group_by(group) %>%
nest() %>%
mutate(v=map(data,~rep_sample_n(.,size=10,replace=FALSE,reps=20))) %>%
unnest(v)
It seems correctly replicate 20 times a sample with size=10, but still I need to change the size...
Thanks.
You could create a sequence of sample sizes, wrap your group_by/nest/etc dude into a For loop, then add each new sample to a list.
Notice how the size argument in ~rep_sample_n is now sizes[i] rather than a fixed number.
sizes <- seq(10,95,by=5)
sample_list <- list()
for (i in 1:length(sizes)){
new_data <- data %>%
group_by(group) %>%
nest() %>%
mutate(v=map(data,~rep_sample_n(.,size=sizes[i],replace=FALSE,reps=20))) %>%
unnest(v)
sample_list[i] <- new_data
}
I am suggesting a for loop instead of lapply(), as it makes more sense to me and this application doesn't take much time anyway.
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.