I have a data set from which I want to take a random sample by group up to 30 rows. However, I also want to make sure that at least 1 row for another grouping is included. Additionally, some groups have less than 30 rows, in which case all of the rows for that group should be included. I can't include the exact data set I'm working with because it's proprietary; however, an example for a data frame df would be:
ID|Age|State|Gender|Salary
1 25 CO M 50000
2 34 CO M 72000
3 28 CO M 52000
4 25 CO F 44000
5 25 CA F 55000
6 34 CA F 100000
7 39 CA M 88000
8 34 CA M 59000
... up to 15000 rows
So, I want a random sample of the data set so that no more than 30 rows are given from each state. Then, for each state, I want at least 1 row for each age and gender that exists in the data set. If there are less than 30 age/gender combinations for a given state, but there are more than 30 rows for that state, then the sample should include multiple rows for a given age/gender so that 30 rows are given for that state. If there are less than 30 rows for that state, then I want all the rows in the data set for that state. If there are more than 30 age/gender combinations for a given state, then the sample should have 1 of each up to 30.
Is there a way for me to do this in R?
Here is some code that takes you half the way. First I simulated data, that resembles yours.
df <-
data.frame(
ID = 1:1500,
Age = sample(18:99, 1500, replace = TRUE),
State = sample(state.abb, 1500, replace = TRUE),
Gender = sample(c("M", "F"), 1500, replace = TRUE),
Salary = sample(44:100 * 1000, 1500, replace = TRUE)
)
Then with group_by() you can create the state grouping, determine the rows per state with mutate() and n(). That information can then be used to draw samples with sample_n(), that adjust to the group size.
library(dplyr)
df %>%
group_by(State) %>%
mutate(n_state = n()) %>%
sample_n(ifelse(n >= 30, 30, n))
This could be extended to calculate further group sizes you mention to use that information to ensure you hit the quotas you are looking for. Unfortunately I do no fully understand what your quotas are from your question.
Related
I am really new at R and this is probably a really basic question. I have a data set with 9 columns that have information on income data. How do I find the overall percentage of individuals who earn more than 50k?
It is easier to help if you provide some sample data in a reproducible format.
Let's say this is your data
df <- data.frame(id = 1:5, income = c(55000, 12000, 650000, 100000, 32000))
df
# id income
#1 1 55000
#2 2 12000
#3 3 650000
#4 4 100000
#5 5 32000
To count the percentage of people whose income is greater than 50K, you can count the number of people whose income is greater than 50k and divide it by total number of people which can be done as -
perc <- sum(df$income > 50000)/nrow(df) * 100
perc
#[1] 60
Another shorter way to do the same thing in R would be to consider logical values as numbers. So TRUE -> 1 and FALSE -> 0 so to get the percentage we can do -
perc <- mean(df$income > 50000) * 100
perc
#[1] 60
I have this data table as model:
ID PRODUCT_TYPE OFFER INENTORY
1 BED Y Y
2 TABLE N Y
3 MOUSE Y N
4 CELLPHONE Y Y
5 CAR Y Y
6 BED N N
7 TABLE N Y
8 MOUSE Y N
9 CELLPHONE Y Y
10 CAR Y Y
.....
I have to extract a sample of 50% of the total population and the sample must consist on appearance of the values of the variables at least once (product_type == bed, cellphone, car, table, mouse, offer = Y, N, etc).
I used this to extract the sample:
subset1<- data2 %>% sample_frac(.5)
but I don't know how to integrate these conditions, can anyone help me with an advice?
It's unclear from the content of the original post whether the question it asks is How does one generate a stratified random sample based on combinations of a set of grouping variables? A stratified random sample is an appropriate approach in this situation because it ensures that each combination of grouping variables is proportionally represented in the sampled data frame.
A tidyverse solution
Since the question does not include a minimal reproducible example, we'll generate some data and illustrate how to split or group it and then randomly sample each of the subgroups.
To begin, we reset the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Given the three grouping variables in the original post, the df object contains 40 unique combinations of productType, offer, and inventory.
The original code attempts to use the dplyr package to sample the data. It was very close to a workable solution. To stratify the sample we use group_by() to group the data by split variables, and then use the sample_frac() function on the grouped data to generate the stratified sample.
library(dplyr)
df %>%
group_by(productType,offer,inventory) %>%
sample_frac(0.5) -> sampledData
Verifying results
A 50% sample from a 10,000 row data frame should have about 5,000 observations.
> nrow(sampledData)
[1] 5001
So far, so good.
We can then verify the results by counting numbers of rows in each stratum of the sample, and comparing them to the original counts for each subgroup in the input data frame.
# check results
originalCounts <- df %>%
group_by(productType,offer,inventory) %>%
summarise(OriginalCount = n())
sampledData %>%
group_by(productType,offer,inventory) %>%
summarise(SampledCount = n()) %>%
full_join(originalCounts,.) %>%
mutate(SampledPct = round(SampledCount / OriginalCount * 100,2))
...and the output:
# A tibble: 40 x 6
# Groups: productType, offer [20]
productType offer inventory OriginalCount SampledCount SampledPct
<chr> <chr> <chr> <int> <int> <dbl>
1 Bed N N 161 80 49.7
2 Bed N Y 371 186 50.1
3 Bed Y N 132 66 50
4 Bed Y Y 336 168 50
5 Bicycle N N 154 77 50
6 Bicycle N Y 349 174 49.9
7 Bicycle Y N 147 74 50.3
8 Bicycle Y Y 350 175 50
9 Blanket N N 134 67 50
10 Blanket N Y 349 174 49.9
# … with 30 more rows
By inspecting the data, we see that data frames with even numbers of observations result in an exact 50% sample, whereas data frames with odd numbers of observations are slightly above or below 50%.
A Base R solution
We can also solve the problem with Base R. This approach uses the three variables in the original post, product type, offer, and inventory to split the data into subgroups based on the combinations of values for these variables, take a random sample from each subset, and combine the result into a single data frame.
First, we set the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Since we want to separately sample each combination of product, offer, and inventory, we create a combined split variable, and then use it to split the data.
splitvar <- paste(df$productType,df$offer,df$inventory,sep="-")
dfList <- split(df,splitvar)
Given the input data frame parameters of 10 products, 2 levels of offer (Y / N), and 2 levels of inventory (Y / N), this creates a dfList object that is a list of 40 data frames, each with varying numbers of observations.
We then use lapply() to randomly select about 50% of each data frame, using the number of rows for each data frame to drive the sample() function.
sampledDataList <- lapply(dfList,function(x){
x[sample(nrow(x),size = round(.5 * nrow(x))),]
})
At this point the sampledDataList object is a list of 40 data frames, each of which has approximately 50% of the rows as the original list.
To create the final data frame, we use do.call() as follows.
sampledData <- do.call(rbind,sampledDataList)
When we check the number of observations in the resulting data frame, we see that it is approximately 50% of the original data size (10,000).
> # this should be approximately 5,000 rows
> nrow(sampledData)
[1] 5001
We can further verify that each data frame is approximately a 50% sample with the following code.
# verify sample percentage by stratum
stratum <- names(sampledDataList)
OriginalCount <- sapply(dfList,nrow)
SampledCount <- sapply(sampledDataList,nrow)
SamplePct <- round(SampledCount / OriginalCount * 100,2)
head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
...and the output:
> head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
stratum OriginalCount SampledCount SamplePct
1 Bed-N-N 161 80 49.69
2 Bed-N-Y 371 186 50.13
3 Bed-Y-N 132 66 50.00
4 Bed-Y-Y 336 168 50.00
5 Bicycle-N-N 154 77 50.00
6 Bicycle-N-Y 349 174 49.86
7 Bicycle-Y-N 147 74 50.34
8 Bicycle-Y-Y 350 175 50.00
9 Blanket-N-N 134 67 50.00
10 Blanket-N-Y 349 174 49.86
As was the case with the dplyr solution, we see that strata with odd numbers of rows either sample one more or one less than an exact 50% of the original data.
I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.
You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6
I have a data set which is made up of observations of the weights of fish, the julian dates they were captured on, and their names. I am seeking to assess what the average growth rate of these fish is according to the day of the year (julian date). I believe the best method to do this is to compose a data.frame with two fields: "Julian Date" and "Growth". The idea is this: for a fish which is observed on January 1 (1) at weight 100 and a fish observed again on April 10 (101) at weight 200, the growth rate would be 100g/100days, or 1g/day. I would represent this in a data.frame as 100 rows in which the "Julian Date" column is composed of the Julian date sequence (1:100) and the "Growth" column is composed of the average growth rate (1g/day) over all days.
I have attempted to compose a for loop which passes through each fish, calculates the average growth rate, then creates a list in which each index contains the sequence of Julian dates and the growth rate (repeated the number of times equal to the length of the Julian date sequence). I would then utilize the function to compose my data.frame.
growth_list <- list() # initialize empty list
p <- 1 # initialize increment count
# Looks at every other fish ID beginning at 1 (all even-number observations are the same fish at a later observation)
for (i in seq(1, length(df$FISH_ID), by = 2)){
rate <- (df$growth[i+1]-df$growth[i])/(as.double(df$date[i+1])-as.double(df$date[i]))
growth_list[[p]] <- list(c(seq(as.numeric(df$date[i]),as.numeric(df$date[i+1]))), rep(rate, length(seq(from = as.numeric(df$date[i]), to = as.numeric(df$date[i+1])))))
p <- p+1 # increase to change index of list item in next iteration
}
# Converts list of vectors (the rows which fulfill above criteria) into a data.frame
growth_df <- do.call(rbind, growth_list)
My expected results can be illustrated here: https://imgur.com/YXKLkpK
My actual results are illustrated here: https://imgur.com/Zg4vuVd
As you can see, the actual results appear to be a data.frame with two columns specifying the type of the object, as well as the length of the original list item. That is, row 1 of this dataset contained 169 days between observations, and therefore contained 169 julian dates and 169 repetitions of the growth rate.
Instead of list(), use data.frame() with named columns to build a list of data frames to be row binded at the end:
growth_list <- vector(mode="list", length=length(df$FISH_ID)/2)
for (i in seq(1, length(df$FISH_ID), by=2)){
rate <- with(df, (growth[i+1]-growth[i])/(as.double(date[i+1])-as.double(date[i])))
date_seq <- seq(as.numeric(df$date[i]), as.numeric(df$date[i+1]))
growth_list[[p]] <- data.frame(Julian_Date = date_seq,
Growth_Rate = rep(rate, length(date_seq))
p <- p + 1
}
growth_df <- do.call(rbind, growth_list)
Welcome to stackoverflow
Couple things about your code:
I recommend using the apply function instead of the for loop. You can set parameters in apply to perform row-wise functions. It makes you code run faster. The apply family of functions also creates a list for you, which reduces the code you write to make the list and populate it.
It is common to supply users with a snippet example of your initial data to work with. Sometimes the way we describe our data is not representative of our actual data. This tradition is necessary to alleviate any communication errors. If you can, please manufacture a dummy dataset for us to use.
Have you tried using as.data.frame(growth_list), or data.frame(growth_list)?
Another option is to use an if else statement within your for loop that performs the rbind function. This would look something like this:
#make a row-wise for loop
for(x in 1:nrow(i)){
#insert your desired calculations here. You can turn the rows into their own dataframe by using this, which may make it easier to perform your calculations:
dataCurrent <- data.frame(i[x,])
#finish with something like this to turn your calculations for each row into an output dataframe of your choice.
outFish <- cbind(date, length, rate)
#make your final dataframe as follows
if(exists("finalFishOut") == FALSE){
finalFishOut <- outFish
}else{
finalFishOut <- rbind(finalFishOut, outFish)
}
}
Please update with a snippet of data and I'll update this answer with your exact solution.
Here is a solution using dplyr and plyr with some toy data. There are 20 fish, with a random start and end time, plus random weights at each time. Find the growth rate over time, then create a new df for each fish with 1 row per day elapsed and the daily average growth rate, and output a new df containing all fish.
df <- data.frame(fish=rep(seq(1:20),2),weight=sample(c(50:100),40,T),
time=sample(c(1:100),40,T))
df1 <- df %>% group_by(fish) %>% arrange(time) %>%
mutate(diff.weight=weight-lag(weight),
diff.time=time-lag(time)) %>%
mutate(rate=diff.weight/diff.time) %>%
filter(!is.na(rate)) %>%
ddply(.,.(fish),function(x){
data.frame(time=seq(1:x$diff.time),rate=x$rate)
})
head(df1)
fish time rate
1 1 1 -0.7105263
2 1 2 -0.7105263
3 1 3 -0.7105263
4 1 4 -0.7105263
5 1 5 -0.7105263
6 1 6 -0.7105263
tail(df1)
fish time rate
696 20 47 -0.2307692
697 20 48 -0.2307692
698 20 49 -0.2307692
699 20 50 -0.2307692
700 20 51 -0.2307692
701 20 52 -0.2307692
The data I have represents sales and their distance (Dist) to a given store One and Two in this example. What I would like to do is, to define the stores catchment area based on sales desity. A cacthment area is defined as the radius that contains 50% of sales. Starting with orders that have the smallest distance (Dist) to a store I would like to calculate radius that contains 50% of sales of a given store.
I the following df that I've calculated in a previous model.
df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
Store = c('One','One','One','One','Two','Two','Two','Two'),
Dist = c(1,5,7,23,1,9,9,23),
Sales = c(10,8,4,1,11,9,4,2))
Now I want to find the minimum distance dist that gives the closes figure to 50% of Sales. So my output looks as follows:
Output <- data.frame(Store = c('One','Two'),
Dist = c(5,9),
Sales = c(18,20))
I have a lot of observation in my actual df and it's unlekely that I will be able to solve for exactly 50%, so I need to round to the nearest observation.
Any suggestions how to do this?
NOTE: I appologise in advance for the poor title, I tried to think of a better way to formulate the problem, suggestions are welcome...
Here is one approach with data.table:
library(data.table)
setDT(df)
df[order(Store, Dist),
.(Dist, Sales = cumsum(Sales), Pct = cumsum(Sales) / sum(Sales)),
by = "Store"][Pct >= 0.5, .SD[1,], by = "Store"]
# Store Dist Sales Pct
# 1: One 5 18 0.7826087
# 2: Two 9 20 0.7692308
setDT(df) converts df into a data.table
The .(...) expression selects Dist, and calculates the cumulative sales and respective cumulative percentage of sales, by Store
Pct >= 0.5 subsets this to only cases where cumulative sales exceeds the threshold, and .SD[1,] takes only the top row (i.e., the smallest value of Dist), by Store
I think it would be easier if you rearrange your data in a certain format. My logic would be to first take cumsum by groups. Then merge sum of groups to the data. Finally i calculate percentage. Now You have got the data and you can subset in any way you want to get the first obs from the group.
df$cums=unlist(lapply(split(df$Sales, df$Store), cumsum), use.names = F)
zz=aggregate(df$Sales, by = list(df$Store), sum)
names(zz)=c('Store', 'TotSale')
df = merge(df, zz)
df$perc=df$cums/df$TotSale
sub-setting the data:
merge(aggregate(perc ~ Store,data=subset(df,perc>=0.5), min),df)
Store perc ID Dist Sales cums TotSale
1 One 0.7826087 2 5 8 18 23
2 Two 0.7692308 6 9 9 20 26