R-Studio Filtering Data - r

I have this data table as model:
ID PRODUCT_TYPE OFFER INENTORY
1 BED Y Y
2 TABLE N Y
3 MOUSE Y N
4 CELLPHONE Y Y
5 CAR Y Y
6 BED N N
7 TABLE N Y
8 MOUSE Y N
9 CELLPHONE Y Y
10 CAR Y Y
.....
I have to extract a sample of 50% of the total population and the sample must consist on appearance of the values ​​of the variables at least once (product_type == bed, cellphone, car, table, mouse, offer = Y, N, etc).
I used this to extract the sample:
subset1<- data2 %>% sample_frac(.5)
but I don't know how to integrate these conditions, can anyone help me with an advice?

It's unclear from the content of the original post whether the question it asks is How does one generate a stratified random sample based on combinations of a set of grouping variables? A stratified random sample is an appropriate approach in this situation because it ensures that each combination of grouping variables is proportionally represented in the sampled data frame.
A tidyverse solution
Since the question does not include a minimal reproducible example, we'll generate some data and illustrate how to split or group it and then randomly sample each of the subgroups.
To begin, we reset the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Given the three grouping variables in the original post, the df object contains 40 unique combinations of productType, offer, and inventory.
The original code attempts to use the dplyr package to sample the data. It was very close to a workable solution. To stratify the sample we use group_by() to group the data by split variables, and then use the sample_frac() function on the grouped data to generate the stratified sample.
library(dplyr)
df %>%
group_by(productType,offer,inventory) %>%
sample_frac(0.5) -> sampledData
Verifying results
A 50% sample from a 10,000 row data frame should have about 5,000 observations.
> nrow(sampledData)
[1] 5001
So far, so good.
We can then verify the results by counting numbers of rows in each stratum of the sample, and comparing them to the original counts for each subgroup in the input data frame.
# check results
originalCounts <- df %>%
group_by(productType,offer,inventory) %>%
summarise(OriginalCount = n())
sampledData %>%
group_by(productType,offer,inventory) %>%
summarise(SampledCount = n()) %>%
full_join(originalCounts,.) %>%
mutate(SampledPct = round(SampledCount / OriginalCount * 100,2))
...and the output:
# A tibble: 40 x 6
# Groups: productType, offer [20]
productType offer inventory OriginalCount SampledCount SampledPct
<chr> <chr> <chr> <int> <int> <dbl>
1 Bed N N 161 80 49.7
2 Bed N Y 371 186 50.1
3 Bed Y N 132 66 50
4 Bed Y Y 336 168 50
5 Bicycle N N 154 77 50
6 Bicycle N Y 349 174 49.9
7 Bicycle Y N 147 74 50.3
8 Bicycle Y Y 350 175 50
9 Blanket N N 134 67 50
10 Blanket N Y 349 174 49.9
# … with 30 more rows
By inspecting the data, we see that data frames with even numbers of observations result in an exact 50% sample, whereas data frames with odd numbers of observations are slightly above or below 50%.
A Base R solution
We can also solve the problem with Base R. This approach uses the three variables in the original post, product type, offer, and inventory to split the data into subgroups based on the combinations of values for these variables, take a random sample from each subset, and combine the result into a single data frame.
First, we set the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Since we want to separately sample each combination of product, offer, and inventory, we create a combined split variable, and then use it to split the data.
splitvar <- paste(df$productType,df$offer,df$inventory,sep="-")
dfList <- split(df,splitvar)
Given the input data frame parameters of 10 products, 2 levels of offer (Y / N), and 2 levels of inventory (Y / N), this creates a dfList object that is a list of 40 data frames, each with varying numbers of observations.
We then use lapply() to randomly select about 50% of each data frame, using the number of rows for each data frame to drive the sample() function.
sampledDataList <- lapply(dfList,function(x){
x[sample(nrow(x),size = round(.5 * nrow(x))),]
})
At this point the sampledDataList object is a list of 40 data frames, each of which has approximately 50% of the rows as the original list.
To create the final data frame, we use do.call() as follows.
sampledData <- do.call(rbind,sampledDataList)
When we check the number of observations in the resulting data frame, we see that it is approximately 50% of the original data size (10,000).
> # this should be approximately 5,000 rows
> nrow(sampledData)
[1] 5001
We can further verify that each data frame is approximately a 50% sample with the following code.
# verify sample percentage by stratum
stratum <- names(sampledDataList)
OriginalCount <- sapply(dfList,nrow)
SampledCount <- sapply(sampledDataList,nrow)
SamplePct <- round(SampledCount / OriginalCount * 100,2)
head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
...and the output:
> head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
stratum OriginalCount SampledCount SamplePct
1 Bed-N-N 161 80 49.69
2 Bed-N-Y 371 186 50.13
3 Bed-Y-N 132 66 50.00
4 Bed-Y-Y 336 168 50.00
5 Bicycle-N-N 154 77 50.00
6 Bicycle-N-Y 349 174 49.86
7 Bicycle-Y-N 147 74 50.34
8 Bicycle-Y-Y 350 175 50.00
9 Blanket-N-N 134 67 50.00
10 Blanket-N-Y 349 174 49.86
As was the case with the dplyr solution, we see that strata with odd numbers of rows either sample one more or one less than an exact 50% of the original data.

Related

R- Sample random row per group until reaching max number of rows

I have a data set from which I want to take a random sample by group up to 30 rows. However, I also want to make sure that at least 1 row for another grouping is included. Additionally, some groups have less than 30 rows, in which case all of the rows for that group should be included. I can't include the exact data set I'm working with because it's proprietary; however, an example for a data frame df would be:
ID|Age|State|Gender|Salary
1 25 CO M 50000
2 34 CO M 72000
3 28 CO M 52000
4 25 CO F 44000
5 25 CA F 55000
6 34 CA F 100000
7 39 CA M 88000
8 34 CA M 59000
... up to 15000 rows
So, I want a random sample of the data set so that no more than 30 rows are given from each state. Then, for each state, I want at least 1 row for each age and gender that exists in the data set. If there are less than 30 age/gender combinations for a given state, but there are more than 30 rows for that state, then the sample should include multiple rows for a given age/gender so that 30 rows are given for that state. If there are less than 30 rows for that state, then I want all the rows in the data set for that state. If there are more than 30 age/gender combinations for a given state, then the sample should have 1 of each up to 30.
Is there a way for me to do this in R?
Here is some code that takes you half the way. First I simulated data, that resembles yours.
df <-
data.frame(
ID = 1:1500,
Age = sample(18:99, 1500, replace = TRUE),
State = sample(state.abb, 1500, replace = TRUE),
Gender = sample(c("M", "F"), 1500, replace = TRUE),
Salary = sample(44:100 * 1000, 1500, replace = TRUE)
)
Then with group_by() you can create the state grouping, determine the rows per state with mutate() and n(). That information can then be used to draw samples with sample_n(), that adjust to the group size.
library(dplyr)
df %>%
group_by(State) %>%
mutate(n_state = n()) %>%
sample_n(ifelse(n >= 30, 30, n))
This could be extended to calculate further group sizes you mention to use that information to ensure you hit the quotas you are looking for. Unfortunately I do no fully understand what your quotas are from your question.

How do I sample specific sizes within groups?

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.
You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6

plotting a scatter plot with wide range data R

I uploaded a csv file to R studio and am trying to plot two columns. The first one shows the number of likes, and the second shows the number of shares. I want to show the relationship between the number of shares when people actually like a post.
The problem is my likes count starts from 1 to 1 million, and the shares count start from 5 to 37000.
sample of my dataset (both columns are of class factor)
topMedia$likes_count
[1] 61 120 271 140 59 498 241 117 124 124 225 117 186 101
[15] 118 134 152 136 153 124 100 77 98 77 88 48 58 66
topMedia$shares_count
[1] 12 171 NULL 23 34 108 430 NULL NULL NULL 283 NULL NULL 57
[15] NULL NULL NULL 68 105 NULL NULL 7 10 45 103 22 75 16
When I use this code to plot a scatter plot. It looks messy.
plot(as.numeric(topMedia$shares_count),as.numeric(topMedia$likes_count))
I tried using other libraries
library(hexbin)
cols = colorRampPalette(c("#fee6ce", "#fd8d3c", "#e6550d", "#a63603"))
plot(hexbin(as.numeric(topMedia$shares_count), as.numeric(topMedia$likes_count), xbins = 40), colorcut = seq(0,1,length=20),
colramp = function(n) cols(20), legend = FALSE,xlab = 'share count', ylab = 'like count')
but I get a similar result even with colours
what would be a better way to show the relationship between those values?
Thanks .
In this case, the even-ish distribution (for what should be a clear positive correlation between "likes" and "shares") is a clue that the numeric data might have been inadvertently loaded as a factor. Another clue is that the x and y value only vary by the number of unique values, not by the range of the underlying numeric data. We need to convert the levels of the factor (and not the values of the factor) to see the intended numbers. We can do this with something like as.numeric(as.character(x)).
To give an example, suppose we had some linearly correlated data like this:
library(ggplot2); library(dplyr)
set.seed(42)
fake_data <- data.frame(x = runif(10000, 0, 1000000))
fake_data$y <- pmax(0, fake_data$x*rnorm(10000, 1, 2) + runif(10000, 0, 1000000))
ggplot(fake_data, aes(x,y)) + geom_point()
If that numeric data were loaded in as factors (easy to do with read.csv if the term stringsAsFactors = FALSE isn't included), it might look more like this, not too dissimilar from the data in this question. The data here is being read as if it were character data, and then made into a factor which is ordered alphabetically, with "10000" before "2" because "1" comes before "2".
fake_data_factor <- fake_data %>%
mutate(x = as.factor(as.character(x)),
y = as.factor(as.character(y)))
The x and y values now have values related to their alphabetical order, different from their underlying levels. R uses the values to sort or to plot, and the x values with the lowest values in the new data have levels near 100,000 instead of near 0. In the table below, 100,124 in row 1 comes alphabetically earlier than 10,058 in row 8!
fake_data_factor %>%
arrange(x) %>%
head(8)
# x y
#1 100124.688120559 0
#2 100229.354342446 289241.187250382
#3 100299.560697749 232233.101769741
#4 100354.233058169 814492.563551191
#5 100364.253856242 1183870.56252858
#6 100370.0227011 1224652.83777805
#7 100461.616180837 1507465.73704898
#8 10058.1261795014 604477.823016668
ggplot(fake_data_factor, aes(as.numeric(x),as.numeric(y))) +
geom_point()
We can get back to the intended numbers by converting the factors to character (which extracts each one's level) and then converting those to numeric.
fake_data_factor %>%
ggplot(aes(as.numeric(as.character(x)),as.numeric(as.character(y)))) +
geom_point()

R sample into two lists

I'm new to R and I want to sample from a list of 97 values. The list is composed of 3 different values (1,2 and 3), each representing a certain condition. The 97 values represent 97 individuals.
Lets assume the list is called original_pop. I want to randomly choose 50 individuals and store them as males and take the remaining 47 individuals and store them as females. A simple and similar scenario:
original_pop = [1 2 3 3 1 2 2 1 3 1 ...]
male_pop = [50 random values from original_pop]
female_pop = [the 47 values that are not in male_pop]
I created original_pop with sample so that the values are random but I don't know how to do the rest. Right now I've stored the first 50 values of original_pop as males and the last 47 as females and it might work because original_pop was randomly generated but I think it would be more appropriate to choose the values from original_pop in a random way and not in order.
Appreciate your responses!
n <- 97
In the absence of your original_pop data, we simulate it below.
original_pop <- sample(1:3, size=n, replace=TRUE)
maleIndexes <- sample(n, 50)
males <- original_pop[maleIndexes]
females <- original_pop[-maleIndexes]

How to group by multiple columns in dataframe using R and do aggregate function

I have a dataframe with columns as defined below. I have provided one set of example, similar to this I have many countries with loan amount and gender variables
country loan_amount gender
1 Austia 175 F
2 Austia 100 F
3 Austia 825 M
4 Austia 175 F
5 Austia 1025 M
6 Austia 225 F
Here I need to group by countries and then for each country, I need to calculate loan percentage by gender in new columns, so that new columns will have male percentage of total loan amount for that country and female percentage of total loan amount for that country. I need to do two group_by function, first to group all countries together and after that group genders to calculate loan percent.
Total loan amount = 2525
female_prcent = 175+100+175+225/2525 = 26.73
male_percent = 825+1025/2525 = 73.26
The output should be as below:
country female_percent male_percent
1 Austia 26.73 73.26
I am trying to do this in R. I tried the below function, but my R session is not producing any result and it is terminating.
df %>%
group_by(country, gender) %>%
summarise_each(funs(sum))
Could someone help me in achieving this output? I think this can be achieved using dplyr function, but I am struck inbetween.
We can try the weighted table from questionr package:
library(questionr)
with(df, wtd.table(country, gender, weights = round(100 * loan_amount/sum(loan_amount), 2)))
F M
Austia 26.73 73.26

Resources