R sample into two lists - r

I'm new to R and I want to sample from a list of 97 values. The list is composed of 3 different values (1,2 and 3), each representing a certain condition. The 97 values represent 97 individuals.
Lets assume the list is called original_pop. I want to randomly choose 50 individuals and store them as males and take the remaining 47 individuals and store them as females. A simple and similar scenario:
original_pop = [1 2 3 3 1 2 2 1 3 1 ...]
male_pop = [50 random values from original_pop]
female_pop = [the 47 values that are not in male_pop]
I created original_pop with sample so that the values are random but I don't know how to do the rest. Right now I've stored the first 50 values of original_pop as males and the last 47 as females and it might work because original_pop was randomly generated but I think it would be more appropriate to choose the values from original_pop in a random way and not in order.
Appreciate your responses!

n <- 97
In the absence of your original_pop data, we simulate it below.
original_pop <- sample(1:3, size=n, replace=TRUE)
maleIndexes <- sample(n, 50)
males <- original_pop[maleIndexes]
females <- original_pop[-maleIndexes]

Related

R-Studio Filtering Data

I have this data table as model:
ID PRODUCT_TYPE OFFER INENTORY
1 BED Y Y
2 TABLE N Y
3 MOUSE Y N
4 CELLPHONE Y Y
5 CAR Y Y
6 BED N N
7 TABLE N Y
8 MOUSE Y N
9 CELLPHONE Y Y
10 CAR Y Y
.....
I have to extract a sample of 50% of the total population and the sample must consist on appearance of the values ​​of the variables at least once (product_type == bed, cellphone, car, table, mouse, offer = Y, N, etc).
I used this to extract the sample:
subset1<- data2 %>% sample_frac(.5)
but I don't know how to integrate these conditions, can anyone help me with an advice?
It's unclear from the content of the original post whether the question it asks is How does one generate a stratified random sample based on combinations of a set of grouping variables? A stratified random sample is an appropriate approach in this situation because it ensures that each combination of grouping variables is proportionally represented in the sampled data frame.
A tidyverse solution
Since the question does not include a minimal reproducible example, we'll generate some data and illustrate how to split or group it and then randomly sample each of the subgroups.
To begin, we reset the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Given the three grouping variables in the original post, the df object contains 40 unique combinations of productType, offer, and inventory.
The original code attempts to use the dplyr package to sample the data. It was very close to a workable solution. To stratify the sample we use group_by() to group the data by split variables, and then use the sample_frac() function on the grouped data to generate the stratified sample.
library(dplyr)
df %>%
group_by(productType,offer,inventory) %>%
sample_frac(0.5) -> sampledData
Verifying results
A 50% sample from a 10,000 row data frame should have about 5,000 observations.
> nrow(sampledData)
[1] 5001
So far, so good.
We can then verify the results by counting numbers of rows in each stratum of the sample, and comparing them to the original counts for each subgroup in the input data frame.
# check results
originalCounts <- df %>%
group_by(productType,offer,inventory) %>%
summarise(OriginalCount = n())
sampledData %>%
group_by(productType,offer,inventory) %>%
summarise(SampledCount = n()) %>%
full_join(originalCounts,.) %>%
mutate(SampledPct = round(SampledCount / OriginalCount * 100,2))
...and the output:
# A tibble: 40 x 6
# Groups: productType, offer [20]
productType offer inventory OriginalCount SampledCount SampledPct
<chr> <chr> <chr> <int> <int> <dbl>
1 Bed N N 161 80 49.7
2 Bed N Y 371 186 50.1
3 Bed Y N 132 66 50
4 Bed Y Y 336 168 50
5 Bicycle N N 154 77 50
6 Bicycle N Y 349 174 49.9
7 Bicycle Y N 147 74 50.3
8 Bicycle Y Y 350 175 50
9 Blanket N N 134 67 50
10 Blanket N Y 349 174 49.9
# … with 30 more rows
By inspecting the data, we see that data frames with even numbers of observations result in an exact 50% sample, whereas data frames with odd numbers of observations are slightly above or below 50%.
A Base R solution
We can also solve the problem with Base R. This approach uses the three variables in the original post, product type, offer, and inventory to split the data into subgroups based on the combinations of values for these variables, take a random sample from each subset, and combine the result into a single data frame.
First, we set the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Since we want to separately sample each combination of product, offer, and inventory, we create a combined split variable, and then use it to split the data.
splitvar <- paste(df$productType,df$offer,df$inventory,sep="-")
dfList <- split(df,splitvar)
Given the input data frame parameters of 10 products, 2 levels of offer (Y / N), and 2 levels of inventory (Y / N), this creates a dfList object that is a list of 40 data frames, each with varying numbers of observations.
We then use lapply() to randomly select about 50% of each data frame, using the number of rows for each data frame to drive the sample() function.
sampledDataList <- lapply(dfList,function(x){
x[sample(nrow(x),size = round(.5 * nrow(x))),]
})
At this point the sampledDataList object is a list of 40 data frames, each of which has approximately 50% of the rows as the original list.
To create the final data frame, we use do.call() as follows.
sampledData <- do.call(rbind,sampledDataList)
When we check the number of observations in the resulting data frame, we see that it is approximately 50% of the original data size (10,000).
> # this should be approximately 5,000 rows
> nrow(sampledData)
[1] 5001
We can further verify that each data frame is approximately a 50% sample with the following code.
# verify sample percentage by stratum
stratum <- names(sampledDataList)
OriginalCount <- sapply(dfList,nrow)
SampledCount <- sapply(sampledDataList,nrow)
SamplePct <- round(SampledCount / OriginalCount * 100,2)
head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
...and the output:
> head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
stratum OriginalCount SampledCount SamplePct
1 Bed-N-N 161 80 49.69
2 Bed-N-Y 371 186 50.13
3 Bed-Y-N 132 66 50.00
4 Bed-Y-Y 336 168 50.00
5 Bicycle-N-N 154 77 50.00
6 Bicycle-N-Y 349 174 49.86
7 Bicycle-Y-N 147 74 50.34
8 Bicycle-Y-Y 350 175 50.00
9 Blanket-N-N 134 67 50.00
10 Blanket-N-Y 349 174 49.86
As was the case with the dplyr solution, we see that strata with odd numbers of rows either sample one more or one less than an exact 50% of the original data.

How to perform a t-test on variables within the same category on r?

I want to perform a t-test on mean age between men and women at time of arrest. However, my data is arranged like so:
Sex: Age:
M 21
F 31
F 42
M 43
Is there a way to separate the sex category into two separate categories (male and female) in order to perform my t-test? Or to perform a t-test within one category? Similar questions have been asked but none that seem to work on my data set. Thanks for any guidance you could offer!
First off, great first question and glad to see high school kids learning statistical programming!
Second: You are well on your way to the answer yourself, this should help you get there.
I am making some assumptions:
prof is the name of your data frame
2 that you are looking to compare the ages of the genders from prof in your t-test
You are working in the right directions with your logic. I added a few more made up observations in my prof data frame but here is how it should work:
# this is a comment in the code, not code, but it explains the reasoning, it always starts with hash tag
women<-prof[which(prof$Sex=="F"),] #notice the comma after parenthesis
men<-prof[which(prof$Sex=="M"),] #notice the comma after parenthesis here too
The left of the comma selects the rows with that data == "something". The right of the comma tells you which columns, leaving it empty tells r to include all columns.
head(men);head(women) # shows you first 6 rows of each new frame
# you can see below that the data is still in a data frame
Sex Age
1 M 21
4 M 43
5 M 12
6 M 36
7 M 21
10 M 23
Sex Age
2 F 31
3 F 42
8 F 52
9 F 21
11 F 36
so to t-test for age, you must ask for the data frame by name AND the column with Age, example: men$Age
t.test(women$Age, men$Age) #this is the test
# results below
Welch Two Sample t-test
data: women$Age and men$Age
t = 0.59863, df = 10.172, p-value = 0.5625
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.93964 20.73964
sample estimates:
mean of x mean of y
36.4 32.0
There is almost always more than one way in R. And sometimes the initial sorting is more complicated, but working with data down the road is easier. So,if you would rather not address age from a data frame you can ask for the column in your initial subset
women<-prof[which(prof$Sex=="F"),"Age"] #set women equal to just the ages where Sex is 'F'
men<-prof[which(prof$Sex=="M"), "Age"]#set men equal to just the ages where Sex is 'M'
And review your data again, this time just a vector of ages for each variable:
head(women); head(men)
[1] 31 42 52 21 36
[1] 21 43 12 36 21 23
Then your t-test is a simple comparison:
t.test(women,men)
# notice same results
Welch Two Sample t-test
data: women and men
t = 0.59863, df = 10.172, p-value = 0.5625
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.93964 20.73964
sample estimates:
mean of x mean of y
36.4 32.0
It appears that your problem lies in three spots in your code:
using gender=="F" when the column is named Sex:
not using to comma in your [,] to specify rows then columns
not addressing the column $Age in your t.test if it is indeed still
two columns
The above codes should get you where you need to be.
A t-test comparing the ages of men to the ages of women can be done like:
df = data.frame(
gender = c("M", "F", "F", "M"),
age = c(21, 31, 42, 43)
)
t.test(age ~ gender, data = df)
This is the test that seems most relevant based on your question.
I'm not sure what you mean when you say "perform a t-test within one category": you can compare a set of values from one group to some known reference value like 0, but I'm not sure what that could tell you (other than that the men in your sample are not 0 years old).
You could try this code:
t.test(Age ~ Sex, paired = FALSE, data = datasetName)
It should give you the same result without the hassle of creating more subsets.

Extract multiple data.frames from one with selection criteria

Let this be my data set:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000),
split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))
So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.
My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.
I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.
In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:
x1 x2 x3 split
1 0.307379064 0.400526799 0.2898194543 SPLITMEHERE
2 0.465236674 0.915204924 0.5168274657 OBS
3 0.063814420 0.110380201 0.9564822116 OBS
4 0.401881416 0.581895095 0.9443995396 OBS
5 0.495227871 0.054014926 0.9059893533 SPLITMEHERE
6 0.091463620 0.945452614 0.9677482590 OBS
7 0.876123151 0.702328031 0.9739113525 OBS
8 0.413120761 0.441159673 0.4725571219 OBS
9 0.117764512 0.390644966 0.3511555807 OBS
10 0.576699384 0.416279417 0.8961428872 OBS
11 0.854786077 0.164332814 0.1609375612 OBS
12 0.336853841 0.794020157 0.0647337821 SPLITMEHERE
13 0.122690541 0.700047133 0.9701538396 OBS
14 0.733926139 0.785366852 0.8938749305 OBS
15 0.520766503 0.616765349 0.5136788010 OBS
16 0.628549288 0.027319848 0.4509875809 OBS
17 0.944188977 0.913900539 0.3767973795 OBS
18 0.723421337 0.446724318 0.0925365961 OBS
19 0.758001243 0.530991725 0.3916394396 SPLITMEHERE
20 0.888036748 0.862066601 0.6501050976 OBS
What I would like to get is this:
data.frame1:
1 0.465236674 0.915204924 0.5168274657 OBS
2 0.063814420 0.110380201 0.9564822116 OBS
3 0.401881416 0.581895095 0.9443995396 OBS
4 0.091463620 0.945452614 0.9677482590 OBS
5 0.876123151 0.702328031 0.9739113525 OBS
6 0.413120761 0.441159673 0.4725571219 OBS
7 0.117764512 0.390644966 0.3511555807 OBS
8 0.576699384 0.416279417 0.8961428872 OBS
9 0.854786077 0.164332814 0.1609375612 OBS
And
data.frame2:
1 0.122690541 0.700047133 0.9701538396 OBS
2 0.733926139 0.785366852 0.8938749305 OBS
3 0.520766503 0.616765349 0.5136788010 OBS
4 0.628549288 0.027319848 0.4509875809 OBS
5 0.944188977 0.913900539 0.3767973795 OBS
6 0.723421337 0.446724318 0.0925365961 OBS
7 0.888036748 0.862066601 0.6501050976 OBS
Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.
Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.
Is there any way to extract this efficiently in R?
The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.
With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:
df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)
The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

How to have a maximum number of replacement when sampling in R?

I'm trying to sample a list of numbers with replacement. I would like to have a hard cap on the number of times a number is chosen. For instance:
x=sample(1:20, 10, replace = TRUE)
[1] 17 5 11 13 5 11 14 11 10 11
In this case the number 11 has a frequency of 4.
Is there a way that I can for that frequency to be 2 or less?
It sounds like you are actually looking for a random assignment of people to houses. This could be done by putting two ids for each house into a vector, generate a random permutation of the vector and then assign each entry to a person.
houseIds <- c(1,2,3,4,5)
houseSamples <- sample(rep(houseIds, 2), 8) #where 8 is the number of people

Generate a set of random unique integers from an interval

I am trying to build some machine learning models,
so I need training data and a validation data
so suppose I have N number of examples, I want to select random x examples in a data frame.
For example, suppose I have 100 examples, and I need 10 random numbers, is there a way (to efficiently) generate 10 random INTEGER numbers for me to extract the training data out of my sample data?
I tried using a while loop, and slowly change the repeated numbers, but the running time is not very ideal, so I am looking for a more efficient way to do it.
Can anyone help, please?
sample (or sample.int) does this:
sample.int(100, 10)
# [1] 58 83 54 68 53 4 71 11 75 90
will generate ten random numbers from the range 1–100. You probably want replace = TRUE, which samples with replacing:
sample.int(20, 10, replace = TRUE)
# [1] 10 2 11 13 9 9 3 13 3 17
More generally, sample samples n observations from a vector of arbitrary values.
If I understand correctly, you are trying to create a hold-out sampling. This is usually done using probabilities. So if you have n.rows samples and want a fraction of training.fraction to be used for training, you may do something like this:
select.training <- runif(n=n.rows) < training.fraction
data.training <- my.data[select.training, ]
data.testing <- my.data[!select.training, ]
If you want to specify EXACT number of training cases, you may do something like:
indices.training <- sample(x=seq(n.rows), size=training.size, replace=FALSE) #replace=FALSE makes sure the indices are unique
data.training <- my.data[indices.training, ]
data.testing <- my.data[-indices.training, ] #note that index negation means "take everything except for those"
from the raster package:
raster::sampleInt(242, 10, replace = FALSE)
## 95 230 148 183 38 98 137 110 188 39
This may fail if the limits are too large:
sample.int(1e+12, 10)

Resources