I have a data set of 12.5 million records and I need to randomly select about 2.5 million. However, these individuals are in 55284 groups and I want to keep groups intact.
So basically I want to remove groups until I've got 2.5 million records left OR select groups until I have about 2.5 million individuals.
If this is my data:
data <- data.frame(
id = c(1, 2, 3, 4, 5),
group = (1, 1, 2, 2, 3)
)
I wouldn't want to remove id1 and keep id2, I'd like to either keep them both or discard both, because they are in the same group(2).
So ideally, this function randomly selects a group, counts these individuals and puts them in a data set, then does the same thing again, keeps counting the individuals until it has about 2.5 million (it is okay to say: if n exceeds 2.5 stop putting groups into new data set).
I haven't been able to find a function and I am not yet skilled enough to put something together myself, unfortunately.
Hope someone can help me out!
Thanks
Too long for a comment hence answering. Do you need something like this ?
#Order data by group so rows with same groups are together
data1 <- data[order(data$group), ]
#Get all the groups in first 2.5M entries
selected_group <- unique(data1$group[1:2500000])
#Subset those groups so you have all groups intact
final_data <- data1[data1$group %in% selected_group, ]
For a random approach, we can use while loop
#Get all the groups in the data
all_groups <- unique(data$group)
#Variable to hold row indices
rows_to_sample <- integer()
#While the number of rows to subset is less than 2.5M
while (length(rows_to_sample) <= 2500000) {
#Select one random group
select_group <- sample(all_groups, 1)
#Get rows indices of that group
rows_to_sample <- c(rows_to_sample, which(data$group == select_group))
#Remove that group from the all_groups
all_groups <- setdiff(all_groups, select_group)
}
data[rows_to_sample, ]
here is a possibility. I demonstrate it using toydata and threshold of 33 (instead of 2.5) million. First I create the toy group vector:
threshold <- 33
set.seed(111)
mygroups <- rep(1:10, rpois(10, 10))
In this toy example group 1 has 10 individuals, group 2 has 8 individuals and so on.
Now I put the groups in random order and use cumsum to determine when the threshold is exceeded:
x <- cumsum(table(mygroups)[sample(1:10)])
randomgroups <- as.integer(names(x[x <= threshold]))
randomgroups
[1] 1 7 5
Related
I have the following database (as an example)
participants <- c(1:60)
codes <- c(1:60)
database <-cbind(participants, codes)
The second variable codes contains emails linked to each participant ID, although for example proposes I just filled it with numbers.
I have 60 participants each one with a participant ID and an email tied to this ID (a number from 1 to 60). As such in the example row 1 is 1 ,1 and so on.
I need to divide the list on 3 groups of identical proportion, eg 20 participants per group.
The way I am doing it now is
#generating list of participants
participants <- c(1:60)
codes <- c(1:60)
database <-cbind(participants, codes)
#randomizing the order of the list
randomized_list <- sample(participants)
#Extracting the three groups
group1 <- randomized_list[c(1:20)]
group2 <- randomized_list[c(21:40)]
group3 <- randomized_list[c(41:60)]
Leaving me to do the work of getting the email addresses and dividing the lists more or less by hand (compare group 1, 2 and 3 with database and making the link).
Is there a more elegant and compact solution for achieving the results I seek?
First assign rows randomly to groups and then use that to access the groups.
# generate random group lables
r_labels <- sample(1:3, nrow(database), replace = T)
# group 1
database[r_labels == 1, ]
I have 400 rows that have a bunch of columns, with the last five being: a,b,c,d,e
For each row, I want to randomly select three of the above 5 variables and do rowmeans(varx,vary,varz) to make column trio_average, and the other two making pair_average.
For example, one row might be the mean of b,d,e for column "trio_average" and the mean of a,c for "pair_average", and the next might be the mean of a,c,e and b,d.
I did this in a pretty roundabout way...I used "randomizr()" to generate a variable called "trio_set" with 400 random (conditional random to keep them all equal) trios of the 5 variables. There's 10 possible combinations of the 5 variables so I have 40 each of for example "a_c_e", "b_c_d" etc.
Then, I used a series of 10 if_else statements:
data <- transform(data, trio_average = ifelse(trio_set = "a_b_c", rowMeans(data[c("a","b","c")]),
ifelse(trio_set = "a_b_d", rowMeans(data[c("a","b","d")]), ....
I would then do this another 10 times for the pairs.
This does get the job done but in reality, my column names are much longer than e.g. "a" so my code in the end is pretty bad looking and inefficient. Is there a better way to do this?
Using base R, we can use row-wise apply
cols <- c('a', 'b', 'c', 'd', 'e')
df$trio_average <- apply(df[cols], 1, function(x) mean(sample(x, 3), na.rm = TRUE))
Select the specific columns you are interested in and for each row randomly select 3 values and take their mean.
To get the mean of the numbers which were not selected we can store the index of random numbers and use it to get two pairs of mean for each row.
df[c('chosen', 'remaining')] <- t(apply(df[cols], 1, function(x) {
inds <- sample(seq_along(x), 3)
c(mean(x[inds]), mean(x[-inds]))
}))
I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.
You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)
There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]
I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1.
Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name.
Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the #paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)