Example Data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))
Problem
Pick a random start point within each id and from that point, select that row and subsequent, sequential rows totaling 1% of the rows within that ID. Then do it again for 2% of each ID's rows, and 3% and so on up to 99% of the rows per ID. Also, do not select a random point to begin sampling that is closer to the end of the ID's rows than the percentage desired to be samples (i.e., don't try to sample 20% of sequential rows from a point that's 10% from the end of an ID's number of rows.)
Desired Result
What dfcombine looks like from the first code chunk below, only instead of randomly selected fruit rows within an id, the fruit rows will have only a random start-point, with the subsequent rows needed for the sample following the start-point row sequentially.
What I've Tried
I can pull part of this problem off with the following code -- but it selects all rows at random, and I need the sample chunks to be sequential following the random start point (FYI: if you run this, you'll see your chunks start at 6% b/c this is a small dataset -- no rows <6% of sample-per-id):
library(tidyverse)
set.seed(123) # pick same sample each time
dflist<-list() # make an empty list
for (i in 1:100) # "do i a hundred times"
{
i.2<-i/100 # i.2 is i/100
dflooped <- df %>% # new df
group_by(id) %>% # group by id
sample_frac(i.2,replace=TRUE) # every i.2, take a random sample
dflooped
dflist[[i]]<-dflooped
}
dflist # check
library(data.table)
dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df
I can also pick the sequentially larger chunks I'm looking for with this -- but it doesn't allow me the random start (it always goes from the beginning of the df):
lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])
and using dplyr group_by spits an error I don't understand:
df2 <- df %>%
group_by(id) %>%
lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])
Error in match.fun(FUN) :
'seq(0.01, 1, 0.01)' is not a function, character or symbol
So I may have some of the pieces, but am having trouble putting them together -- the solution may or may not include what I've done above. Thanks.
Sequential sampling within ID
Create fake data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
Here we'll define a function to do what you want:
I question the statistical impact of only starting your random sample from a spot where you won't "run out" of observations within this ID category.
Would it not be better to loop back to the top of the records within each ID category if you were to run out? That would ensure a uniform chance of beginning your sample within any portion of a specific ID field as opposed to limiting yourself to only within the first 80% of the data if we want a 20% sample size. Just a thought! I built this as you asked though!
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
#browser()
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - (nrows_to_sample + 1) )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
Test function for a single function call
Test out the function with just a single sample for a certain percent (10% here). This is also a good way to redo several of the same function call to ensure a randomized starting location.
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
Now place function in for loop
Set aside a unique list of all potential values within the id field. Also set aside a vector of sample sizes in percent format (between 0 and 1).
# capture all possible values in id field
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)
# initialize list:
combined_list <- list()
for(i in 1:length(possible_ids)) {
#browser()
print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}
Process the results
# process results of for loop
combined_list
# number of rows in each df in our list
sapply(combined_list, nrow)
This is the resulting dataset of all combinations of samples
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
EDIT:
I'll leave what I initially wrote up there, but in retrospect, I think this is actually a bit closer to what you are asking for.
This solution uses the same type of function, but I used nested for loops to achieve what you were asking for.
For each ID, it will:
subset dataframe for this ID value
find random starting point
sample n% of the data (starting with 1%)
repeat with +1% to n (up to 99%)
Code:
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# don't let us use zero as an index
if(nrows_to_sample < 1) {
nrows_to_sample <- 1
}
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - nrows_to_sample )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:99 / 100)
# adding an expand grid
ids_sample <- expand.grid(possible_ids, sampleperc_sequence)
# initialize list:
combined_list <- list()
counter <- 1
for(i in 1:length(possible_ids)) {
for(j in 1:length(sampleperc_sequence)) {
print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])
# manually keep track of counter
counter <- counter + 1
}
}
random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])
# process results of for loop
combined_list
# check size of first list element
combined_list[[1]] # A, 10% sample is 1 record
# check thirtieth element
combined_list[[30]] # A, 30% sample is 3 records
# check size of the sixtieth list element
combined_list[60] # A, 60% sample is 6 records
sapply(combined_list, nrow) # number of rows in each df in our list
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?
library( "dplyr")
set.seed(123)
NF_1 <- slice_sample(NF, n = 5)
You can use the sample function from base R.
All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.
n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
# Now index those selected rows
NF_1 <- NF[selected_samples, ]
Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.
This code should solve your problem, or at least give you a good idea or where to go.
n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups
# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)
# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)
# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.
my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})
# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements
I have a R data.frame with fast Fourier Transform results. The columns are Power and Frequency. For purpose of my study, I have selected a portion of the full FFT results (frequencies between 1 and 48).
I have to identify maximum 5 peaks (highest powers) inside the frequency range. Also, there is a condition that we won't consider any peaks that are less than 45% of the highest peak. I will share the code below:
df_FFT_2 <- select(filter(df_FFT, frequency <49 & frequency >0 ),c(frequency, power))
with(df_FFT_2,plot(frequency,power, type = "l"))
max_power <- max(df_FFT_2$power)
max_freq <- df_FFT_2$frequency[which(df_FFT_2$power == max_power)]
print(c(max_power, max_freq))
# Calculate a threshold and then create a filtered data.frame
threshold <- 0.45 * max_power
df_filt <- df_FFT_2[which(df_FFT_2$power > threshold), ]
df_filt <- df_filt[order(-df_filt$power), ]
if(nrow(df_filt) > 5){
df_filt <- df_filt[1:5, ]
}
freq_FFT <- df_filt$frequency
n = length(freq_FFT)
This code works fine. With different input data-sets, I get 1(min),2,3,4 or 5(max) peaks.
However, this is a small part of the full prototype I am working with. I have found that when I get only one peak, the final results are not very good. I would like to select at least 2 peaks, even if the second peak does not satisfy the threshold condition.
I would like to put a condition
if (length(freq_FFT) < 2){
disregard the threshold condition and find the two highest peaks.
}
I haven't been able to modify the code accordingly.
Consider sorting the main data beforehand and then add a conditional block that stacks (using rbind) the filtered data with top two in main dataset. And because the single one row may be the leading power, run a unique to de-dupe the data. Code comments in below signify adjustments:
...
threshold <- 0.45 * max_power
df_FFT_2 <- df_FFT_2[order(-df_FFT_2$power), ] # ORDER BY DESCENDING POWER
df_filt <- df_FFT_2[which(df_FFT_2$power > threshold), ]
df_filt <- df_filt[order(-df_filt$power), ]
if(nrow(df_filt) > 5) {
df_filt <- head(df_filt, 5) # CHANGED TO head()
}
# NEW IF LOGIC BLOCK
if(nrow(df_filt) < 2) {
df_filt <- rbind(df_filt,
head(df_FFT_2, 2)) # TOP TWO ROWS
df_filt <- unique(df_filt)
}
This data is extracted from hospital dataset, each a row is an entry of a patient at a different time, there should be multiple observations for each patient, since they come to the hosiptal several times. Like what I extracted here, there are 8 observations for patient with id 12c55eb78ef60, and 11 observations for patient with id 12e8597e4ec2a.
I have done the cleaning progress to extract those whose first entry has ckd0 ==1, which means they are at stage 0.
What i want to do is to check whether the stage of disease for the patient changes through his or her several dropbys to the hospital.
Here, if in the last observation of a patient, the "ckd0" attribute is 0, it means he progressed to higher stage, and should be labelled as 1.
If the in the last observation of a patient, the "ckd0" attribute is 1, it means he is still at stage 0, and should be labelled as 0.
I want to add such label(add a new column to the original dataframe) to each obervation of the patient. I guess it should be accomplished by a loop, but my code doesn't work. :(
x=unique(personal_id)
label = vector(mode = "numeric",length = 0)
for (i in 1:length(x)){
temp=subset(mydata_1[which(mydata_1$personal_id==x[i]),])
k=numeric(length(temp))
b=order(temp$report_time)[length(temp$report_time)]
if (temp$ckd0[b] == 0){
k[]=1
}
label=c(label,k)
}
nonckd <- cbind(mydata_1,label)
I tried to modify your code as little as possible. The original problem was a little different from what I encountered. When I ran your code, it said that the number of rows didn't match up. This was caused by creating the vector k as such: k < numeric(length(temp)). In R, the length of a data.frame is how many columns it has, not rows. Replacing this with nrow(temp) produces the desired output. Aside from this, I added some comments for clarification, and cleaned the code up a little. Please let me know of any issues with the below code.
# sample data.frame as described
mydata_1 <- data.frame(
personal_id = rep(c(1:10), 2),
report_time = 1:20,
ckd0 = sample(1:100 %% 2, 20)
)
# variables
x <- unique(mydata_1$personal_id)
label <- NULL
for (i in 1:length(x)){
# extract a subset of the data.frame
temp <- mydata_1[which(mydata_1$personal_id == x[i]), ]
# create a vector initialized with 0s
# and get last report index
k <- numeric(nrow(temp))
b <- order(temp$report_time)[length(temp$report_time)]
# change 0s to 1s
if (temp$ckd0[b] == 0) k[] <- 1
label <- c(label, k)
}
nonckd <- cbind(mydata_1, label)
Using R, I want to randomly split a data frame into three smaller data frames. The first one has 80% of the total observations. The second and the third have, respectively, 15% and 5% of the total observations. The three data frames cannot have any overlaps. Do you have any suggestions?
Here is a quick function to split into an arbitrary number of groups depending on how many values you specify in the 'props' parameter. It should be fairly self explanatory
#' Splits data.frame into arbitrary number of groups
#'
#' #param dat The data.frame to split into groups
#' #param props Numeric vector. What proportion of the data should
#' go in each group?
#' #param which.adjust Numeric. Which group size should we 'fudge' to
#' make sure that we sample enough (or not too much)
split_data <- function(dat, props = c(.8, .15, .05), which.adjust = 1){
# Make sure proportions are positive
# and the adjustment group isn't larger than the number
# of groups specified
stopifnot(all(props >= 0), which.adjust <= length(props))
# could check to see if the sum is 1
# but this is easier
props <- props/sum(props)
n <- nrow(dat)
# How large should each group be?
ns <- round(n * props)
# The previous step might give something that
# gives sum(ns) > n so let's force the group
# specified in which.adjust to be a value that
# makes it so that sum(ns) = n
ns[which.adjust] <- n - sum(ns[-which.adjust])
ids <- rep(1:length(props), ns)
# Shuffle ids so that the groups are randomized
which.group <- sample(ids)
split(dat, which.group)
}
split_data(mtcars)
split_data(mtcars, c(.7, .3))
By manipulating the parts vector, you should be able to generate as many unique sets as you want -
totrows <- nrow(dat)
rownos <- seq(totrows)
parts <- c(0.8,0.15,0.05)
sets <- vector(mode = "list", length = length(parts))
for( i in seq(parts))
{
# calculating random % row numbers, % specified by parts[i]
sets[[i]] <- sample(x = rownos, size = parts[i]*totrows)
# removing used row nos
rownos <- setdiff(rownos, sets[[i]])
}
If you want overlapping sets, you can remove the setdiff statement in the loop.