I have a R data.frame with fast Fourier Transform results. The columns are Power and Frequency. For purpose of my study, I have selected a portion of the full FFT results (frequencies between 1 and 48).
I have to identify maximum 5 peaks (highest powers) inside the frequency range. Also, there is a condition that we won't consider any peaks that are less than 45% of the highest peak. I will share the code below:
df_FFT_2 <- select(filter(df_FFT, frequency <49 & frequency >0 ),c(frequency, power))
with(df_FFT_2,plot(frequency,power, type = "l"))
max_power <- max(df_FFT_2$power)
max_freq <- df_FFT_2$frequency[which(df_FFT_2$power == max_power)]
print(c(max_power, max_freq))
# Calculate a threshold and then create a filtered data.frame
threshold <- 0.45 * max_power
df_filt <- df_FFT_2[which(df_FFT_2$power > threshold), ]
df_filt <- df_filt[order(-df_filt$power), ]
if(nrow(df_filt) > 5){
df_filt <- df_filt[1:5, ]
}
freq_FFT <- df_filt$frequency
n = length(freq_FFT)
This code works fine. With different input data-sets, I get 1(min),2,3,4 or 5(max) peaks.
However, this is a small part of the full prototype I am working with. I have found that when I get only one peak, the final results are not very good. I would like to select at least 2 peaks, even if the second peak does not satisfy the threshold condition.
I would like to put a condition
if (length(freq_FFT) < 2){
disregard the threshold condition and find the two highest peaks.
}
I haven't been able to modify the code accordingly.
Consider sorting the main data beforehand and then add a conditional block that stacks (using rbind) the filtered data with top two in main dataset. And because the single one row may be the leading power, run a unique to de-dupe the data. Code comments in below signify adjustments:
...
threshold <- 0.45 * max_power
df_FFT_2 <- df_FFT_2[order(-df_FFT_2$power), ] # ORDER BY DESCENDING POWER
df_filt <- df_FFT_2[which(df_FFT_2$power > threshold), ]
df_filt <- df_filt[order(-df_filt$power), ]
if(nrow(df_filt) > 5) {
df_filt <- head(df_filt, 5) # CHANGED TO head()
}
# NEW IF LOGIC BLOCK
if(nrow(df_filt) < 2) {
df_filt <- rbind(df_filt,
head(df_FFT_2, 2)) # TOP TWO ROWS
df_filt <- unique(df_filt)
}
Related
Trying to write a function that transforms a dataframe by high pass filtering each row entry by some percentile of the column values. The function is written for single cell RNA-sequencing data but in principle anything works. Transposes it at the end because it makes some downstream code cleaner.
topquantile.binarize <- function(scRNAseq_data, percentile){
# takes in data that is gene by cell
# returns dataframe of cell by gene
# calculates quantile for each gene
# if a gene in a cell is in the top 90th quantile
# that gene is accepted
for (i in c(1:dim(scRNAseq_data)[1])){
filter_value <- quantile(scRNAseq_data[i,], percentile)
filter_value <- as.numeric(filter_value)
high_pass <- function(x) {
if (x > filter_value) {
x <- 1
} else {
x <- 0
}
return(x)
}
scRNAseq_data[i, ] <- apply(scRNAseq_data[i, ], 2, high_pass)
}
return(t(scRNAseq_data))
}
EXAMPLE DATA
library(tictoc)
tic()
set.seed(42)
scRNAseq_data <- data.frame(matrix(rnorm(1000*100, mean=0, sd=1), 1000, 100))
res <- topquantile.binarize(scRNAseq_data, 0.9)
toc()
You will notice that even at 100 columns each with 1000 rows its running pretty slow, using tictoc you'll see it takes around 4 seconds (possibly a little more to do that.
I realize that technically the function does more than just look for values in the top quantile but whatever.
Use matrixStats::rowQuantiles and exploit the vectorization of the R language. Runs in the blink of an eye.
res1 <- t(+(scRNAseq_data > matrixStats::rowQuantiles(as.matrix(scRNAseq_data), probs=.9)))
stopifnot(all.equal(res, res1))
MatrixGenerics::rowQuantiles from bioconductor might also work.
I have two data frame one with 24 row*2 columns and another with 258 row*2 columns. The columns are similar, I am interested in one column and want to find the values in two data frame that are approximately close to each other?
I am trying to simulate a spectrum and compare with an experiment.
df_exp <- data.frame("Name"=c(exp,Int), "exp" = c(x1, x2, x3, ...,x258),"int"= c(y1,y2,y3,...,y258))
df_sim <- data.frame("Name"=c(sim,Int), "sim" = c(x1, x2, x3, ...,x24),"int" = c(y1,y2,y3,...,y24))
Initial values (exp column from df_exp and sim column from df_sim):
exp sim
206.0396 182.0812
207.1782 229.1183
229.0776 246.1448
232.1367 302.1135
241.1050 319.1401
246.1691 357.1769
250.0235 374.2034
... ...
I tried this r code
match(df_exp$exp[1:258], df_sim$sim[1:24], nomatch = 0)
This code gives me all zero values because there is no exact match. The numbers always vary in decimal places. I tried to round the numbers to zero decimal places, and find values that are close. But that is not my intent. I want to find df_exp(229.0776,246.1691,...) and df_sim(229.1183, 246.1448,...) and make a new data frame with all those approximately close values. Can you please suggest some help?
You can define a similarity cutoff and loop over them:
### define your cutoff for similarity
cutoff <- 0.01
### initialize vectors to store the similar values
similar_sim <- vector(); similar_exp <- vector();
### open loop over both DF values
for (sim_value in df_sim$sim) {
for (exp_value in df_exp$exp) {
### if similar (< cutoff) append values to vectors
if ( abs(sim_value - exp_value) < cutoff ) {
similar_sim <- append(similar_sim, sim_value)
expilar_exp <- append(expilar_exp, exp_value)
}
}
}
### recreate a DF with the similar values
similar_df <- as.data.frame(cbind(similar_sim, similar_exp))
if you want to save every values of one similar to the value of the other as it sounds. Otherwise you can skip a loop and use range selection, e.g:
x[ x < x+cutoff & x > x-cutoff ]
I currently have this for-loop which I want to vectorize. It calculates the percentage amount of 6's in a for different subvectors. Starting with a[1:100], a[1:200], ... always in 100's steps.
rolls.max <- 100000
a <- sample(1:6, size=rolls.max, replace=TRUE)
sixes.ratio <- c()
for(i in 1:(rolls.max/100)) {
sixes.count <- table(a[1:(i*100)])[6]
ratio <- sixes.count/(i*100)
sixes.ratio <- c(sixes.ratio, ratio)
}
I think the most difficult part is to get the count of 6's from a for each subvector. I tried this:
rolls.max <- 100000
a <- matrix(sample(1:6, size=rolls.max, replace=TRUE))
subset.creator <- function(x, c) if (c!=0 && c%%100==0) { as.vector(table(x[1:(rolls[c/100])]))[6] }
sixes.count <- mapply(subset.creator, a, col(a))
# Converting the other lines won't be difficult I think
Want I wanted to achieve with this is, to create a subvector of a for every 100th call of the function subset.creator. Then create a table and take the sixth column, to get the count of 6's and then extract only the count by using as.vector()
But this just gives me rubbish instead of a vector with counts of 6's.
If you want to create a "rolling tally" at every hundredth chunk of your simulated rolls, one way to solve the problem is to create a vector of "stops" that represents your cutoff points, then use sapply to perform the calculation (in this case, counting up the 6s) at each stop:
rolls.max <- 100000
a <- sample(1:6, size=rolls.max, replace=TRUE)
# a vector of "stops" at every hundredth entry of 'a'
stops <- seq(0, rolls.max, 100)[-1]
# counts of 6s from the first entry in 'a' to the values in 'stops'
count.6 <- sapply(stops, function(x) sum(a[1:x] == 6))
# ...or just as easily, the rolling proportion of 6s
prop.6 <- sapply(stops, function(x) mean(a[1:x] == 6))
Example Data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))
Problem
Pick a random start point within each id and from that point, select that row and subsequent, sequential rows totaling 1% of the rows within that ID. Then do it again for 2% of each ID's rows, and 3% and so on up to 99% of the rows per ID. Also, do not select a random point to begin sampling that is closer to the end of the ID's rows than the percentage desired to be samples (i.e., don't try to sample 20% of sequential rows from a point that's 10% from the end of an ID's number of rows.)
Desired Result
What dfcombine looks like from the first code chunk below, only instead of randomly selected fruit rows within an id, the fruit rows will have only a random start-point, with the subsequent rows needed for the sample following the start-point row sequentially.
What I've Tried
I can pull part of this problem off with the following code -- but it selects all rows at random, and I need the sample chunks to be sequential following the random start point (FYI: if you run this, you'll see your chunks start at 6% b/c this is a small dataset -- no rows <6% of sample-per-id):
library(tidyverse)
set.seed(123) # pick same sample each time
dflist<-list() # make an empty list
for (i in 1:100) # "do i a hundred times"
{
i.2<-i/100 # i.2 is i/100
dflooped <- df %>% # new df
group_by(id) %>% # group by id
sample_frac(i.2,replace=TRUE) # every i.2, take a random sample
dflooped
dflist[[i]]<-dflooped
}
dflist # check
library(data.table)
dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df
I can also pick the sequentially larger chunks I'm looking for with this -- but it doesn't allow me the random start (it always goes from the beginning of the df):
lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])
and using dplyr group_by spits an error I don't understand:
df2 <- df %>%
group_by(id) %>%
lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])
Error in match.fun(FUN) :
'seq(0.01, 1, 0.01)' is not a function, character or symbol
So I may have some of the pieces, but am having trouble putting them together -- the solution may or may not include what I've done above. Thanks.
Sequential sampling within ID
Create fake data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
Here we'll define a function to do what you want:
I question the statistical impact of only starting your random sample from a spot where you won't "run out" of observations within this ID category.
Would it not be better to loop back to the top of the records within each ID category if you were to run out? That would ensure a uniform chance of beginning your sample within any portion of a specific ID field as opposed to limiting yourself to only within the first 80% of the data if we want a 20% sample size. Just a thought! I built this as you asked though!
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
#browser()
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - (nrows_to_sample + 1) )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
Test function for a single function call
Test out the function with just a single sample for a certain percent (10% here). This is also a good way to redo several of the same function call to ensure a randomized starting location.
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
Now place function in for loop
Set aside a unique list of all potential values within the id field. Also set aside a vector of sample sizes in percent format (between 0 and 1).
# capture all possible values in id field
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)
# initialize list:
combined_list <- list()
for(i in 1:length(possible_ids)) {
#browser()
print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}
Process the results
# process results of for loop
combined_list
# number of rows in each df in our list
sapply(combined_list, nrow)
This is the resulting dataset of all combinations of samples
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
EDIT:
I'll leave what I initially wrote up there, but in retrospect, I think this is actually a bit closer to what you are asking for.
This solution uses the same type of function, but I used nested for loops to achieve what you were asking for.
For each ID, it will:
subset dataframe for this ID value
find random starting point
sample n% of the data (starting with 1%)
repeat with +1% to n (up to 99%)
Code:
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# don't let us use zero as an index
if(nrows_to_sample < 1) {
nrows_to_sample <- 1
}
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - nrows_to_sample )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:99 / 100)
# adding an expand grid
ids_sample <- expand.grid(possible_ids, sampleperc_sequence)
# initialize list:
combined_list <- list()
counter <- 1
for(i in 1:length(possible_ids)) {
for(j in 1:length(sampleperc_sequence)) {
print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])
# manually keep track of counter
counter <- counter + 1
}
}
random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])
# process results of for loop
combined_list
# check size of first list element
combined_list[[1]] # A, 10% sample is 1 record
# check thirtieth element
combined_list[[30]] # A, 30% sample is 3 records
# check size of the sixtieth list element
combined_list[60] # A, 60% sample is 6 records
sapply(combined_list, nrow) # number of rows in each df in our list
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
I am generating a data vector to sample from with sample without replacement.
If the dataset I am generating from is large enough, the vector exceeds the limits of R.
How can I represent these data in such a way that I can sample without replacement but can still handle huge datasets?
Generating the vector of counts:
counts <- vector()
for (i in 1:1024) {
counts <- c(counts, rep(i, times=data[i,]$readCount))
}
Sampling:
trial_fn <- function(counts) {
replicate(num_trials, sample(counts, size=trial_size, replace=F), simplify=F)
}
trials <- trial_fn(counts)
Error: cannot allocate vector of size 32.0 Mb
Is there a more sparse or compressed way I can represent this and still be able to sample without replacement?
If I understand correctly, your data has 1024 rows with different readCount.
The vector you build has the first readCount value repeated once, the second readCount repeated twice and so on.
Then you want to sample from this vector without replacement. So basically, you're sampling the first readCount with a probability of 1 / sum(1:1024), the second readCount with a probability of 2 / sum(1:1024) and so on, and each time you extract one value, it is removed from the set.
Of course the fastest and easier approach is yours, but you can also do it with much less memory but losing speed (significantly). This can be done by giving probabilities of extraction to sample function, extracting one value at a time and manually "removing" the extracted value.
Here's an example :
# an example of your data
data <- data.frame(readCount=1:1024)
# custom function to sample
mySample <- function(values, size, nElementsPerValue){
nElementsPerValue <- as.integer(nElementsPerValue)
if(sum(nElementsPerValue) < size)
stop("Total number of elements per value is lower than the sample size")
if(length(values) != length(nElementsPerValue))
stop("nElementsPerValue must have the same length of values")
if(any(nElementsPerValue < 0))
stop("nElementsPerValue cannot contain a negative numbers")
# remove values having zero elements inside
nElementsPerValue <- nElementsPerValue[which(nElementsPerValue > 0)]
values <- values[which(nElementsPerValue > 0)]
# pre-allocate the result vector
res <- rep.int(0.0,size)
for(i in 1:size){
idx <- sample(1:length(values),size=1,replace=F,prob=nElementsPerValue)
res[i] <- values[idx]
# remove sampled value from nElementsPerValue
nElementsPerValue[idx] <- nElementsPerValue[idx] - 1
# if zero elements remove also from values
if(nElementsPerValue[idx] == 0){
values <- values[-idx]
nElementsPerValue <- nElementsPerValue[-idx]
}
}
return(res)
}
# just for reproducibility
set.seed(123)
# sample 100k values from readCount
system.time(
a <- mySample(data$readCount, 100000, 1:1024),
gcFirst=T)
# on my machine it gives :
# user system elapsed
# 10.63 0.00 10.67