I am working with the R programming language.
I am trying to count the first time a certain pattern (e.g. ABCD) appears in a random string (e.g. ACABCDCDBCABCDBC - answer =6 ). I wrote a function to do this:
library(stringr)
letters <- c("A", "B", "C", "D")
results <- list()
for (i in 1:100)
{
iteration_i = i
letters_i = paste(sample(letters, 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)),collapse="")
position_i = str_locate(letters_i, "ADBC")
results_tmp = data.frame(iteration_i , letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
This looks something like this now (note: I don't think this is correct - in row 5, I see ABCD at the beginning of the row, but its being recorded as NA for some reason):
iteration_i letters_i start end
1 1 BACDCCCDCCCDCDDBBCBBAACACBBBBAAABDDDACAABDDABBABADCDDCDACCBBBCABCDABCDCCCDADDDBADBDCADAABDBDCDCAACCB NA NA
2 2 CACACCCCDCCBADACBBAADBCABBAAAAADBDDBCADCAAADADAAABDCABBAABABBCBDADCDDDDCDBADDBDCBCDDDBDCDDAACBBBBACA 20 23
3 3 CDCBDAABDDDDADBAAABBADAADBDDDBDADDCABADDDCDABBBCBCBBACBBDADABBCDCCACDBCDCDDBDBADBCDCADDADDDBDBAAABBD 79 82
4 4 ADBCDBADADBAAACAADACACACACBDDCACBDACCBDAAABDBAAAABBCCDBADADDADCBCABCBAABDCBCDCDACDCCDBADCBDDAADBCDAC 1 4
5 5 D**ABCD**DDCCBCDABADBBBBCDBCADCBBBDCAAACACCCBCBCADBDDABBACACBDABAAACCAAAAACCCCBCBCCABABDDADBABDDDCCDDCCC NA NA
6 6 DDDDDBDDDDBDDDABDDADAADCABCDAABBCCCDAABDDAACBDABBBBBABBCBDADBDCCAAADACCBCDDBDCAADCBBBCACDBBADDDDCABC NA NA
Currently, I am only generating 100 letters and hoping that this is enough to observe the desired pattern (sometimes this doesn't happen, notice the NA's) - is there a way to add a WHILE LOOP to what I have written to keep generating letters until the desired pattern first appears?
Can someone please show me how to do this?
Thanks!
The loop is a repeat loop, not while, that only breaks when the pattern is found. I have set the results list length to 2, there's no point in making it bigger just to test the code.
library(stringr)
Letters <- c("A", "B", "C", "D")
Pattern <- "ADBC"
n <- 2L
set.seed(2022)
results <- vector("list", length = n)
for (i in seq.int(n)) {
repeat {
l <- sample(Letters, 100, replace = TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
letters_i <- paste(l, collapse = "")
position_i <- str_locate(letters_i, pattern = Pattern)
if(any(!is.na(position_i))) break
}
results_tmp <- data.frame(iteration = i, letters = letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
results_df
#> iteration letters start end
#> 1 1 ADBDBDBBCABBBDDBADDAADCBBADACACDCCBBADAADCDDABADCABCDCDDCCCBDDAABACCBDAAAADBDDCCCCADBCBBDABBDCCCBADD 83 86
#> 2 2 DDBDBDBCDDBDBBBDBABBCCBBCCBDBDABBAAABACABADCCBBABADBCCCDABABBDBADCADCABDDDAAACCBDCAACACACBBDDDACCDDC 50 53
Created on 2022-06-11 by the reprex package (v2.0.1)
Related
I have the following function:
library(dplyr)
var_1 <- rnorm(100, 10, 10)
var_2 <- rnorm(100, 1, 10)
var_3 <- rnorm(100, 5, 10)
response <- rnorm(100, 1, 1)
my_data <- data.frame(var_1, var_2, var_3, response)
my_data$id <- 1:100
simulate <- function() {
results <- list()
results2 <- list()
for (i in 1:100) {
iteration_i <- i
sample_i <- my_data[sample(nrow(my_data), 10), ]
results_tmp <- data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file <- data.frame(test_1 %>%
group_by(iteration_i) %>%
summarise(Count=n()))
cumulative <- cumsum(summary_file$Count)
summary_file$Cumulative <- cumulative
summary_file$unobserved <- 100 - cumulative
return(summary_file)
}
When I call this function, I get the following output:
> head(simulate())
iteration_i Count Cumulative unobserved
1 1 10 10 90
2 2 7 17 83
3 3 10 27 73
4 4 5 32 68
5 5 7 39 61
6 6 8 47 53
I want to try to run this function 10 times and append all the results into a single file. I tried to do this using the "replicate()" function - but this is not working:
# Method 1 : Did not work
n_replicates = 10
iterations_required <- replicate(n_replicates, {
simulate()
})
# Method 2: Did not work
lapply(seq_len(10), simulate(1))
# Method 3: Did Not Work
library(purrr)
rerun(10, simulate(1))
# Method 4: Did Not Work
lapply(seq_len(10), simulate)
Ideally, I would like to get something like this:
# works fine!
results <- list()
for (i in 1:10) {
game_i <- i
s_i <- simulate()
results_tmp <- data.frame(game_i, s_i)
results[[i]] <- results_tmp
}
final_file <- do.call(rbind.data.frame, results)
My Question: Is there a reason that "Method 1, Method 2, Method 3, Method 4" were not working - could someone please show me how to fix this?
# Method 1 :
n_replicates = 10
iterations_required <- replicate(n_replicates, {
simulate()
}, simplify=FALSE)
# Method 2:
iterations_required<-lapply(seq_len(10), function(x) simulate(1))
# Method 4:
iterations_required<-lapply(seq_len(10), simulate)
# to merge into one data.frame
as.data.frame(data.table::rbindlist(iterations_required, idcol=TRUE))
Alternatively, if you modify your function to simulate(i), where i will be the first column in the output (interation index). Then you could use do.call(rbind.data.frame, lapply(seq_len(n_replicates), simulate))
replicate by default tries to simplify the result in a matrix. So the trick is actually just not to simplify.
n_replicates<- 10
iterations_required <- replicate(n_replicates, simulate(), simplify=FALSE)
I have 100,000 individuals
Using a combination of upper case letters, lower case letters and numbers, I want to create
a five-character ID for each individual. I should not have any duplicates.
How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
(length(letters) + length(LETTERS) + length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) + 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Update
It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:
set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"
length(unique(ID))
#> [1] 100000
Which takes less than 10 milliseconds.
Created on 2022-05-15 by the reprex package (v2.0.1)
You can try the code below (given N <- 1e5 and k <- 5):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
S denotes a sub-space from which we draw the alphabets or digits
expand.grid gives all combinations
If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203
I'm writing a function to analyse .csv files in a directory on my hard drive, using a series of for and while loops (I know for loops are unpopular in R, but they're good for what I need).
The function creates a number of data-frames, and performs actions on each one in turn before overwriting them and moving on to the next file in the directory to repeat the action.
The part of the code that does not work so far is the creation of a matrix from vectors taken from the data files being analysed. A simplified version of the code is shown below:
data1 <- seq(1, 10, 1)
data2 <- seq(1, 7, 1)
data3 <- seq(1, 5, 1)
n <- max(length(data1), length(data2), length(data3))
k <- c(1, 2, 3)
for(a in k){
if(a == 1){
length(get(paste("data", a, sep = ""))) <- n
data_matrix <- get(paste("data", a, sep = ""))
}else{
while(exists(paste("data", a, sep = ""))){
length(get(paste("data", a, sep = ""))) <- n
data_matrix <- cbind(data_matrix, get(paste("data", a, sep = "")))
}
}
}
The nature of my data is that the length of the columns in my datasets vary with each data collection, so I've adapted a technique found in this post that deals with using cbind to bind objects of a different length without replication of the data within the smaller objects.
The issue I have when trying to implement this code is I get the error message:
Error in length(get(paste("data", a, sep = ""))) <- n :
target of assignment expands to non-language object
I'm guessing the issue is that the function get() cannot be used to select items in the Global Environment and to modify them in this way.
You could use:
get("x")[1:n]
to get a vector called "x" padded with NA to length n.
That is:
> x=1:3
> n=10
> get("x")[1:n]
[1] 1 2 3 NA NA NA NA NA NA NA
Having said that, this is a neater way to get the matrix you want (hopefully you can adapt to your scenario):
> datalist <- list(data1, data2, data3)
> maxlength <- max(lengths(datalist))
> sapply(datalist, function(x) x[1:maxlength] )
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
[4,] 4 4 4
[5,] 5 5 5
[6,] 6 6 NA
[7,] 7 7 NA
[8,] 8 NA NA
[9,] 9 NA NA
[10,] 10 NA NA
For those who want to see how the solution proposed by #GeorgeSavva looks using the loop method that I am employing (my loop contained additional errors):
data1 <- seq(1, 10, 1)
data2 <- seq(1, 7, 1)
data3 <- seq(1, 5, 1)
n <- max(length(data1), length(data2), length(data3))
k <- c(1, 2, 3)
for(a in k){
if(a == 1){
data_matrix <- get(paste("data", a, sep = ""))[1:n]
}else{
data_matrix <- cbind(data_matrix, get(paste("data", a, sep = ""))[1:n])
}
}
While loop was unnecessary. I have written my code this way so that I can make it as versatile as possible as I obtain on a daily basis a varying number of datasets, with a varying size in each dataset.
I can use common operations on each dataset, so I can write a function that will tidy the data, construct charts and compare the datasets automatically without having to write new commands for each analysis.
I want to find how many combinations of genome are found in a sequence. I mean for binary combinations: AA,AT,AG,AC,... 16 combinations like that;or for 3-elemented combinations ATG,ACG,... 64 combinations like that. I know how to do that with a package and I will write down it here. I want to create my own code to perform this
seqinr package is perfect on its job. That is the code that i used for;
install.packages('seqinr')
library(seqinr)
m = read.fasta(file='sequence.fasta')
mseq = m[[1]]
count(mseq,2) # gives how many binary combinations are found in the seq
count(mseq,3) # gives how many 3-elemented combinations are found in the seq
This is a slow way to do it. I am certain it is faster in the bioconductor package.
# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")
# define a function called count
count = function(mseq, n){
# split the sequence into every possible sub sequence of length n
x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
# how many unique sub sequences of length R are there?
length(table(x))
}
Actually just checked and this is pretty much how they did it:
function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"),
frame = start)
{
if (!missing(frame))
start = frame
istarts <- seq(from = 1 + start, to = length(seq), by = by)
oligos <- seq[istarts]
oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
if (wordsize >= 2) {
for (i in 2:wordsize) {
oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
}
}
counts <- table(factor(oligos, levels = oligos.levels))
if (freq == TRUE)
counts <- counts/sum(counts)
return(counts)
}
If you want to find the code for a function use getAnywhere()
getAnywhere(count)
The simple thing to do is just something like this:
# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")
# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
Here is a way using a "function factory" (https://adv-r.hadley.nz/function-factories.html).
The 2-element and 3-element combinations are n-grams of size 2 and 3. So we make this n-gram function factory.
# Generate a function to create a function
ngram <- function(size) {
function(myvector) {
substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))
}
}
# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)
# 2 element combinations
table(bigram(testSeq))
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
# count of 2 element combinations
length(unique(bigram(testSeq)))
[1] 9
# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)
[1] 9
# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))
first define some function to bind list rowwise and column wise
# a function to append vectors row wise
rbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- rbind(res, list[[i]])
return(res)
}
cbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- cbind(res, list[[i]])
return(res)
}
# generate sample data
sample.dat <- list()
set.seed(123)
for(i in 1:365){
vec1 <- sample(c(0,1), replace=TRUE, size=5)
sample.dat[[i]] <- vec1
}
dat <- rbindlist(sample.dat)
dat has five columns. Each column is a location and has 365 days of the year (365 rows) with values 1 or 0.
I have another dataframe (see below) which has certain days of the year for each column (location) in dat.
# generate second sample data
set.seed(123)
sample.dat1 <- list()
for(i in 1:5){
vec1 <- sort(sample(c(258:365), replace=TRUE, size=4), decreasing = F)
sample.dat1[[i]] <- vec1
}
dat1 <- cbindlist(sample.dat1)
I need to use dat1 to subset days in dat to do a calculation. An example below:
1) For location 1 (first column in both dat1 and dat):
In column 1 of dat, select the days from 289 till 302 (using dat1), find the longest consecutive occurrence of 1.
Repeat it and this time select the days from 303 (302 + 1) till 343 from dat, find the longest consecutive occurrence of 1.
Repeat it for 343 till 353: select the days from 344 (343 + 1) till 353, find the longest consecutive occurrence of 1.
2) Do this for all the columns
If I want to do sum of 1s, I can do this:
dat <- as.tibble(dat)
dat1 <- as.tibble(dat1)
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
sum.range1 <- sum(..1[range1:range2]) # this will generate sum between range 1 and range 2
sum.range2 <- sum(..1[range2:range3]) # this will generate sum between range 2 and range 3
sum.range3 <- sum(..1[range3:range4]) # this will generate sum between range 3 and range 4
c(sum.range1=sum.range1,sum.range2=sum.range2,sum.range3=sum.range3)
})
For longest consequtive occurrence of 1 between each range, I thought of using the rle function. Example below:
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
spell.range1 <- rle(..1[range1:range2]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range1 <- tapply(spell.range1$lengths, spell.range1$values, max)[2] # this should select the maximum consequtive run of 1
spell.range2 <- rle(..1[range2:range3]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range2 <- tapply(spell.range2$lengths, spell.range2$values, max)[2] # this should select the maximum consequtive run of 1
spell.range3 <- rle(..1[range3:range4]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range3 <- tapply(spell.range3$lengths, spell.range3$values, max)[2] # this should select the maximum consequtive run of 1
c(spell.1.range1 = spell.1.range1, spell.1.range2 = spell.1.range2, spell.1.range3 = spell.1.range3)
})
I get an error which I think is because I am not using the rle function properly here. I would really like to keep the code as above since
my others code are in the same pattern and format of the outputs is suited for my need, so I would appreciate if someone can suggest how to fix it.
OP's code does work for me. So, without a specific error message it is impossible to understand why the code is not working for the OP.
However, the sample datasets created by the OP are matrices (before they were coerced to tibble) and I felt challenged to find a way to solve the task in base R without using purrr:
To find the number of consecutive occurences of a particular value val in a vector x we can use the following function:
max_rle <- function(x, val) {
y <- rle(x)
len <- y$lengths[y$value == val]
if (length(len) > 0) max(len) else NA
}
Examples:
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 1)
[1] 4
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 0)
[1] 2
# find consecutive occurrences in column batches
lapply(seq_len(ncol(dat1)), function(col_num) {
start <- head(dat1[, col_num], -1L)
end <- tail(dat1[, col_num], -1L) - 1
sapply(seq_along(start), function(range_num) {
max_rle(dat[start[range_num]:end[range_num], col_num], 1)
})
})
[[1]]
[1] 8 4 5
[[2]]
[1] 4 5 2
[[3]]
[1] NA 3 4
[[4]]
[1] 5 5 4
[[5]]
[1] 3 2 3
The first lapply() loops over the columns of dat and dat1, resp. The second sapply() loops over the row ranges stored in dat1 and subsets dat accordingly.