DNA conditional frequency in R - r

I'm trying to find if there is any conditional dependence within 2 different DNA sequences in R
This is my code, however i'm getting an error;
Error in `[.data.frame`(data, i) : undefined columns selected
I'm not sure where the issue is, if I parentheses the data[i-1]==bases[b2], i just get multiple unexpected}, which is the only thing I can think else to do.
for (b1 in 1:length(bases))
{
for (b2 in 1:length(bases))
{
count = 1
for (i in 2:length(mydata1))
{
if ((mydata1[i]==bases[b1]) & mydata1[i-1]==bases[b2])
{
count = count+1
}
}
b3 = c(bases[b1], bases[b2], count)
print(b3)
}
}
_I'm expecting essentially a list of certain DNA bases, for example I see it as if the DNA sequence IS conditional upon the previous base then;.
[1] "A" "C" "002"
[1] "A" "C" "005"
[1] "A" "C" "009"
and so on, that can show me any indication as to whether a certain base has any sort of affect upon the identity of the following base, by clearly showing a condition for A to be previous to C.
Ok so essentially the mydata1 (there is also mydata2) are DNA sequences, that is to say a list of "A", "G", "C" and "T", each of which is 10,000 bases long.
As shown here;
V1
1 T
2 C
3 G
4 G
5 T
6 G
7 G
8 G
9 C
10 A
I'm tasked with trying to determine if the sequence has bases that are dependent on one another, so if [1] T affects the presence of [2] C, etc. One of the sequences is dependent, the other is not.

If I understand correctly, you want to count the occurrences of each pair of nucleotides i, i+1 in a sequence of DNA. You can achieve this with R function table; an example is provided below.
# input sequence
seq <- "ACGTACTGCACAAACTAC"
# length of input sequence
length_seq <- nchar(seq, type="chars")
# first substring: from 1 to second-last
seq1 <- substr(seq, 1, (length_seq - 1))
# second substring: from 2 to last
seq2 <- substring(seq, 2, length_seq)
# split strings
seq1_split <- strsplit(seq1, "")[[1]]
seq2_split <- strsplit(seq2, "")[[1]]
# initialize vectors
first_nt <- vector(mode="character", length = (length_seq - 1))
second_nt <- vector(mode="character", length = (length_seq -1))
# fill vectors
count = 0
for (b in seq1_split)
{
count = count + 1
first_nt[count] <- b
}
count = 0
for (b in seq2_split)
{
count = count + 1
second_nt[count] <- b
}
# create matrix with character i and i+1 in each row
mat <- matrix(c(first_nt, second_nt), nrow=(length_seq - 1))
# collapse matrix
to_table <- apply(mat, 1, paste, collapse="")
# table
my_table <- table(to_table)
print(my_table)

Related

Generation of Unique ID

Can some help with how to generate a unique 6 digit URN in R,as I don't know how to do this please.
Below are the rule for the URN
It needs to be alphanumeric,start with letter and maybe end with letter (e.g AA34YB)
Use only upper case alphabets
Do not use the alphabets O or I (this is the alphabet after H and before J)
Use only digits from 1- 9. Exclude 0
First two digit should be letter,then followed by 2 digit number and end with 2 digit letter,e.g "AA22DD","EE34TY","ER67YU"
All records must contain number as shown in rule 5
IT MUST BE 6 DIGIT PLEASE
I would love to generate upto 4 million unique records please.Any R code suggestion is highly welcome.I am not an expert in R,actually new to R
Thanks very much
here is a function that will generate ordered unique IDs:
generateIDs <- function(n, existing=NULL){
# Initialise a counter to produce IDs
counter <- 0
# Create a arrays of letters and digits
letters <- LETTERS[LETTERS %in% c("O", "I") == FALSE]
digits <- 1:9
# Initialise an array to store the IDs created
ids <- c()
# iterate through the letters
for(first in letters){
# iterate through the letters
for(second in letters){
# iterate through the digits
for(third in digits){
# iterate through the digits
for(fourth in digits){
# iterate through the letters
for(fifth in letters){
# iterate through the letters
for(sixth in letters){
# Create the unique code
code <- paste0(first, second, third, fourth, fifth, sixth)
# Check if already exists
if(code %in% existing == FALSE){
# Iterate the counter
counter <- counter + 1
# Store the ID
ids[counter] <- code
existing[length(existing) + 1] <- code
# Check if created enough IDs
if(counter == n){
return(ids)
}
# Note progress
if(counter %% 10000 == 0){
cat("\rCreated", counter, "ids!")
}
}
}
}
}
}
}
}
}
That is a horrific number of nested for loops but it avoids the inefficient random generation of IDs. You can test it using the following code:
generateIDs(10)
"AA11AA" "AA11AB" "AA11AC" "AA11AD" "AA11AE" "AA11AF" "AA11AG" "AA11AH" "AA11AJ" "AA11AK"
Note that ideally you should run this function once. Theoretically, this function could create up to 26873856 unique IDs but it doesn't scale well!
See #GKi's answer for a much better solution! :-)
You can use expand.grid to generate Unique ID's.
n <- 10
t1 <- LETTERS[!LETTERS %in% c("O", "I")]
t2 <- 1:9
#t1 <- rawToChar(as.raw(c(65:72,74:78,80:90)), multiple = TRUE) #Alternativ
#t2 <- rawToChar(as.raw(49:57), multiple = TRUE)
apply(expand.grid(t1, t1, t2, t2, t1, t1)[seq(n),], 1, paste, collapse = "")
# 1 2 3 4 5 6 7 8
#"AA11AA" "BA11AA" "CA11AA" "DA11AA" "EA11AA" "FA11AA" "GA11AA" "HA11AA"
# 9 10
#"JA11AA" "KA11AA"
set.seed(1) #Sample randomly
apply(expand.grid(t1, t1, t2, t2, t1, t1)[sample(length(t1)^4*length(t2)^2, n),]
, 1, paste, collapse = "")
#10938497 17633234 12201267 18120554 21612295 21509711 13901861 6841049
#"SL15UK" "BG59TR" "CU65XL" "BH54ES" "GJ13HV" "YF31FV" "EE79KN" "SV66CG"
#23945701 10770210
#"NK23KX" "TG68QK"
In case it needs to much memory look #Joseph-Crispell's answer.

How to assign the column name to the variable dynamically

I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.

Shuffling string (non-randomly) for maximal difference

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?
python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo
Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7

Selecting random row from a data.frame and assigning it to one of the two other data.frames based on three conditions in R

I have a data.frame (a) as mentioned below:
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
Lets assume each row represents an edge of a graph and the values of the rows are vertices.
What I want is to pick a random row (which is an edge) from data.frame (a) and assign it to data.frame (b) or data.frame (c) based on the three conditions below. Just to clarify that data.frame (b and c) are empty in the beginning. So the conditions are:
When a row(edge) is randomly picked from data.frame (a) and if neither vertex has been assigned, then assign the edge to the data.frame with least number of rows.
To clarify this condition:
Lets say I pick a random row(edge)#2 from data.frame (a) which has two vertices "a" and "e". So I should check if data.frame (b) and data.frame (c) have either "a" or "e" present in any of their rows. So if they have "a" or "e" present then this rule should not be implemented and next rule should be checked. If both data.frames do not have "a" or "e" present in any of the rows then nrow(number of rows) should be checked in both data.frames and the one with lower number of nrow() should be assigned that row. If both have same nrow() then any of the two data.frame could be assigned that row.
When a row(edge) is randomly picked from data.frame (a) and if one of the vertices of that row is present in any of the data.frames (b) or (c) then assign the row(edge) to that data.frame
If a random row is picked say for example #3 which has "a" and "f". Then data.frames b and c should be checked to see if any of the rows contain either "a" or "f". Suppose data.frame (b) does not contain either "a" or "f" but data.frame (c) contains "f". So the row should be assigned to data.frame (c).
Now there is also a possibility that data.frame (b) contains "a" and data.frame(c) contains "f". In that case, all the instances of "a" in data.frame (b) and "f" in data.frame (c) should be counted. If "a" appears 3 times and "f" appears 4 times then the row should be assigned to (b) i.e The row then should be assigned to the data.frame which has lower number of instances of the vertex present in that data.frame.
When a row(edge) is randomly picked from data.frame (a) and if both the vertices of that row are present in a data.frame then assign the row to that data.frame
So to summarize, a random row should be picked from data.frame(a) and check for the above mentioned conditions and should be assigned to data.frame(b) or (c) after going through the conditions above. So all the rows of data.frame(a) have to be checked for the conditions.
This should get you started. You can't continually randomly select rows, as you discovered, as that leads to duplicates. Instead, randomly assign the rows to a vector which gives the order they should be processed in. If you don't think this is the right approach, you could also randomly select a row, then remove it from a and later randomly select from what remains. If you still need a, remove the row from a copy of a.
set.seed(1)
dfa <- data.frame(V1 = sample(letters[1:9], replace = TRUE), V2 = sample(letters[1:9], replace = TRUE))
todo <- sample(1:nrow(dfa), nrow(dfa), replace = FALSE)
dfb <- dfa[todo[1],]
dfc <- dfa[todo[2],]
Now continue through 'todo' in order, applying your conditions
and using rbind to add rows to the dfb and dfc:
for (i in 3:length(todo)) {
# apply your logic
# if a row belongs in dfb, do
dfb <- rbind(dfb, dfa[todo[i],])
# etc
}
aCopy<-read.table("isnodes.txt")
p1<-aCopy[-c(1:nrow(aCopy)),]
p2<-aCopy[-c(1:nrow(aCopy)),]
currentRowHistory<-aCopy[-c(1:nrow(aCopy)),]
for(i in 1:nrow(a)) {
currentRow <- aCopy[sample(nrow(aCopy), 1), ]
currentRowHistory <- rbind(currentRow,currentRowHistory)
currentRowV1 <- as.character(currentRow$V1[1])
currentRowV2 <- as.character(currentRow$V2[1])
aCopy <- aCopy[!(aCopy$V1 == currentRowV1 & aCopy$V2 == currentRowV2),]
if(length(which(currentRowV1 == p1$V1)) | length(which(currentRowV1 == p1$V2))){
if(length(which(currentRowV2 == p1$V1)) | length(which(currentRowV2 == p1$V2))){
p1<-rbind(currentRow,p1)
result <- "case 1 assign it to p1"
}
else if(length(which(currentRowV2 == p2$V1)) | length(which(currentRowV2 == p2$V2))){
V1occurances <- length(which(p1$V1 == currentRowV1))+length(which(p1$V2==currentRowV1))
V2occurances <- length(which(p2$V1 == currentRowV2))+length(which(p2$V2==currentRowV2))
ifelse(V1occurances<V2occurances,p1<-rbind(currentRow,p1),p2<-rbind(currentRow,p2))
result <- "case 2"
}
else {
p1<-rbind(currentRow,p1)
result <- "case 3 assign it to p1"
}
} else if(length(which(currentRowV1 == p2$V1)) | length(which(currentRowV1 == p2$V2))){
if(length(which(currentRowV2 == p2$V1)) | length(which(currentRowV2 == p2$V2))){
p2<-rbind(currentRow,p2)
result <- "case 1 assign it to p2"
}
else if(length(which(currentRowV2 == p1$V1)) | length(which(currentRowV2 == p1$V2))){
V1occurancesInP2 <- length(which(p2$V1 == currentRowV1))+length(which(p2$V2==currentRowV1))
V2occurancesInP1 <- length(which(p1$V1 == currentRowV2))+length(which(p1$V2==currentRowV2))
ifelse(V1occurancesInP2<V2occurancesInP1,p2<-rbind(currentRow,p2),p1<-rbind(currentRow,p1))
result <- "case 2"
}
else {
p2<-rbind(currentRow,p2)
result <- "case 3 assign it to p2"
}
} else if(length(which(currentRowV2 == p1$V1)) | length(which(currentRowV2 == p1$V2))){
p1<-rbind(currentRow,p1)
result <- "Assign it to p1 case 3"
} else if(length(which(currentRowV2 == p2$V1)) | length(which(currentRowV2 == p2$V2))){
p2<-rbind(currentRow,p2)
result <- "Assign it to p2 case 3"
} else {
ifelse(nrow(p1)<nrow(p2),p1<-rbind(currentRow,p1), p2<-rbind(currentRow,p2))
}
}

selecting n consequent grouped variables and apply the function in r

Here is example data:
myd <- data.frame (matrix (sample (c("AB", "BB", "AA"), 100*100,
replace = T), ncol = 100))
variablenames= paste (rep (paste ("MR.", 1:10,sep = ""),
each = 10), 1:100, sep = ".")
names(myd) <- variablenames
Each variable has a group, here we have ten groups. Thus the group index for the each variable in this data frame is as follows:
group <- rep(1:10, each = 10)
Thus Variable names and group
data.frame (group, variablenames)
group variablenames
1 1 MR.1.1
2 1 MR.1.2
3 1 MR.1.3
4 1 MR.1.4
5 1 MR.1.5
6 1 MR.1.6
7 1 MR.1.7
8 1 MR.1.8
9 1 MR.1.9
10 1 MR.1.10
11 2 MR.2.11
<<<<<<<<<<<<<<<<<<<<<<<<
100 10 MR.10.100
Each groups means that the following steps whould be applied to group of variables seperately.
I have longer function to work the following is short example:
function considering two variables at time
myfun <- function (x1, x2) {
out <- NULL
out <- paste(x1, x2, sep=":")
# for other steps to be performed here
return (out)
}
# group 1
myfun (myd[,1], myd[,2]); myfun (myd[,3], myd[,4]); myfun (myd[,5], myd[,6]);
myfun (myd[,7], myd[,8]); myfun (myd[,9], myd[,10]);
# group 2
myfun (myd[,11], myd[,12]); myfun (myd[,13], myd[,14]); .......so on to group 10 ;
In this way I need to walk for variables 1:10 (i.e. in first group to perform the above action), then 11:20 (the second group). The group doesnot matter in this case number of variables in each group are divisible with number of variables (10) taken (considered) at a time (2).
However in the following example where 3 variables taken at a time - number of total variable in each group (3), 10/3, you have one variable left over at the end.
function considering three variable at time.
myfun <- function (x1, x2, x3) {
out <- NULL
out <- paste(x1, x2, x3, sep=":")
# for other steps to be performed here
return (out)
}
# for group 1
myfun (myd[,1], myd[,2], myd[,3])
myfun (myd[,4], myd[,5], myd[,6])
myfun (myd[,7], myd[,8], myd[,9])
# As there one variable left before proceedomg to second group, the final group will
have 1 extra variable
myfun (myd[,7], myd[,8], myd[,9],myd[,10] )
# for group 2
myfun (myd[,11], myd[,12], myd[,13])
# and to the end all groups and to end of the file.
I want to loop this process by user defined n number of variables consered at time, where n may be 1 to maximum number of variables in each group.
Edit: Just illustration to show the process (just group 1 and 2 demostrated for example):
Create a function that will split your data up into appropriate lists, and apply whatever functions you want to your list.
This function will create your second grouping variable. (The first grouping variable (group) is provided in your question; if you change that value, you should also change DIM in the function below.)
myfun = function(LENGTH, DIM = 10) {
PATTERN = rep(1:(DIM %/% LENGTH), each=LENGTH)
c(PATTERN, rep(max(PATTERN), DIM %% LENGTH))
}
Here are the groups on which we will split myd. In this example, we are splitting myd first into 10-column groups, and each group into 3-column groups, except for the last group, which will have 4 columns (3+3+4 = 10).
NOTE: To change the number of columns you're grouping by, for example, grouping by two variables at a time, change group2 = rep(myfun(3), length.out=100) to group2 = rep(myfun(2), length.out=100).
group <- rep(1:10, each = 10)
# CHANGE THE FOLLOWING LINE ACCORDING
# TO THE NUMBER OF GROUPS THAT YOU WANT
group2 = rep(myfun(3), length.out=100)
This is the splitting process. We first split up just by names, and match those names with myd to create a list of data.frames.
# Extract group names for matching purposes
temp = split(names(myd), list(group, group2))
# Match the names to myd
temp = lapply(1:length(temp),
function(x) myd[, which(names(myd) %in% temp[[x]])])
# Extract the names from the list for future reference
NAMES = lapply(temp, function(x) paste(names(x), collapse="_"))
Now that we have a list, we can do lots of fun things. You wanted to paste your columns together separated by a colon. Here's how you'd do that.
# Do what you want with the list
# For example, to paste the columns together:
FINAL = lapply(temp, function(x) apply(x, 1, paste, collapse=":"))
names(FINAL) = NAMES
Here's a sample of the output:
lapply(FINAL, function(x) head(x, 5))
# $MR.1.1_MR.1.2_MR.1.3
# [1] "AA:AB:AB" "AB:BB:AA" "BB:AB:AA" "BB:AA:AB" "AA:AA:AA"
#
# $MR.2.11_MR.2.12_MR.2.13
# [1] "BB:AA:AB" "BB:AB:BB" "BB:AA:AA" "AB:BB:AA" "BB:BB:AA"
#
# $MR.3.21_MR.3.22_MR.3.23
# [1] "AA:AB:BB" "BB:AA:AA" "AA:AB:BB" "AB:AA:AA" "AB:BB:BB"
#
# <<<<<<<------SNIP------>>>>>>>>
#
# $MR.1.4_MR.1.5_MR.1.6
# [1] "AB:BB:AA" "BB:BB:BB" "AA:AA:AA" "BB:BB:AB" "AB:AA:AA"
#
# $MR.2.14_MR.2.15_MR.2.16
# [1] "AA:BB:AB" "BB:BB:BB" "BB:BB:AB" "AA:BB:AB" "BB:BB:BB"
#
# $MR.3.24_MR.3.25_MR.3.26
# [1] "AA:AB:BB" "BB:AA:BB" "BB:AB:BB" "AA:AB:AA" "AB:AA:AA"
#
# <<<<<<<------SNIP------>>>>>>>>
#
# $MR.1.7_MR.1.8_MR.1.9_MR.1.10
# [1] "AB:AB:AA:AB" "AB:AA:BB:AA" "BB:BB:AA:AA" "AB:BB:AB:AA" "AB:BB:AB:BB"
#
# $MR.2.17_MR.2.18_MR.2.19_MR.2.20
# [1] "AB:AB:BB:BB" "AB:AB:BB:BB" "AB:AA:BB:BB" "AA:AA:AB:AA" "AB:AB:AB:AB"
#
# $MR.3.27_MR.3.28_MR.3.29_MR.3.30
# [1] "BB:BB:AB:BB" "BB:BB:AA:AA" "AA:BB:AB:AA" "AA:BB:AB:AA" "AA:AB:AA:BB"
#
# $MR.4.37_MR.4.38_MR.4.39_MR.4.40
# [1] "BB:BB:AB:AA" "AA:BB:AA:BB" "AA:AA:AA:AB" "AB:AA:BB:AB" "BB:BB:BB:BB"
#
# $MR.5.47_MR.5.48_MR.5.49_MR.5.50
# [1] "AB:AA:AA:AB" "AB:AA:BB:AA" "AB:BB:AA:AA" "AB:BB:BB:BB" "BB:AA:AB:AA"
#
# $MR.6.57_MR.6.58_MR.6.59_MR.6.60
# [1] "BB:BB:AB:AA" "BB:AB:BB:AA" "AA:AB:AB:BB" "BB:AB:AA:AB" "AB:AA:AB:BB"
#
# $MR.7.67_MR.7.68_MR.7.69_MR.7.70
# [1] "BB:AB:BB:AA" "BB:AB:BB:AA" "BB:AB:BB:AB" "AB:AA:AA:AA" "AA:AA:AA:AB"
#
# $MR.8.77_MR.8.78_MR.8.79_MR.8.80
# [1] "AA:AB:AA:AB" "AB:AA:AB:BB" "BB:BB:AA:AB" "AB:BB:BB:BB" "AB:AA:BB:AB"
#
# $MR.9.87_MR.9.88_MR.9.89_MR.9.90
# [1] "AA:BB:AB:AA" "AA:AB:BB:BB" "AA:BB:AA:BB" "AB:AB:AA:BB" "AB:AA:AB:BB"
#
# $MR.10.97_MR.10.98_MR.10.99_MR.10.100
# [1] "AB:AA:BB:AB" "AB:AA:AB:BB" "BB:AB:AA:AA" "BB:BB:AA:AA" "AB:AB:BB:AB"
I suggest to recode myfun to take a matrix and use pasteCols from plotrix package.
library(plotrix)
myfun = function(x){
out = pasteCols(t(x), sep = ":")
# some code
return(out)
}
then, its very easy: for each group, compute the index of the first and of the last column you want to use when you call myfun, using modulus and integer division:
rubiques_solution = function(group, myd, num_to_group){
# loop over groups
for(g in unique(group)){
var_index = which(group == g)
num_var = length(var_index)
# test to make sure num_to_group is smaller than the number of variable
if(num_var < num_to_group){
stop("num_to_group > number of variable in at least one group")
}
# number of calls to myfun
num_calls = num_var %/% num_to_group
# the idea here is that we create the first and last column
# in which we are interested for each call
first = seq(from = var_index[1], by = num_to_group, length = num_calls)
last = first + num_to_group -1
# the last call will contain possibly more varialbe, we adjust here:
last[length(last)] = last[length(last)] + (num_var %% num_to_group)
for(i in num_calls){
# maybe do something with the return value of myfun ?
myfun(myd[,first[i]:last[i]])
}
}
}
group = rep(1:10, each = 10) # same than yours
myd = data.frame (matrix (sample (c("AB", "BB", "AA"), 100*100, replace = T), ncol = 100)) # same than yours
num_to_group = 2 # this is your first example
rubiques_solution(group, myd, num_to_group)
hope i understood the problem right.

Resources