function to sample variable number of substrings given string length - r

I'm trying to write an R function that will sample a variable number of 5-element substrings, based on the length of the original string in each row of a data frame. I first calculated the number of times I'd like each draw to repeat, and would like to add this into the function so that the number of samples taken for each row is based on the "num_draws" column for that row. my thought was to use a generalized instance, and then use an apply statement outside of the function to act on each row, but I can't figure out how to set up the function to call col 3 as a generalized instance (without calling either the value of just the first row, or the value of all rows).
example data frame:
BP TF num_draws
1 CGGCGCATGTTCGGTAATGA TFTTTFTTTFFTTFTTTTTF 6
2 ATAAGATGCCCAGAGCCTTTTCATGTACTA TFTFTFTFFFFFFTTFTTTTFTTTTFFTTT 9
3 TCTTAGGAAGGATTC FTTTTTTTTTFFFFF 4
desired output:
[1]GGCGC FTTTF
AATGA TTTTF
TTFFT TGTTC
TAATG TTTTT
AATGA TTTTF
CGGCG TFTTT
[2]AGATG FTFTF
ATAAG TFTFT
ATGCC FTFFF
GCCCA FFFFF
ATAAG TFTFT
GTACT TFFTT
GCCCA FFFFF
TGCCC TFFFF
AGATG FTFTF
[3]TTAGG TTTTT
CTTAG TTTTT
GGAAG TTTTT
GGATT TTFFF
example code:
#make example data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
#calculate number of draws by length
frag_length<- 5
my_df<- cbind(my_df, (round((nchar(my_df[,1]) / frag_length) * 1.5, digits = 0)))
colnames(my_df) <- c("BP", "TF", "num_draws")
#function to sample x number of draws per row (this does not work)
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
samp<- substr(string, nStart, nStart + 4)
replicate(n= string[,3], expr = samp)
}
apply(my_df[,1:2], c(1,2), Fragment)

One option would be to change the function to have another argument n and create the nStart inside the replicate call
Fragment = function(string, n) {
replicate(n= n, {nStart <- sample(1:(nchar(string) -5), 1)
samp <- substr(string, nStart, nStart + 4)
})
}
apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
# BP TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT
#$`2`
# BP TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF
#$`3`
# BP TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF

Related

Creating a t-test loop over a dataframe using an index

So, let's say I have a 1000-row, 6-column dataframe, the columns are a1, a2, b1, b2, c1, c2. I want to run some t-tests using a's, b's, and c's and get an output df with 3 columns for the t-values of a-b-c and another three for the significance information for those values, making it a total of 6 columns. The problem I have is with rows, I want to loop over chunks of 20, rendering the output a (1000/20=)50-row, 6-column df.
I have already tried creating an index column for my inital df which repeats a 1 for the first 20 row, a 2 for the next 20 row and so on.
convert_n <- function(df) {
df <- df %T>% {.$n_for_t_tests = rep(c(1:(nrow(df)/20)), each = 20)}
}
df <- convert_n(df)
However, I can't seem to find a way to properly utilize the items in this column as indices for a "for" or any kind of loop.
Below you can see the relevant code for that creates a 1-row, 6-column df; I need to modify the [0:20] parts, create a loop that does this for 20 groups and binds them.
t_test_a <- t.test(df$a1[0:20], dfff$a2[0:20], paired = T, conf.level
= 0.95)
t_test_b <- t.test(df$b1[0:20], dfff$b2[0:20], paired = T, conf.level
= 0.95)
t_test_c <- t.test(df$c1[0:20], dfff$c2[0:20], paired = T, conf.level
= 0.95)
t_tests_df <- data.frame(t_a = t_test_a$statistic[["t"]],
t_b = t_test_b$statistic[["t"]],
t_c = t_test_c$statistic[["t"]])
t_tests_df <- t_tests_df %T>% {.$dif_significance_a = ifelse(.$t_a >
2, "YES", "NO")} %T>%
{.$dif_significance_b = ifelse(.$t_b >
2, "YES", "NO")} %T>%
{.$dif_significance_c = ifelse(.$t_c >
2, "YES", "NO")} %>%
dplyr::select(t_a, dif_significance_a,
t_b, dif_significance_b,
t_c, dif_significance_c)
Thank you in advance for your help.
You can use split() and sapply():
set.seed(42)
df <- data.frame(a1 = sample(1000, 1000), a2 = sample(1000, 1000),
b1 = sample(1000, 1000), b2 = sample(1000, 1000),
c1 = sample(1000, 1000), c2 = sample(1000, 1000))
group <- gl(50, 20)
D <- split(df, group)
myt <- function(Di)
with(Di, c(at=t.test(a1, a2)$statistic, ap=t.test(a1, a2)$p.value,
bt=t.test(b1, b2)$statistic, bp=t.test(b1, b2)$p.value,
ct=t.test(c1, c2)$statistic, cp=t.test(c1, c2)$p.value))
sapply(D, FUN=myt) ### or
t(sapply(D, FUN=myt))
This is not the most pretty but i did a for loop like this:
df <- data.frame(a1 = sample(1000, 1000),
a2 = sample(1000, 1000),
b1 = sample(1000, 1000),
b2 = sample(1000, 1000),
c1 = sample(1000, 1000),
c2 = sample(1000, 1000))
df_ttest <- data.frame(p_a = c(1:50),
t_a = c(1:50),
p_b = c(1:50),
t_b = c(1:50),
p_c = c(1:50),
t_c = c(1:50))
index <- 0:50*20
for(i in seq_along(index)) {
df_ttest$p_a[i] = t.test(df$a1[index[i] : index[i+1]])$p.value
df_ttest$p_b[i] = t.test(df$b1[index[i] : index[i+1]])$p.value
df_ttest$p_c[i] = t.test(df$c1[index[i] : index[i+1]])$p.value
df_ttest$t_a[i] = t.test(df$a1[index[i] : index[i+1]])$statistic
df_ttest$t_b[i] = t.test(df$b1[index[i] : index[i+1]])$statistic
df_ttest$t_c[i] = t.test(df$c1[index[i] : index[i+1]])$statistic
}
This gives a 50x6 dataframe with seperate columns of p and t values for every 20 row chunk of a, b and c.
You could even go further and make a nested for loop to cycle through each row in df_ttest to make this abit prettier.

apply statement to sample columns, across rows of different lengths

I'm trying to write a simple R function to sample 5-element substrings across two columns of a single data frame. The length of the strings are equal for each row, but they differ down the columns. The function works when I specify a row and col to act on, but I can't get the apply statement to work on on each row and each column. As written, it will only pull random samples based on the length of the first instance, so if the first instance is shorter than any of the other strings, the output for the other rows is sometimes less than 5-elements.
example df:
BP TF
1 CGTCTCTATTCTAGGCAAGA TTTFFFFTFFFTFFFTFTTT
2 AAGTCACTCGAATTCGGATGCCCCCTAGGC TTFFFFFTFFFFTTFTFFTTTFTTTTFTFF
3 TGCTCATGACGGGAC FFFTFTFFFFTFTFT
'intended output:'
1 CTATT FFTFF
2 CCTAG TTTFT
3 TCATG TFTFF
'reproducible example code:'
#make fake data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
Fragment = function(string) {
nStart = sample(1:nchar(string) -5, 1)
substr(string, nStart, nStart + 4)
}
Fragment(string = my_df[1,1])#works for the first row, first col.
but this does not work:
apply(my_df, c(1,2), function(x) Fragment(string = my_df[1:nrow(my_df),1:ncol(my_df)]))
There was an error in your function:
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
substr(string, nStart, nStart + 4)
}
It was missing parentheses between nchar(string) - 5, which made the subsetting go wrong.
You can then simply use apply(my_df, c(1,2), Fragment) as suggested in the comments.
To show that this works now:
for(i in 1:10000){
stopifnot(all(5 == sapply(apply(my_df, c(1,2), Fragment), nchar)))
}
This shows that in 10000 tries, it always produced 5 characters as output.

Faster alternative to nested loops

I have written the below function, which contains a nested loop. In short, it calculates differences in emissions between i (28) pairs alternative technologies for j (48) countries. For a single combination and a single country, it takes 0.32 sec, which should give a total time of 0.32*28*48 = around 7 min. The function actually takes about 50 min, which makes me think there may be some unnecessary computing going on. Is a nested loop the most efficient approach here?
Any help is greatly appreciated!
alt.comb.p <- function(Fmat){
y.empty = matrix(data = 0,ncol = 2,nrow = nrow(FD)-1)
row.names(y.empty) <- paste(FD$V1[2:nrow(FD)],FD$V2[2:nrow(FD)],sep = " ")
country.list = unique(FD$V1)
for (j in 1:length(country.list)){ # for every country
for (i in 1:ncol(alt.comb)){ # for every possible combination
# the final demand of the first item of the combination is calculated
first = alt.comb[,i][1]
first.name = row.names(Eprice.Exio)[first]
loc1 = grep(pattern = first.name,x = row.names(y.empty))
country.first = substr(x = row.names(y.empty)[loc1[j]],start = 0,stop = 2)
y.empty[,1][loc1[j]] <- Eprice.Exio[first.name,country.first]
# the final demand of the second item of the combination is calculated
second = alt.comb[,i][2]
second.name = row.names(Eprice.Exio)[second]
loc2 = grep(pattern = second.name,x = row.names(y.empty))
country.second = substr(x = row.names(y.empty)[loc2[j]],start = 0,stop = 2)
y.empty[,2][loc2[j]] <- Eprice.Exio[second.name,country.second]
# calculates the difference between the total pressures from item 1 and item 2
r.1 = sum(Fmat%*%as.vector(y.empty[,1]))
r.2 = sum(Fmat%*%as.vector(y.empty[,2]))
r.dif = r.1-r.2 # negative means alternative 1 is better
alt.comb[2+j,i] <- r.dif
row.names(alt.comb)[2+j] <- country.first
y.empty = matrix(data = 0,ncol = 2,nrow = nrow(FD)-1)
row.names(y.empty) <- paste(FD$V1[2:nrow(FD)],FD$V2[2:nrow(FD)],sep = " ")
}
}
return(alt.comb)
}
Edit:
A simplified example would be:
Fmat = matrix(data = runif(1:9600), ncol=9600, nrow=9600)
alt.comb.p <- function(Fmat){
y.empty = matrix(data = 0,ncol = 2,nrow = 9600)
country.list = runif(n = 10)
alt.comb = matrix(data=0,ncol=5,nrow=10)
for (j in 1:10){ # for every country
for (i in 1:5){ # for every possible combination
y.empty[50,1] <- runif(1)
y.empty[60,2] <- runif(1)
# calculates the difference between the total pressures from item 1 and item 2
r.1 = sum(Fmat%*%as.vector(y.empty[,1]))
r.2 = sum(Fmat%*%as.vector(y.empty[,2]))
r.dif = r.1-r.2 # negative means alternative 1 is better
alt.comb[j,i] <- r.dif
y.empty = matrix(data = 0,ncol = 2,nrow = 9600)
}
}
return(alt.comb)
}

Conditional change to data frame column(s) based on values in other columns

Within the simulated data set
n = 50
set.seed(378)
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
py = abs(rnorm(n, 25, 10)),
yrsquit = abs (rnorm (n, 10,2)),
outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
)
I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried
df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1) {
age[x] <- age[x] + 15
sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))
but there is no difference as shown by
tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)
The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.
Here is an easier way to modify the data frame without a loop or sapply:
idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15;
sex[idx] <- sample(c("m", "f"), sum(idx),
replace = TRUE, prob = c(0.8, 0.2))})
Now, the data frames are different:
> tapply(df$age, df$outcome, mean)
0 1
60.46341 57.55556
> tapply(df1$age, df$outcome, mean)
0 1
60.46341 72.55556
> tapply(df$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
2 7
> tapply(df1$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
1 8

How do I concatenate one word with all elements of a list in R barplot title?

This is the code I'm currently running:
n <- 7
N <- 52
r <- 13
reps <- 1000000
deck <- rep(c('h','d','c','s'), each = r)
diamonds <- rep(NA, length.out = reps)
pos <- sample(x = 1:52, size = 7, replace = FALSE)
for(i in 1:reps) {
hand <- sample(x = deck, replace = FALSE)[pos]
diamonds[i] <- sum(ifelse(hand == 'd', 1, 0))
}
barplot(table(diamonds), col = 'red', xlab = '# of diamonds',
ylab = paste('frequency out of',reps,'trials'),
main = paste('Positions:',pos[1],pos[2],pos[3],pos[4],
pos[5],pos[6],pos[7]))
What I'd really like is to be able to give a title to the barplot with something like the following
barplot(..., main = paste('Positions:',pos))
and have the title say "Positions: p1 p2 p3 p4 p5 p6 p7", where p1,p2,...,p7 are the elements of pos.
For anyone that's interested, this code randomly chooses 7 positions from 52 and then counts the number of diamonds ('d') within those 7 positions after each shuffle of the deck for 1000000 shuffles. Then the empirical distribution of the number of diamonds within those 7 cards is plotted.
Use collapse in paste to collapse the multiple elements in a vector containing the base test and pos,
paste(c('Positions:', pos), collapse=" ")
Otherwise, when you paste "Positions:" to pos you get the former recycled to the length of pos.

Resources