Increment every nth range R - r

I am writing R code where there's a vector 'x' which contains values 1 to 100 and I want to create another vector 'y' which subsets a range of values at every nth range. I'm sure I can use the rep() and seq() but I can't figure out the code to get what I need. Here's what the output should look like
x <- 1:100
y <- 1 2 3 11 12 13 21 22 23 31 32 33 41 42 43 51 52 53 61 62 63 71 72 73 81 82 83 91 92 93
So if I was to do have a vector x <- 1001:1100, x[y] should return:
1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043...etc
Any ideas?

You could use grepl for that:
x <- 1001:1100
y <- grepl("[1-3]$", x)
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043 1051 1052
#[18] 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093
It simply checks for each element of x whether the last digit is in the range of 1, 2 or 3 and if so, it returns TRUE, otherwise FALSE. This logical index is then used to subset x.
In case your objective is not to subset elements ending in 1,2 or 3 but instead, to always subset 3 elements, then leave out 7, and then subset 3 again etc... you could do:
x <- 1001:1100
y <- rep(c(TRUE, FALSE), c(3, 7))
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043 1051 1052
#[18] 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093
In this case, vector y which is again logical, is recycled - note that length(x) should be divisible by length(y) for this to work properly.

For fun, With outer:
x <- 1001:1100
y <- as.vector(outer(1:3, seq(0, length(x)-10, 10), "+"))
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043
# [16] 1051 1052 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093

Probably this may help you:
x <- 1:100
y <- as.integer()
for(i in seq(1, length(x), 10)) {
y <- append(y, c(x[i], x[i+1], x[i+2]))
}

Hm. This started out as fun, but now I happen to like it since it is constructed in basically the same way the author of the question put it:
> do.call("c",lapply(0:5,function(X) 1:3+10*X))
[1] 1 2 3 11 12 13 21 22 23 31 32 33 41 42 43 51 52 53

Related

Gompertz-Makeham parameter estimation

I would like estimate the parameters of the Gompert-Makeham distribution, but I haven't got a result.
I would like a method in R, like this Weibull parameter estimation code:
weibull_loglik <- function(parm){
gamma <- parm[1]
lambda <- parm[2]
loglik <- sum(dweibull(vec, shape=gamma, scale=lambda, log=TRUE))
return(-loglik)
}
weibull <- nlm(weibull_loglik,parm<-c(1,1), hessian = TRUE, iterlim=100)
weibull$estimate
c=weibull$estimate[1];b=weibull$estimate[2]
My data:
[1] 872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
[16] 20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
[31] 50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
[46] 374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
[61] 894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
[76] 2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
[91] 2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
[106] 12 6 3 2 1 1
summary(vec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 32.0 314.0 900.9 1355.0 4054.0
It would be nice to have a reproducible example, but something like:
library(bbmle)
library(eha)
set.seed(101)
vec <- rmakeham(1000, shape = c(2,3), scale = 2)
dmwrap <- function(x, shape1, shape2, scale, log) {
res <- try(dmakeham(x, c(shape1, shape2), scale, log = log), silent = TRUE)
if (inherits(res, "try-error")) return(NA)
res
}
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=1),
data = data.frame(y = vec),
method = "Nelder-Mead"
)
Define a wrapper that (1) takes shape parameters as separate values; (2) returns NA rather than throwing an error when e.g. parameters are negative
Use Nelder-Mead rather than default BFGS for robustness
the fitdistrplus package might help too
if you're going to do a lot of this it may help to fit parameters on the log scale (i.e. use parameters logshape1, etc., and use exp(logshape1) etc. in the fitting formula)
I had to work a little harder to fit your data; I scaled the variable by 1000 (and found that I could only compute the log-likelihood; the likelihood gave an error that I didn't bother trying to track down). Unfortunately, it doesn't look like a great fit (too many small values).
x <- scan(text = "872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
12 6 3 2 1 1")
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=10000),
data = data.frame(y = x/1000),
method = "Nelder-Mead"
)
cc <- as.list(coef(m1))
png("gm.png")
hist(x,breaks = 25, freq=FALSE)
with(cc,
curve(exp(dmwrap(x/1000, shape1, shape2, scale, log = TRUE))/1000, add = TRUE)
)
dev.off()

Sum columns of data frame when condition is met

I would like to make a new column that would be the sum of only positive values in existing columns. So in column TotalImportSE3 I would like to obtain sum of of columns SE3-NO1, SE3-DK1, SE3-FI, SE3-SE4 only of positive values, if noone of them are positive sum should be 0.
SE3 - NO1 SE3 - DK1 SE3 - FI SE3 - SE2 SE3 - SE4 TotalImportSE3
47 1114 666 -225 2716 -3784 NA
48 1009 671 -151 1491 -2943 NA
54 1057 711 -1062 1658 -2201 NA
55 1077 711 -1213 3492 -3015 NA
94 772 414 -501 2904 -2262 NA
95 -786 -314 -407 -2368 -2005 NA
Output should look like:
SE3 - NO1 SE3 - DK1 SE3 - FI SE3 - SE2 SE3 - SE4 TotalImportSE3
47 1114 666 -225 2716 -3784 4496
48 1009 671 -151 1491 -2943 3171
54 1057 711 -1062 1658 -2201 3426
55 1077 711 -1213 3492 -3015 5280
94 772 414 -501 2904 -2262 4090
95 -786 -314 -407 -2368 -2005 0
So far my effort:
df1 <- df$`SE3 - NO1`[which(Data$`SE3 - NO1`>0)]
df2 <- df$`SE3 - DK1`[which(Data$`SE3 - DK1`>0)]
But it does create two vectors of different lenghts, so it messes up completely.
you can try this:
df$TotalImportSE3=rowSums(df*(df>0))
df is name of your dataframe

Find the highest value in an interval defined by an other file

I have these two dataset:
Before that contains 5 columns (chromsome, start, end, line number, score)
chrI 861 870 87 5
chrI 871 880 88 11
chrI 881 890 89 11
chrI 891 900 90 19
chrI 901 910 91 19
chrI 911 920 92 20
chrI 921 930 93 20
chrI 931 940 94 20
chrI 941 950 95 19
chrI 951 960 96 19
chrI 961 970 97 19
chrI 971 980 98 19
chrI 981 990 99 25
chrI 991 1000 100 20
chrI 1001 1010 101 20
chrI 1011 1020 102 20
chrI 1021 1030 103 20
chrI 1031 1040 104 15
chrI 1041 1050 105 14
chrI 1051 1060 106 14
chrI 1061 1070 107 13
chrI 1071 1080 108 13
chrI 1081 1090 109 13
chrI 1091 1100 110 7
chrI 1101 1110 111 7
Peaks that contains 4 columns (chromsome, start, end, value)
"chrI" 880 1091 383
"chrI" 1350 1601 302
"chrI" 1680 1921 241
"chrI" 2220 2561 322
"chrI" 2750 2761 18
"chrI" 3100 3481 420
"chrI" 3660 4211 793
"chrI" 4480 4491 20
"chrI" 4710 4871 195
"chrI" 5010 5261 238
For each lines of Peaks I would like to extract the corresponding lines (e.g all the lines between 880 and 1091 for the first line) in Before, find the highest score value and write it on a new file.
Output
chrI 981 990 99 25
To this end, I've written this function:
summit <- function(x,y,output){
y<- Before
chrom <- x[1]
start <-x[2]
end <-x[3]
startLine <- y[which((y$V1 == chrom) & (y$V2==start)),]
endLine <- y[which((y$V1 == chrom) & (y$V3==end)),]
Subset <- y[which((y$V2 >= startLine$V2) & (y$V3 <= endLine$V2))]
maximum <- Subset[which(Subset$V4 == max(Subset$V4))]
output <- print(maximum)
}
apply(Peaks,1,summit,output = 'peaks_list.bed')
I don't have an error message but It runs during the entire night without giving me results so I guess something is wrong with my code but I don't know what.
I also try this:
Peaks_Range <- GRanges(seqnames=Peaks$V1, ranges=IRanges(start=Peaks$V2, end=Peaks$V3))
Before_Range <- GRanges(seqnames=Before$V1, ranges=IRanges(start=Before$V2, end=Before$V3),score=Before$V5)
Merged <- mergeByOverlaps(Peaks_Range,Before_Range)
Merged <- as.data.frame(Merged)
for (i in 1:nrow(Peaks)){
start <-Peaks[i,2]
end <-Peaks[i,3]
Subset <- subset(Merged,Merged$Peaks_Range.start == start)
maximum <- as.numeric(max(Subset$score))
summit <- Subset[which(Subset$score == maximum),]
write.table(summit,'peaks_list.bed', sep="\t", append=T, col.name = FALSE, row.names = FALSE, quote = FALSE)
}
It works (I think) but this is very very slow so I search an alternative way to do it.
Does anyone have any idea?
You can use cut to help you get the index.
setwd("/home/wang/Downloads")
before <- read.table("before.txt", header = F, stringsAsFactors = F)
colnames(before) <- c("chromosome", "start", "end", "line number", "score")
peaks <- read.table("peaks.txt", header = F, stringsAsFactors = F, quote = "\"")
colnames(peaks) <- c("chromosome", "start", "end", "value")
summit <- function(peaks_vec){
chromosome = peaks_vec[1]
start = as.numeric(peaks_vec[2])
end = as.numeric(peaks_vec[3])
filter_before = subset(before, chromosome == chromosome)
up_index = cut(end, filter_before[,2], labels = F) +1
down_index = cut(start, filter_before[,2], labels = F) +1
if(!is.na(down_index) & !is.na(up_index)){
new_filter_before = filter_before[down_index : up_index, ]
max_index = which.max(new_filter_before[,5])
return(unlist(new_filter_before[max_index,]))
}else {
return(rep(NA, 5)) # you can input what you want.
}
}
result <- t(apply(as.matrix(peaks), 1, summit))
remove_na_result <- as.data.frame(na.omit(result))
colnames(remove_na_result) <- colnames(before)
And the final output is :
chromsome start end line number score
1 chrI 981 990 99 25
Wish my answer is helpful.

Divide paired matching columns

I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this?
I tried this, but this does not take the pairing from patient into account.
m1 <- m1[sort(colnames(df)]
m1_g <- m1[,grep("^n",colnames(df))]
m1_r <- m1[,grep("^t",colnames(df))]
m1_new <- m1_g/m1_r
m1_new
head(df)
na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536
> head(patient)
$`008`
[1] "na-008" "ta-008"
$`012`
[1] "nc012" "tb012"
$`018`
[1] "na-018" "ta-018"
$`020`
[1] "na020" "tc020"
$`045`
[1] "nb045" "tc045"
$`080`
[1] "nb-080" "ta-080"

Sample Function R does not produce uniformly distributed sample

I am creating a survey. There are 31 possible questions, I would like each respondent to answer a subset of 3. I would like them to be administered in a random order. Participants should not answer the same questions twice
I have created a table matrix with a participant index, and a column for the question indices for the 1st, 2nd and 3rd questions.
Using the code below, index 31 is under-represented in my sample.
I think I am using the sample function incorrectly. I was hoping someone could please help me?
SgPassCode <- data.frame(PassCode=rep(0,10000), QIndex1=rep(0,10000),
QIndex2=rep(0,10000), QIndex3=rep(0,10000))
set.seed(123)
for (n in 1:10000){
temp <- sample(31,3,FALSE)
SgPassCode[n,1] <- n
SgPassCode[n,-1] <- temp
}
d <- c(SgPassCode[,2],SgPassCode[,3],SgPassCode[,4])
hist(d)
The issue is with hist and the way it picks its bins, not sample. Proof is the output of table:
table(d)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1003 967 938 958 989 969 988 956 983 990 921 1001 982 1016 1013 959
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 907 918 918 991 931 945 998 1017 1029 980 959 886 947 987 954
If you want hist to "work", hist(d, breaks = 0:31) (and certainly a lot of other things) will work.

Resources