Error in the output file of a for loops in r - r

I'm trying to perform a resample of a list using the for loops in R for generating a data frame that records the output of each trial.
I get the for loops to work without error, but I am sure I am making a mistake somewhere as I should not be getting the result for the jth entry that I get as possible outcomes.
Here's how I am generating my list:
set1=rep(0,237) # repeat 0's 237 times
set2=rep(1,33) # repeats 1s 33 times
aa=c(set1,set2) # put the two lists together
table(aa) # just a test count to make sure I have it set up right
Now I want to take a random sample set of size j out of aa and record how many 0's and 1's I get each time I perform this task (let's say n number of trials).
Here's how I have set it up:
n=1000
j=27
output=matrix(0,nrow=2,ncol=n)
for (i in 1:n){
trial<-sample(aa,j,replace=F)
counts=table(trial)
output[,i]=counts
}
Checking the output,
table(output[1,])
# 17 18 19 20 21 22 23 24 25 26 27
1 1 9 17 46 135 214 237 205 111 24
table(output[2,])
# 1 2 3 4 5 6 7 8 9 10 27
111 205 237 214 135 46 17 9 1 1 24
I do not think I am getting the right answer from the distribution for the jth value (in this case 27) for either of the expected number of 0's or 1's (should be close to 0 as oppose to the high number it returns).
Any suggestions as to where I am going wrong would be greatly appreciated.

If you have only 0s in trial length(counts)==1 and the value gets recycled when you assign to output. Try this:
for (i in 1:n){
trial<-sample(aa,j,replace=F)
trial <- factor(trial, levels=0:1)
counts=table(trial)
output[,i]=counts
}
Of course, you could more efficiently use rhyper:
table(rhyper(1000, table(aa)[1], table(aa)[2], 27))

Related

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

Reassigning one column according to another using data.table

I am interested in replacing the value of -11 in one column "contra_end" to the corresponding values contained in "current_age", another column. -11 is a variable indicating current activity, and I want to replace that value with the actual age of each individual stored in "current_age". Age has ~500,000 values and only ~4,000 values from the first column have the value -11. When I run the following code to assign my age column values to the -11 values in "contra_end" I get the following error. Can I make this work without creating a new age variable?
biobank[contra_end == -11, contra_end := biobank[,"current_age", with=FALSE]]
Error in `[.data.table`(biobank, contra_end == -11, `:=`(contra_end, biobank[, :
Supplied 500000 items to be assigned to 4919 items of column 'contra_end'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
I used a short dataset which I made using this code
biobank <- data.frame(contra_end = c(0,13,15,109,-11,23,45),
current_age = c(34,35,36,46,43,56,23))
which gives
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 -11 43
6 23 56
7 45 23
Using the tidyverse::mutate
biobank_2 <- biobank %>%
mutate(contra_end = ifelse(contra_end == -11, current_age, contra_end))
Or using base
biobank$contra_end[biobank$contra_end==-11] <- biobank$current_age[biobank$contra_end==-11]
Both options give:
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 43 43
6 23 56
7 45 23
EDIT: I didn't even notice that you were looking for a solution in data.table until after I posted. It doesn't sound like you have too many records for either of the solutions I posted to not be efficient enough, though.

Speedup search of Elements

I got two data.frames m (23 columns 135.973 rows) with the two important columns
head(m[,2])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(m[,7])
# [1] 3661216 3661217 3661223 3661224 3661564 3661567
and search (4 columns 1.019.423 rows) with three important columns
head(search[,1])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(search[,3])
# [1] 3000009 3003160 3003187 3007262 3028947 3050944
head(search[,4])
# [1] 3000031 3003182 3003209 3007287 3028970 3050995
For each row in m I like to get the information if the m[XX,7] position is between any position of search[,3] and search[,4]. So search[,3] can be considered as "start" and search[,4] as "end". In addition search[,1] and m[,2] have to be identical.
Example:
m at row 215
"chr1" 10.984.038
hits in search at line 2898
"chr1" 10.984.024 10.984.046
In general I'm not interested which line or how many lines of search could be found. I just want the information for any line of m is there a matching line in search yes or no.
I'm ending up in this function:
f_4<-function(x,y,z){
for (out in 1:length(x[,1])) {
z[out]<-length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4])))
}
return(z)
}
found4<-vector(length=length(m[,1]), mode="numeric")
found4<-f_4(m,search,found4)
It took 3 hours to run this code.
I have already tried some speedup approaches, however I didn't manage to get any of this running proper or faster.
I even tried some lappy/apply approaches -which worked but aren't faster-. However they failed when trying to speed up using parLapply/parRapply.
Anybody got a quite faster approach and may can give some advise?
EDIT 2015/09/18
Found another way to speed up, using foreach %dopar%.
f5<-function(x,y,z){
foreach(out=1:length(x[,1]), .combine="c") %dopar% {
takt<-1000
z=length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4]) ))
}
return(z)
}
found5<-vector(length=length(m[,1]), mode="numeric")
found5<-f5(m,search,found5)
Only need 45min. However I'm always getting 0 only. Thing I need to read some more of the foreach %dopar% tutorials.
You can try merging with subsequent logical subsetting. First let's create some mock data:
set.seed(123) # used for reproducibility
m <-as.data.frame(matrix(sample(50,7000, replace=T), ncol=7, nrow=1000))
search <- as.data.frame(matrix(sample(50,1200, replace=T), ncol=4, nrow=300))
Since we want to compare different rows of the two sets, we can use the criterion that m[,2] should be equal to search[,1]. For convenience we can name these columns "ID" in both sets:
m <- cbind(m,seq_along(1:nrow(m)))
search <- cbind(search,seq_along(1:nrow(search)))
colnames(m) <- c("a","ID","c","d","e","f","val","rownum.m")
colnames(search) <- c("ID","nothing","start","end", "rownum.s")
We have added a column to m named 'rownum.m' and a similar column to search which in the end will help identifying the resulting entries in the initial dataset.
Now we can merge the data sets, such that the ID is the same:
m2 <- merge(m,search)
In a final step, we can perform a logical subset of the merged data set and assign the output to a new data frame m3:
m3 <- m2[(m2[,"val"] >= m2[,"start"]) & (m2[,"val"] <= m2[,"end"]),]
#> head(m3)
# ID a c d e f val rownum.m nothing start end rownum.s
#5 1 14 36 36 31 30 25 846 10 20 36 291
#13 1 34 49 24 8 44 21 526 10 20 36 291
#17 1 19 32 29 44 24 35 522 6 33 48 265
#20 1 19 32 29 44 24 35 522 32 31 50 51
#21 1 19 32 29 44 24 35 522 10 20 36 291
#29 1 6 50 10 13 43 22 15 10 20 36 291
If we are only interested in a TRUE/FALSE statement whether a specific row of m matches the criterions, we can define a vector match_s:
match_s <- m$rownum.m %in% m3$rownum.m
which can be stored as an additional column in the original data set m:
m <- cbind(m,match_s)
Finally, we can remove the auxiliary column 'rownum.m' from the data set m which is no longer needed, with m <- m[,-8].
The result is:
> head(m)
# a ID c d e f val match_s
#1 15 14 8 11 16 13 23 FALSE
#2 40 30 8 48 42 50 20 FALSE
#3 21 9 8 19 30 36 19 TRUE
#4 45 43 26 32 41 33 27 FALSE
#5 48 43 25 10 15 13 4 FALSE
#6 3 24 31 33 8 5 36 FALSE
If you're trying to find SNPs (say) inside a set of genomic regions, don't use R. Use BEDOPS.
Convert your SNP or single-base positions to a three-column BED file. In R, make a three-column data table with m[,2], m[,7] and m[,7] + 1, which represent the chromosome, start and stop position of the SNP. Use write.table() to write out this data table to a tab-delimited text file.
Do the same with your genomic regions: Write search[,1], search[,3], and search[,4] to a three-column data table representing the chromosome, start and stop position of the region. Use write.table() to write this out to a tab-delimited text file.
Use sort-bed to sort both BED files. This step might be optional, but it doesn't take long to do and it guarantees that the files are prepped for use with BEDOPS tools.
Finally, use bedmap on the two BED files to map SNPs to genomic regions. Mapping associates SNPs with regions. The bedmap tool can report which SNPs map to a region, or report the number of SNPs, or one or more of many other operations. The documentation for bedmap goes into more detail on the list of operations, but the provided example should get you started quickly.
If your data are in BED format, or can be quickly coerced into BED format, don't use R for genomic operations, as it is slow and memory-intensive. The BEDOPS toolkit introduced the use of sorting to make genomic operations fast, with low memory overhead.

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources