global condition on vector - r

.Hi
I would like to make a comparison operation on my vector: I got one with numerical values that I want to transform in 2^. However if one value is greater than 65000 after it has be transformed I would like there's no transformation for the entire vector.
Currently I'm trying this:
final<-ifelse((2^vec>65000)vec,2^vec)
It works great but for each value. So if one is greater than 65000 after transformation it this code returns me the initial value but if does'nt exceed 65000 it returns me the transformed value and I have a mixed vector with transformed and non transformed values.
here an example:
> vec
32.82 576.47 36.45 78.93 8.77 63.28 176.86 1.88 291.97 35.59
And the result after my code
> final
32.820000 576.470000 36.450000 78.930000 436.549065 63.280000 176.860000 3.680751 291.970000 35.590000
here, you can see that some values have been transformed en some not. In this kind of situation finally I would like fina=vec. I tried with a "break" instead of vec for the "yes" condition in the ifelse but it does'nt work. Probably something like that could work but I don't what.
If someone has an idea ^^
Thanks

How's this?
log_if_bigger = function(vec, thresh){
if(any(vec>thresh)){
return(log2(vec))
}else{
return(vec)
}
}
Usage:
# if any values, bigger than 0 then log - here there are:
> log_if_bigger(c(1,2,3,4),0)
[1] 0.000000 1.000000 1.584963 2.000000
# if any values bigger than 9 then log - here there arent:
> log_if_bigger(c(1,2,3,4),9)
[1] 1 2 3 4
Then you just want something like:
final = log_if_bigger(vec, 65000)
or possibly:
final = log_if_bigger(vec, log2(65000))
based on your condition where you test 2^vec>65000

Related

Cannot fill my R for loop

so I am working on R with a matrix as following:
diff_0
SubPop0-1, SubPop1-1, SubPop2-1, SubPop3-1, SubPop4-1,
SubPop0-1, NA NA NA NA NA
SubPop1-1, 0.003403100 NA NA NA NA
SubPop2-1, 0.005481177 -0.002070277 NA NA NA
SubPop3-1, 0.002216444 0.005946314 0.001770977 NA NA
SubPop4-1, 0.010344845 0.007151529 0.004237316 -0.0021275130 NA
... but bigger ;-).
This is a matrix of pairwise genetic differenciation between each SubPop from 0 to 4. I would like to obtain a mean differenciation value for each subPop.
For instance, for SubPop-0, the mean would just correspond to the mean of the 4 values from column 1. However for SubPop-2, this would be the mean of the 2 values in line 3 and the 2 value in column 3, since this is a demi-matrix.
I wanted to write a for loop to compute each mean value for each SubPop, taking this into account. I tried the following:
Mean <- for (r in 1:nrow(diff_0)) {
mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
}
First this isolates each line and column of index [r], whose values refer to the same SubPop r. 'sum' enable to gather these values and eliminate 'NA's. Finally I get the mean value for my SubPop r. I was hoping my for loop would give me with value for each index r, which would be a SubPop.
However, eventhough my mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T)), if run alone with a fixed r value between 1 and 5, does give me what I want; well the 'for loop' itself only returns an empty vector.
Something like for (r in 1:nrow(diff_0)) { print(diff_0[r,1]) } also works, so I do not understand what is going on.
This is a trivial question but I could not find an answer on the internet! Although I know I am probably missing the obvious :-)...
Thank you very much,
Cheers!
Okay, based on what you want to do (and if I understand everything correctly) there are several ways of doing this.
The one that comes to my mind now is just making your lower triangular matrix to an "entire matrix" (i.e. fill the upper triangle with the transpose of the lower triangle) and then do row- or column-wise means
My R is running right now on something else, so I can't check my code but this should work
diff = diff_0
diff[upper.tri(diff)] = t(diff_0[lower.tri(diff)]) #This step might need some work
As I said, my R is running right now so I can't check the correctness of the last line - I might be confused with some transposes there, so I'd appreciate any feedback on whether it actually worked or not.
You can then either set the diagonal values to 0 or alternatively add na.rm = TRUE to the mean statement
mean_diffs = apply(diff,2,FUN = function(x)mean(x, na.rm = TRUE))
that should work
Also: Yes, your code does not work, because the assignment is not in the for loop. This should work:
means = rep(NA, nrow(diff_0)
for (r in 1:length(means)){
means[r] = mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
But in general for loops are not what you want to do in R
This may be a solution...
for(i in 1:nrow(diff_0)) {
k<-mean(cbind(as.numeric(diff_0[,i]),as.numeric(diff_0[i,])),na.rm=T)
if(i==1) {
data_mean <- k
}else{
data_mean <- rbind(data_mean,k)
}
}
colnames(data_mean) <- "mean"
rownames(data_mean) <- c("SubPop0","SubPop1","SubPop2","SubPop3","SubPop4")
data_mean
mean
SubPop0 0.005361391
SubPop1 0.003607666
SubPop2 0.002354798
SubPop3 0.001951555
SubPop4 0.004901544

using seq_along() to handle the empty case

I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349

correlation loop keep getting NA

Despite using two complete columns where every element is numeric and no numbers are missing for rows 2 thru 570, I find it impossible to get a result other than NA when using a loop to find a rolling 24-week correlation between the two columns.
rolling.correlation <- NULL
temp <- NULL
for(i in 2:547){
temp <- cor(training.set$return.SPY[i:i+23],training.set$return.TLT[i:i+23])
rolling.correlation <- c(rolling.correlation, temp)
} #end "for" loop
rolling.correlation
The cor()command works fine for [2:25], [i:25], or [2:i] but R doesn't understand when I say [i:i+23]
I want R to calculate a correlation for rows 2 thru 25, then 3 thru 26, ..., 547 thru 570. The result should be a vector of length 546 which has numeric values for each correlation. Instead I'm getting a vector of 546 NAs. How can I fix this? Thanks for your help.
Look what happens when you do
5:5+2
# [1] 7
Note that : has a higher operator precedence than + which means 5:5+2 is the same as (5:5)+2 when you really want 5:(5+2). Use
i:(i+23)

Suppress large output to R console

How can I make R check whether an object is too large to print in the console? "Too large" here means larger than a user-defined value.
Example: You have a list f_data with two elements f_data$data (a 100MB data.frame) and f_data$info (for instance, a vector). Assume you want to inspect the first few entries of the f_data$data data.frame but you make a mistake and type head(f_data) instead of head(f_data$data). R will try to print the whole content of f_data to the console (which would take forever).
Is there somewhere an option that I can set in order to suppress the output of objects that are larger than let's say 1MB?
Edit: Thank you guys for your help. After implementing the max.rows option I realized that this gives indeed the desired output. BUT the problem that the output takes very long to show up still persists. I will give you a proper example below.
df_nrow=100000
df_ncol=100
#create list with first element being a large data.frame
#second element is a short vector
test_list=list(df=data.frame(matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol)),
vec=1:110)
#only print the first 100 elements of an object
options(max.print=100)
#head correctly displays the first row of the data.frame
#BUT for some reason the output takes really long to show up in the console (~30sec)
head(test_list)
#let's try to see how long exactly
system.time(head(test_list))
# user system elapsed
# 0 0 0
#well, obviously system.time is not the proper tool to measure this
#the same problem if I just print the object to the console without using head
test_list$df
I assume that R performs some sort of analysis on the object being printed and this is what takes so long.
Edit 2:
As per my comment below, I checked whether the problem persists if I use a matrix instead of a data.frame.
#create list with first element being a large MATRIX
test_list=list(mat=matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol),vec=1:110)
#no problem
head(test_list)
#no problem
test_list$mat
Could it be that the output to the console is not really efficiently implemented for data.frame objects?
I think there is no such option, but you can check the size of an object with object.size and print it if is lower than a threshold (measure in bytes), for example:
print.small.objects <- function(x, threshold = 1e06, ...)
{
if (object.size(x) < threshold) {
print(x, ...)
} else {
cat(paste("too big object\n"))
print(object.size(x))
}
}
Here's an example that you could adjust up to 100MB. It basically only prints the first 6 rows and 5 columns if the object's size is above 8e5 bytes. You could also turn this into a function and place it in your .Rprofile
> lst <- list(data.frame(replicate(100, rnorm(1000))), 1:10)
> sapply(lst, object.size)
# [1] 810968 88
> lapply(lst, function(x){
if(object.size(x) > 8e5) head(x)[1:5] else x
})
#[[1]]
# X1 X2 X3 X4 X5
#1 0.3398235 -1.7290077 -0.35367971 0.09874918 -0.8562069
#2 0.2318548 -0.3415523 -0.38346083 -0.08333569 -1.1091982
#3 0.0714407 -1.4561768 0.50131914 -0.54899188 0.1652095
#4 -0.5170228 1.7343073 -0.05602883 0.87855313 0.4025590
#5 0.6962212 -0.3179930 0.28016057 1.05414456 -0.5172885
#6 0.9471200 1.4424843 -1.46323827 -0.78004192 -1.3611820
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10

After merging missing values have a spurious value

After merging the resulting dataset doesn't have the same number of nonmissings (the variable on which I merge has no duplicates in either dataset), instead it has the same number of missings - meaning that I get 72229 spurious values, all exactly like one value in the second dataset.
There is just one row in the second dataset with that spurious value, and it looks perfectly normal. If I set its value to missing, I get the desired result (except for the one missing value). If I set its value to 1000, I get 72229 times 1000 in the result. So I thought it's something about that row, but if try to make a reproducible example using that row and some others, the error doesn't occur.
I was unsuccessful at making a reproducible example in a sufficiently small subset of the data, that I could comfortably share it, so I'm mostly soliciting sage advice. Advice on how to reproduce the problem would also be nice.
> table(is.na(cogdj$int.youth))
FALSE TRUE
1731 178
> sum(duplicated(soep$PERSNR))
[1] 0
> sum(duplicated(cogdj$PERSNR))
[1] 0
> soep = merge(soep,cogdj[,c('PERSNR','ana','ded','mat','ari','int.youth')],by="PERSNR",all.x=T,incomparables=NA)
> table(is.na(soep$int.youth))
FALSE TRUE
73959 178
> nrow(soep[which(round(soep$int.youth,20)==1.6737269266506955567),])
[1] 72229
> nrow(cogdj[which(round(cogdj$int.youth,20)==1.6737269266506955567),])
[1] 1
> cogdj[which(round(cogdj$int.youth,20)==1.6737269266506955567),c('PERSNR','int.youth')]
PERSNR int.youth
1 609104 1.673727

Resources