Here is the table that I am trying to manipulate:
colnames sampA sampB
#1 conA conB
#2 1.1 4.4
#3 2.2 5.5
#4 3.3 6.6
I want to calculate log2(x(1-x)) for each number in $sampB. Here is my code so far:
DF[-1,3] <- apply(DF[-1,]$sampB,1,function(x) log2(x(1-x)))
then I got the error message:
dim(X) must have a positive length
You shouldn't need apply(), as log2() is vectorized. Try this
x <- as.numeric(as.character(DF$sampB[-1]))
log2(x * (1 - x))
I took off the first element because I'm not really sure what that conB part is about (and now you have confirmed it in the comments). I also suspect that the column might be a factor (because of conB), so I wrapped the column in as.numeric(as.character(...)). That may not be necessary, but better safe than sorry.
Related
I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349
Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1
I have a minor problem, and I'm unsure how to fix the error.
Basically, I have two columns and I want to use a Double For Loop to calculate the averages between each number in both columns so it results in a vector of averages. To clarify, apply and mean functions isn't the best function because I need only half of the total possible combinations to obtain averages. For example:
Col1<-c(1,2,3,4,5)
Col2<-c(1,2,3,4,5)
Q1<-data.frame(cbind(Col1, Col2))
Q1$mean<-0
for (i in 1:length(Q1$Col1)) {
for (j in i+1:length(Q1$Col2)) {
Q1$mean[i]<-(Q1$Col1[i]+Q1$Col2[j])/2
}
}
Basically, for each number in Q1$Col1, I want it average it with Q1$Col2. The reason why I want to use a double for loop is to eliminate duplicates. This is the matrix version to provide visualization:
1.0 1.5 2.0 2.5 3.0
1.5 2.0 2.5 3.0 3.5
2.0 2.5 3.0 3.5 4.0
2.5 3.0 3.5 4.0 4.5
3.0 3.5 4.0 4.5 5.0
Here, each row represents a number from Q1$Col1 and each column represents a number from Q1$Col2. However, notice that there is redundancy on both sides of the matrix diagonal. So using the Double For Loop, I eliminate the redundancy to obtain the averages of the unique combination of cases. Using the matrix above, it should look like this:
1.0 1.5 2.0 2.5 3.0
2.0 2.5 3.0 3.5
3.0 3.5 4.0
4.0 4.5
5.0
What I think you're asking is this: given two vectors of numbers, how can I find the mean of the first items in each vector, the mean of the second items in each vector, and so on. If that's the case, then here is a way to do that.
First, you want use cbind() not rbind() in order to get columns not rows.
Col1<-c(1,2,3,4,5)
Col2<-c(2,3,4,5,6)
Q1<-cbind(Col1, Col2)
Then you can use the function [rowMeans()][1] to figure out (you guessed it) the means of each row. (See also rowSums() and colMeans() and colSums().)
rowMeans(Q1)
#> [1] 1.5 2.5 3.5 4.5 5.5
The more general way to do this is the apply() function, which will let us apply a function to each column or row. Here we use the argument 1 to apply it to rows (because the first row takes the first item from Col1 and Col2, etc.).
apply(Q1, 1, mean)
The results are these:
#> [1] 1.5 2.5 3.5 4.5 5.5
If you really want them in your existing matrix, you could do something like this:
means <- rowMeans(Q1)
cbind(Q1, means)
You do not need the loops to get the averages, you can use vectorised operations:
Col1 <- c(1,2,3,4,5)
Col2 <- c(2,3,4,5,6)
Mean <- (Col1+Col2)/2
Q1 <- rbind(Col1, Col2, Mean)
However rbind treats your vectors as rows, you could use cbind for columns.
You could just use the outer function to first calculate the averages, then use lower.trito fill the area underneath the diagonal of the matrix with NA values.
matrix<-outer(Q1$Col1, Q1$Col2, "+")/2
matrix[lower.tri(matrix)] = NA
sample
Symobls IDs Value1 Value2 Value3
1 NA NA 3.1 2.3 1.7
2 TP53 1234 5.8 6.9 10.1
3 Kras 5678 0.1 0.3 0.5
4 NA NA 10.3 2.1 7.9
5 Hras 9991 20.0 30.0 40.0
6 TP53 1234 -3.1 0.2 1.7
My table looks like this one.
I need to calculate values by row instead or column.
So, I tried to Use Symbols as new row names. In this way, I can calculate whole row value by using sample[,"Hras"]
When tried to do this, I encountered this problem.
rownames(sample)<-sample[,1]
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A1CF’, ‘A2M’, ‘A2ML1’, ‘AAGAB’, ‘AAK1’, ‘AAMDC’, ‘AARS2’, ‘AASDH’, ‘AASDHPPT’, ‘AASS’, ‘ABAT’, ‘ABCA1’, ‘ABCA13’, ‘ABCA2’, ‘ABCA4’, ‘ABCA5’, ‘ABCA8’, ‘ABCA9’, ‘ABCB1’, ‘ABCB11’, ‘ABCB4’, ‘ABCB5’, ‘ABCB6’, ‘ABCB8’, ‘ABCB9’, ‘ABCC1’, ‘ABCC10’, ‘ABCC11’, ‘ABCC12’, ‘ABCC13’, ‘ABCC3’, ‘ABCC4’, ‘ABCC5’, ‘ABCC6’, ‘ABCC8’, ‘ABCC9’, ‘ABCD3’, ‘ABCD4’, ‘ABCE1’, ‘ABCF2’, ‘ABCG1’, ‘ABHD1’, ‘ABHD10’, ‘ABHD11’, ‘ABHD12’, ‘ABHD13’, ‘ABHD17B’, ‘ABHD2’, ‘ABHD5’, ‘ABHD6’, ‘ABI1’, ‘ABI2’, ‘ABI3BP’, ‘ABL2’, ‘ABLIM1’, ‘ABLIM2’, ‘ABO’, ‘ABR’, ‘ABRA’, ‘ABTB1’, ‘ABTB2’, ‘ACAA1’, ‘ACAA2’, ‘ACACA’, ‘ACACB’, ‘ACAD10’, ‘ACADL’, ‘ACADSB’, ‘ACAN’, ‘ACAP1’, ‘ACAP2’, ‘ACAP3’, ‘ACAT1’, �� [... truncated]
Is this because of the "NA"? Other options?
Thanks
This is a microarray dataset. I have done normalization and going to extract values of several genes to perform plot, cross-correlation and t-test. In fact, not only NA but several genes that I am going to use for plotting figures have multiple rows. So, I need to extract them into another table for later use.
Here, I am just answering a way to change the row.names as you requested in the question. The ultimate goal is not clear. For the specified problem, you could try using make.names with option unique=TRUE. This will make sure that duplicates are named differently. In the first column, there are NA values, which will be named as NA., NA..1 etc.. (if that is okay for you).
row.names(sample) <- make.names(sample[,1],TRUE)
Or as commented by #Richard Scriven,
row.names(sample) <- paste(make.unique(df[,1]))
Another option would be to convert data.frame to matrix (which will permit duplicate values). I would recommend this only if the columns are of the same class. For example, if you have character and numeric columns, this will convert all the columns to character class. In your dataset, it seems to me that except the first column, all others are numeric (with the possible exception of "IDs" column). But again the NA values would be a problem. If you want to subset the '1st' or '3rd' row based on the rownames, it will be difficult.
sample1 <- as.matrix(sample[,-1])
row.names(sample1) <- sample[,1]
sample1['Hras',]
# IDs Value1 Value2 Value3
# 9991 20 30 40
I have the following data frame:
id<-c(1,2,3,4)
x<-c(0,2,1,0)
A<-c(1,3,4,3)
df<-data.frame(id,x,A)
now I want to make a variable called value in a way that if x>0, value for each id would be equal to A+1/x, and if x=0, value would be equal to A.
with this aim I have typed
value <- df$A + as.numeric(df$x > 0)*(1/df$x)
what I expect is a vector a follows:
[1] 1 3.5 5.0 3
however what I get by typing the above command is :
[1] NaN 3.5 5.0 NaN
I wonder if anyone can help me with this problem.
Thanks in advance!
I think it would be simpler to use function ifelse in this case:
ifelse(df$x>0, df$A + (1/df$x), df$A)
[1] 1.0 3.5 5.0 3.0
See ?ifelse for more details.
With the command you were trying to apply, although as.numeric(df$x>0) was indeed giving you a vector of 1 and 0 (so it was a good idea), it didn't change the fact that 1/df$x was an NaN when df$x was equal to 0 (since you can not divide by 0).