I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()
Related
I am new to R and this may be a very basic question.
I am working on microarray data where there are thousands of columns in a dataframe. I am trying to remove all those rows that have a value less than 2 and greater than -2 in any of the columns. Therefore, I cannot specify the column name.
How can I remove all those rows that have any value less than 2 and greater than -2 in any column.
Any help would be greatly appreciated.
indices <- which(apply(DF, 1, function(row) any(abs(row) < 2)))
DF[-indices,]
First, you want to find the relevant rows to remove.
You can achieve that by going over each row (apply with 1 as the second argument) and then check if it has any values between -2 and 2 (not including them). In other words, the absolute value is less than 2. (You can ask if any of the absolute values in the row are less than 2, or if the minimum absolute one is.)
This will give you a boolean vector. Applying the which function to it will produce a vector of indices where the value was TRUE.
Now you just need to remove those rows from the data.frame (I called it DF).
We can use rowSums to create a logical index of values less than 2
DF[!rowSums((DF < 2) & (DF > -2)),]
# V1 V2 V3 V4 V5
#6 5 2 -5 -2 -2
#14 2 -4 2 5 -4
data
set.seed(24)
DF <- as.data.frame(matrix(sample(-5:5, 5*20, replace = TRUE), 20, 5))
I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?
I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])
I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!
I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g