handling matrices with rows of unequal length in R - r

There are two matrices that I want to divide: numer1 and denom1. The problem is that they are of unequal row lengths. The script is run every week, so the dimensions change weekly, too.
This week:
dim(numer1) = 998 rows, 99 columns
dim(denom1) = 997 rows, 99 columns.
Last week:
dim(numer1) = 999 rows, 99 columns
dim(denom1) = 998 rows, 99 columns.
Is there a way to compare these matrices and remove the last row in the larger matrix (in this example, numer1)?
Here's what I have tried:
fun1 <- as.data.frame(abs(numer1[-last(numer1),]/denom1))
Thank you!

How about this:
rows <- 1:pmin(nrow(numer1), nrow(denom1))
frac1 <- numer1[rows,] / denom1[rows,]

Related

Why is the matrix function showing the number of columns 100

In this example code I define the X1=matrix(rnorm(length(y)*100), nrow = length(y)); I get the number of rows 97 which is correct, but the number of columns 100.
When I multiply with 10 instead with 100 in: X1=matrix(rnorm(length(y)*10 the number of columns is then 10.
I don't know why that is? Since I didn't assign any value for the columns.
library(glmnet)
library(ncvreg)
data("prostate");
X=prostate[,1:8];
y=prostate$lpsa; #97 values
X1=matrix(rnorm(length(y)*100), nrow = length(y)); #97x100
nrow(X1); ncol(X1);

Calculate Fisher's exact test p-value in dataframe rows

I have a list of 1700 samples in a data frame where every row represents the number of colorful items that every assistant has counted in a random number of specimens from different boxes. There are two available colors and two individuals counting the items so this could easily create a 2x2 contingency table.
df
Box-ID 1_Red 1_Blue 2_Red 2_Blue
1 1075 918 29 26
2 903 1076 135 144
I would like to know how can I treat every row as a contigency table (either vector or matrix) in order to perform a chi-square test (like Fisher's or Barnard's) and generate a sixth column with p-values.
This is what I've tried so far, but I am not sure if it's correct
df$p-value = chisq.test(t(matrix(c(df[,1:4]), nrow=2)))$p.value
I think you could do something like this
df$p_value <- apply(df,1,function(x) fisher.test(matrix(x[-1],nrow=2))$p.value)

Using aggregate to get the mean of duplicate rows in a data.frame in r

I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?

Using rnorm() to generate data sets

I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.
data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]
I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.
The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer
#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )
#bind the 2 - y will be the first column
mymatrix <- cbind(y,x)
> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51
Update
I just wanted to be a bit more specific about the error that you get when you try matrix in your question.
First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:
If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!

Data Manipulation, Looping to add columns

I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()

Resources