Why is the matrix function showing the number of columns 100 - r

In this example code I define the X1=matrix(rnorm(length(y)*100), nrow = length(y)); I get the number of rows 97 which is correct, but the number of columns 100.
When I multiply with 10 instead with 100 in: X1=matrix(rnorm(length(y)*10 the number of columns is then 10.
I don't know why that is? Since I didn't assign any value for the columns.
library(glmnet)
library(ncvreg)
data("prostate");
X=prostate[,1:8];
y=prostate$lpsa; #97 values
X1=matrix(rnorm(length(y)*100), nrow = length(y)); #97x100
nrow(X1); ncol(X1);

Related

Summing a specific vector index

I'm having trouble figuring out how vectors are formatted. I need to find the average height of participants in the cystfibr package of the ISwR library. When printing the entire height data set it appears to be a 21x2 matrix with height values and a 1 or 2 to indicate sex. However, ncol returns a value of NA suggesting it is a vector. Trying to get specific indexes of the matrix (heightdata[1,]) also returns an incorrect number of dimensions error.
I'm looking to sum up only the height values in the vector but when I run the code I get the sum of the male and female integers. (25)
install.packages("ISwR")
library(ISwR)
attach(cystfibr)
heightdata = table(height)
print(heightdata)
print(sum(heightdata))
This is what the output looks like.
You can convert the cystfibr to a dataframe format to find out the sum of all vectors present in the data.
install.packages("ISwR")
library(ISwR)
data <- data.frame(cystfibr) # attach and convert to dataframe format
As there are no unique identifier present in the data, so done sum across observations
apply(data [,"height", drop =F], 2, sum) # to find out the sum of height vector
height
3820
unlist(lapply(data , sum))
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
sapply(data, sum)
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
table gives you the count of values in the vector.
If you want to sum the output of height from heightdata, they are stored in names of heightdata but it is in character format, convert it to numeric and sum.
sum(as.numeric(names(heightdata)))
#[1] 3177
which is similar to summing the unique values of height.
sum(unique(cystfibr$height))
#[1] 3177

How to sample with various sample size in R?

I am trying to get a random sample from a dataframe with different size.
example the first sample should only have 8 observations
2nd sample can have 10 observations
3rd can have 12 observations
df[sample(nrow(df),10 ), ]
this gives me a fixed 10 observations when I take a sample
In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.
Any help will be appreciated
You could try using replicate:
times_to_sample = 5L
NN = nrow(df)
replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
This will return a list of length times_to_sample, the ith element of which will give you a data.frame with the result for the ith replication.
simplify=FALSE prevents simplify2array from mangling the results into a not-particularly-useful matrix.
You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a to b rows, you'll want to ensure a >= 1, b <= nrow(df).
If times_to_sample is going to be large, it'll be more efficient to get all of the samples from 5:10 up front instead:
idx = sample(5:10, times_to_sample, replace = TRUE)
lapply(idx, function(i) df[sample(NN, i), ])
A little less readable but surely more efficient than to repeatedly to sample(5:10, 1), i.e. only one at a time (not leveraging vectorization)

Using aggregate to get the mean of duplicate rows in a data.frame in r

I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?

Using rnorm() to generate data sets

I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.
data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]
I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.
The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer
#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )
#bind the 2 - y will be the first column
mymatrix <- cbind(y,x)
> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51
Update
I just wanted to be a bit more specific about the error that you get when you try matrix in your question.
First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:
If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!

handling matrices with rows of unequal length in R

There are two matrices that I want to divide: numer1 and denom1. The problem is that they are of unequal row lengths. The script is run every week, so the dimensions change weekly, too.
This week:
dim(numer1) = 998 rows, 99 columns
dim(denom1) = 997 rows, 99 columns.
Last week:
dim(numer1) = 999 rows, 99 columns
dim(denom1) = 998 rows, 99 columns.
Is there a way to compare these matrices and remove the last row in the larger matrix (in this example, numer1)?
Here's what I have tried:
fun1 <- as.data.frame(abs(numer1[-last(numer1),]/denom1))
Thank you!
How about this:
rows <- 1:pmin(nrow(numer1), nrow(denom1))
frac1 <- numer1[rows,] / denom1[rows,]

Resources