I have a ordered data frame and want to know the number of the last row.
data_ranking <- reduced_data[order(reduced_data$outcome,reduced_data$hospital,na.last=NA),]
nobs <- nrow(data_ranking)
gives me different results of
data_ranking <- reduced_data[order(reduced_data$outcome,reduced_data$hospital,na.last=NA),]
nobs <- length(data_ranking)
I would like to understand why is that. It seems that nrowgives me the answer I'm looking for, but I don't understand why.
data frames are essentially lists where each element has the same length.
Each element of the list is a column, hence length gives you the length of the list, usually the number of columns.
nrow will give you the number of rows, ncol (or length) the number of columns.
The obvious equivalence of columns and list lengths gets messy once we have nonstandard structures within the data.frame (eg. matrices) and
x <- data.frame(y=1:5, z = matrix(1:10,ncol=2))
ncol(x)
# 3
length(x)
# 3
x1 <- data.frame(y=1:5, z = I(matrix(1:10,ncol=2)))
ncol(x1)
# 2
length(x)
# 2
Related
I am trying to force some list objects (e.g. 4 tables of frequency count) into a matrix by doing rbind. However, they have uneven columns (i.e. some range from 2 to 5, while others range from 1:5). I want is to display such that if a table does not begin with a column of 1, then it displays NA in that row in the subsequent rbind matrix. I tried the approach below but the values repeat itself in the row rather than displaying NAs if is does not exist.
I considered rbind.fill but it requires for the table to be a data frame. I could create some loops but in the spirit of R, I wonder if there is another approach I could use?
# Example
a <- sample(0:5,100, replace=TRUE)
b <- sample(2:5,100, replace=TRUE)
c <- sample(1:4,100, replace=TRUE)
d <- sample(1:3,100, replace=TRUE)
list <- list(a,b,c,d)
table(list[4])
count(list[1])
matrix <- matrix(ncol=5)
lapply(list,(table))
do.call("rbind",(lapply(list,table)))
When I have a similar problem, I include all the values I want in the vector and then subtract one from the result
table(c(1:5, a)) - 1
This could be made into a function
table2 <- function(x, values, ...){
table(c(x, values), ...) - 1
}
Of course, this will give zeros rather than NA
I have a data frame which has 7 variables with 500 observations each.
I got all the subsets of this dataset by using powerset function. Now, I have a list of 128 subsets, each has different sizes. I mean, I got 128 different datasets.
How can I split these 128 subsets from the list?
Here is what I have so far;
#data generation part
x1=rnorm(n=500, m=2, sd=1);
x2=rbinom(n=500, 1 , 0.6);
y=rbinom(n=500, 1 , 0.7);
r1=rbinom(n=500, 1 , 0.65);
x1x1=x1*x1;
x1x2=x1*x2;
x1y=x1*y;
x2y=x2*y;
s=rbind(x1,x2,y,x1x1,x1x2,x1y,x2y);
sdata<-data.frame(t(s));
#getting subsets of 7 variables as a list
len = length(sdata)
l = vector(mode="list",length=2^len) ; l[[1]]=numeric()
counter = 1L
for(x in 1L:length(sdata)){
for(subset in 1L:counter){
counter=counter+1L
id=rep(l[[counter]], nrow(l[[counter]]))
l[[counter]] = data.frame(l[[subset]],sdata[x])
}
}
So, "l" is a list contains 128 elements. Each element is a matrix with different sizes. I want them all splitted. I tried to add "id" vector for every element but i couldn't make it. If i could add id vectors for every element, i can split them by using ids.
Expected outcome is getting 128 different data frames (subsets) with different sizes. I want them to be seperate.
Do you have any suggestion or different idea for splitting this list?
I am going to ignore the trivial (null) subset, and consider that you want the other 127 combinations.
# This is your data
sdata <- matrix(1:(500*7), 500, 7)
# We generate all the possible combinations (127 cases)
sComb <- do.call(list, unlist(lapply(1:7, function(n) combn(1:7, n, simplify = F)), r = F))
# And then we create all the possible datasets
l <- lapply(sComb, function(i) sdata[,i])
Hope this helps
Edit:
If you want to conserve the names (and every element of the list as a matrix) change the last line to this one
l <- lapply(sComb, function(i){ x <- matrix(sdata[,i], nrow(sdata)); colnames(x) <- colnames(sdata)[i]; x })
I am working on my first real project within R and ran into a problem. I am trying to compare 2 columns within 2 different data.frames. I tried running the code,
matrix1 = matrix
for (i in 1:2000){
if(data.QW[i,1] == data.RS[i,1]){
matrix1[i,1]== "True"
}
else{
matrix1[i,1]== "False"
}
}
I got this error:
Error in Ops.factor(data.QW[i,1], data.RS[i,1]) :
level sets of factors are different
I think this may be because QW and RS have different row lengths. But I am trying to see where these errors might be within the different data.frames and fix them according to the source document.
I am also unsure if matrix will work for this or if I need to make it into a vector and rbind it into the matrix every time.
Any good readings on this would also be appreciated.
As mentioned in the comments, providing a reproducible example with the contents of the dataframe will be helpful.
Going by how the question topic sounds, it appears that you want to compare column 1 of data frame A against column 1 of data frame B and store the result in a logical vector. If that summary is accurate, please take a look here.
Too long for a comment.
Some observations:
Your columns, data.QW[,1] and data.RS[,1] are almost certainly factors.
The factors almost certainly have different set of levels (it's possible that one of the factors has a subset of the levels in the other factor). When this happens, comparisons using == will not work.
If you read your data into these data.frames using something like read.csv(...) any columns containing character data were converted to factors by default. You can change that behavior by setting stringsAsFactors=FALSE in the call to read.csv(...). This is a very common problem.
Once you've sorted out the factors/levels problem, you can avoid the loop by using, simply: data.QW[1:2000,1]==data.RW[1:2000,1]. This will create a vector of length 2000 containing all the comparisons. No loop needed. Of course this assumes that both data.frames have at least 2000 rows.
Here's an example of item 2:
x <- as.factor(rep(LETTERS[1:5],3)) # has levels: A, B, C, D, E
y <- as.factor(rep(LETTERS[1:3],5)) # has levels: A, B, C
y==x
# Error in Ops.factor(y, x) : level sets of factors are different
The below function compare compares data.frames or matrices a,b to find row matches of a in b. It returns the first row position in b which matches (after some internal sorting required to speed thinks up). Rows in a which have no match in b will have a return value of 0. Should handle numeric, character and factor column types and mixtures thereof (the latter for data.frames only). Check the example below the function definition.
compare<-function(a,b){
#################################################
if(dim(a)[2]!=dim(b)[2]){
stop("\n Matrices a and b have different number of columns!")
}
if(!all(sapply(a, class)==sapply(b, class))){
stop("\n Matrices a and b have incomparable column data types!")
}
#################################################
if(is.data.frame(a)){
i <- sapply(a, is.factor)
a[i] <- lapply(a[i], as.character)
}
if(is.data.frame(b)){
i <- sapply(b, is.factor)
b[i] <- lapply(b[i], as.character)
}
len1<-dim(a)[1]
len2<-dim(b)[1]
ord1<-do.call(order,as.data.frame(a))
a<-a[ord1,]
ord2<-do.call(order,as.data.frame(b))
b<-b[ord2,]
#################################################
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]<a[i,m]){
at<-(at+1)
if(at>len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
#################################################
found<-found[order(ord1)]
found<-ord2[found]
return(found)
}
# example data sets:
ncols<-10
nrows<-1E4
a <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- rbind(a,b) # example of b containing a
b <- b[sample(dim(b)[1],dim(b)[1],replace = F),]
found<-compare(a,b)
a<-as.data.frame(a) # = conversion to factors
b<-as.data.frame(b) # = conversion to factors
found<-compare(a,b)
I would like to create ncol(y) number of matrices taking each column from y matrix and replicating it rep number of times. I am not doing the for loop right though. To reiterate, below I would like to get three separate matrices, the first one would have values of 1 to 100 repeated 200 times (they come from the first columns of y), second would have vector 101-200 repeated 200 times (2nd column of y) as well and the third one would have values 201-300 repeated 200 times (3rd column of y). Preferably the output name would be matrix1, matrix2 or a list.
y <- matrix(1:300,100,3)
rep = 200
for (i in 1:ncol(y)) {
newmatrix <- replicate(rep,y[,i])
valuematrix[[i]] <- newmatrix
}
You're missing the initialization of valuematrix. You can do this through
valuematrix <- list()
just before the for loop.
You might also consider using lapply to solve this problem. It automatically stores the matrices in a list.
y <- matrix(1:300, 100, 3)
rep = 200
matList <- lapply(1:ncol(y), function(i) replicate(rep, y[,i]))
I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))
I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]