Forcing Rbind with uneven columns in R - r

I am trying to force some list objects (e.g. 4 tables of frequency count) into a matrix by doing rbind. However, they have uneven columns (i.e. some range from 2 to 5, while others range from 1:5). I want is to display such that if a table does not begin with a column of 1, then it displays NA in that row in the subsequent rbind matrix. I tried the approach below but the values repeat itself in the row rather than displaying NAs if is does not exist.
I considered rbind.fill but it requires for the table to be a data frame. I could create some loops but in the spirit of R, I wonder if there is another approach I could use?
# Example
a <- sample(0:5,100, replace=TRUE)
b <- sample(2:5,100, replace=TRUE)
c <- sample(1:4,100, replace=TRUE)
d <- sample(1:3,100, replace=TRUE)
list <- list(a,b,c,d)
table(list[4])
count(list[1])
matrix <- matrix(ncol=5)
lapply(list,(table))
do.call("rbind",(lapply(list,table)))

When I have a similar problem, I include all the values I want in the vector and then subtract one from the result
table(c(1:5, a)) - 1
This could be made into a function
table2 <- function(x, values, ...){
table(c(x, values), ...) - 1
}
Of course, this will give zeros rather than NA

Related

why nrow(dataframe) and length(dataframe) in r give different results?

I have a ordered data frame and want to know the number of the last row.
data_ranking <- reduced_data[order(reduced_data$outcome,reduced_data$hospital,na.last=NA),]
nobs <- nrow(data_ranking)
gives me different results of
data_ranking <- reduced_data[order(reduced_data$outcome,reduced_data$hospital,na.last=NA),]
nobs <- length(data_ranking)
I would like to understand why is that. It seems that nrowgives me the answer I'm looking for, but I don't understand why.
data frames are essentially lists where each element has the same length.
Each element of the list is a column, hence length gives you the length of the list, usually the number of columns.
nrow will give you the number of rows, ncol (or length) the number of columns.
The obvious equivalence of columns and list lengths gets messy once we have nonstandard structures within the data.frame (eg. matrices) and
x <- data.frame(y=1:5, z = matrix(1:10,ncol=2))
ncol(x)
# 3
length(x)
# 3
x1 <- data.frame(y=1:5, z = I(matrix(1:10,ncol=2)))
ncol(x1)
# 2
length(x)
# 2

How to compare two columns in different data.frames within R

I am working on my first real project within R and ran into a problem. I am trying to compare 2 columns within 2 different data.frames. I tried running the code,
matrix1 = matrix
for (i in 1:2000){
if(data.QW[i,1] == data.RS[i,1]){
matrix1[i,1]== "True"
}
else{
matrix1[i,1]== "False"
}
}
I got this error:
Error in Ops.factor(data.QW[i,1], data.RS[i,1]) :
level sets of factors are different
I think this may be because QW and RS have different row lengths. But I am trying to see where these errors might be within the different data.frames and fix them according to the source document.
I am also unsure if matrix will work for this or if I need to make it into a vector and rbind it into the matrix every time.
Any good readings on this would also be appreciated.
As mentioned in the comments, providing a reproducible example with the contents of the dataframe will be helpful.
Going by how the question topic sounds, it appears that you want to compare column 1 of data frame A against column 1 of data frame B and store the result in a logical vector. If that summary is accurate, please take a look here.
Too long for a comment.
Some observations:
Your columns, data.QW[,1] and data.RS[,1] are almost certainly factors.
The factors almost certainly have different set of levels (it's possible that one of the factors has a subset of the levels in the other factor). When this happens, comparisons using == will not work.
If you read your data into these data.frames using something like read.csv(...) any columns containing character data were converted to factors by default. You can change that behavior by setting stringsAsFactors=FALSE in the call to read.csv(...). This is a very common problem.
Once you've sorted out the factors/levels problem, you can avoid the loop by using, simply: data.QW[1:2000,1]==data.RW[1:2000,1]. This will create a vector of length 2000 containing all the comparisons. No loop needed. Of course this assumes that both data.frames have at least 2000 rows.
Here's an example of item 2:
x <- as.factor(rep(LETTERS[1:5],3)) # has levels: A, B, C, D, E
y <- as.factor(rep(LETTERS[1:3],5)) # has levels: A, B, C
y==x
# Error in Ops.factor(y, x) : level sets of factors are different
The below function compare compares data.frames or matrices a,b to find row matches of a in b. It returns the first row position in b which matches (after some internal sorting required to speed thinks up). Rows in a which have no match in b will have a return value of 0. Should handle numeric, character and factor column types and mixtures thereof (the latter for data.frames only). Check the example below the function definition.
compare<-function(a,b){
#################################################
if(dim(a)[2]!=dim(b)[2]){
stop("\n Matrices a and b have different number of columns!")
}
if(!all(sapply(a, class)==sapply(b, class))){
stop("\n Matrices a and b have incomparable column data types!")
}
#################################################
if(is.data.frame(a)){
i <- sapply(a, is.factor)
a[i] <- lapply(a[i], as.character)
}
if(is.data.frame(b)){
i <- sapply(b, is.factor)
b[i] <- lapply(b[i], as.character)
}
len1<-dim(a)[1]
len2<-dim(b)[1]
ord1<-do.call(order,as.data.frame(a))
a<-a[ord1,]
ord2<-do.call(order,as.data.frame(b))
b<-b[ord2,]
#################################################
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]<a[i,m]){
at<-(at+1)
if(at>len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
#################################################
found<-found[order(ord1)]
found<-ord2[found]
return(found)
}
# example data sets:
ncols<-10
nrows<-1E4
a <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- rbind(a,b) # example of b containing a
b <- b[sample(dim(b)[1],dim(b)[1],replace = F),]
found<-compare(a,b)
a<-as.data.frame(a) # = conversion to factors
b<-as.data.frame(b) # = conversion to factors
found<-compare(a,b)

From randomly to not randomly selecting columns

I have this piece of script for R and I want to adjust it a little bit.
Here's the script I have, mydata is an imported .csv file of n columns:
library(orddom)
R=6
delta = numeric (R)
for (i in 1:R) {`
a <- data.matrix(sample(mydata, 2, replace=FALSE))
drops <- c(colnames(a))
b <- data.matrix(mydata[,!(names(mydata) %in% drops)])
a1 <- na.omit(t(matrix(a,1)))
b1 <- na.omit(t(matrix(b,1)))
colnames(a1) <- c("Group 1")
colnames(b1) <- c("Group 2")
delta [i] <- abs(as.numeric(orddom(a1, b1, alpha = 0.05, paired=FALSE)[13,1]))
The problem is that for vector a, the columns of mydata get resampled randomly, leading to several equal delta values, because every time the iterative process start again there is a possibility that the same set of columns get selected.
Now I want the columns to be not randomly resampled. So I want all the possible column combinations, column 1 and 2 and 3 is the same combination as column 2 and 1 and 3 and so on, avoiding combinations of one column with itself, without repetition.
Is there a way to exclude column combinations that have already been selected before?
Then I would like to calculate delta for every combination and store it in a vector.
orddom: Ordinal Dominance Statistics
You can try the following:
#get the combos outside the loop
combos<-combn(length(mydata),2)
R<-ncol(combos)
delta<-numeric(R)
#in the loop, replace the first line
a <- mydata[,combos[,i]]
#the rest should be ok
There are some improvements you could make in the code but they are not relevant in what you are asking.

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Creating a matrix from operations done on another dataset

I have a large dataset, X with 58140 columns, filled with either 1 or 0
I would like to create a 58139 x 58139 matrix from the information of the 58139 columns in the dataset.
For each Aij in the matrix I would like to find the number of common rows which contain the value 1 for Column i+1 and Column J+1 from X.
I figured I can do this through sum(X[[2]]+X[[3]] == 2) for the A12 element of the matrix.
The only problem left is a way to code the matrix in.
You can use mapply. That returns a numeric vector. Then you can just wrap it in a call to matrix and ignore the first row and column.
# sample data
set.seed(123)
X <- data.frame(matrix(rbinom(200, 1, .5), nrow=10))
#
A <- matrix(mapply(function(i, j) sum(rowSums(X[, c(i,j)])==2),
i=rep(1:ncol(X), ncol(X)),
j=rep(1:ncol(X), each=ncol(X))),
ncol=ncol(X))[-1, -1]
A

Resources