as.data.frame and cbind results in factor columns - r

I have a big data.frame with a mix of integer, character and strings columns. I'll need to order the data.frame by a numeric column.
When I combine the original columns into a data.frame all the columns change to factor, including the column I need for the sort. So the sort gives something like 1, 10, 100... instead of 1, 2, 3...
Here is an example of my problem.
a <- 1:10
b <- c(1,3,5,6,2,10,100,110,7,4)
c <- LETTERS[1:10]
d <- as.data.frame(cbind(a, b, c)) # I am using this construction
e <- d[with(d, order(b)), ]
How can I fix this?

Actually you need to do:
d <- data.frame(a, b, c, stringsAsFactors=FALSE)
The last part stringsAsFactors=FALSE prevents column d$c from being converted to factors. Include it, and your strings will stay as strings.
Don't forget stringsAsFactors=FALSE - it will save you untold misery, trust me!

Related

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

I a data frame with different columns that has string answers from different assessors, who used random upper or lower cases in their answers. I want to convert everything to lower case. I have a code that works as follows:
# Creating a reproducible data frame similar to what I am working with
dfrm <- data.frame(a = sample(names(islands))[1:20],
b = sample(unname(islands))[1:20],
c = sample(names(islands))[1:20],
d = sample(unname(islands))[1:20],
e = sample(names(islands))[1:20],
f = sample(unname(islands))[1:20],
g = sample(names(islands))[1:20],
h = sample(unname(islands))[1:20])
# This is how I did it originally by writing everything explicitly:
dfrm1 <- dfrm
dfrm1$a <- tolower(dfrm1$a)
dfrm1$c <- tolower(dfrm1$c)
dfrm1$e <- tolower(dfrm1$e)
dfrm1$g <- tolower(dfrm1$g)
head(dfrm1) #Works as intended
The problem is that as the number of assessors increase, I keep making copy paste errors. I tried to simplify my code by writing a function for tolower, and used sapply to loop it, but the final data frame does not look like what I wanted:
# function and sapply:
dfrm2 <- dfrm
my_list <- c("a", "c", "e", "g")
my_low <- function(x){dfrm2[,x] <- tolower(dfrm2[,x])}
sapply(my_list, my_low) #Didn't work
# Alternative approach:
dfrm2 <- as.data.frame(sapply(my_list, my_low))
head(dfrm2) #Lost the numbers
What am I missing?
I know this must be a very basic concept that I'm not getting. There was this question and answer that I simply couldn't follow, and this one where my non-working solution simply seems to work. Any help appreciated, thanks!
Maybe you want to create a logical vector that selects the columns to change and run an apply function only over those columns.
# only choose non-numeric columns
changeCols <- !sapply(dfrm, is.numeric)
# change values of selected columns to lower case
dfrm[changeCols] <- lapply(dfrm[changeCols], tolower)
If you have other types of columns, say logical, you also could be more explicit regarding the types of columns that you want to change. For example, to select only factor and character columns, use.
changeCols <- sapply(dfrm, function(x) is.factor(x) | is.character(x))
For your first attempt, if you want the assignments to your data frame dfrm2 to stick, use the <<- assignment operator:
my_low <- function(x){ dfrm2[,x] <<- tolower(dfrm2[,x]) }
sapply(my_list, my_low)
Demo

Count with conditional - dataframe

I would like to count how many times a observation appears with the condition one column is greater than another.
For example, how many times the "A", "B" and "C" apperead counting only if the column B is greater than colun C.
set.seed(20170524)
A <- rep(c("A","B","C"),5)
B <- round(runif(15,0,20),0)
C <- round(runif(15,1,5),0) + B
D <- as.data.frame(cbind(A,B,C))
D <- D[order(B),]
Thank you!
#firstly, those numbers got converted to factors, this is problematic.
D$B<-as.numeric(D$B)
D$C<-as.numeric(D$C)
#Then, get the counts for the A:
countA = sum(D$A=='A' & D$B < D$C)
Similarly for 'B' and 'C'
If there's many more than just categories "A,B,C" you might want to do a data.table for the by= option, but someone will probably be along to say that's overkill.
You can use: table(D$A[which(D$B>D$C)])
Note that when you do D <- as.data.frame(cbind(A,B,C)) you will get factors so either you transform B and C into numeric variables afterwards, or you just create directly a data.frame without passing through a matrix:
D <- data.frame(A,B,C)

How to compare two columns in different data.frames within R

I am working on my first real project within R and ran into a problem. I am trying to compare 2 columns within 2 different data.frames. I tried running the code,
matrix1 = matrix
for (i in 1:2000){
if(data.QW[i,1] == data.RS[i,1]){
matrix1[i,1]== "True"
}
else{
matrix1[i,1]== "False"
}
}
I got this error:
Error in Ops.factor(data.QW[i,1], data.RS[i,1]) :
level sets of factors are different
I think this may be because QW and RS have different row lengths. But I am trying to see where these errors might be within the different data.frames and fix them according to the source document.
I am also unsure if matrix will work for this or if I need to make it into a vector and rbind it into the matrix every time.
Any good readings on this would also be appreciated.
As mentioned in the comments, providing a reproducible example with the contents of the dataframe will be helpful.
Going by how the question topic sounds, it appears that you want to compare column 1 of data frame A against column 1 of data frame B and store the result in a logical vector. If that summary is accurate, please take a look here.
Too long for a comment.
Some observations:
Your columns, data.QW[,1] and data.RS[,1] are almost certainly factors.
The factors almost certainly have different set of levels (it's possible that one of the factors has a subset of the levels in the other factor). When this happens, comparisons using == will not work.
If you read your data into these data.frames using something like read.csv(...) any columns containing character data were converted to factors by default. You can change that behavior by setting stringsAsFactors=FALSE in the call to read.csv(...). This is a very common problem.
Once you've sorted out the factors/levels problem, you can avoid the loop by using, simply: data.QW[1:2000,1]==data.RW[1:2000,1]. This will create a vector of length 2000 containing all the comparisons. No loop needed. Of course this assumes that both data.frames have at least 2000 rows.
Here's an example of item 2:
x <- as.factor(rep(LETTERS[1:5],3)) # has levels: A, B, C, D, E
y <- as.factor(rep(LETTERS[1:3],5)) # has levels: A, B, C
y==x
# Error in Ops.factor(y, x) : level sets of factors are different
The below function compare compares data.frames or matrices a,b to find row matches of a in b. It returns the first row position in b which matches (after some internal sorting required to speed thinks up). Rows in a which have no match in b will have a return value of 0. Should handle numeric, character and factor column types and mixtures thereof (the latter for data.frames only). Check the example below the function definition.
compare<-function(a,b){
#################################################
if(dim(a)[2]!=dim(b)[2]){
stop("\n Matrices a and b have different number of columns!")
}
if(!all(sapply(a, class)==sapply(b, class))){
stop("\n Matrices a and b have incomparable column data types!")
}
#################################################
if(is.data.frame(a)){
i <- sapply(a, is.factor)
a[i] <- lapply(a[i], as.character)
}
if(is.data.frame(b)){
i <- sapply(b, is.factor)
b[i] <- lapply(b[i], as.character)
}
len1<-dim(a)[1]
len2<-dim(b)[1]
ord1<-do.call(order,as.data.frame(a))
a<-a[ord1,]
ord2<-do.call(order,as.data.frame(b))
b<-b[ord2,]
#################################################
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]<a[i,m]){
at<-(at+1)
if(at>len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
#################################################
found<-found[order(ord1)]
found<-ord2[found]
return(found)
}
# example data sets:
ncols<-10
nrows<-1E4
a <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- rbind(a,b) # example of b containing a
b <- b[sample(dim(b)[1],dim(b)[1],replace = F),]
found<-compare(a,b)
a<-as.data.frame(a) # = conversion to factors
b<-as.data.frame(b) # = conversion to factors
found<-compare(a,b)

Subtracting two dataset

I have 2 datasets. One is the parent dataset (A) and other one is a subset (B) of it. I want to create a dataset from A which does not contain rows from B. It should be something like
C=A-B
Both the datasets A and B have same number of columns and column names.
If B is an actual subset of A, you can use setdiff on rownames:
sset <- subset(mtcars,cyl==4)
mtcars[setdiff(rownames(mtcars),rownames(sset)),]
If you do not want to convert it into a string for comparing, i.e Do exact matches
you can try this out
a <- data.frame(t(matrix(1:12,3,4)))
b <- data.frame(t(matrix(7:21,3,5)))
a[!apply(a,1,FUN=function(y){any(apply(b,1,FUN=function(x){all(x==y)}))}),]
Something like the following might do the trick:
C <- A[!(apply(A, 1, toString) %in% apply(B, 1, toString)), ]

How can combine dataset in R?

I think my question is very simple.
dat1<-seq(1:100)
dat2<-seq(1:100)
how can I combine dat1 and dat2 and make it look like
dat3<-seq(1:200)
Thanks so much!
How do you want to combine dat1 and dat2? By rows or columns? I'd take a look at the help pages for rbind() (row bind) , cbind() (column bind), orc() which combines arguments to form a vector.
Let me start by a comment.
In order to create a sequence of number on can use the following syntax:
x <- seq(from=, to=, by=)
A shorthand for, e.g., x <- seq(from=1, to=10, by=1) is simply 1:10. So, your notation is a little bit weird...
On the other hand, you can combine two or more vectors using the c() function. Let us say, for example, that a <- c(1, 2) and b <- c(3, 4). Then c <- c(a, b) is the vector (1, 2, 3, 4).
There exist similar functions to combine data sets: rbind() and cbind().

Resources