I would like to compare two columns in my dataset, however they have different levels. I cant seem to find a way to get this to work. Any suggestions?
Example:
x = c('a','b','c')
y = c('a','b','g')
z = data.frame(x,y)
if(z$x == z$y){1} else{0}
returns: Error in Ops.factor(z$x, z$y) : level sets of factors are different
I have tried to make them have similar levels, i.e,:
z$x <- factor(z$x, levels=c(levels(z$y),levels(z$x)))
z$y <- factor(z$y, levels=c(levels(z$y),levels(z$x)))
but it still returns the error.
ive also used is.same().
You could convert them to characters for the comparison. However, if you want to compare all of the rows you'll probably want to use ifelse:
ifelse(as.character(z$x) == as.character(z$y), 1, 0)
We can convert the logical to binary by using as.integer
with(z, as.integer(levels(x)[x] == levels(y)[y]))
Related
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work
print(df[order(df$x) & df$x < 5,])
This does not order the results.
To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects
df <- data.frame(x = sample(1:50))
print(df[order(df$x) & df$x < 5,])
This does not order the results so I felt I had reproduced the problem but with simpler data.
Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object
df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))
produces
[1] "integer"
Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.
print(ndf[ndf$x < 5, ])
obviously generates an error:
Error in ndf$x : $ operator is invalid for atomic vectors.
Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object
ndf <- df[df$x < 5,]
class(ndf)
[1] "integer"
It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected
df <- data.frame(x = sample(1:50), y = rnorm(50))
ndf <- df[order(df$x),]
print(ndf[ndf$x < 5, ])
and this solved my original problem, but led to two further questions:
Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?
A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.
Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense
df[order(df$x) & df$x < 5,]
Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()
subset(df[order(df$x),], x < 5)
This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.
Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with
library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)
I'm trying to extract prop.test p-values over a set of columns in a dataframe existing in the global environment (df) and save them as a dataframe. I have a criteria column and 19 variable columns (among others)
proportiontest <- function() {
prop_df <- data.frame()
for(i in 1:19) {
x <- paste("df$var_", i, sep="")
y <- (prop.test(table(df$criteria, x), correct=FALSE))$p.value
z <- cbind (x, y)
prop_df <- rbind(prop_df, z)
}
assign("prop_df",prop_df,envir = .GlobalEnv)
}
proportiontest()
When I run this I get the error:
Error in table(df$criteria, x) : all arguments must have the same length
When I manually insert the column name into the function (instead of x) everything runs fine. e.g.
y <- (prop.test(table(df$criteria, df$var_1), correct=FALSE))$p.value
I seem to have the problem of using the variable (x) value generated via the for loop as the argument.
What am I missing or doing wrong in this case? I have tried passing x into the table() function as.String(x) as.character(x) among countless others to no avail. I cannot seem to understand in which form the argument must be. I'm probably misunderstanding something very basic in R but it's driving me insane and I cannot seem to formulate the question in a manner where google/SO can help me.
Currently in your function x is just a string. If you want to use a column from your data frame df you can do this in your for loop:
x <- df[,i]
You'll then need to change z or you'll be cbinding a column to a single p value, maybe just change to this:
z <- cbind(i,y)
so that you know which df column belongs to each p value.
You should be careful as well since the function will search for df created within itself and then move to the parent environment if it doesn't find it, so maybe you could pass the df as an argument to avoid any mistakes.
Say I have loaded a csv file into R with two columns (column A and column B say) with real value entries. Call the dataframe df. Is there away of speeding up the following code:
dfm <- df[floor(A) = x & floor(B) = y,]
x <- 2
y <- 2
dfm
I am hoping there will be something akin to function e.g.
dfm <- function(x,y) {df[floor(A) = x & floor(B) = y,]}
so that I can type
Any help much appreciated.
The way that's written right now won't work for a few reasons:
You need to assign values to x and y before you assign dfm. In other words, the lines x <- 2 and y <- 2 must come before the dfm <- ... line.
R doesn't know what A and B are, even if you put them inside the brackets of the dataframe that contains them. You need to write df$A and df$B.
= is the assignment operator, but you're looking for the logical operator ==. Right now your code is saying "Assign the value x to floor(A)" (which doesn't really make sense). You want to tell it "Only choose rows where floor(A) equals x", or floor(A)==x.
So what you want is:
dfm.create <- function(x,y) {df[floor(df$A)==x & floor(df$B)==y,]}
dfm <- dfm.create(2,2)
Note that if you want the dataframe to be called dfm, you don't want to name the function dfm, or you will have to erase the function to make the dataframe.
I want to extract a set of rows of an existing dataset:
dataset.x <- dataset[(as.character(dataset$type))=="x",]
however when I run
summary(dataset.x$type)
It displays all types which were present in the original dataset. Basically I get a result that says
x 12354235 #the correct itemcount
y 0
z 0
a 0
...
Not only is the presence of 0 elements ugly but it also messes up any plot of dataset.x due to the presence of hundrets of entries with the value 0.
Building on Chase's answer, subsetting and dropping unused levels in factors comes up a lot, so it pays to just create your own function by combining droplevels and subset:
subsetDrop <- function(...){droplevels(subset(...))}
I'm assuming this is a factor? If so, droplevels() can be used: http://stat.ethz.ch/R-manual/R-patched/library/base/html/droplevels.html
If you add a small reproducible example, it will help others get on the same page and give better advice if this isn't right.
Others have explained what is happening and how to fix it, I just want to show why it is a desirable default.
Consider the following sample code:
mydata <- data.frame(
x = factor( rep( c(0:5,0:5), c(0,5,10,20,10,5,5,10,20,10,5,0))),
sex = rep( c('F','M'), each=50 ) )
mydata.males <- mydata[ mydata$sex=='M', ]
mydata.males.dropped <- droplevels(mydata.males)
mydata.females <- mydata[ mydata$sex=='F', ]
mydata.females.dropped <- droplevels(mydata.females)
par(mfcol=c(2,2))
barplot(table(mydata.males$x), main='Male', sub='Default')
barplot(table(mydata.females$x), main='Female', sub='Default')
barplot(table(mydata.males.dropped$x), main='Male', sub='Drop')
barplot(table(mydata.females.dropped$x), main='Female', sub='Drop')
Which produces this plot:
Now, which is the more meaningful comparison, the 2 plots on the left? or the 2 plots on the right?
Instead of dropping unused levels it may be better to rethink what you are doing. If the main goal is to get the count of the x's then you can use sum rather than subsetting and getting the summary. And how meaningful can a plot be on a variable that you have already forced to be a single value?
Try
dataset$type <-
as.character(dataset$type)
followed by your original code. It's probably just that R is still treating that column as a
factor and is keeping all of the information about that factor in the column.