I have a large data frame with over 40 variables of different classes. About half of the variables are characters, however, I would like to coerce those variables to factor while leaving the integers, logicals, etc. as is.
I have tried using a an lapply function like the one below, but it coerces all variables instead of just the characters:
aframe2 <- as.data.frame(lapply(aframe1, factor))
I have also tried as.data.frame(aframe1, stringsAsFactors=TRUE) with no success. Is there something I am doing wrong or some other function I can use to do this?
This could be solved by using a if/else statement
aframe1[] <- lapply(aframe1, function(x) if(is.character(x)) factor(x) else x)
or create an index for factor columns and loop only on those columns
i1 <- sapply(aframe1, is.character)
aframe1[i1] <- lapply(aframe1[i1], factor)
Related
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I would like to compare two columns in my dataset, however they have different levels. I cant seem to find a way to get this to work. Any suggestions?
Example:
x = c('a','b','c')
y = c('a','b','g')
z = data.frame(x,y)
if(z$x == z$y){1} else{0}
returns: Error in Ops.factor(z$x, z$y) : level sets of factors are different
I have tried to make them have similar levels, i.e,:
z$x <- factor(z$x, levels=c(levels(z$y),levels(z$x)))
z$y <- factor(z$y, levels=c(levels(z$y),levels(z$x)))
but it still returns the error.
ive also used is.same().
You could convert them to characters for the comparison. However, if you want to compare all of the rows you'll probably want to use ifelse:
ifelse(as.character(z$x) == as.character(z$y), 1, 0)
We can convert the logical to binary by using as.integer
with(z, as.integer(levels(x)[x] == levels(y)[y]))
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
I have a dataframe with >100 columns, all of which are INTs.
I have subsetted some columns which I would like to factorise, allowing me to conduct an ANOVA, say
my_variables_list = headers[grep('independent', headers)]
Now I would like to loop over all these variables and factorise:
for (i in my_variables_list) {
df$i = as.factor(df$i)
}
However this doesn't work - no error message is returned, but also no changes are made to the df. Similarly, if I try to run a single line of this it also fails.
df$my_variables_list[10] <- as.factor(df$my_variables_list[10])
You should use the [] operators to subset your dataframe within the for loop:
for (i in my_variables_list) {
df[,i] = as.factor(df[,i])
}
An example on iris avoiding the loop. We first look for the patter Sepal or Sepal in the colnames of iris, then convert those columns to factor with lapply
my_variables_list = grep('Petal|Sepal', colnames(iris))
iris[, my_variables_list] <- lapply(iris[, my_variables_list], as.factor)
or on you data.frame:
df[,my_variables_list] <- lapply(df[, my_variables_list], as.factor)
I have a function in R to turn factors to numeric:
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
and I have a dataframe that consists of both factors, numeric and other types of data.
I want to apply the functions above at once on the whole dataframe to turn all factors to numeric types columns.
Any idea ?
thanks
You could check whether the column is factor or not by is.factor and sapply. Use that as an index to filter out those columns and convert the columns to "numeric" by as.numeric.factor function in a lapply loop.
indx <- sapply(dat, is.factor)
dat[indx] <- lapply(dat[indx], as.numeric.factor)
You could also apply the function without subsetting (but applying it on a subset would be faster)
To prevent the columns to be converted to "factor", you could specify stringsAsFactors=FALSE argument or colClasses argument within the read.table/read.csv I would imagine the columns to have atleast a single non-numeric component which automatically convert this to factor while reading the dataset.
One option would be:
dat[] <- lapply(dat, function(x) if(is.factor(x)) as.numeric(levels(x))[x] else x)