Change class of dataframe based on other dataframe - r

I have a dataframe in R that I import from excel and a dataframe that I create with a script. These dataframes contain the same columns but since one is imported from excel, the class of the columns are not identical to the columns of the dataframe created with the script.
The dataframes contain 500+ columns so to do it individually would take a lot of time. Is there any way to change the class of all columns of the excel imported dataframe to the class of the columns from the script created dataframe?
Many thanks!

df1 <- data.frame(a=1,b="2")
df2 <- data.frame(a=1L,b=2,d=3)
nms <- intersect(names(df1), names(df2))
df2[nms] <- Map(function(ref, tgt) { class(tgt) <- class(ref); tgt; }, df1[nms], df2[nms])
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: int 1
# $ b: chr "2"
# $ d: num 3
Granted, $a remains integer instead of being cast to numeric; if that's not a concern, then that may suffice. If not, then this more-verbose and more-flexible option might be preferred:
cls <- sapply(df1[nms], function(z) class(z)[1])
df2[nms] <- Map(function(tgt, cl) {
if (cl == "numeric") {
tgt <- as.numeric(tgt)
} else if (cl == "integer") {
tgt <- as.integer(tgt)
} else if (cl == "character") {
tgt <- as.character(tgt)
}
tgt
}, df2[nms], cls)
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: num 1
# $ b: chr "2"
# $ d: num 3
The rationale behind sapply(.., class(z)[1]) is that some classes have length greater than 1 (e.g., tbl_df, POSIXct), which will spoil that process.

Related

UseMethod common tasks all S3 classes

I am using S3 methods in that way.
First, seek all commonn task between all classes programmed and put this code only once before "Usemethod". Then, I program the rest of each class.
The problem arises when you modify an argument, because they are defined by-reference. Other tasks like check arguments or define sub-functions works well in these schemas.
The next example, I modify an argument:
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
UseMethod("secure_filter", table)
}
secure_filter.data.table <- function(
table, col, value
){
return(table[ col == value,])
}
secure_filter.data.frame <- function(
table, col, value
){
return(table[table$col == !!value,])
}
and the result is wrong
df <- data.frame(a=c("a", "b", "c"), column = c("1", "2", "3"))
dt <- as.data.table(df)
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
Empty data.table (0 rows and 2 cols): a,column
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
[1] a column
<0 rows> (or 0-length row.names)
Am I using S3 well? How do I save repeated code between S3 classes?
Any example in a well known R function?
I am using this approach to re-program tidyverse scripts to data.table scripts.
Use NextMethod instead UseMethod.
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
NextMethod("secure_filter")
#UseMethod("secure_filter", table)
}
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
a column
1: a 1
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
a column
1 a 1
But I don´t know that answer is well behaved because it doesn't shot the dispatched method neither get the generic.
> sloop::s3_dispatch(secure_filter(dt, "column", 1))
secure_filter.data.table
secure_filter.data.frame
secure_filter.default
> sloop::s3_get_method(secure_filter)
Error: Could not find generic

How to create a dataframe with dynamic column length in R

I am creating dataframes in R like below.
len5<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0), stringsAsFactors=FALSE)
len6<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0),"C6"=character(0),stringsAsFactors=FALSE)
len7<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0),"C6"=character(0),"C7"=character(0),stringsAsFactors=FALSE)
etc
However I need to create it dynamically in a loop starting from length 5 to 15
for dataframe of column length starting from 5 to 15.
Is there any way of doing that? all the dataframes will be characters only
Thanks
Tanmay
We can do this with lapply to create a list of data.frames. It is better to keep that in the list and not create multiple objects in the global environment.
i1 <- 5:15
lst <- lapply(i1, function(x) data.frame(setNames(replicate(x,character(0)),
paste0("C", seq_len(x))), stringsAsFactors = FALSE))
names(lst) <- paste0("len", i1)
In case, the program needs to take objects from global environment
list2env(lst, .GlobalEnv)
str(len5)
#'data.frame': 0 obs. of 5 variables:
# $ C1: chr
# $ C2: chr
# $ C3: chr
# $ C4: chr
# $ C5: chr

Applying a vector of classes to a dataframe

I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")

Names of variables inside the 'for loop' [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)

Error when adding columns with default values to dataframe with zero rows

Why does this work,
# add ONE column to dataframe with zero rows
x <- data.frame(a=character(0))
x["b"] <- character(0)
while this does not?
# add SEVERAL columns to dataframe with zero rows
x <- data.frame(a=character(0))
x[c("b", "c")] <- character(0)
error in value[[jvseq[[jjj]]]] : index out of limits [... freely translated]
Note, that this is perfectly okay, if we have non-zero rows.
x <- data.frame(a=1)
x["b"] <- NA
x <- data.frame(a=1)
x[c("b", "c")] <- NA
And what would be a simple alternative to add multiple columns to zero row dataframes?
From help("[.data.frame"):
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list.
From help("["):
Recursive (list-like) objects
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
Thus, you need to do pass a list (or data.frame):
x <- data.frame(a=character(0))
x[c("b", "c")] <- list(character(0), character(0))
str(x)
#'data.frame': 0 obs. of 3 variables:
# $ a: Factor w/ 0 levels:
# $ b: chr
# $ c: chr

Resources