i'm teaching myself R right now. I'm trying to convert integer variables into categorical with the following.
train[, c("Store", "DayOfWeek")] <- apply(train[,c("Store", "DayOfWeek")], 2, as.factor)
but it's turning the variables into characters instead. can't figure out why - except possibly R coercion.
'data.frame': 1017209 obs. of 2 variables:
$ Store : chr "1" "2" "3" "4" ...
$ DayOfWeek : chr "5" "5" "5" "5" ...
when i do it to the vars individually (instead of using apply), it works. THanks
apply is the wrong tool. The "apply" way to do this is to use lapply because data frames are lists, where each column is an element of the list:
mtcars[,c('cyl','vs')] <- lapply(mtcars[,c('cyl','vs')],as.factor)
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
In general, be cautious about using apply on data frames. The very first line of the documentation of apply makes it clear that the first thing it does is coerce it's argument to a matrix and matrices can only hold data of one type. So your data frame will be instantly converted to all numbers, all integers, all characters, depending on what's in it.
As mentioned above, lapply is the right tool. You can use dplyr and mutate_each for this task and many similar column transformations as follows:
library(dplyr)
train <- train %>% mutate_each(funs(as.factor), c(Store, DayOfWeek))
Related
I have a dataset, and for some reason some of the text fields are coming through as length-one lists, rather than straight values. (We are still investigating why.) In the meantime, as a stopgap, I would like to convert those lists back into the standard fields they should be.
Here's an example of the kind of data structure I'm seeing:
library(dplyr)
mtcars %>%
bind_cols(n = I(list("x"))) %>%
str()
Which comes out looking like this:
Obviously, what I am looking for is that final column to be a column, not a whole bunch of individual lists. Since it might be more than one column in the dataset, it would be good if the approach is flexible enough to go, "For each column, if it is a list, make it a column again".
Is this possible? I've found a bunch of stuff online on how to append a list to a data frame as a column, or how to pull the contents of a column out into a list, but nothing that quite fits the scenario I'm describing here.
If like in the example, length of all the list element is 1 you can use unlist on all list columns.
library(dplyr)
data <- data %>% mutate(across(where(is.list), unlist))
data
#'data.frame': 32 obs. of 12 variables:
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
# $ disp: num 160 160 108 258 360 ...
# $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
# $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec: num 16.5 17 18.6 19.4 17 ...
# $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
# $ am : num 1 1 1 0 0 0 0 0 0 0 ...
# $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
# $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# $ n : chr "x" "x" "x" "x" ...
However, more often than not this happens because you have at least one entry which has length greater than 1 (Use which(lengths(data$n) > 1) to check) in which case it would be safer to select the 1st value from each list.
data <- data %>% mutate(across(where(is.list), ~sapply(.x, `[[`, 1)))
data
data <- mtcars %>% bind_cols(n = I(list("x")))
I intend to use a function to save me typing work for repetitive procedures. Many things are already working but not everything is working yet. Here is the code:
quicky <- function(df, factors){
output <- as.character(substitute(factors)[-1])
print(output)
df[,output]
for(i in names(df[,output])){
hist(df[,as.character(i)])
df[,as.character(i)] <- as.factor(df[,as.character(i)])#<- Why does this not work?
}
}
quicky(mtcars, c(cyl,hp,drat))
Request for help and explanation! Thanks in advance.
As we are looping over the column names created from 'output', just looping over those instead of further subsetting the data and getting te names. Also, in the function, return the dataset at the end
quicky <- function(df, factors){
output <- as.character(substitute(factors)[-1])
print(output)
for(i in output){
df[[i]] <- as.factor(df[[i]])
}
df
}
out <- quicky(mtcars, c(cyl,hp,drat))
str(out)
#'data.frame': 32 obs. of 11 variables:
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... ###
# $ disp: num 160 160 108 258 360 ...
# $ hp : Factor w/ 22 levels "52","62","65",..: 11 11 6 11 15 9 20 2 7 13 ...###
# $ drat: Factor w/ 22 levels "2.76","2.93",..: 16 16 15 5 6 1 7 11 17 17 ...###
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec: num 16.5 17 18.6 19.4 17 ...
# $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
# $ am : num 1 1 1 0 0 0 0 0 0 0 ...
# $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
# $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
NOTE: Changed the [ to [[ so that it works with data.table and tbl_df
The reason quickly is failing to return the results of assignments to the columns of df is a peculiar feature of an R for-loop. It returns NULL. And the last function that was evaluated within your quicky function was for. So all you need to do is add a call to the value of df outside of the loop:
quicky <- function(df, factors){
output <- as.character(substitute(factors)[-1])
print(output)
df[,output]
for(i in names(df[,output])){
hist(df[,as.character(i)])
df[, i] <- as.factor(df[, i ])
}; df # add a call to evaluate `df`
}
str( quicky(mtcars, c(cyl,hp,drat)) )
#-------
[1] "cyl" "hp" "drat"
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : Factor w/ 22 levels "52","62","65",..: 11 11 6 11 15 9 20 2 7 13 ...
$ drat: Factor w/ 22 levels "2.76","2.93",..: 16 16 15 5 6 1 7 11 17 17 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ..
This behavior of for is in contrast to most other functions in R. With a for-loop, the evaluations and assignments done within it typically become effective outside the for-loop body, i.e. in the calling environment, but the function call itself returns NULL. Most other functions have no effect outside their function body environments which then requires the programmer to assign the returned value to a named object if any lasting effect is desired. (You should, of course, not expect the value of mtcars to be affected by that action.)
I received a script that generates a bunch of objects. I want to combine multiple dataframes using bind_rows. I am able to choose the correct objects using grep but I am not able to pass those object names as argument to bind_rows.
For example, I want to select the objects that start with df and pass those to bind_rows. In the example below I expect to have a dataframe named data which have the dataframe mtcars 3 times.
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
notdf4 <- mtcars
dfx <- ls()[grep("^df", ls())]
data <- bind_rows(eval(parse(text = dfx)))
The suggestion to use mget makes sense, although it returns a list so you would need to use do.call to execute an `rbind operation.
str( do.call( rbind, mget(ls( patt="^df.") ) ) )
'data.frame': 96 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
I think using mget and do.call (rather than will have a lower chance of offending people like me who might be called R purists. I chose to use the "pattern" argument to ls as cleaner than first getting all the workspace names and then selecting from them with grep.
It appears that one can add/delete a column to a data.table in-place, i.e., without copying all the other columns over to a new table.
Is it possible to do that with a vanilla data.frame?
PS. I know how to add/delete columns "functionally", i.e., creating a new frame without modifying the original one.
You can delete or modify an existing column from a data.frame by reference with data.table::set. I doubt you can add a column without making a copy. The reason that you can add a column to a data.table without making a copy is that data.table over allocates memory. See ?alloc.col for more.
R> library(data.table)
R> data(mtcars)
R> tracemem(mtcars)
[1] "<0x59fef68>"
R> set(mtcars, j="mpg", value=NULL) # remove a column
R> set(mtcars, j="cyl", value=rep(42, 32)) # modify a column
R> untracemem(mtcars)
R> str(mtcars)
'data.frame': 32 obs. of 10 variables:
$ cyl : num 42 42 42 42 42 42 42 42 42 42 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Compare that with normal data.frame operations
R> data(mtcars)
R> tracemem(mtcars)
[1] "<0x6b3ec30>"
R> mtcars[, "mpg"] <- NULL
tracemem[0x6b3ec30 -> 0x84de0c8]:
tracemem[0x84de0c8 -> 0x84de410]: [<-.data.frame [<-
tracemem[0x84de410 -> 0x84de6b0]: [<-.data.frame [<-
R> tracemem(mtcars)
[1] "<0x84dca30>"
R> mtcars[, "cyl"] <- rep(42, 32)
tracemem[0x84dca30 -> 0x84dcc28]:
tracemem[0x84dcc28 -> 0x84dd018]: [<-.data.frame [<-
tracemem[0x84dd018 -> 0x84dff70]: [<-.data.frame [<-
R> untracemem(mtcars)
R> data(mtcars)
I have a big ol' data frame with two ID columns for courses and users, and I needed to split it into one dataframe per course to do some further analysis/subsetting. After eliminating quite a few rows from each of the individual course dataframes, I'll need to stick them back together.
I split it up using, you guessed it, split, and that worked exactly as I needed it to. However, unsplitting was harder than I thought. The R documentation says that "unsplit reverses the effect of split," but my reading on the web so far is suggesting that that is not the case when the elements of the split-out list are themselves dataframes.
What can I do to rejoin my modified dfs?
This is a place for do.call. Simply calling df <- rbind(split.df) will result in a weird and useless list object, but do.call("rbind", split.df) should give you the result you're looking for.
unsplit() will work / does seem to work in the general situation that you describe, but not the particular situation of removing rows from the thus split data frame.
Consider
> spl <- split(mtcars, mtcars$cyl)
> str(spl, max = 1)
List of 3
$ 4:'data.frame': 11 obs. of 11 variables:
$ 6:'data.frame': 7 obs. of 11 variables:
$ 8:'data.frame': 14 obs. of 11 variables:
> str(unsplit(spl, f = mtcars$cyl))
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
As we can see, unsplit() can undo a split. However, in the case where the split data frame is further worked upon and altered to remove rows, there will be a mismatch between the total number of rows in the data frames in the split list and the variable used to split the original data frame.
If you know or can compute the changes required to make the variable used to split the original data frame then unsplit() can be deployed. Though it is more than likely that this will not be trivial.
The general solution is, as #Andrew Sannier mentions is the do.call(rbind, ...) idiom:
> spl <- split(mtcars, mtcars$cyl)
> str(do.call(rbind, spl))
'data.frame': 32 obs. of 11 variables:
$ mpg : num 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
$ cyl : num 4 4 4 4 4 4 4 4 4 4 ...
$ disp: num 108 146.7 140.8 78.7 75.7 ...
$ hp : num 93 62 95 66 52 65 97 66 91 113 ...
$ drat: num 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
$ wt : num 2.32 3.19 3.15 2.2 1.61 ...
$ qsec: num 18.6 20 22.9 19.5 18.5 ...
$ vs : num 1 1 1 1 1 1 1 1 0 1 ...
$ am : num 1 0 0 1 1 1 0 1 1 1 ...
$ gear: num 4 4 4 4 4 4 3 4 5 5 ...
$ carb: num 1 2 2 1 2 1 1 1 2 2 ...
Outside of base R, also consider:
data.table::rbindlist() with the side effect of the result being a data.table
dplyr::bind_rows() which despite its somewhat confusing name will bind rows across lists
The answer by Andrew Sannier works but has the side-effect that the rownames get changed. rbind adds the list names to them, so e.g. "Datsun 710" becomes "4.Datsun 710". One can use unname in between to avoid this problem.
Complete example:
mtcars_reorder = mtcars[order(mtcars$cyl), ] #reorder based on cyl first
l1 = split(mtcars_reorder, mtcars_reorder$cyl) #split by cyl
l1 = unname(l1) #remove list names
l2 = do.call(what = "rbind", l1) #unsplit
all(l2 == mtcars_reorder) #check if matches
#> TRUE