Function in R for several variables - r

I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.

You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame

Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))

Related

How to convert all factor variables into numeric variables (in multiple data frames at once)?

I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))
What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.
for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}

apply resulting in list instead of dataframe

I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))

Coercing multiple time-series columns to factors in large dataframe

I would like to know if there is an "easy/quick" way to convert character variables to factor.
I am aware, that one could make a vector with the column names and then use lapply. However, I am working with a large data frame with more than 200 variables, so it would be preferable not having to write the 200+ names in the vector.
I am also aware that I can coerce the entire data frame by using lapply, type.convert and sapply, but as I am working with time series data where some is categorical, and some is numerical, I am not interested in that either.
Is there any way to use the column number in this? I.e. [ ,2:200]? I tried the following, but without any luck:
df[ ,2:30] <- lapply(df[ ,2:30], type.convert)
sapply(df, factor)
With the solution above, I would still have to do multiple of them, but it would still be quicker than writing all the variable names.
I also have a feeling a loop might be usable here, but I would not be sure of how to write it out, or if it is even a way to do it.
df[ ,2:30] <- lapply(df[ ,2:30], as.factor)
As you write, that you need to convert (all?) character variables to factors, you could use mutate_if from dplyr
library(dplyr)
mutate_if(df, is.character, as.factor)
With this you only operate on columns for which is.character returns TRUE, so you don't need to worry about the column positions or names.

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

Change part of data frame to character, then to numeric in R

I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.

Resources