Efficient Way to Convert to Numeric - r

I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?

I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.

This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.

I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.

Related

R creating variable tables from list of variable names

I am currently trying to create a table from a list of variable names (something I feel should be relatively simple) and I can't for the life of me, figure out how to do it correctly.
I have a data table that I've named 'file' and there are a list of 3 variable names within this file. What I want to do is create a table of each variable and then rbind them together. For further context, these few lines of code will be worked into a much larger function. The list of variable names must be able to accommodate the number of variables the user defines.
I have tried the following:
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file$variable_list))
na_table<-as.data.table(table(is.na(file$variable)))
}
When I run the above code, R returns empty tables of var_table and na_table. What am I doing wrong?
An option is to loop over the 'variable_list, extract the column, apply tableandrbindwithindo.call`
do.call(rbind, lapply(variable_list, function(nm) table(file[[nm]])))
NOTE: assuming that the levels of the columns are the same
If the levels are not the same, make it same by converting the columns to factor with levels specified
lvls <- na.omit(sort(unique(unlist(file[, unlist(variable_list), with = FALSE]))))
do.call(rbind, lapply(variable_list, function(nm)
table(factor(file[[nm]], levels = lvls))))
Or if we have a data.table, use the data.table methods
rbindlist(lapply(variable_list, function(nm) file[, .N,by = c(nm)]), fill = TRUE)
The problem (at least one of the problems) might be that you are attempting to use the $ operator incorrectly. You cannot substitute text values into the second argument. You can use its syntactic equivalent [[ instead of $, however. So this would be a possible improvement. (I've not tested it since you provided no test material.)
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file[[variable]])) # clearly not variable_list
na_table<-as.data.table(table(is.na(file[[variable]] )))
}
I'm guessing you might have done something like, ...
var_table <- file[, table(variable ) ]
... since data.table syntax evaluates text values in the environment of the file (which in this case is confusing named "file". It's better not to use such names, since in this case there's also an R function by that name.

R - Unexpected behavior when typechecking columns of tidyverse data frame

I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.

How to convert character/factor to integer?

I know that has been asked quite frequently. However, by applying the previous advice I'm still confused about two things.
How to convert from multinomial values to integers?
How to get the integer back to the factor/character after the analysis?
library(car)
data(Prestige)
View(Prestige)
# here I convert directly from character which seems quite useless
Prestige$TYPE<-as.numeric(levels(Prestige$type))
# here I generate factors
Prestige$type<-as.factor(Prestige$type)
# and try to convert afterwards. doesnt work either
Prestige$TYPE<-as.numeric(levels(Prestige$type))
Basically, I would like to extract the three levels in type without renaming it manually.
A vector with class factor has an attributes called levels. The levels function acts on that attributes and not on the vector itself.
library(car)
data(Prestige)
length(Prestige$type) # 102
levels(Prestige$type) # Notice that this has length 3.
If you want the numeric values for the vector, use
as.numeric(Prestige$type)
What was bc is not 1, what was prof is now 2, and what was wc is now 3.
if you need to reconstitute the factor, use
factor(Prestige$type, 1:3, c("bc", "prof", "wc"))
But as a general rule, it's better not to alter your factors unless you need to alter the categories. If you need the numerical codes under the data, make a new variable
Prestige$type_numeric <- as.numeric(Prestige$type)

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources