How to convert character/factor to integer? - r

I know that has been asked quite frequently. However, by applying the previous advice I'm still confused about two things.
How to convert from multinomial values to integers?
How to get the integer back to the factor/character after the analysis?
library(car)
data(Prestige)
View(Prestige)
# here I convert directly from character which seems quite useless
Prestige$TYPE<-as.numeric(levels(Prestige$type))
# here I generate factors
Prestige$type<-as.factor(Prestige$type)
# and try to convert afterwards. doesnt work either
Prestige$TYPE<-as.numeric(levels(Prestige$type))
Basically, I would like to extract the three levels in type without renaming it manually.

A vector with class factor has an attributes called levels. The levels function acts on that attributes and not on the vector itself.
library(car)
data(Prestige)
length(Prestige$type) # 102
levels(Prestige$type) # Notice that this has length 3.
If you want the numeric values for the vector, use
as.numeric(Prestige$type)
What was bc is not 1, what was prof is now 2, and what was wc is now 3.
if you need to reconstitute the factor, use
factor(Prestige$type, 1:3, c("bc", "prof", "wc"))
But as a general rule, it's better not to alter your factors unless you need to alter the categories. If you need the numerical codes under the data, make a new variable
Prestige$type_numeric <- as.numeric(Prestige$type)

Related

Efficient Way to Convert to Numeric

I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.
This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.
I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.

R - convert from categorical to numeric for KNN

I'm trying to use the Caret package of R to use the KNN applied to the "abalone" database from UCI Machine Learning (link to the data). But it doesn't allow to use KNN when there's categorical values.
How do I convert the categorical values (in this database: "M","F","I") to numeric values, such as 1,2,3, respectively?
The first answer seems like a really bad idea. Coding {"M","F","I"} to {1, 2, 3} implies that Infant = 3 * Male, Male = Female/2 and so on.
KNN via caret does allow categorical values as predictors if you use the formula methods. Otherwise you need to encode them as binary dummy variables.
Also, showing your code and having a reproducible example would help a lot.
Max
When data are read in via read.table, the data in the first column are factors. Then
data$iGender = as.integer(data$Gender)
would work. If they are character, a detour via factor is easiest:
data$iGender= as.integer(as.factor(data$Gender))
One of easiest way to use kNN algorithm in your dataset in which one of its feature is categorical : "M", "F" and "I" as you mentioned is as follows:
Just in your CVS or Excel file that your dataset exsits, go ahead in the right column and change M to 1 and F to 2 and I to 3. In this case you have discrete value in your dataset and you can easily use kNN algorithm using R.
You can simply read the file with stringsAsFactors = TRUE
Example
data_raw<-read.csv('...../credit-default.csv', stringsAsFactors = TRUE)
The stringasfactors will give a numerical replacement for the Char datatypes
Try using knncat package in R, which converts categorical variables into numerical counterpart.
Here's the link for the package

how to transform columns of a data frame according to the values in a vector in R?

I am trying to normalize some columns on a data frame so they have the same mean. The solution I am now implementing, even though it works, feels like there is a simpler way of doing this.
# we make a copy of women
w = women
# print out the col Means
colMeans(women)
height weight
65.0000 136.7333
# create a vector of factors to normalize with
factor = colMeans(women)/colMeans(women)[1]
# normalize the copy of women that we previously made
for(i in 1:length(factor)){w[,i] <- w[,i] / factor[i]}
#We achieved our goal to have same means in the columns
colMeans(w)
height weight
65 65
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer?
By the way, what does women/factor actually doing? as doing:
colMeans(women/factor)
height weight
49.08646 98.40094
Is not the same result.
Can use mapply too
colMeans(mapply("/", w, factor))
Re your question re what does women/factor do, so women is a data.frame with two columns, while factor is numeric vector of length two. So when you do women/factor, R takes each entry of women (i.e. women[i,j]) and divides it once by factor[1] and then factor[2]. Because factor is shorter in length than women, R rolls factor over and over again.
You can see, for example, that every second entry of women[, 1]/factor equals to every second entry of women[, 1] (because factor[1] equals to 1)
One way of doing this is using sweep. By default this function subtracts a summary statistic from each row, but you can also specify a different function to perform. In this case a division:
colMeans(sweep(women, 2, factor, '/'))
Also:
rowMeans(t(women)/factor)
#height weight
#65 65
Regarding your question:
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer? By the way, what does women/factor actually doing?
women/factor ## is similar to
unlist(women)/rep(factor,nrow(women))
What you need is:
unlist(women)/rep(factor, each=nrow(women))
or
women/rep(factor, each=nrow(women))
In my solution, I didn't use rep because factor gets recycled as needed.
t(women) ##matrix
as.vector(t(women))/factor #will give same result as above
or just
t(women)/factor #preserve the dimensions for ?rowMeans
In short, column wise operations are happening here.

Custom function does not work in R 'ddply' function

I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele
I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources