Suppose I have a column in a data frame as colors say c("Red", "Blue", "Blue", "Orange").
I would like to get it as c(1,2,2,3).
Red as 1
Blue as 2
Orange as 3
Is there a simpler way of doing this other than the obvious if/else or switch functions?
Set up a named vector, describing the link between colour and integers (i.e. specifically how the strings map to the integers):
colors=c(1,2,3)
names(colors)=c("Red", "Blue", "Orange")
Now use the named vector to generate a list of numbers associated with the colours in your data frame:
>colors[c("Red","Blue","Blue","Orange")]
Red Blue Blue Orange
1 2 2 3
UPDATE to address questions below. Here's an example of what I think you're trying to do.
dataframe=data.frame(Gender=c("F","F","M","F","F","M"))
strings=sort(unique(dataframe$Gender))
colors=1:length(strings)
names(colors)=strings
dataframe$Colours=colors[dataframe$Gender]
Can have a look at the result:
> dataframe
Gender Colours
1 F 1
2 F 1
3 M 2
4 F 1
5 F 1
6 M 2
Note that this example assumes that you have no specific mapping between Gender and Colours in mind. If this is really the case, then it might be simpler to just follow the comment from #alexis_laz instead.
I must be missing something, but this method would work I believe. Having coerced your column with words (below, "names") to a factor, you revalue them by your numbers in "colors".
require(plyr)
colors <- c("1","2","3")
names <- c("Red", "Blue", "Orange")
df <- data.frame(names, colors)
df$names <- as.factor(df$names)
df$names <- revalue(x = df$names, c("Red" = 1, "Blue" = 2, "Orange" = 3))
Using car::recode() function:
library(car)
recode(x, "'Red'=1; 'Blue'=2; 'Orange'=3;")
# [1] 1 2 2 3
Here is a function based on previous code:
# Recode 'string' into 'integer'
recode_str_int <- function(df, feature) {
# 1. Unique values
# 1.1. 'string' values
list_str <- sort(unique(df[, feature]))
# 1.2. 'integer' values
list_int <- 1:length(list_str)
# 2. Create new feature
# 2.1. Names
names(list_int) = list_str
df$feature_new = list_int[df[, feature]]
# 3. Result
df$feature_new
} # recode_str_int
Call it like:
df$new_feature <- recode_str_int(df, "feature")
Related
I have a toy example to explain what I am trying to work on :
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
I managed to do assigning unique ids to column y and now output looks like:
aski2 = data.frame(x=c("a","b","c","a","d","d"),y=c("1","2","3","2","1","4"))
as you see "b" is present in both col x and y and we assigned an id=1 in col y
and "a" with id=2 in col y and so on..
As you see these values are also present in col x.....
col x has "a" as its first element ."a" was also in col y and assigned an id=2
so I'll assign an id=2 for a in col x also
Now what i m trying to do next is look for these values in col x and if it occurs in col y I assign that id to it
FINAL DATAFRAME LIKE
aski3 = data.frame(x=c("2","1","4","2","3","3"),y=c("1","2","3","2","1","4"))
Without the need to create aski2 as an intermediate, a possible solution is to use match with lapply to get the numeric representations of the letters:
# create a vector of the unique values in the order
# in which you want them assigned to '1' till '4'
v <- unique(aski$y)
# convert both columns to integer values with 'match' and 'lapply'
aski[] <- lapply(aski, match, v)
which gives:
> aski
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
If you want the number as characters, you can additionally do:
aski[] <- lapply(aski, as.character)
First, convert both columns to character vectors.
Then, collect all unique values from the two columns to use as levels of a factor.
Convert both columns to factors, then numeric.
aski = data.frame(x=c("a","b","c","a","d","d"),y=c("b","a","d","a","b","c"))
aski$x <- as.character(aski$x)
aski$y <- as.character(aski$y)
lev <- unique(c(aski$y, aski$x))
aski$x <- factor(aski$x, levels=lev)
aski$y <- factor(aski$y, levels=lev)
aski$x <- as.numeric(aski$x)
aski$y <- as.numeric(aski$y)
aski
A solution from dplyr. We can first create a vector showing the relationship between index and letter as vec by unique(aski$y). After this step, you can use Jaap's lapply solution, or you can use mutata_all from dplyr as follows.
# Create the vector showing the relationship of index and letter
vec <- unique(aski$y)
# View vec
vec
[1] "b" "a" "d" "c"
library(dplyr)
# Modify all columns
aski2 <- aski %>% mutate_all(funs(match(., vec)))
# View the results
aski2
x y
1 2 1
2 1 2
3 4 3
4 2 2
5 3 1
6 3 4
Data
aski <- data.frame(x = c("a","b","c","a","d","d"),
y = c("b","a","d","a","b","c"),
stringsAsFactors = FALSE)
So I have a really large dataset that has some missing/bad data. I would like to code the missing data using an IF else statement. Instead of assigning just one value for all of the missing/bad ones, I want to assign base on a fraction.
So for instance for df below:
Assign 50% of the df$col2==B to BLUE and the other 50% to RED
col1 col2
1 a
2 a
3 b
4 b
I know you can do:
if else( df$col2==b, "BLUE", df$col1)
but I want:
col1 col2
1 a
2 a
3 BLUE
4 RED
I'm looking to do the partitioning base of the condition.
You can do this by generating a vector of "Red" and "Blue" to select as the replacement when needed.
## Generate some random data with missing values
set.seed(2017)
a = sample(c("Red", "Blue"), 20, replace=TRUE)
a = ifelse(runif(20, 0, 1) < 0.12, NA, a)
## Now replace missing
a = ifelse(is.na(a),
sample(c("Red", "Blue"), length(a), replace=TRUE, prob=c(0.5,0.5)), a)
In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.
Is there anything in R to do this? For example if there is a variable called color with values {'Blue','Red','Green'} the encoder would translate:
Blue => 1
Green => 2
Red => 3
and create an object with this mapping to then use for transforming new data in a similar fashion.
Add:
It doesn't seem like just factors will work because there is no persisting of the mapping. If the new data has an unseen level from the training data, the entire structure changes. Ideally I would like the new levels labeled missing or 'other' somehow.
sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor.
Try CatEncoders package. It replicates the Python sklearn.preprocessing functionality.
# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)
# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values
# [1] 3 3 NA
# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA
I would add the functionality of reporting the non-matching labels with a warning.
PS: It also has the OneHotEncoder function.
If I correctly understand what do you want:
# function which returns function which will encode vectors with values of 'vec'
label_encoder = function(vec){
levels = sort(unique(vec))
function(x){
match(x, levels)
}
}
colors = c("red", "red", "blue", "green")
color_encoder = label_encoder(colors) # create encoder
encoded_colors = color_encoder(colors) # encode colors
encoded_colors
new_colors = c("blue", "green", "green") # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors
other_colors = c("blue", "green", "green", "yellow")
color_encoder(other_colors) # NA's are introduced
# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result
# dealing with multiple columns
# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
matrix(
sample(c("red", "blue", "green", "yellow"), 12, replace = TRUE),
ncol = 3)
)
color_dataframe
# encode each column
for (column in colnames(color_dataframe)){
color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
I wrote the following which I think works, the efficiency of which and/or how it will scale is not yet tested
str2Int.fit_transform<-function(df, plug_missing=TRUE){
list_of_levels=list() #empty list
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels<-unique(df[,i]) #distinct levels
list_of_levels[[colnames(df)[i]]] <- levels #set list with name of the columns to the levels
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}#end if character/factor IF
}#end loop
return (list(list_of_levels,df)) #return the list of levels and the new DF
}#end of function
str2Int.transform<-function(df,list_of_levels,plug_missing=TRUE)
{
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels=list_of_levels[[colnames(df)[i]]]
if (! is.null(levels)){
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}
}# character or factor
}#end of loop
return(df)
}#end of function
######################################################
# Test the functions
######################################################
###Test fit transform
# as strings
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=FALSE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
#as factors
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=TRUE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
###Test transform
str2Int.transform(sample_dat,result[[1]])
It's hard to believe why no one has mentioned caret's dummyVars function.
This is a widely searched question, and people don't want to write their own methods or copy and paste other users methods, they want a package, and caret is the closest thing to sklearn in R.
EDIT: I now realize that what the user actually want's is to turn strings into a counting number, which is just as.numeric(as.factor(x)) but I'm going to leave this here because using hot-one encoding is the more accurate method of encoding categorical data.
# input P to the function below is a dataframe containing only categorical variables
numlevel <- function(P) {
n <- dim(P)[2]
for(i in 1: n) {
m <- length(unique(P[[i]]))
levels(P[[i]]) <- c(1:m)
}
return(P)
}
Q <- numlevel(P)
df<- mtcars
head(df)
df$cyl <- factor(df$cyl)
df$carb <- factor(df$carb)
vec <- sapply(df, is.factor)
catlevels <- sapply(df[vec], levels)
#store the levels for each category
#level appearing first is coded as 1, second as 2 so on
df <- sapply(df, as.numeric)
class(df) #matrix
df <- data.frame(df)
#converting back to dataframe
head(df)
# Data
Country <- c("France", "Spain", "Germany", "Spain", "Germany", "France")
Age <- c(34, 27, 30, 32, 42, 30)
Purchased <- c("No", "Yes", "No", "No", "Yes", "Yes")
df <- data.frame(Country, Age, Purchased)
df
# Output
Country Age Purchased
1 France 34 No
2 Spain 27 Yes
3 Germany 30 No
4 Spain 32 No
5 Germany 42 Yes
6 France 30 Yes
Using CatEncoders package : Encoders for Categorical Variables
library(CatEncoders)
# Saving names of categorical variables
factors <- names(which(sapply(df, is.factor)))
# Label Encoder
for (i in factors){
encode <- LabelEncoder.fit(df[, i])
df[, i] <- transform(encode, df[, i])
}
df
# Output
Country Age Purchased
1 1 34 1
2 3 27 2
3 2 30 1
4 3 32 1
5 2 42 2
6 1 30 2
Using R base : factor function
# Label Encoder
levels <- c("France", "Spain", "Germany", "No", "Yes")
labels <- c(1, 2, 3, 1, 2)
for (i in factors){
df[, i] <- factor(df[, i], levels = levels, labels = labels, ordered = TRUE)
}
df
# Output
Country Age Purchased
1 1 34 1
2 2 27 2
3 3 30 1
4 2 32 1
5 3 42 2
6 1 30 2
Here is an easy and neat solution:
From the superml package:
https://www.rdocumentation.org/packages/superml/versions/0.5.3
There is a LabelEncoder class:
https://www.rdocumentation.org/packages/superml/versions/0.5.3/topics/LabelEncoder
install.packages("superml")
library(superml)
lbl <- LabelEncoder$new()
lbl$fit(sample_dat$column)
sample_dat$column <- lbl$fit_transform(sample_dat$column)
decode_names <- lbl$inverse_transform(sample_dat$column)
I'm trying to convert a dataframe consisting of two columns into a named vector (nested list). The information in each row is essentially key:value pairs, so the lists in the final vector should each be named by the keys and contain their respective values.
Example input:
Var1 Var2
A 1
A 2
B 1
B 3
C 3
C 4
C 5
Example Output:
namedArray = list(A = c(1,2), B = c(1,3), C = c(3,4,5))
I managed to do this using dcast() in the reshape2 package, however this required additional post-processing to remove row names and NA's introduced by casting the data frame.
Is there a more efficient way to accomplish this?
If you have 2 columns: X and Y in dataframe df1, and you want Y's values to be the names of items with values from X:
myList <- as.list(df1$X)
names(myList) <- df1$Y
For the modified question, the answer is that there is already a functions that does exactly that ( and might have been a better answer that what I gave:
> split(dat$Var2, dat$Var1)
$A
[1] 1 2
$B
[1] 1 3
$C
[1] 3 4 5
Thank you #42- and #MMerry for getting me to think about split(). I found a nice solution splitting one variable by the other and wrapping the output into a list.
y <- as.list(split(df$Var2, df$Var1))
If you want key value pairs in a list from a data frame a technique could look like this:
x = data.frame(x=letters[1:5],y=1:5)
y = split(x,seq(1:nrow(x)))
names(y) = x$x
y$a
I am working on a rather large dataset in R, containing a continuous numeric variable. In another dataset, I have named intervals, described by min and max values, that I want to apply to the continuous variable in my large dataset.
Below is some example code:
df<-data.frame(x=c(1:6))
groups<-data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
I want to make a new column, df$cat, so that the values of df$x are within the min-max boudaries found in the groups data frame.
Ideally, I would like to have groups$min >= df$x > groups$max.
> df
x cat
1 1 a
2 2 b
3 3 b
4 4 c
5 5 d
6 6 d
Is there any easy way of doing this?
Set up data:
df <- data.frame(x=c(1:6))
groups <- data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
Use cut() with the labels argument specified:
brks <- c(groups$min,tail(groups$max,1))
df$cat <- cut(df$x,breaks=brks,labels=groups$cat,right=FALSE)
df<-data.frame(x=c(1:6))
groups<-data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
for(i in 1:nrow(groups)){
numbers_in_range = df$x[df$x >= groups[i,]$min & df$x <= groups[i,]$max]
df[,i+1] = df$x %in% numbers_in_range
colnames(df)[2:ncol(df)] = as.character(groups$cat)
}
something like this will tell you which numbers are in which groups ranges. Is this what you were after?