Converting factors to numeric values in R - r

I have factors in R that are salary ranges of the form $100,001 - $150,000, over $150,000, $25,000, etc. and would like to convert these to numeric values (e.g. converting the factor $100,001 - $150,000 to the integer 125000).
Similarly I have educational categories such as High School Diploma, Current Undergraduate, PhD, etc. that I would like to assign numbers to (e.g., giving PhD a higher value than High School Diploma).
How do I do this, given the dataframe containing these values?

For converting the currency
# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000"), educ = c("High School Diploma", "Current Undergraduate",
"PhD"),stringsAsFactors=FALSE)
# Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)
# remove text
temp <- gsub("[[:alpha:]]","", temp)
# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))
For your education levels - if you want it numeric
df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
"Current Undergraduate", "PhD")))
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 High School Diploma 125000.5 1
# 2 over $150,000 Current Undergraduate 150000.0 2
# 3 $25,000 PhD 25000.0 3
EDIT
Having missing / NA values should not matter
# Data that includes missing values
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000" , NA), educ = c(NA, "High School Diploma",
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)
Rerun the above commands to get
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 <NA> 125000.5 NA
# 2 over $150,000 High School Diploma 150000.0 1
# 3 $25,000 Current Undergraduate 25000.0 2
# 4 <NA> PhD NA 3

You could use the recode function in the car package.
For example:
library(car)
df$salary <- recode(df$salary,
"'$100,001 - $150,000'=125000;'$150,000'=150000")
For more information on how to use this function see the help file.

I'd just make a vector of values that map to the levels of your factor and map them in. The code below is a much less elegant solution than I'd have liked because I can't figure out how to do the indexing with a vector, but nonetheless this will do the job if your data's not overwhelmingly large. Say we want to map the factor elements of fact to the numbers in vals:
fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)
#for example:
vals[levels(fact)=="b"]
# gives: [1] 2
#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")
#our vlookup function:
vlookup<-function(fact,vals,x) {
#probably should do an error checking to make sure fact
# and vals are the same length
out<-rep(vals[1],length(x))
for (i in 1:length(x)) {
out[i]<-vals[levels(fact)==x[i]]
}
return(out)
}
#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)
This should work for what you're describing.

Related

Splitting complex string between symbols R

I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.

Exctract number and sum number from free text input, add to df

I have a dataframe with a column that contains free text entries on years of education. From the free text entries I want to extract all of the numbers and sum them.
Example: data_en$educationTxt[1] gives "6 primary school 10 highschool"
With the following code I can extract both numbers and sum them.
library(stringr)
x <- as.numeric(str_extract_all(data_en$education[1], "[0-9A]+")[[1]])
x <- as.vector(x)
x <- sum(x)
However, I would ideally like to do this for all free text entries (i.e. each row) and subsequently add the results to the dataframe per row (i.e. in a variable such as data_en$educationNum). I'm a bit stuck on how to proceed.
You can use sapply:
data_en$educationNum <- sapply(str_extract_all(data_en$education, "[0-9]+"),
function(i) sum(as.numeric(i)))
data_en
# education educationNum
# 1 6 primary school 10 highschool 16
# 2 10 primary school 2 highschool 12
# 3 no school 0
Data
data_en <- data.frame(education = c("6 primary school 10 highschool",
"10 primary school 2 highschool",
"no school"))
You just need to map over the output of str_extract_all
x <- c('300 primary 1 underworld', '6 secondary 9 dungeon lab')
library(purrr)
map_dbl(str_extract_all(x, '\\d+'), ~ sum(as.numeric(.)))
# [1] 301 15

Create anonymous names for each unique factor level (e.g., companies)

The goal is to create and use anonymous names for firms. Doing so makes it possible to distribute samples of plots without disclosing proprietary information about specific firms.
The toy data frame shows that there can be multiple instances of firms and that names of different firms vary in unpredictable ways. The code works work but seems to be laborious and subject to mistakes.
Is there a more efficient way to rename each firm in a new variable that has an anonymous replacement name?
df <- data.frame(firm = c(rep("Alpha LLC",3), "Baker & Charlie", rep("Delta and Associates", 2), "Epsilon", "The Gamma Firm"), fees = rep(100, 500, 8))
# create a translation table (named vector) where each firm has a unique "name" of the form "Firm LETTER number"
uniq <- as.character(unique(df$firm))
uniq.df <- data.frame(firmname = uniq, anonfirm = paste0("Firm ", LETTERS[seq(1:length(uniq))], seq(1:length(uniq))))
# create a "named vector" with firm on top (as names) and anonymous name on bottom
translation.vec <- uniq.df[ , 2] # the anonymous name firm name
names(translation.vec) <- uniq.df[ , 1] # original name as column name for anonymous firm name
df$anon <- translation.vec[df$firm] # finds index of firm; replaces w/anonymous
> df
firm fees anon
1 Alpha LLC 100 Firm A1
2 Alpha LLC 100 Firm A1
3 Alpha LLC 100 Firm A1
4 Baker & Charlie 100 Firm B2
5 Delta and Associates 100 Firm C3
6 Delta and Associates 100 Firm C3
7 Epsilon 100 Firm D4
8 The Gamma Firm 100 Firm E5
When you store your firm names in the data.frame they become a factor. It's pretty easy just to swap the names of the levels of your factor. For example
set.seed(15) # so sample() is reproducible
newnames <- paste0("Firm ", LETTERS[1:nlevels(df$firm)], 1:nlevels(df$firm))
df$anon <- factor(df$firm, labels=sample(newnames))
Here I just change the labels of the factor. I also throw in a sample() other wise the firms will be named in alphabetical order. This produces
firm fees anon
1 Alpha LLC 100 Firm D4
2 Alpha LLC 100 Firm D4
3 Alpha LLC 100 Firm D4
4 Baker & Charlie 100 Firm A1
5 Delta and Associates 100 Firm C3
6 Delta and Associates 100 Firm C3
7 Epsilon 100 Firm B2
8 The Gamma Firm 100 Firm E5
The order of your new levels of your factor will still contain some information about the original order of the firms; you can eliminate that data by casting to character if you plan to share the R data set rather than save to a flat text file or just display the information.
df$anon <- as.character(factor(df$firm, labels=sample(newnames)))
Expanding on #LaurenGoodwin's very smart comment -
You can change to a factor, then to numeric, which will make each company a different number
companies <- LETTERS
anon <- as.numeric(as.factor(companies))
If you wanted it as more than a number, just change to a character and use paste.
anon <- paste('Firm', as.character(anon))
[1] "Firm 1" "Firm 2" "Firm 3" "Firm 4" "Firm 5" "Firm 6" "Firm 7"

Perform multiple summary functions and return a dataframe

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.

Create variables from content of a row in R

I have a hospital visit data that contain records for gender, age, main diagnosis, and hospital identifier. I intend to create separate variables for these entries. The data has some pattern: most observations start with gender code (M or F) followed by age, then diagnosis and mostly the hospital identifier. But there are some exceptions. In some the gender id is coded 01 or 02 and in this case the gender identifier appears at the end.
I looked into the archives and found some examples of grep but I was not successful to efficiently implement it to my data. For example the code
ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),]
could extract each diagnoses individually, but not all at once. How can I do this task?
Sample data that contain current situation (column 1) and what I intend to have is shown below:
diagnosis hospital diag age gender
m3034CVDA A cvd 30-34 M
m3034cardvA A cardv 30-34 M
f3034aceB B ace 30-34 F
m3034hfC C hf 30-34 M
m3034cereC C cere 30-34 M
m3034resPC C resp 30-34 M
3034copd_Z_01 Z copd 30-34 M
3034copd_Z_01 Z copd 30-34 M
fcereZ Z cere NA F
f3034respC C resp 30-34 F
3034copd_Z_02 Z copd 30-34 F
There appears to be two key parts to this problem.
Dealing with the fact that strings are coded in two different
ways
Splicing the string into the appropriate data columns
Note: as for applying a function over several values at once, many of the functions can handle vectors already. For example str_locate and substr.
Part 1 - Cleaning the strings for m/f // 01/02 coding
# We will be using this library later for str_detect, str_replace, etc
library(stringr)
# first, make sure diagnosis is character (strings) and not factor (category)
diagnosis <- as.character(diagnosis)
# We will use a temporary vector, to preserve the original, but this is not a necessary step.
diagnosisTmp <- diagnosis
males <- str_locate(diagnosisTmp, "_01")
females <- str_locate(diagnosisTmp, "_02")
# NOTE: All of this will work fine as long as '_01'/'_02' appears *__only__* as gender code.
# Therefore, we put in the next two lines to check for errors, make sure we didn't accidentally grab a "_01" from the middle of the string
#-------------------------
if (any(str_length(diagnosisTmp) != males[,2], na.rm=T)) stop ("Error in coding for males")
if (any(str_length(diagnosisTmp) != females[,2], na.rm=T)) stop ("Error in coding for females")
#------------------------
# remove all the '_01'/'_02' (replacing with "")
diagnosisTmp <- str_replace(diagnosisTmp, "_01", "")
diagnosisTmp <- str_replace(diagnosisTmp, "_02", "")
# append to front of string appropriate m/f code
diagnosisTmp[!is.na(males[,1])] <- paste0("m", diagnosisTmp[!is.na(males[,1])])
diagnosisTmp[!is.na(females[,1])] <- paste0("m", diagnosisTmp[!is.na(females[,1])])
# remove superfluous underscores
diagnosisTmp <- str_replace(diagnosisTmp, "_", "")
# display the original next to modified, for visual spot check
cbind(diagnosis, diagnosisTmp)
Part 2 - Splicing the string
# gender is the first char, hospital is the last.
gender <- toupper(str_sub(diagnosisTmp, 1,1))
hosp <- str_sub(diagnosisTmp, -1,-1)
# age, if present is char 2-5. A warning will be shown if values are missing. Age needs to be cleaned up
age <- as.numeric(str_sub(diagnosisTmp, 2,5)) # as.numeric will convert none-numbers to NA
age[!is.na(age)] <- paste(substr(age[!is.na(age)], 1, 2), substr(age[!is.na(age)], 3, 4), sep="-")
# diagnosis is variable length, so we have to find where to start
diagStart <- 2 + 4*(!is.na(age))
diag <- str_sub(diagnosisTmp, diagStart, -2)
# Put it all together into a data frame
dat <- data.frame(diagnosis, hosp, diag, age, gender)
## OR WITHOUT ORIGINAL DIAGNOSIS STRING ##
dat <- data.frame(hosp, diag, age, gender)

Resources