Creating a dummy variable in R using loan default data - r

I'm working with Lending Club data set and I'm trying to create a dummy variable for the target variable loan_status. So my main goal is for Charged Off to be 0 and Fully Paid to be 1 and all else would be 'NA'. The variable loan status has several values: Current, Fully Paid, Late, Grace Period, Delinquent, Charged off, and Does not qualify due to credit profile. I just want to focus on Charged Off and Fully Paid. I've tried numerous times but still no success. For example:
Creating a new target variable
loan_status1 <- if(loan_status== 'Fully Paid'){'Yes'} else if
(loan_status== 'Charged Off') {'No'} else 'NA'
Also I've tried this:
if(loan_status=='Fully Paid'){
0} else if (loan_status=='Charged Off') {
1} else (loan_status=='NA')
I would appreciate any guidance.

Basically you could try to run a for-loop over your data by executing this:
Don't set NAs as strings ('NA'), better set to data type NA
loan_status <- sample(rep(c('Fully Paid', 'Charged Off', "abc"), 100), 100, replace = FALSE)
for (i in seq_along(loan_status)){
if (loan_status[i] == 'Fully Paid'){
loan_status[i] <- as.integer(0)
} else if (loan_status[i] == 'Charged Off'){
loan_status[i] <- as.integer(1)
} else {
loan_status[i] == NA
}
}
Maybe you want to do this the easy way with the factor() function:
For instance you could do:
factor(loan_status, levels = c('Fully Paid', 'Charged Off'), labels = c(0, 1))

The OP requested a 1:1 replacement, i.e., only one data field involved, of selected values. Besides the nested ifelse approach, this could be done by using factors or join for larger data.
If more than two or three values need to be replaced, the "hard-coded" nested ifelse approach easily gets unhandy.
Factor case 1: Yes, No
# create some data
loan_status <- c("Fully Paid", "Charged Off", "Something", "Else")
# do the conversion
factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No"))
#[1] Yes No <NA> <NA>
#Levels: Yes No
Or,
as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No")))
#[1] "Yes" "No" NA NA
if the result is expected as character.
Factor case 2: 0L, 1L as integers
If the result is expected to be of type integer, the factor approach can still be used but needs additonal conversion.
as.integer(as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1"))))
#[1] 0 1 NA NA
Note, that the conversion to character is essential here. Otherwise, the result would return the numbers of the factor levels:
as.integer(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1")))
#[1] 1 2 NA NA
Join
In case of larger data and many items to be replaced using data.table join might be an alternative worth considering:
library(data.table)
# create translation table
translation_map <- data.table(
loan_status = c("Fully Paid", "Charged Off"),
target = c(0L, 1L))
# create some user data
DT <- data.table(id = LETTERS[1:4],
loan_status = c("Fully Paid", "Charged Off", "Something", "Else"))
DT
# id loan_status
#1: A Fully Paid
#2: B Charged Off
#3: C Something
#4: D Else
# right join
translation_map[DT, on = "loan_status"]
# loan_status target id
#1: Fully Paid 0 A
#2: Charged Off 1 B
#3: Something NA C
#4: Else NA D
By default (nomatch = NA), data.table does a right join, i.e, takes all rows of DT.

Related

How to create sub set of data in equal random distribution considering multiple condition in R

I have the below-mentioned df in R:
x <- structure(list(ID = c("I-1", "I-2", "I-3", "I-4", "I-5", "I-6",
"I-7", "I-8", "I-9", "I-10", "I-11"), Unique_Id = c("UR-112",
"UR-112", "UR-112", "UR-113", "UR-113", "UR-114", "UR-114", "UR-114",
"UR-115", "UR-115", "UR-116"), Date = c("2020-01-01 14:15:16",
"2020-02-12 14:15:16", "2020-03-23 14:15:16", "2020-01-01 14:15:16",
"2020-04-11 14:15:16", "2020-04-07 14:15:16", "2020-05-08 14:15:16",
"2020-05-09 14:15:16", "2020-01-18 14:15:16", "2020-03-23 14:15:16",
"2020-02-11 14:15:16"), Status = c("Approved", "In Process",
"In Process", "Hold", "Hold", "Approved", "Approved", "In Process",
"Approved", "Approved", "Approved")), class = "data.frame", row.names = c(NA,
-11L))
I need to create a sub set of random 3 Unique_Id which is spread across all Date and these three Unique_Id must come under the available Status.
Required Output:
ID Unique_Id Date Status
I-1 UR-112 2020-01-01 14:15:16 Approved
I-2 UR-112 2020-02-12 14:15:16 In Process
I-3 UR-112 2020-03-23 14:15:16 In Process
I-4 UR-113 2020-01-01 14:15:16 Hold
I-5 UR-113 2020-04-11 14:15:16 Hold
I-11 UR-116 2020-02-11 14:15:16 Approved
I am trying the following code but it takes so much time to generate on 1000 thousand dataset, and I need to perform this logic on a dataset which has more than 1 Million rows.
code:
id <- character(0)
while(length(id) != 3) {
id <- character(0)
for(i in unique(x$Status)) {id <-
c(id, sample(setdiff(x$Unique_Id[x$Status == i], id), 1))}
}
x[x$Unique_Id %in% id,]
Your problem description is somewhat confusing, so this may not be what you need:
library(data.table) #This is fast, specially if your data is big
setDT(x) # converts x to data.table. Don't worry, it remains a data.frame too!
x[Unique_Id %in% sample(unique(Unique_Id), 3), ]
sample(unique(Unique_Id), 3) takes the unique values of Unique_Id and randomly samples 3.
x[var %in% foo, ] is data.table-se for "filter my table x when variable var is contained in vector foo".
EDIT TO ADD:
After further clarification by the OP, the solution is more complex and looks like this:
First we need to find which Unique_Ids have an approximate distribution of 50% "Approved". Then we will sample 3 out of those Unique_Ids and retrieve all the information associated to them.
Step by step solution
IDs_OK <- x[, .N,
by = .(Unique_Id, Status == "Approved")][,
dcast(.SD,
Unique_Id ~ Status,
fill = 0)][
(`TRUE` / (`TRUE` + `FALSE`)) %between% c(.4, .6),
sample(unique(Unique_Id), 3)]
x[, .N,by = .(Unique_Id, Status == "Approved")] Counts cases (.N) by each combination of Unique_Id and status (it will return TRUE where approved, and FALSE otherwise)
We chain that result (that's the ][) and convert the table to wide (dcast with Unique_Id in the rows and TRUE and FALSE in the columns. fill = 0 instructs to fill with 0 when no case is found.
We chain that result and filter the cases where the proportion of "TRUEs" is between 40 - 60%.
For those cases, we take the unique Unique_Ids and sample 3 out of them.
We assing those 3 Unique_Ids to a variable called IDs_OK.
x[Unique_Id %in% IDs_OK, ] # this is your expected result.
One-line solution:
It is possible to use a join in the data.table style (X[Y, on = "var"] joins X to Y on variable var):
x[x[, .N, by = .(Unique_Id, Status == "Approved")][, dcast(.SD, Unique_Id ~ Status, fill = 0)][(`TRUE` / (`TRUE` + `FALSE`)) %between% c(.4, .6), .(Unique_Id = sample(unique(Unique_Id), 3))], on = "Unique_Id"]
The only difference is in the last line, where I used .(Unique_Id = sample(unique(Unique_Id), 3)). The dot returns the result as a data.table, a condition necessary to make the join.

How to get one row per unique ID with multiple columns per values of particular column

I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)

Best practice for handling different datasets with same type of data but different column names

For example, suppose I want to build a package for analyzing customer transactions. In a nice world, every transactions dataset would look like
TransactionId CustomerId TransactionDate
1: 1 1 2017-01-01
2: 2 2 2017-01-15
3: 3 1 2017-05-20
4: 4 3 2017-06-11
Then I could make nice functions like
num_customers <- function(transactions){
length(unique(transactions$CustomerId))
}
In reality, the column names people use vary. (E.g. "CustomerId", "CustomerID", and "cust_id" might all be used by different companies).
My question is, what is the best way for me to deal with this? I plan on relying heavily on data.table, so my instinct was do make the users provide a mapping from their column names to the ones I use as an attribute of their table like
mytransactions <- data.table(
transaction_id = c(1L, 2L, 3L, 4L),
customer_id = c(1L, 2L, 1L, 3L),
transaction_date = as.Date(c("2017-01-01", "2017-01-15", "2017-05-20", "2017-06-11"))
)
setattr(
mytransactions,
name = "colmap",
value = c(TransactionID="transaction_id", CustomerID="customer_id", TransactionDate="transaction_date")
)
attributes(mytransactions)
However, unfortunately, as soon as they subset their data this attribute gets removed.
attributes(mytransactions[1:2])
If you expect data to have a specific shape and set of attributes, define a class. It's really easy to do in R using the S3 system, since you only need to change the class attribute.
The best way to let users create S3 objects is through a function. To keep the original "feel" of adapting existing datasets, have the users provide a dataset and name which columns to use for different values. Default argument values can keep your package code succinct and reward standards-respecting users.
transaction_table <- function(dataset,
cust_id = "CustomerId",
trans_id = "TransactionId",
trans_date = "TransactionDate") {
keep_columns <- c(
CustomerId = cust_id,
TransactionId = trans_id,
TransactionDate = trans_date
)
out_table <- dataset[, keep_columns, with = FALSE]
setnames(out_table, names(keep_columns))
setattr(out_table, "class", c("transaction_table", class(out_table)))
out_table
}
standardized <- transaction_table(
mytransactions,
cust_id = "customer_id",
trans_id = "transaction_id",
trans_date = "transaction_date"
)
standardized
# CustomerId TransactionId TransactionDate
# 1: 1 1 2017-01-01
# 2: 2 2 2017-01-15
# 3: 1 3 2017-05-20
# 4: 3 4 2017-06-11
As a bonus, you can now take full advantage of the S3 system, defining class-specific methods for generic functions.
print.transaction_table <- function(x, ...) {
time_range <- range(standardized[["TransactionDate"]])
formatted_range <- strftime(time_range)
cat("Transactions from", formatted_range[1], "to", formatted_range[2], "\n")
NextMethod()
}
print(standardized)
# Transactions from 2017-01-01 to 2017-06-11
# CustomerId TransactionId TransactionDate
# 1: 1 1 2017-01-01
# 2: 2 2 2017-01-15
# 3: 1 3 2017-05-20
# 4: 3 4 2017-06-11

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.
I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.
Please see an example / fictitious data set:
require(data.table)
data <- data.table(month = rep(c('may', 'jun', 'jul'), each = 5),
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))
In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.
For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".
If I were to try and do this by hand using data table, I would want something like:
mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))
I would want this repeated "by = month".
If I were to use the "plyr" package for a data frame, I would solve using the following approach:
require(plyr)
data <- data.frame(data)
mdata <- ddply(data, .(month), function(dfmo) {
dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
return(dfmo)
})
Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.
A different way to illustrate :
data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]
month may.q1 may.q2
1: may yes econ
2: may yes econ
3: may yes econ
4: may yes econ
5: may yes econ
6: jun lunch foggy
7: jun lunch foggy
8: jun lunch foggy
9: jun lunch foggy
10: jun lunch foggy
11: jul oranges heavy rain
12: jul oranges heavy rain
13: jul oranges heavy rain
14: jul oranges heavy rain
15: jul oranges heavy rain
But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.
Edit: Seems very inefficient on bigger data. Check out #MatthewDowle's answer for a really fast and neat solution.
Here's a solution using data.table.
dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][,
ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]
The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).
melt.dt <- function(DT, id.var) {
stopifnot(inherits(DT, "data.table"))
measure.var <- setdiff(names(DT), id.var)
ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
m1 <- lapply(c("list", id.var), as.name)
m2 <- as.call(lapply(c("factor", "ind"), as.name))
m3 <- as.call(lapply(c("c", measure.var), as.name))
quoted <- as.call(c(m1, ind = m2, values = m3))
DT[, eval(quoted)]
}
The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.
What about something like this
data <- data.table(
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
)
tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")
str(tmp)
## Classes ‘data.table’ and 'data.frame': 30 obs. of 5 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ may : chr "yes" "yes" "yes" "yes" ...
## $ jun : chr "breakfast" "breakfast" "breakfast" "breakfast" ...
## $ jul : chr "oranges" "oranges" "oranges" "oranges" ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
If you want to go further and melting this data again you can use the melt package
require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")
str(res)
## 'data.frame': 90 obs. of 3 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ month : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
## $ response: chr "yes" "yes" "yes" "yes" ...

Resources