Converting Categorical Columns into Multiple Binary Columns in R

Converting Categorical Columns into Multiple Binary Columns in R - r

I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.
I found this code online:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)
# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)
Converting categorical variables in R for ANN (neuralnet)
But I'm very new to R. Can anyone break down what this complicated looking code is doing?
edit:
Implementing #nongkrong's recommendations I'm running into a problem:
CSV:
X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q
R:
newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData
R Output:
X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1 1 0 0 0 0 0 0 0 0 0 0
2 0 1 0 1 0 0 0 1 0 0 0
3 0 0 1 0 1 0 0 0 1 0 0
4 0 1 0 0 0 1 0 0 0 1 0
5 0 0 1 0 0 0 1 0 0 0 1
6 1 0 0 0 0 0 0 0 0 0 0
It works great with 1 column, but is missing X2D and X3Q. Any ideas why?

#nongkrong is right--read ?formulas and you'll see that most functions that accept formulas as input (e.g. lm, glm, etc.) will automatically convert categorical variables (stored as factors or characters) to dummies; you can force this on non-factor numeric variables by specifying as.factor(var) in your formula.
That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.
So, I've annotated the code for you:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
Lots going on in this first line, so let's go bit-by-bit:
d$purpose==x checks each entry of d$purpose for equality to x; the result will be TRUE or FALSE (or NA if there are missing values). Multiplying by 1 (*1) forces the output to be an integer (so TRUE becomes 1 and FALSE becomes 0).
lapply applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose) (i.e., each level of d$purpose), we output a vector of 0s and 1s, where the 1s correspond to the elements of d$purpose matching the given level. The output of lapply is a list (hence l in front of apply), with one list element corresponding to each of the levels of d$purpose.
We want to get this into our data.frame, so a list isn't very useful; Reduce is what we use to back out the information from the list to a data.frame form. Reduce(cbind,LIST) is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)--convenient shorthand, especially when we don't know the length of LIST.
Wrapping this in data.frame casts this into the mode data.frame.
#This line simply puts column names on each of the indicator variables
# Note that you can replace the RHS of this line with whatever
# naming convention you want for the levels--a common approach might
# be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original
# data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
# where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)
Also note that something like this could have been done much simpler in the data.table package:
library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]

Related

Is there a best practice in producing a series of columns filled with values based on logical checks?

Consider the following data:
mydf<-data.frame(ID=c('a','b','c','d','e'),
home=c(0,2,1,0,0),
aroundtown=c(0,3,0,1,2),
outinspace=c(5,0,0,2,1))
My goal is to produce columns _any and _exclusive for each of home, aroundtown, and outinspace. The _any columns should be filled with the original values associated with the ID and the variable. The _exclusive columns should read 0 if there are no other columns for a given ID with a value other than zero. (As shown below).
I am able to get this done with the following:
md2<-mydf%>%
pivot_longer(cols=-ID)%>%
group_by(ID)%>%
mutate(tot=sum(value))%>%
mutate(Exclusive=ifelse(value==tot,"Exclusive",""))%>%
mutate(freq=ifelse(Exclusive=="",1,2))%>%
select(-Exclusive)%>%
uncount(freq)%>%
group_by(ID,name)%>%
mutate(exclusive=ifelse(cumsum(value)>value,"exclusive","Any"))%>%
select(-tot)%>%
pivot_wider(names_from = c(name,exclusive),values_from = value)%>%
replace(is.na(.),0)
I can't help but wonder if there is a better, or preferred-by-some, means of achieving this.
The ideal solution would
Avoid explicitly naming home,aroundtown, and outinspace
Have the flexibility to allow any number of variables beyond ID
Allow for multiple logical checks similar in nature (for instance _mostly comes to mind), resulting in columns for each check.

My preferred solution would be to use something not necessarily computationally optimized but more readable and adaptable as several of your requirements involve that aspect.
I decided to separate each of your logical categories (any, exclusive) into different functions. This way, you can add whatever criteria you want in the future by creating a new function (for instance is_Mostly).
is_Any <- function(dfrow){
return(dfrow)
}
is_Exclusive <- function(dfrow){
nonzeros <- which(dfrow > 0)
if (length(nonzeros) == 1){
dfrow[!nonzeros] <- 0
return(dfrow)
}
return(rep(0, length(dfrow)))
}
is_Mostly <- function(dfrow){
dfrow[!dfrow == max(dfrow)] <- 0
return(dfrow)
}
The master function will be called using the numerical columns of your choice (without naming them) and with the selected conditions.
add_new_columns <- function(df, columns, conditions){
for (condition in conditions){
eval_func <- eval(parse(text = paste0("is_", condition)))
new_cols <- paste0(columns, "_", condition)
df[, new_cols] <- t(apply(df[, columns], 1, eval_func))
}
return(df)
}
my_columns <- colnames(mydf)[-1]
my_conditions <- c("Any", "Exclusive")
mydf2 <- add_new_columns(mydf, my_columns, my_conditions)
mydf2
# ID home aroundtown outinspace home_Any aroundtown_Any outinspace_Any home_Exclusive aroundtown_Exclusive outinspace_Exclusive
#1 a 0 0 5 0 0 5 0 0 5
#2 b 2 3 0 2 3 0 0 0 0
#3 c 1 0 0 1 0 0 1 0 0
#4 d 0 1 2 0 1 2 0 0 0
#5 e 0 2 1 0 2 1 0 0 0

How can I use rowSums with conditions to return binary value?

Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?

as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.

We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0

One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().

R: find difference if at least one of the two values are zero

I would like to calculate the abs(difference) of rows (current with preceding row), if at least one of the values is zero. If both values are non-zeros, it should return 0.
Example: the column 'A' shows the initial data, 'Diff' shows what I want to get:
df <- data.frame(A=c(0,0,1,2,3,4,0,0),Diff=c(0,0,1,0,0,0,4,0))
Thanks

Here's a way using the dplyr package and the lag function:
library(dplyr)
df1 <- data.frame(A=c(0,0,1,2,3,4,0,0),Diff=c(0,0,1,0,0,0,4,0))
df1 %>%
mutate(lag_A = lag(A),
Diff2 = abs(A - lag_A)) %>%
mutate(Diff2 = ifelse(is.na(lag_A), 0,
ifelse(A == 0 | lag_A == 0, Diff2, 0)))
A Diff lag_A Diff2
1 0 0 NA 0
2 0 0 0 0
3 1 1 0 1
4 2 0 1 0
5 3 0 2 0
6 4 0 3 0
7 0 4 4 4
8 0 0 0 0
You could then use the select function to grab the columns you want. I left all of the columns in the table to illustrate the solution.

Assumption
I'm gonna assume the first value of Diff is not the result of the operation you want, because the first value of A has no preceding value, therefore i chose to repeat it on the answer, but you can add anything instead of A[1].
Edit: I also assumed you wanted a solution using only base, otherwise, bouncyball's solution works like a charm.
Solution
Diff = rep(0,(length(A)-1))
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
Diff = c(A[1],Diff)
You can improve upon this by making it a named function like this:
foo = function(A){
Diff = rep(0,(length(A)-1))
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
Diff = c(A[1],Diff)
Diff
}
Explanation
Instead of operating A, we use:
A[-1], which yields 0 1 2 3 4 0 0 (equal to A but without the first element)
and A[-length(A)], which yields: 0 0 1 2 3 4 0 (equal to A, but without the last element).
If we subtract the first from the latter we have the element-wise subtractions right, but we still need to know when these subtractions need to occur.
By combining functions xor, mapply and which (all of them are members of base), like this:
which(mapply(xor,A[-1],A[-length(A)])) we find:
2 6 which are the indexes of the positions in A[-1] and A[-length(A)] where subtractions should happen.
Now we create a 0 vector called Diff the same size of our new vectors using:
Diff = rep(0,length(A)-1), then we change only the indexes that should not contain 0s using:
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
And finally we put back the first element again using Diff = c(A[1],Diff).

Randomly drop a column selected from a group, excluding one

I have the following data frame, which is going to be used as an input in a logit regression:
my_frame<-data.frame(y=c(1,0,1),A=c(0,1,1),B=c(1,0,0),C=c(0,0,0),t=c(1,1,1),x=c(1,0,0),z=c(1,0,1))
Knowing that the dummy variables A, B and C are connected through a linear equation (A+B+C=1), I need to drop one of the three before proceeding.
y A B C t x z
1 0 1 0 1 1 1
0 1 0 0 1 0 0
1 1 0 0 1 0 1
Now, here is the difficult part. I want to exclude randomly one of the columns of a group comprised by A,B,C and D, but not the one that has 1 as a value in the last row of the dataframe.
In my example, I want one of B and C to be excluded randomly.
Column D is not present, because in this particular data frame it would always be valued 0, but it is still part of the same group of variables.

I don't really get, what you mean with your last sentence about column D, but anyway, you could try this:
my_frame<-data.frame(y=c(1,0,1),A=c(0,1,1),B=c(1,0,0),C=c(0,0,0),t=c(1,1,1),x=c(1,0,0),z=c(1,0,1))
allRelevantCols <- c("A", "B", "C")
# Get all columns, which can be excluded
allColsToExclude <- allRelevantCols[which(my_frame[nrow(my_frame), allRelevantCols] == 0)]
for (i in 1:<how often you would like to run this>) {
colsToExclude <- c(sample(allColsToExclude, 1))
my_frame[, -which(colnames(my_frame) %in% colsToExclude)]
}

need to count number of specific transitions in a vector in R

I am programming a sampler in R, which basically is a big for loop, and for every Iterations I have to count the number of transitions in a vector. I have a vector called k, which contains zeros and ones, with 1000 entries in the vector.
I have used the following, horribly slow, code:
#we determine the number of transitions n00,n01,n10,n11
n00=n01=n10=n11=0 #reset number of transitions between states from last time
for(j in 1:(1000-1)){
if(k[j+1]==1 && k[j]==0) {n01<-n01+1}
else { if(k[j+1]==1 && k[j]==1) {n11<-n11+1}
else { if(k[j+1]==0 && k[j]==1) {n10<-n10+1}
else{n00<-n00+1}
}
}
}
So for every time the loop goes, the variables n00,n01,n10,n11 counts the transitions in the vector. For example, n00 counts number of times a 0 is followed by another 0. And so on...
This is very slow, and I am very new to R, so I am kind of desperate here. I do not understand how to use grep, if that even is possible.
Thank you for your help

Try something like this:
x <- sample(0:1,20,replace = TRUE)
> table(paste0(head(x,-1),tail(x,-1)))
00 01 10 11
4 3 4 8
The head and tail return portions of the vector x: all but the last element, and then all but the first element. This means that the corresponding elements are the consecutive pairs from x.
Then paste0 just converts each one to a character vector and pastes the first elements, the second element, etc. The result is a character vector with elements like "00", "01", etc. Then table just counts up how many of each there are.
You can assign the result to a new variable like so:
T <- table(paste0(head(x,-1),tail(x,-1)))
Experiment yourself with each piece of the code to see how it works. Run just head(x,-1), etc. to see what each piece does.
To address the comment below, to ensure that all types appear with counts when you run table, convert it to a factor first:
x1 <- factor(paste0(head(x,-1),tail(x,-1)),levels = c('00','01','10','11'))
table(x1)

If we don't care about distinguishing the n00 and n11 cases, then this becomes much simpler:
x <- sample(0:1,20,replace = TRUE)
# [1] 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0
table(diff(x))
# -1 0 1
# 4 11 4
Since the question says that you're primarily interested in the transitions, this may be acceptable, otherwise one of the other answers would be preferable.

x <- sample(0:1, 10, replace = TRUE)
# my sample: [1] 0 0 0 0 0 1 0 1 1 0
rl <- rle(x)
zero_to_zero <- sum(rl$len[rl$val == 0 & rl$len > 1] - 1)
one_to_one <- sum(rl$len[rl$val == 1 & rl$len > 1] - 1)
zero_to_one <- sum(diff(rl$val) == -1)
one_to_zero <- sum(diff(rl$val) == 1)
x
# [1] 0 0 0 0 0 1 0 1 1 0
zero_to_zero
# [1] 4
one_to_one
# [1] 1
zero_to_one
# [1] 2
one_to_zero
# [1] 2
#joran's answer is faaaar cleaner though...Still, I thought I just as well could finish the stroll I started down (the dirty) trail, and share the result.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Converting Categorical Columns into Multiple Binary Columns in R - r

Related

Is there a best practice in producing a series of columns filled with values based on logical checks?

How can I use rowSums with conditions to return binary value?

R: find difference if at least one of the two values are zero

Randomly drop a column selected from a group, excluding one

need to count number of specific transitions in a vector in R

Categories

Resources