Converting a categorical variable to multiple binary variables [duplicate] - r

This question already has answers here:
Generate a dummy-variable
(17 answers)
How do I make a dummy variable in R?
(3 answers)
Closed 5 years ago.
I wish to convert part of my data to binary wide format.
This is my input:
mydf <- data.frame( transaction =c (1,0,1,1,1,0,0), quality = c("NEW", "OLD","OLD", "OLD","OLD","NEW","NEW"), brand = c(1,2,3,1,2,2,1))
transaction quality brand
1 1 NEW 1
2 0 OLD 2
3 1 OLD 3
4 1 OLD 1
5 1 OLD 2
6 0 NEW 2
7 0 NEW 1
>
and I wish to convert the brand column to wide format so that have the following output
transaction quality brand_1 brand_2 brand_3
1 1 NEW 1 0 0
2 0 OLD 0 1 0
3 1 OLD 0 0 1
4 1 OLD 1 0 0
5 1 OLD 0 1 0
6 0 NEW 0 1 0
7 0 NEW 1 0 0
I tried different approaches such as model.matrix function but couldn't reach to my desired output.

For every row we select it's corresponding column which needs to be changed to 1. We generate the row/column combination by using seq(for selecting rows) and paste0 (to select columns). For all those row/column combination we use mapply to change all the corresponding values to 1 using the not-so-famous global assignment operator.
#Generate new columns to be added
cols <- paste0("brand-", 1:3)
#Initialise the columns to 0
mydf[cols] <- 0
mapply(function(x, y) mydf[x, y] <<- 1, seq(nrow(mydf)),
paste0("brand-", mydf$brand))
mydf
# transaction quality brand brand-1 brand-2 brand-3
#1 1 NEW 1 1 0 0
#2 0 OLD 2 0 1 0
#3 1 OLD 3 0 0 1
#4 1 OLD 1 1 0 0
#5 1 OLD 2 0 1 0
#6 0 NEW 2 0 1 0
#7 0 NEW 1 1 0 0
We can remove the orginal brand column if we no longer require it using
mydf$brand <- NULL

For a tidy approach
library(dplyr)
library(tidyr)
library(tibble)
mydf %>%
rownames_to_column() %>%
group_by(rowname, transaction, quality, brand) %>%
summarise(count = n()) %>%
spread(brand, count, sep = "-", fill = 0) %>%
ungroup() %>%
select(-rowname)
# # A tibble: 7 x 5
# transaction quality `brand-1` `brand-2` `brand-3`
# * <dbl> <fctr> <dbl> <dbl> <dbl>
# 1 1 NEW 1 0 0
# 2 0 OLD 0 1 0
# 3 1 OLD 0 0 1
# 4 1 OLD 1 0 0
# 5 1 OLD 0 1 0
# 6 0 NEW 0 1 0
# 7 0 NEW 1 0 0

Related

Mark IDs that have a specific value across columns

I am trying to create a variable 'check' with values 1/0. These should be assigned based on whether across columns V1 to V3 there is at least one value = 1 for each ID.
DF <- data.frame (ID= c(1,1,1,2,3,3,4,5,5,6), V1= c(1,0,0,1,1,0,0,1,0,0),
V2= c(1,1,0,0,1,0,1,1,0,0), V3= c(0,1,0,1,0,0,0,0,1,0))
This is the code I am using but group by doesn't seem to work. It does seem to go across columns and mark as 1 all of those having at least one value of 1 but not by ID.
DF %>% dplyr::group_by(ID) %>%
dplyr::mutate(Check= case_when(if_any('V1':'V3',~.x!=0)~1,TRUE ~0)) %>%
dplyr::ungroup()
So the output I am looking for is this one:
ID
V1
V2
V3
check
1
1
1
0
1
1
0
1
1
1
1
0
0
0
1
2
1
0
1
0
3
1
1
0
0
3
0
0
0
0
4
0
1
0
0
5
1
1
0
1
5
0
0
1
1
6
0
0
0
0
Could you help?
Many thanks!
Edit: apologies, I have noticed a mistake in the output, it should be fine now.
Please check the below code
these are steps i followed
after grouping by ID column, derive new columns where if column is equal to 0 then change the value to NA, replace the 0 with NA
then retain the previous values to all the other rows so if the value is 1 it will be retained to other rows within the by group
then sum the values of all the three variables and if the sum of all 3 variable is 3 then derive check variable and update the value to 1
retain the 1 to other rows within the by group else set to zero
DF %>% group_by(ID) %>%
mutate(across(starts_with('V'), ~ ifelse(.x==0, NA, .x), .names = 'new_{col}')) %>%
fill(starts_with('new')) %>%
mutate(check=ifelse(rowSums(across(starts_with('new')))==3,1,0)) %>%
fill(check, .direction = 'downup') %>% mutate(check=replace_na(check,0)) %>%
select(-starts_with('new'))
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 10 × 5
# Groups: ID [6]
ID V1 V2 V3 check
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 1
2 1 0 1 1 1
3 1 0 0 0 1
4 2 1 0 1 0
5 3 1 1 0 0
6 3 0 0 0 0
7 4 0 1 0 0
8 5 1 1 0 1
9 5 0 0 1 1
10 6 0 0 0 0

Mutate multiply columns based on conditional and column name

I have a dataframe with the following structure (See example). The dots after OperatedIn2007 column signify multiple columns with same name, changing only the year (e.g OperatedIn2008, OperatedIn2009, etc.).
I wish to do the following procedure:
If the group is 1, then add one in all columns whose names start with OperatedIn.
The expected result should be similar to the one presented in the desired output.
A nonscalable solution would be to use:
df <- df %<%
mutate(OperatedIn2006 = ifelse(group == 1, 1, 0)) %<%
[...]
I imagine there is some slick solution using dplyr or data.table, but I could not think of it myself.
Example
ID group OperatedIn2006 OperatedIn2007 ...
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 1 0 0
6 2 0 0
Desired output
ID group OperatedIn2006 OperatedIn2007 ...
1 1 1 1
2 2 0 0
3 3 0 0
4 4 0 0
5 1 1 1
6 2 0 0
We could use across with an ifelse statement:
library(dplyr)
df %>%
mutate(across(-c(ID, group), ~ifelse(group==1, 1, .)))
ID group OperatedIn2006 OperatedIn2007
1 1 1 1 1
2 2 2 0 0
3 3 3 0 0
4 4 4 0 0
5 5 1 1 1
6 6 2 0 0

is there an r code that can select based on 3 columns and returns a value based on the options

I want to create a new variable name based on roof, wall and floor. if any of the options has 1 then the new variable is assigned 1 and zero otherwise.
roof<-c(1,1,1,0,1,0)
wall<-c(0,1,1,0,0,0)
floor<-c(1,1,1,0,1,0)
data<-data.frame(roof,wall,floor)
data
data$code<-c(1,1,1,0,1,0)
You can use pmap:
library(tidyverse)
roof<-c(1,1,1,0,1,0)
wall<-c(0,1,1,0,0,0)
floor<-c(1,1,1,0,1,0)
data<-data.frame(roof,wall,floor)
#
data %>%
mutate(code_want = pmap_int(data %>%
select(roof:floor) %>%
mutate_all(as.logical), any))
# roof wall floor code code_want
#1 1 0 1 1 1
#2 1 1 1 1 1
#3 1 1 1 1 1
#4 0 0 0 0 0
#5 1 0 1 1 1
#6 0 0 0 0 0

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

convert long to wide format with two factors in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
I have the following data set:
sample.data <- data.frame(Step = c(1,2,3,4,1,2,1,2,3,1,1),
Case = c(1,1,1,1,2,2,3,3,3,4,5),
Decision = c("Referred","Referred","Referred","Approved","Referred","Declined","Referred","Referred","Declined","Approved","Declined"),
Reason = c("Docs","Slip","Docs","","Docs","","Slip","Docs","","",""))
sample.data
Step Case Decision Reason
1 1 1 Referred Docs
2 2 1 Referred Slip
3 3 1 Referred Docs
4 4 1 Approved
5 1 2 Referred Docs
6 2 2 Declined
7 1 3 Referred Slip
8 2 3 Referred Docs
9 3 3 Declined
10 1 4 Approved
11 1 5 Declined
Is it possible in R to translate this into a wide table format, with the decisions on the header, and the value of each cell being the count of the occurrence, for example:
Case Referred Approved Declined Docs Slip
1 3 1 0 2 0
2 1 0 1 1 0
3 2 0 1 1 1
4 0 1 0 0 0
5 0 0 1 0 0
library(reshape2)
df1 <- dcast(sample.data, Case~Decision+Reason)
names(df1)[2:5] <- c("Approved", "Declined", "Docs", "Slip")
df1$Referred <- df1$Docs + df1$Slip
df1
# Case Approved Declined Docs Slip Referred
# 1: 1 1 0 2 1 3
# 2: 2 0 1 1 0 1
# 3: 3 0 1 1 1 2
# 4: 4 1 0 0 0 0
# 5: 5 0 1 0 0 0
Using:
library(reshape2)
tmp <- melt(sample.data, id.var=c("Step", "Case"))
tmp <- tmp[tmp$value!="",]
dcast(tmp, Case ~ value, value.var="Case", length)
you get:
Case Approved Declined Docs Referred Slip
1: 1 1 0 2 3 1
2: 2 0 1 1 1 0
3: 3 0 1 1 2 1
4: 4 1 0 0 0 0
5: 5 0 1 0 0 0
Using the data.table-package, you can use the same melt and dcast functionality as with reshape2, but you don't need a temporary dataframe:
library(data.table)
dcast(melt(setDT(sample.data), id.var=c("Step", "Case"))[value!=""],
Case ~ value, value.var="Case", length)
which will give you the same result.
We can use gather/spread from tidyr
library(tidyr)
library(dplyr)
gather(sample.data, Var, Val, 3:4) %>%
group_by(Case, Val) %>%
summarise(n=n()) %>%
filter(Val!='') %>%
spread(Val, n, fill=0)
# Case Approved Declined Docs Referred Slip
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 1 1 0 2 3 1
#2 2 0 1 1 1 0
#3 3 0 1 1 2 1
#4 4 1 0 0 0 0
#5 5 0 1 0 0 0

Resources