Create categorical variable from mutually exclusive dummy variables [duplicate] - r

This question already has answers here:
Reconstruct a categorical variable from dummies in R [duplicate]
(3 answers)
Closed 3 years ago.
How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?
Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).
Would appreciate a base R solution.
For example, I have the following data:
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L),
.Dim = c(10L, 4L),
.Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
I would like to get the following results
state
1 NJ
2 NY
3 NJ
4 VA
5 NY
6 TX
7 NJ
8 VA
9 TX
10 VA
cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L,
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")),
class = "data.frame", row.names = c(NA, -10L))

# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))
df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])
Result:
> df
a b c cat
1 1 0 0 a
2 0 1 0 b
3 0 0 1 c
4 0 1 0 b
5 0 0 1 c
To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,
catmaker <- function(data, varnames, catname) {
data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])
return(data)
}
newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")
One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").
P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.

You can use tidyr::gather:
library(dplyr)
library(tidyr)
as_tibble(dummy.df) %>%
mutate(id =1:n()) %>%
pivot_longer(., -id, values_to = "Value",
names_to = c("txt","State"), names_sep = "\\.") %>%
filter(Value ==1) %>% select(State)
#> # A tibble: 10 x 1
#> State
#> <chr>
#> 1 NJ
#> 2 NY
#> 3 NJ
#> 4 VA
#> 5 NY
#> 6 TX
#> 7 NJ
#> 8 VA
#> 9 TX
#> 10 VA

You can do:
states <- names(dummy.df)[max.col(dummy.df)]
Or if as in your example it's a matrix you'd need to use colnames():
colnames(dummy.df)[max.col(dummy.df)]
Then just clean it up with sub():
sub(".*\\.", "", states)
"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"

EDIT : with your data
One way with model.matrix for dummy creation and matrix multiplication :
dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX",
"State.VA")))
level_names <- colnames(dummy.df)
# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)
# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#> [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#> [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA
General reprex :
# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
y = 1:9)
dfr
#> vec y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 b 6
#> 7 c 7
#> 8 c 8
#> 9 c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)
# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)
# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c

Related

subset rows in dataframe using combinations of conditions

I have a data frame:
table = structure(list(Plot = 1:10, Sp1 = c(0L, 0L, 1L, 1L, 0L, 1L, 0L,
0L, 1L, 0L), Sp2 = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L),
Sp3 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L), Sp4 = c(0L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-10L))
0 represents a species (Sp) being absent from a plot. 1 represents a species being present.
First, I want to subset my data frame so that only plots with Sp1 or Sp3 or Sp4 remain. This can be done easily with filter from dplyr:
reduced_table <- table %>% filter(table$Sp1 == 1 |table$Sp3 == 1 | table$Sp4 == 1)
But, what if I want to reduce the table so that only plots that have any combination of two of these species is present. For example plots with Sp1 & Sp3, or Sp1 and Sp4, or Sp3 and Sp4 would remain.
Can this be done eloquently like using filter? My real situation has many more species and therefore many more combinations so explicitly writing out the combinations is not ideal.
We can use if_any with filter
library(dplyr)
table %>%
filter(if_any(c(Sp1, Sp3, Sp4), ~ .== 1))
-output
# Plot Sp1 Sp2 Sp3 Sp4
#1 1 0 1 1 0
#2 2 0 0 1 1
#3 3 1 1 1 1
#4 4 1 0 1 0
#5 5 0 0 1 1
#6 6 1 0 0 0
#7 7 0 1 1 0
#8 8 0 0 1 1
#9 9 1 0 0 1
Or using a combnation of columns
library(purrr)
combn(c("Sp1", "Sp3", "Sp4"), 2, simplify = FALSE) %>%
map_dfr( ~ table %>%
filter(if_all(.x, ~ . == 1))) %>%
distinct
If the intention is to do filtering on pairwise column checks, use combn from base R
subset(table, Reduce(`|`, combn(c("Sp1", "Sp3", "Sp4"), 2,
FUN = function(x) rowSums(table[x] == 1) == 2, simplify = FALSE)))

create new column with mutate() when value in any of several other columns is TRUE (or 1)

I have a dataframe (my_dataframe) with 5 columns. All have 0 or 1 values. I would like to create a new column called cn7_any, which should have values of 1 when any values from columns 2:5 are ==1.
structure(list(cn7_normal = c(1L, 1L, 1L, 1L, 1L, 1L),
cn7_right_paralysis_central = c(0L, 0L, 0L, 0L, 0L, 0L),
cn7_right_paralysis_peripheral = c(0L, 0L, 0L, 0L, 0L, 0L),
cn7_left_paralysis_central = c(0L, 0L, 0L, 0L, 0L, 0L),
cn7_left_paralysis_peripheral = c(0L, 0L, 0L, 0L, 0L, 0L)),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
> head(my_dataframe)
# A tibble: 6 x 5
cn7_normal cn7_right_paralysis_cen… cn7_right_paralysis_perip… cn7_left_paralysis_cen… cn7_left_paralysis_peri…
<int> <int> <int> <int> <int>
1 1 0 0 0 0
2 1 0 0 0 0
I could do it successfully with case_when():
my_dataframe<-my_dataframe%>%
mutate(cn7_paralisis_any=case_when(cn7_right_paralysis_central==1 ~ 1,
cn7_right_paralysis_peripheral==1 ~ 1,
cn7_left_paralysis_central==1 ~ 1,
cn7_left_paralysis_peripheral==1 ~ 1,
TRUE ~ 0)
)
Although it worked, I wonder whether there is a simpler, less verbose solution. I feel I should be using any() somehow. Any ideas?
my_dataframe$cn7_any <- apply(my_dataframe[ , 2:5], 1, max)
Your data is all zeroes, so I'll change a couple to prove the point.
rowSums(my_dataframe[,2:5]) > 0
# [1] FALSE TRUE FALSE TRUE FALSE FALSE
+(rowSums(my_dataframe[,2:5]) > 0)
# [1] 0 1 0 1 0 0
my_dataframe$cn7_any <- +(rowSums(my_dataframe[,2:5]) > 0)
Within dplyr,
my_dataframe %>%
mutate(cn7_any = rowSums(across(-cn7_normal, ~ . > 0)) > 0)
# # A tibble: 6 x 6
# cn7_normal cn7_right_paralysis_central cn7_right_paralysis_peripheral cn7_left_paralysis_central cn7_left_paralysis_peripheral cn7_any
# <int> <int> <int> <int> <int> <lgl>
# 1 1 0 0 0 0 FALSE
# 2 1 0 0 0 1 TRUE
# 3 1 0 0 0 0 FALSE
# 4 1 0 0 1 0 TRUE
# 5 1 0 0 0 0 FALSE
# 6 1 0 0 0 0 FALSE
It seems like a logical thing you're doing, not a number thing, but if you want numbers, just use the +(.) trick as above:
my_dataframe %>%
mutate(cn7_any = +(rowSums(across(-cn7_normal, ~ . > 0)) > 0))
Similar to Using any() vs | in dplyr::mutate
I also changed a few digits in your dataset.
V2: Using or |
V3: Using the dplyr::rowwise() prior to mutate to effectively group input by rows, then use the all() function (all looks at the entire vector, which is why you get the unexpected result)
my_dataframe<-structure(list(cn7_normal = c(1L, 1L, 1L, 1L, 1L, 1L),
cn7_right_paralysis_central = c(0L, 0L, 0L, 0L, 0L, 0L),
cn7_right_paralysis_peripheral = c(1L, 0L, 0L, 0L, 0L, 0L),
cn7_left_paralysis_central = c(0L, 1L, 0L, 0L, 0L, 0L),
cn7_left_paralysis_peripheral = c(0L, 0L, 0L, 0L, 0L, 0L)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
my_dataframe%>%
rowwise() %>% ### rowwise ###
mutate(cn7_paralisis_any=case_when(cn7_right_paralysis_central==1 ~ 1,
cn7_right_paralysis_peripheral==1 ~ 1,
cn7_left_paralysis_central==1 ~ 1,
cn7_left_paralysis_peripheral==1 ~ 1,
TRUE ~ 0),
cn7_v2=(cn7_right_paralysis_central|cn7_right_paralysis_peripheral|cn7_left_paralysis_central|cn7_left_paralysis_peripheral),
cn7_v3=any(cn7_right_paralysis_central ,cn7_right_paralysis_peripheral, cn7_left_paralysis_central, cn7_left_paralysis_peripheral)
) %>%
select(cn7_paralisis_any,cn7_v2,cn7_v3)
# A tibble: 6 x 3
# Rowwise:
# cn7_paralisis_any cn7_v2 cn7_v3
# <dbl> <lgl> <lgl>
#1 1 TRUE TRUE
#2 1 TRUE TRUE
#3 0 FALSE FALSE
#4 0 FALSE FALSE
#5 0 FALSE FALSE
#6 0 FALSE FALSE
I now use dplyr::if_any and dplyr::if_all in such cases. I think it makes the code very clear and readable whenever we must perform such rowwise logical operations in dplyr.
For this particular case, I would now use:
library(dplyr)
my_dataframe %>%
mutate(cn7_paralisis_any = +if_any(across(-cn7_normal)))

Weighted mean using aggregated

Sorry for asking what might be a very basic question, but I am stuck in a conundrum and cannot seem to get out of it.
I have a code that looks like
Medicine Biology Business sex weights
0 1 0 1 0.5
0 0 1 0 1
1 0 0 1 05
0 1 0 0 0.33
0 0 1 0 0.33
1 0 0 1 1
0 1 0 0 0.33
0 0 1 1 1
1 0 0 1 1
Where the first three are fields of study, and the fouth variable regards gender. Obviously with many more observations.
What I want to get, is the mean level of the the field of study (medicine, biology, business) by the variable sex (so the mean for men and the mean for women). To do so, I have used the following code:
barplot_sex<-aggregate(x=df_dummies[,1:19] , by=list(df$sex),
FUN= function(x) mean(x)
Which works perfectly and gives me what I needed. My problem is that I need to use a weighted mean now, but I canno use
FUN= function(x) weighted.mean(x, weights)
as there are many more observations than fields of study.
The only alternative I managed to do was to edit(boxplot) and change the values manually, but then R doesn't save the changes. Plus, I am sure there must be a trivial way to do exactly what I need.
Any help would be greatly appreciated.
Bests,
Gabriele
Using by.
by(dat, dat$sex, function(x) sapply(x[, 1:3], weighted.mean, x[, "weights"]))
# dat$sex: 0
# Medicine Biology Business
# 0.0000000 0.3316583 0.6683417
# ---------------------------------------------------------------------------------------
# dat$sex: 1
# Medicine Biology Business
# 0.82352941 0.05882353 0.11764706
Data:
dat <- structure(list(Medicine = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L
), Biology = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Business = c(0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), sex = c(1L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), weights = c(0.5, 1, 5, 0.33, 0.33, 1, 0.33,
1, 1)), class = "data.frame", row.names = c(NA, -9L))

String matching where strings contain punctuation

I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.
I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')

R: use a row as a grouping vector for row sums

If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.

Resources