Related
Survey.ID Quit Boss Subord Subord2 Subord3 Subord4
1 1 0 0 0 0 1 0
2 2 1 0 0 1 0 0
3 3 0 0 0 0 0 0
4 4 0 0 0 0 1 0
5 5 0 0 0 0 1 0
6 6 1 0 0 0 1 0
I have a df above. Each of the variables is a binary variable that categorizes if someone is a boss, or a certain level of subordinate. I am trying to see what is most predictive of someone quitting the past month. I am using the logistic regression
model <- glm(Quit ~ Subord, family=binomial, data = df)
summary(model)
exp(cbind(RR = coef(model), confint.default(model)))
I would like to find the Relative Risk (RR) for each group of employees: Boss, Subord, Subord2, Subord3, Subord4. However, I would like to reference group to be Subord4. I believe right now, the reference is set to boss? How do I fix this?
I think this might help
Libraries
library(tidyverse)
Sample data
df <-
structure(list(id = 1:6, Quit = c(0L, 1L, 0L, 0L, 0L, 1L), Boss = c(0L,
0L, 0L, 0L, 0L, 0L), Subord = c(0L, 0L, 0L, 0L, 0L, 0L), Subord2 = c(0L,
1L, 0L, 0L, 0L, 0L), Subord3 = c(1L, 0L, 0L, 1L, 1L, 1L), Subord4 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Code
df %>%
#Create a single column of variables: Boss Subord Subord2 Subord3 Subord4
pivot_longer(cols = -c(id,Quit)) %>%
#Keeping only those with value = 1
filter(value == 1) %>%
#Making "Subord4" the baseline, as the first level of the factor
mutate(name = fct_relevel(as.factor(name),"Subord4")) %>%
glm(data = .,formula = Quit~name, family=binomial)
I have a data frame:
table = structure(list(Plot = 1:10, Sp1 = c(0L, 0L, 1L, 1L, 0L, 1L, 0L,
0L, 1L, 0L), Sp2 = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L),
Sp3 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L), Sp4 = c(0L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-10L))
0 represents a species (Sp) being absent from a plot. 1 represents a species being present.
First, I want to subset my data frame so that only plots with Sp1 or Sp3 or Sp4 remain. This can be done easily with filter from dplyr:
reduced_table <- table %>% filter(table$Sp1 == 1 |table$Sp3 == 1 | table$Sp4 == 1)
But, what if I want to reduce the table so that only plots that have any combination of two of these species is present. For example plots with Sp1 & Sp3, or Sp1 and Sp4, or Sp3 and Sp4 would remain.
Can this be done eloquently like using filter? My real situation has many more species and therefore many more combinations so explicitly writing out the combinations is not ideal.
We can use if_any with filter
library(dplyr)
table %>%
filter(if_any(c(Sp1, Sp3, Sp4), ~ .== 1))
-output
# Plot Sp1 Sp2 Sp3 Sp4
#1 1 0 1 1 0
#2 2 0 0 1 1
#3 3 1 1 1 1
#4 4 1 0 1 0
#5 5 0 0 1 1
#6 6 1 0 0 0
#7 7 0 1 1 0
#8 8 0 0 1 1
#9 9 1 0 0 1
Or using a combnation of columns
library(purrr)
combn(c("Sp1", "Sp3", "Sp4"), 2, simplify = FALSE) %>%
map_dfr( ~ table %>%
filter(if_all(.x, ~ . == 1))) %>%
distinct
If the intention is to do filtering on pairwise column checks, use combn from base R
subset(table, Reduce(`|`, combn(c("Sp1", "Sp3", "Sp4"), 2,
FUN = function(x) rowSums(table[x] == 1) == 2, simplify = FALSE)))
This question already has answers here:
Reconstruct a categorical variable from dummies in R [duplicate]
(3 answers)
Closed 3 years ago.
How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?
Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).
Would appreciate a base R solution.
For example, I have the following data:
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L),
.Dim = c(10L, 4L),
.Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
I would like to get the following results
state
1 NJ
2 NY
3 NJ
4 VA
5 NY
6 TX
7 NJ
8 VA
9 TX
10 VA
cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L,
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")),
class = "data.frame", row.names = c(NA, -10L))
# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))
df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])
Result:
> df
a b c cat
1 1 0 0 a
2 0 1 0 b
3 0 0 1 c
4 0 1 0 b
5 0 0 1 c
To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,
catmaker <- function(data, varnames, catname) {
data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])
return(data)
}
newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")
One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").
P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.
You can use tidyr::gather:
library(dplyr)
library(tidyr)
as_tibble(dummy.df) %>%
mutate(id =1:n()) %>%
pivot_longer(., -id, values_to = "Value",
names_to = c("txt","State"), names_sep = "\\.") %>%
filter(Value ==1) %>% select(State)
#> # A tibble: 10 x 1
#> State
#> <chr>
#> 1 NJ
#> 2 NY
#> 3 NJ
#> 4 VA
#> 5 NY
#> 6 TX
#> 7 NJ
#> 8 VA
#> 9 TX
#> 10 VA
You can do:
states <- names(dummy.df)[max.col(dummy.df)]
Or if as in your example it's a matrix you'd need to use colnames():
colnames(dummy.df)[max.col(dummy.df)]
Then just clean it up with sub():
sub(".*\\.", "", states)
"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"
EDIT : with your data
One way with model.matrix for dummy creation and matrix multiplication :
dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX",
"State.VA")))
level_names <- colnames(dummy.df)
# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)
# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#> [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#> [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA
General reprex :
# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
y = 1:9)
dfr
#> vec y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 b 6
#> 7 c 7
#> 8 c 8
#> 9 c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)
# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)
# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c
I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.
I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.