Related
Survey.ID Quit Boss Subord Subord2 Subord3 Subord4
1 1 0 0 0 0 1 0
2 2 1 0 0 1 0 0
3 3 0 0 0 0 0 0
4 4 0 0 0 0 1 0
5 5 0 0 0 0 1 0
6 6 1 0 0 0 1 0
I have a df above. Each of the variables is a binary variable that categorizes if someone is a boss, or a certain level of subordinate. I am trying to see what is most predictive of someone quitting the past month. I am using the logistic regression
model <- glm(Quit ~ Subord, family=binomial, data = df)
summary(model)
exp(cbind(RR = coef(model), confint.default(model)))
I would like to find the Relative Risk (RR) for each group of employees: Boss, Subord, Subord2, Subord3, Subord4. However, I would like to reference group to be Subord4. I believe right now, the reference is set to boss? How do I fix this?
I think this might help
Libraries
library(tidyverse)
Sample data
df <-
structure(list(id = 1:6, Quit = c(0L, 1L, 0L, 0L, 0L, 1L), Boss = c(0L,
0L, 0L, 0L, 0L, 0L), Subord = c(0L, 0L, 0L, 0L, 0L, 0L), Subord2 = c(0L,
1L, 0L, 0L, 0L, 0L), Subord3 = c(1L, 0L, 0L, 1L, 1L, 1L), Subord4 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Code
df %>%
#Create a single column of variables: Boss Subord Subord2 Subord3 Subord4
pivot_longer(cols = -c(id,Quit)) %>%
#Keeping only those with value = 1
filter(value == 1) %>%
#Making "Subord4" the baseline, as the first level of the factor
mutate(name = fct_relevel(as.factor(name),"Subord4")) %>%
glm(data = .,formula = Quit~name, family=binomial)
I have a matrix (data1) looking like:
gene S869 S907 S909 S016 S090 S160
1 S1 0 0 0 0.000000 0 0
2 S2 0 0 0 0.000000 0 0
3 S3 0 0 0 0.423405 0 0
4 S4 0 0 0 0.000000 0 0
5 S5 0 0 0 0.000000 0 0
6 S6 0 0 0 0.000000 0 0
I have another dataset (data2) looking like:
Cultivar Dose
S869 10
S907 5
S909 7
S016 19
S090 15
S160 12
Then I want to do a linear regression using
for (gene in 1:ngenes){
model = lm(Dose~X[gene,])
}
I want to check regression between genes and dose using these two datasets. So that I get p values for all genes for dose across all varieties. Thank you in advance!
We can use Map to loop over the 'Cultivar, 'Dose' column, create a 'Dose' column in 'data1' subset based on the 'Cultivar' column and apply lm
Map(function(x, y) lm(Dose ~ ., transform(data1[x], Dose = y)),
as.character(data2$Cultivar), data2$Dose)
data
data1 <- structure(list(gene = c("S1", "S2", "S3", "S4", "S5", "S6"),
S869 = c(0L, 0L, 0L, 0L, 0L, 0L), S907 = c(0L, 0L, 0L, 0L,
0L, 0L), S909 = c(0L, 0L, 0L, 0L, 0L, 0L), S016 = c(0, 0,
0.423405, 0, 0, 0), S090 = c(0L, 0L, 0L, 0L, 0L, 0L), S160 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
data2 <- structure(list(Cultivar = c("S869", "S907", "S909", "S016", "S090",
"S160"), Dose = c(10L, 5L, 7L, 19L, 15L, 12L)), class = "data.frame",
row.names = c(NA,
-6L))
This question already has answers here:
Reconstruct a categorical variable from dummies in R [duplicate]
(3 answers)
Closed 3 years ago.
How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?
Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).
Would appreciate a base R solution.
For example, I have the following data:
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L),
.Dim = c(10L, 4L),
.Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
I would like to get the following results
state
1 NJ
2 NY
3 NJ
4 VA
5 NY
6 TX
7 NJ
8 VA
9 TX
10 VA
cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L,
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")),
class = "data.frame", row.names = c(NA, -10L))
# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))
df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])
Result:
> df
a b c cat
1 1 0 0 a
2 0 1 0 b
3 0 0 1 c
4 0 1 0 b
5 0 0 1 c
To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,
catmaker <- function(data, varnames, catname) {
data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])
return(data)
}
newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")
One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").
P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.
You can use tidyr::gather:
library(dplyr)
library(tidyr)
as_tibble(dummy.df) %>%
mutate(id =1:n()) %>%
pivot_longer(., -id, values_to = "Value",
names_to = c("txt","State"), names_sep = "\\.") %>%
filter(Value ==1) %>% select(State)
#> # A tibble: 10 x 1
#> State
#> <chr>
#> 1 NJ
#> 2 NY
#> 3 NJ
#> 4 VA
#> 5 NY
#> 6 TX
#> 7 NJ
#> 8 VA
#> 9 TX
#> 10 VA
You can do:
states <- names(dummy.df)[max.col(dummy.df)]
Or if as in your example it's a matrix you'd need to use colnames():
colnames(dummy.df)[max.col(dummy.df)]
Then just clean it up with sub():
sub(".*\\.", "", states)
"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"
EDIT : with your data
One way with model.matrix for dummy creation and matrix multiplication :
dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX",
"State.VA")))
level_names <- colnames(dummy.df)
# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)
# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#> [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#> [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA
General reprex :
# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
y = 1:9)
dfr
#> vec y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 b 6
#> 7 c 7
#> 8 c 8
#> 9 c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)
# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)
# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c
I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.
I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
I am attempting to use the spp.est function of the package called "fossil" in RStudio. I have created a matrix called "akimiskibb" of abundance data with species as the columns and sites as the rows. When I try to use the function spp.est, I type this:
spp.est(akimiskibb, rand = 10, abund = TRUE, counter = FALSE, max.est = 'all')
The problem comes in because my abundance data has a lot of zeroes, so I get this error message:
Error in if (max(x) == 1) warning("cannot use incidence data for abundance-based analyses. If the data is incidence based, please run this function again with the option of 'abund=FALSE'") :
missing value where TRUE/FALSE needed
This function has worked in the past with matrices with a lot of zeroes (which are also abundance data, not presence/absence). I don't know what I am doing wrong.
Has anyone experienced something similar and found a way around this?
Thank you,
Kayla
Data:
matrix format:
*sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 sp10 sp11 sp12 sp13 sp14 sp15 sp16 sp17
sample1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
0 0
sample 2 0 0 0 1 0 0 1 25 7 0 18 12 0 0 0
1 1
sample3 0 0 0 0 0 0 0 3 0 0 3 1 0 0 0
5 4
sp18 sp19 sp20 sp21 sp22 sp23 sp24 sp25 sp26 sp27 sp28 sp29 sp30 sp31
sp32
sample1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
sample 2 1 0 1 0 0 0 0 0 0 0 0 3 2
0 3
sample3 0 0 1 0 0 11 0 0 0 0 0 0 0
0 1
sp33 sp34 sp35 sp36 sp37 sp38 sp39 sp40 sp41 sp42 sp43 X
sample1 0 0 0 0 0 0 0 0 0 0 0 NA
sample 2 0 0 3 2 1 0 0 1 8 0 0 NA
sample3 0 0 0 0 0 0 0 0 0 0 0 NA*
dput:
*structure(list(sp1 = c(0L, 0L, 0L), sp2 = c(0L, 0L, 0L), sp3 = c(0L,
0L, 0L), sp4 = c(0L, 1L, 0L), sp5 = c(0L, 0L, 0L), sp6 = c(0L,
0L, 0L), sp7 = c(0L, 1L, 0L), sp8 = c(1L, 25L, 3L), sp9 = c(0L,
7L, 0L), sp10 = c(0L, 0L, 0L), sp11 = c(0L, 18L, 3L), sp12 = c(0L,
12L, 1L), sp13 = c(0L, 0L, 0L), sp14 = c(0L, 0L, 0L), sp15 = c(1L,
0L, 0L), sp16 = c(0L, 1L, 5L), sp17 = c(0L, 1L, 4L), sp18 = c(0L,
1L, 0L), sp19 = c(0L, 0L, 0L), sp20 = c(0L, 1L, 1L), sp21 = c(0L,
0L, 0L), sp22 = c(0L, 0L, 0L), sp23 = c(0L, 0L, 11L), sp24 = c(0L,
0L, 0L), sp25 = c(0L, 0L, 0L), sp26 = c(0L, 0L, 0L), sp27 = c(0L,
0L, 0L), sp28 = c(0L, 0L, 0L), sp29 = c(0L, 3L, 0L), sp30 = c(0L,
2L, 0L), sp31 = c(0L, 0L, 0L), sp32 = c(0L, 3L, 1L), sp33 = c(0L,
0L, 0L), sp34 = c(0L, 0L, 0L), sp35 = c(0L, 3L, 0L), sp36 = c(0L,
2L, 0L), sp37 = c(0L, 1L, 0L), sp38 = c(0L, 0L, 0L), sp39 = c(0L,
0L, 0L), sp40 = c(0L, 1L, 0L), sp41 = c(0L, 8L, 0L), sp42 = c(0L,
0L, 0L), sp43 = c(0L, 0L, 0L), X = c(NA, NA, NA)), .Names = c("sp1",
"sp2", "sp3", "sp4", "sp5", "sp6", "sp7", "sp8", "sp9", "sp10",
"sp11", "sp12", "sp13", "sp14", "sp15", "sp16", "sp17", "sp18",
"sp19", "sp20", "sp21", "sp22", "sp23", "sp24", "sp25", "sp26",
"sp27", "sp28", "sp29", "sp30", "sp31", "sp32", "sp33", "sp34",
"sp35", "sp36", "sp37", "sp38", "sp39", "sp40", "sp41", "sp42",
"sp43", "X"), class = "data.frame", row.names = c("sample1",
"sample 2", "sample3"))*
packages used:
fossil (made in R version 3.4.4)
Version of R: R x64 3.4.1
OK, not familiar with this package so thank you for introducing it to me. Always great to see how R is being used in so many disciplines.
I see two issues with your applications. First, according to the fossil documentation for spp.est, your data needs to have your samples as columns and your species as rows. The second issue is the "species" X with the NA values. You need to get rid of these because the function can't handle them.
Here's the code:
library(fossil)
library(tidyverse)
akimiskibb <- as.data.frame(t(akimiskibb)) %>%
filter(!is.na(sample1) == T)
spp.est(akimiskibb, rand = 10, abund = TRUE, counter = FALSE, max.est = 'all')
# which results in
N.obs S.obs S.obs(+95%) S.obs(-95%) Chao1 Chao1(upper) Chao1(lower)
[1,] 1 11.8 25.30995 -1.709948 20.90 28.38331 13.41669
[2,] 2 16.0 25.46770 6.532301 27.15 33.37680 20.92320
[3,] 3 20.0 20.00000 20.000000 26.00 29.24037 22.75963
ACE ACE(upper) ACE(lower) Jack1 Jack1(upper) Jack1(lower)
[1,] 14.56286 31.16289 -2.037175 16.82501 36.43129 -2.781268
[2,] 19.19286 30.64000 7.745713 22.21008 35.28448 9.135677
[3,] 22.28571 22.28571 22.285714 25.95082 25.95082 25.950820
attr(,"data.type")
[1] "abundance"
Warning messages:
1: In ests[[k]](b) :
This data appears to be presence/absence based, but this estimator is for abundance data only
2: In ests[[k]](b) :
This data appears to be presence/absence based, but this estimator is for abundance data only