String matching where strings contain punctuation - r

I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))

I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.

I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')

Related

How to change the reference level for Risk Ratio in logistic regression in R?

Survey.ID Quit Boss Subord Subord2 Subord3 Subord4
1 1 0 0 0 0 1 0
2 2 1 0 0 1 0 0
3 3 0 0 0 0 0 0
4 4 0 0 0 0 1 0
5 5 0 0 0 0 1 0
6 6 1 0 0 0 1 0
I have a df above. Each of the variables is a binary variable that categorizes if someone is a boss, or a certain level of subordinate. I am trying to see what is most predictive of someone quitting the past month. I am using the logistic regression
model <- glm(Quit ~ Subord, family=binomial, data = df)
summary(model)
exp(cbind(RR = coef(model), confint.default(model)))
I would like to find the Relative Risk (RR) for each group of employees: Boss, Subord, Subord2, Subord3, Subord4. However, I would like to reference group to be Subord4. I believe right now, the reference is set to boss? How do I fix this?
I think this might help
Libraries
library(tidyverse)
Sample data
df <-
structure(list(id = 1:6, Quit = c(0L, 1L, 0L, 0L, 0L, 1L), Boss = c(0L,
0L, 0L, 0L, 0L, 0L), Subord = c(0L, 0L, 0L, 0L, 0L, 0L), Subord2 = c(0L,
1L, 0L, 0L, 0L, 0L), Subord3 = c(1L, 0L, 0L, 1L, 1L, 1L), Subord4 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Code
df %>%
#Create a single column of variables: Boss Subord Subord2 Subord3 Subord4
pivot_longer(cols = -c(id,Quit)) %>%
#Keeping only those with value = 1
filter(value == 1) %>%
#Making "Subord4" the baseline, as the first level of the factor
mutate(name = fct_relevel(as.factor(name),"Subord4")) %>%
glm(data = .,formula = Quit~name, family=binomial)

Weighted mean using aggregated

Sorry for asking what might be a very basic question, but I am stuck in a conundrum and cannot seem to get out of it.
I have a code that looks like
Medicine Biology Business sex weights
0 1 0 1 0.5
0 0 1 0 1
1 0 0 1 05
0 1 0 0 0.33
0 0 1 0 0.33
1 0 0 1 1
0 1 0 0 0.33
0 0 1 1 1
1 0 0 1 1
Where the first three are fields of study, and the fouth variable regards gender. Obviously with many more observations.
What I want to get, is the mean level of the the field of study (medicine, biology, business) by the variable sex (so the mean for men and the mean for women). To do so, I have used the following code:
barplot_sex<-aggregate(x=df_dummies[,1:19] , by=list(df$sex),
FUN= function(x) mean(x)
Which works perfectly and gives me what I needed. My problem is that I need to use a weighted mean now, but I canno use
FUN= function(x) weighted.mean(x, weights)
as there are many more observations than fields of study.
The only alternative I managed to do was to edit(boxplot) and change the values manually, but then R doesn't save the changes. Plus, I am sure there must be a trivial way to do exactly what I need.
Any help would be greatly appreciated.
Bests,
Gabriele
Using by.
by(dat, dat$sex, function(x) sapply(x[, 1:3], weighted.mean, x[, "weights"]))
# dat$sex: 0
# Medicine Biology Business
# 0.0000000 0.3316583 0.6683417
# ---------------------------------------------------------------------------------------
# dat$sex: 1
# Medicine Biology Business
# 0.82352941 0.05882353 0.11764706
Data:
dat <- structure(list(Medicine = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L
), Biology = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Business = c(0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), sex = c(1L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), weights = c(0.5, 1, 5, 0.33, 0.33, 1, 0.33,
1, 1)), class = "data.frame", row.names = c(NA, -9L))

how to make a vector of different values in R for linear regression

I have a matrix (data1) looking like:
gene S869 S907 S909 S016 S090 S160
1 S1 0 0 0 0.000000 0 0
2 S2 0 0 0 0.000000 0 0
3 S3 0 0 0 0.423405 0 0
4 S4 0 0 0 0.000000 0 0
5 S5 0 0 0 0.000000 0 0
6 S6 0 0 0 0.000000 0 0
I have another dataset (data2) looking like:
Cultivar Dose
S869 10
S907 5
S909 7
S016 19
S090 15
S160 12
Then I want to do a linear regression using
for (gene in 1:ngenes){
model = lm(Dose~X[gene,])
}
I want to check regression between genes and dose using these two datasets. So that I get p values for all genes for dose across all varieties. Thank you in advance!
We can use Map to loop over the 'Cultivar, 'Dose' column, create a 'Dose' column in 'data1' subset based on the 'Cultivar' column and apply lm
Map(function(x, y) lm(Dose ~ ., transform(data1[x], Dose = y)),
as.character(data2$Cultivar), data2$Dose)
data
data1 <- structure(list(gene = c("S1", "S2", "S3", "S4", "S5", "S6"),
S869 = c(0L, 0L, 0L, 0L, 0L, 0L), S907 = c(0L, 0L, 0L, 0L,
0L, 0L), S909 = c(0L, 0L, 0L, 0L, 0L, 0L), S016 = c(0, 0,
0.423405, 0, 0, 0), S090 = c(0L, 0L, 0L, 0L, 0L, 0L), S160 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
data2 <- structure(list(Cultivar = c("S869", "S907", "S909", "S016", "S090",
"S160"), Dose = c(10L, 5L, 7L, 19L, 15L, 12L)), class = "data.frame",
row.names = c(NA,
-6L))

how to make a vector of different values in R

I have a matrix (data1) looking like:
gene S869 S907 S909 S016 S090 S160
1 S1 0 0 0 0.000000 0 0
2 S2 0 0 0 0.000000 0 0
3 S3 0 0 0 0.423405 0 0
4 S4 0 0 0 0.000000 0 0
5 S5 0 0 0 0.000000 0 0
6 S6 0 0 0 0.000000 0 0
I want to make a vector (Say X) for this matrix. I have another dataset (data2) looking like:
Cultivar Dose
S869 10
S907 5
S909 7
S016 19
S090 15
S160 12
Then I want to do a linear regression using
for (gene in 1:ngenes){
model = lm(Dose~X[gene,])
}
and also want to get the p values from above regression. Thank you in advance!
If we want to use the 'Cultivar' to extract the column names, loop over the 'Cultivar', 'Dose' corresponding values with Map, extract the column based on the 'Culivar' value for 'data1', create a new column 'Dose' with transform and apply the lm, extract the p values from the coefficients
lst1 <- Map(function(x, y) {
tmpdat <- transform(as.data.frame(data1)[x], Dose = y)
model <- lm(Dose ~ ., data = tmpdat)
coef(summary(model))[, "Pr(>|t|)"]}, data2$Cultivar, data2$Dose)
lst1
#$S869
#[1] 1.218223e-78
#$S907
#[1] 1.218223e-78
#$S909
#[1] 2.265096e-79
#...
data
data1 <- structure(list(gene = c("S1", "S2", "S3", "S4", "S5", "S6"),
S869 = c(0L, 0L, 0L, 0L, 0L, 0L), S907 = c(0L, 0L, 0L, 0L,
0L, 0L), S909 = c(0L, 0L, 0L, 0L, 0L, 0L), S016 = c(0, 0,
0.423405, 0, 0, 0), S090 = c(0L, 0L, 0L, 0L, 0L, 0L), S160 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
data2 <- structure(list(Cultivar = c("S869", "S907", "S909", "S016", "S090",
"S160"), Dose = c(10L, 5L, 7L, 19L, 15L, 12L)), class = "data.frame",
row.names = c(NA,
-6L))

delete the colmuns which are contained just NA or 0 or both values [duplicate]

This question already has an answer here:
Remove the columns with the colsums=0
(1 answer)
Closed 7 years ago.
What should I do if I want to remove the columns which are contained just the 0 or NA or both values?
mat <- structure(c(0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L,
0L, 2L, 0L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, NA, 0L, NA, 0L, 0L,
0L, 0L, NA, 0L, 2L, 0L, 0L), .Dim = c(6L, 6L), .Dimnames = list(
c("A05363", "A05370", "A05380", "A05397", "A05400", "A05426"), c("X1.110590170", "X1.110888172", "X1.110906406", "X1.110993854", "X1.110996710", "X1.111144756")))
My output should be like this:
X1.110590170 X1.110906406 X1.110993854 X1.111144756
A05363 0 0 0 0
A05370 0 0 0 NA
A05380 1 2 0 0
A05397 0 0 1 2
A05400 2 0 0 0
A05426 0 NA 0 0
You can filter out the columns using the apply function along the columns.
You will simply have to use the all function to make sure that all values in the column satisfy the logic: is.na(x) | x == 0.
filter_cols <- apply(mat, 2, function(x) !all(is.na(x) | x == 0))
mat[,filter_cols]
#' X1.110590170 X1.110906406 X1.110993854 X1.111144756
#' A05363 0 0 0 0
#' A05370 0 0 0 NA
#' A05380 1 2 0 0
#' A05397 0 0 1 2
#' A05400 2 0 0 0
#' A05426 0 NA 0 0

Resources