understanding count the number of occurrences of a pattern in a string - r

my input:
library(tidyverse)
library(stringi)
tdf<-data.frame("foo"=c('|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|TI'),
"bar"=c('|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI'),
"xyz" = c('|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ReviewNG-ICV|TI|BB.2',
'|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ICV'),
"gaz" = c('|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI',
'|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI',
'|NG-BB.2|ICV|AI|TI','|BB.2'))
I trying count the number of occurrences of each label in my tdf, all label have 4 "form": Total count of occurences, ReviewNG-label, NG-label and at least "pure" |label, |label|. For example label AI, have all matches total, have ReviewNG-AI, NG-AI, and |AI or |AI| pure form. So that my code:
pt_t <- c("AI" )
sum(stringi::stri_count_fixed(tdf, regex(pt_t)))
pt_rng <- c("ReviewNG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_rng)))
pt_ng<-c("NG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_ng)))
pt<-c("|AI","|AI|")
sum(stringi::stri_count_fixed(tdf, regex(pt)))
And my output:
Warning in stringi::stri_count_fixed(tdf, regex(pt_t)) :
argument is not an atomic vector; coercing
[1] 30
Warning in stringi::stri_count_fixed(tdf, regex(pt_rng)) :
argument is not an atomic vector; coercing
[1] 7
Warning in stringi::stri_count_fixed(tdf, regex(pt_ng)) :
argument is not an atomic vector; coercing
[1] 14
Warning in stringi::stri_count_fixed(tdf, regex(pt)) :
argument is not an atomic vector; coercing
[1] 15
First of all, I don't exactly understand at warning message.
Now let's look a count: For total it's Ok, for ReviewNG-AI stil good. But next a problematic:
for NG-AI I understand is double count NG plus ReviewNG, and last "pure" count for |AI' or '|AI| I totally don't understand how it equally 15, where manually I count 16.
I also trying stringr in tidyverse but here really erroneous output:
sum(str_count(tdf,pt))
res<-tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt)))))
rowSums(res)

Your problem here is using an special character in RegEx: | is reserved for or in RegEx. If we want to search for | we need to escape with \\|. So, for example:
library(dplyr)
library(stringr)
pt <- c("\\|AI", "\\|AI\\|")
Now, we want to count every occurence of |AI and |AI|, so the search pattern looks like this:
paste(pt, collapse = "|")
#> [1] "\\|AI|\\|AI\\|"
So, putting it all together:
tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt, collapse = "|")))))
returns
foo bar xyz gaz
1 0 12 0 4

Maybe this kind of solution. As Martin already explained why and how we could do a different strategy.
If all Labels are separated by |
we could pivot_longer and count them. Depending on your desired output:
library(dplyr)
library(tidyr)
tdf %>%
pivot_longer(
everything()
) %>%
mutate(value = sub('\\|', '', value)) %>%
separate_rows(value, sep = "\\|") %>%
group_by(name, value) %>%
summarise(Labels = n())
name value Labels
<chr> <chr> <int>
1 bar AI 12
2 bar BB.2 7
3 bar ReviewNG-BB.2 4
4 foo NG-BB.3 6
5 foo ReviewNG-BB.2 11
6 foo ReviewNG-BB.3 5
7 foo TI 1
8 gaz AI 4
9 gaz BB.2 1
10 gaz BB.3 7
11 gaz ICV 4
12 gaz NG-BB.2 4
13 gaz NG-TI 7
14 gaz ReviewNG-AI 7
15 gaz TI 4
16 xyz BB.2 4
17 xyz ICV 8
18 xyz NG-AI 7
19 xyz ReviewNG-ICV 4
20 xyz TI 4

Related

Error in 2 * "X2B" : non-numeric argument to binary operator

I am trying to look at the baseball data from 1903 through 1960 from the Lahman database. I am doing this for my own research. I am wanting to use the batting table, which does not include batting average, slugging, OBP or OPS.
I want to calculate those, but I first need to get total bases. I am having trouble getting the program to calculate total bases with the X2B and X3B.
I've looked into trying as.numeric, but I couldn't get it to work. This is using R and R studio. I've tried putting quotes around X2B and X3B for the doubles and triples and without quotes.
batting_1960 <- batting_1903 %>%
filter(yearID <= 1960 & G >= 90) %>%
mutate(Batting_Average = H/AB, TB = (2*"X2B")+(3*"X3B")+HR+(H-"X2B"-"X3B"-HR)) %>%
arrange(yearID, desc(Batting_Average))
I expect that for each row of data, that the total bases will be calculated in a new column but I get the error:
Error in 2 * "X2B" : non-numeric argument to binary operator
This would be so that I could eventually calculated OPS, OBP and slugging.
Your code is trying to mutiply 2 by the literal string "X2B", which is not going to work. Column names should be unquoted in mutate().
Your error:
> tibble(X2B = 1:10) %>% mutate(TB = 2 * "X2B")
Error in 2 * "X2B" : non-numeric argument to binary operator
Should be, for example:
> tibble(X2B = 1:10) %>% mutate(TB = 2 * X2B)
# A tibble: 10 x 2
X2B TB
<int> <dbl>
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
6 6 12
7 7 14
8 8 16
9 9 18
10 10 20

Cut function alternative in R

I have some data in the form:
Person.ID Household.ID Composition
1 4593 1A_0C
2 4992 2A_1C
3 9843 1A_1C
4 8385 2A_2C
5 9823 8A_1C
6 3458 1C_9C
7 7485 2C_0C
: : :
We can think of the composition variable as a count of adults/children i.e. 2A_1C would equate to two adults and two children.
What I want to do is reduce the amount of possible levels of composition. For person 5 we have composition of 8A_1C, I am looking for a way to reduce this to 4+A_0C. So for example we would have 4+ for any composition value with greater than 4A.
Person.ID Household.ID Composition
5 9823 4+A_1C
6 3458 1A_4+C
: : :
I am unsure of how to do this in R, I am thinking of using filter() or select() from dyplyr. Otherwise I would need to use some sort of regular expression.
Any help would be appreciated. Thanks
Data:
Person.ID <- c(1,2,3,4,5,6,7,8)
Household.ID <- c(4593,4992,9843,8385,9823,3458,7485)
Composition <- c("1A_0C","2A_1C","1A_1C","2A_2C","8A_1C","1A_9C","2A_0C")
dat <- tibble(Person.ID, Household.ID, Composition)
Function:
above4 <- function(f){
ff <- gsub("[^0-9]","",f)
if(ff>4){return("4+")}
if(ff<=4){return(ff)}
}
Apply function (done on separated data, but can recombine after):
dat_ <- dat %>% tidyr::separate(., col=Composition,
into=c("Adults", "Children"),
sep="_") %>%
dplyr::mutate(Adults_ = unlist(lapply(Adults,above4)),
Children_ = unlist(lapply(Children,above4)))
You might then use select, filter to get your required dataset.
dat_ %>% dplyr::mutate(Composition_ = paste0(Adults_, "A_", Children_, "C")) %>%
dplyr::select(Person.ID, Household.ID, Composition=Composition_)
# A tibble: 7 x 3
Person.ID Household.ID Composition
<dbl> <dbl> <chr>
1 1. 4593. 1A_0C
2 2. 4992. 2A_1C
3 3. 9843. 1A_1C
4 4. 8385. 2A_2C
5 5. 9823. 4+A_1C
6 6. 3458. 1A_4+C
7 7. 7485. 2A_0C
We can use gsub:
df$Composition <- gsub("(?<!\\d)([5-9]|\\d{2,})(?=[AC])", "4+", df$Composition, perl = TRUE)
This assumes that 2 or more consecutive digits represent a number that's always greater than 4 (i.e. no 01, 02, or 001).
Output:
Person.ID Household.ID Composition
1 1 4593 1A_0C
2 2 4992 2A_1C
3 3 9843 1A_1C
4 4 8385 2A_2C
5 5 9823 4+A_1C
6 6 3458 1C_4+C
7 7 7485 2C_0C

Dynamically determine if a dataframe column exists and mutate if it does

I have code that pulls and processes data from a database based upon a client name. Some clients may have data that does not include a specific column name, e.g., last_name or first_name. For clients that do not use last_name or first_name, I don't care. For clients that do use either of those fields, I need to mutate() those columns with toupper() so that I can join on those standardized fields later in the ETL process.
Right now, I'm using a series of if() statements and some helper functions to look into the names of a dataframe then mutate if they exist. I'm using if() statements because ifelse() is mostly vectorized and doesn't handle dataframes well.
library(dplyr)
set.seed(256)
b <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
another_col = sample(1000:9999, 5, FALSE))
d <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
last_name = sample(letters, 5, FALSE))
mutate_first_last <- function(df){
mutate_first_name <- function(df){
df %>%
mutate(first_name = first_name %>% toupper())
}
mutate_last_name <- function(df){
df %>%
mutate(last_name = last_name %>% toupper())
}
n <- c("first_name", "last_name") %in% names(df)
if (n[1] & n[2]) return(df %>% mutate_first_name() %>% mutate_last_name())
if (n[1] & !n[2]) return(df %>% mutate_first_name())
if (!n[1] & n[2]) return(df %>% mutate_last_name())
if (!n[1] & !n[2]) return(df)
}
I get what I expect to get this way
> b %>% mutate_first_last()
id col_name another_col
1 48 8318 6207
2 39 7155 7170
3 16 4486 4321
4 55 2521 8024
5 15 1412 4875
> d %>% mutate_first_last()
id col_name last_name
1 64 7438 A
2 43 4551 Q
3 48 7401 K
4 78 3682 Z
5 87 2554 J
but is this the best way to handle this kind of task? To dynamically look to see if a column name exists in a dataframe then mutate it if it does? It seems strange to have to have multiple if() statements in this function. Is there a more streamlined way to process these data?
You can use mutate_at with one_of, both from dplyr. This will mutate column only if it matches with one of c("first_name", "last_name"). If no match, it will generate a simple warning but you can ignore or suppress it.
library(dplyr)
d %>%
mutate_at(vars(one_of(c("first_name", "last_name")), toupper)
id col_name last_name
1 19 7461 V
2 52 9651 H
3 56 1901 P
4 13 7866 Z
5 25 9527 U
# example with no match
b %>%
mutate_at(vars(one_of(c("first_name", "last_name"))), toupper)
id col_name another_col
1 34 9315 8686
2 26 5598 4124
3 17 3318 2182
4 32 1418 4369
5 49 4759 6680
Warning message:
Unknown variables: `first_name`, `last_name`
Here are a bunch of other ?select_helpers in dplyr -
These functions allow you to select variables based on their names.
starts_with(): starts with a prefix
ends_with(): ends with a prefix
contains(): contains a literal string
matches(): matches a regular expression
num_range(): a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.
Update dplyr 1.0.0
In dplyr 1.0, the scoped variants of mutate such as _at or _all were replaced by across().
In addition, the best tidy_select helper for this case is any_of as it will perform on the variables which exist, but ignores those that don't exist (without warning message).
As result, you can write the following:
# purrr syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), ~toupper(.x)))
# function name syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
which both return the mutated column
id col_name last_name
1 19 4398 Q
2 72 1135 S
3 54 9767 V
4 60 4364 K
5 35 1564 X
while
b %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
ignores the columns and thus returns (without warning message):
id col_name another_col
1 42 7601 4482
2 22 1773 7072
3 47 2719 5884
4 1 9595 5945
5 81 8044 3927

R unnest_tokens and calculate positions (start and end location) of each token

How to get the position of all the tokens after using unnest_tokens?
Here is a simple example -
df<-data.frame(id=1,
doc=c("Patient: [** Name **], [** Name **] Acct.#:
[** Medical_Record_Number **] MR #: [** Medical_Record_Number **]
Location: [** Location **] "))
Tokenize by white space using tidytext -
library(tidytext)
tokens_df<-df %>%
unnest_tokens(tokens,doc,token = stringr::str_split,
pattern = "\\s",
to_lower = F, drop = F)
How to get the position of all the tokens?
id tokens start end
1 Patient: 1 8
1 9 9
1 [** 12 14
1 Name 16 19
Here is the non-tidy approach to the problem.
regex = "([^\\s]+)"
df_i = str_extract_all(df$doc, regex)
df_ii = str_locate_all(df$doc, regex)
output1 = Map(function(x, y, z){
if(length(y) == 0){
y = NA
}
if(nrow(z) == 0){
z = rbind(z, list(start = NA, end = NA))
}
data.frame(id = x, token = y, z)
}, df$id, df_i, df_ii) %>%
do.call(rbind,.) %>%
merge(df, .)
I think the first answerer here has the right idea that the best approach is to use string handling, rather than tokenization and NLP, if tokens split on whitespace and character positions is the output you want.
If you also do want to use tidy data principles and end up with a data frame, try out something like this:
library(tidyverse)
df <- data_frame(id=1,
doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] "))
df %>%
mutate(tokens = str_extract_all(doc, "([^\\s]+)"),
locations = str_locate_all(doc, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-doc) %>%
unnest(tokens, locations)
#> # A tibble: 11 x 4
#> id tokens start end
#> <dbl> <chr> <int> <int>
#> 1 1.00 Patient: 1 8
#> 2 1.00 [** 12 14
#> 3 1.00 Name 16 19
#> 4 1.00 **], 21 24
#> 5 1.00 [** 26 28
#> 6 1.00 Name 30 33
#> 7 1.00 **] 35 37
#> 8 1.00 Acct.#: 39 45
#> 9 1.00 [** 50 52
#> 10 1.00 Medical_Record_Number 54 74
#> 11 1.00 **] 76 78
This will work for multiple documents with id columns for each string, and it is removing actual whitespace from the output because of the way the regex is constructed.
EDITED:
In a comment, the original poster asked for an approach that would allow tokenizing by sentence and also keeping track of the positions of each word. The following code does that, in the sense that we get the start and end position for each token within each sentence. Could you use a combination of the sentenceID column with the start and end columns to find what you're looking for?
library(tidyverse)
library(tidytext)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
d <- data_frame(txt = james)
d %>%
unnest_tokens(sentence, txt, token = "sentences") %>%
mutate(sentenceID = row_number(),
tokens = str_extract_all(sentence, "([^\\s]+)"),
locations = str_locate_all(sentence, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-sentence) %>%
unnest(tokens, locations)
#> # A tibble: 112 x 4
#> sentenceID tokens start end
#> <int> <chr> <int> <int>
#> 1 1 the 1 3
#> 2 1 question 5 12
#> 3 1 thus 14 17
#> 4 1 becomes 19 25
#> 5 1 a 27 27
#> 6 1 verbal 29 34
#> 7 1 one 36 38
#> 8 1 again; 40 45
#> 9 1 and 47 49
#> 10 1 our 51 53
#> # ... with 102 more rows
Notice that these aren't quite "tokenized" in the normal sense from unnest_tokens(); they will still have their closing punctuation attached to each word like commas and periods. It seemed like you wanted that from your original question.

filtering a dataset dependant on a value within a string

I am currently working with Google Analytics and R and have a query I hope someone can help me with.
I have exported my data from GA into R and have it in a dataframe ready for processing.
I want to create a for loop which goes through my data and sums a number of columns in my dataframe if one column contains a certain value.
For example, my dataframe looks like this
I have a list of ID's which are the individual 3 digit numbers, which I can use in a for loop.
My past experience of R I have been able to filter the list so that I have
data[data$ID == 341,] -> datanew
and I have found some code which can see if there is a certain string within a string producing a bool
grepl(value, chars)
Is there a way to link these up together so that I have a sum code similar to below
aggregate(cbind(users, conversion)~ID,data=datanew,FUN=sum) -> resultforID
Basically taking that data and for every 341 add the users and conversions..
I hope I have explained this the best way possible.
Thanks in advance
data table has 3 columns. ID, users, Conversion with the users and Conversion linked to the IDs.
Some ID's are on their own, so 341, others are 341|246 and some will have three numbers with them seperated by the |
# toy data
mydata = data.frame(ID = c("341|243","341|243","341|242","341","243",
"999","111|341|222"),
Users = 10:16,
Conv = 5:11)
# ID Users Conv
# 1 341|243 10 5
# 2 341|243 11 6
# 3 341|242 12 7
# 4 341 13 8
# 5 243 14 9
# 6 999 15 10
# 7 111|341|222 16 11
# are you looking for something like below:
# presume you just want to filter those IDs have 341.
library(dplyr)
mydata[grep("341",mydata$ID),] %>%
group_by(ID) %>%
summarise_each(funs(sum))
# ID Users Conv
# 1 111|341|222 16 11
# 2 341 13 8
# 3 341|242 12 7
# 4 341|243 21 11
If I understand your question correctly, you may want to look at cSplit from my "splitstackshape" package.
Using #KFB's sample data (which is hopefully representative of your actual data), try:
library(splitstackshape)
cSplit(mydata, "ID", "|", "long")[, lapply(.SD, sum), by = ID]
# ID Users Conv
# 1: 341 62 37
# 2: 243 35 20
# 3: 242 12 7
# 4: 999 15 10
# 5: 111 16 11
# 6: 222 16 11
Alternatively, from the Hadleyverse, you can use "dplyr" and "tidyr" together, like this:
library(dplyr)
library(tidyr)
mydata %>%
transform(ID = strsplit(as.character(ID), "|", fixed = TRUE)) %>%
unnest(ID) %>%
group_by(ID) %>%
summarise_each(funs(sum))
# Source: local data frame [6 x 3]
#
# ID Users Conv
# 1 111 16 11
# 2 222 16 11
# 3 242 12 7
# 4 243 35 20
# 5 341 62 37
# 6 999 15 10
I think this should work:
library(dplyr)
sumdf <- yourdf %>%
group_by(ID) %>%
summarise_each(funs(sum))
I'm not clear about the structure of your ID column, but if you need to just get the numbers you could try this:
library(tidyr)
newdf <- separate(yourdf, ID, c('id1', 'id2'), '|') %>%
filter(id1 == 341) # optional if you just want one ID
Here are two answers. The first being with subset and the second is with 'grep' using a string
initial run
x1<-sample(1:4,10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-subset(dat,subset=x1==i)
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z)
}
with GREP
x1<-sample(letters[1:3],10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-dat[grep(i,dat$x1),]
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z) #this will assign separate objects as your aggregates with names based on the string
}

Resources