keeping document number in tidytext - r

When I unnest_tokens for a list I enter manually; the output includes the row number each word came from.
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
The results look like this, which is what I want.
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
But when I try to read in the real responses from a csv file:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
But otherwise use the same code:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
I get the entire token list assigned to row 1 and then again assigned to row 2 and so on.
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
OR, if I move the mutate(row_name = row_number) to after the unnest command, I get the row number for each token.
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
What am I missing?

I guess if you import the text using text <- read.csv("TextSample.csv", stringsAsFactors=FALSE), text is a data frame while if you enter it manually it is a vector.
If you would alter the code to: text_df <- tibble(text = text$col_name) to select the column from the data frame (which is a vector) in the csv case, I think you should get the same result as before.

Related

dplyr::full_join two data frames with part-match in the "by" argument in R

I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
Using a fuzzyjoin::regex_left_join you could do:
Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10

Split columns considering only the first dot in R using separate

This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?
Here is an alternative using readrs parse_number and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d+\\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
A tidyverse approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word
I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9]+)\\. (.*)")
The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
Try read.table + sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
I am not sure how to do this with tidyr, but the following should work with base R.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word

Tidytext - set expressions as a single token

I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.
Normal outcome:
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice
4 1 text
5 2 a
6 2 second
7 2 nice
8 2 text
What I would like (expression = "nice text"):
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text
Here's a concise solution based on negative lookahead (?!...), to disallow separate_rows to separate Text on whitespace \\s if there's nice to the left of \\s and text to its right (\\bare word boundary anchors, in case you have, say, "nice texts", which you do want to separate)
library(tidyr)
df %>%
separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
Id Text
<dbl> <chr>
1 1 A
2 1 first
3 1 nice text
4 2 A
5 2 second
6 2 nice text
A more advanced regex is with (*SKIP)(*F):
df %>%
separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")
For more info: How do (*SKIP) or (*F) work on regex?
A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:
library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
Text = c('A first nice text', 'A second nice text')) %>%
unnest_tokens('Word', Text)
df %>%
group_by(Id) %>%
summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
mutate(temp_id = row_number()) %>%
filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
ungroup() %>%
select(-temp_id)
which gives:
# A tibble: 6 x 2
Id Word
<dbl> <chr>
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text

How do I keep certain special characters when making ngrams using tidytext::unnest_tokens()?

I'm working on text that has character combinations like "3/8" and "5/8" when referring to particular sizes of things and I'm making bigrams to help analyze the text. I'd like to not have the "/" character removed but am not finding a way to do that. Here is an example:
library(tidyverse)
library(tidytext)
tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
unnest_tokens(bigrams,text,token="ngrams",n=2)
Here is the output:
# A tibble: 9 x 1
bigrams
<chr>
1 my example
2 example is
3 is 3
4 3 8
5 8 pipe
6 pipe and
7 and 5
8 5 8
9 8 wrench
Thank you for your input.
Edit: I've found one way around this, but it is crude and would love to hear more elegant solutions.
library(tidyverse)
library(tidytext)
library(stringr)
tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
mutate(text=str_replace_all(text,"\\/","forwardslash")) %>%
unnest_tokens(bigrams,text,token="ngrams",n=2) %>%
mutate(bigrams=str_replace_all(bigrams,"forwardslash","/"))
Output:
# A tibble: 7 x 1
bigrams
<chr>
1 my example
2 example is
3 is 3/8
4 3/8 pipe
5 pipe and
6 and 5/8
7 5/8 wrench
We may also use chartr for replacement
library(tidytext)
tibble(text="My example is 3/8 pipe and 5/8 wrench") %>%
mutate(text = chartr("/", "_", text)) %>%
unnest_tokens(bigrams, text, token = "ngrams", n = 2) %>%
mutate(bigrams = chartr("_", "/", bigrams))
-output
# A tibble: 7 × 1
bigrams
<chr>
1 my example
2 example is
3 is 3/8
4 3/8 pipe
5 pipe and
6 and 5/8
7 5/8 wrench

Replace data entry errors with the most common value - dplyr

I have a data frame which contains some data entry errors.
I wish to replace these outlier values per group with the most common value per group.
My data looks as follows:
df <- data.frame(CODE = c("J1745","J1745","J1745","J1745","J1100","J1100","J1100","J1100","J1100","J1100"),NDC = c(1234,1234,1234,1234,5678,5678,5678,5678,5678,5678),DOSAGE = c("10ML","10 ML","10 ML","10 ML","5 ML","5 ML","5 ML","5 ML","50 ML","5 ML"),DESC = c("TEXT1","TEXT 1","TEXT 1","TEXT 1","TEXT 2","TEXT 2","TEXT 2","TEXT 2","TEXT 10","TEXT 2"))
As you can see my DOSAGE and DESC columns contain some inconsistencies and I would like to replace them with the most common value within each group.
My desired output looks as follows:
I agree with the comment that this is potentially dangerous.
The code below replaces elements that have <= a specified number of occurrences with the most common value. I use base-R machinery within the replacement function because that's what I know how to do.
repl_common <- function(x,n=1) {
tt <- tapply(x,x,length) ## count number of instances
m <- names(tt)[which.max(tt)] ## find mode
x[tt[as.character(x)]<=n] <- m ## replace
return(x)
}
## apply by group across specified columns
df %>% group_by(CODE) %>% mutate(across(c(DOSAGE,DESC), repl_common))
You can use the Mode function from here to get the most common value.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Apply this function by group.
library(dplyr)
df %>% group_by(CODE, NDC) %>% mutate(across(c(DOSAGE, DESC), Mode)) %>% ungroup
# CODE NDC DOSAGE DESC
# <chr> <dbl> <chr> <chr>
# 1 J1745 1234 10 ML TEXT 1
# 2 J1745 1234 10 ML TEXT 1
# 3 J1745 1234 10 ML TEXT 1
# 4 J1745 1234 10 ML TEXT 1
# 5 J1100 5678 5 ML TEXT 2
# 6 J1100 5678 5 ML TEXT 2
# 7 J1100 5678 5 ML TEXT 2
# 8 J1100 5678 5 ML TEXT 2
# 9 J1100 5678 5 ML TEXT 2
#10 J1100 5678 5 ML TEXT 2

Resources