I am new to text-mining in R. I want to remove stopwords (i.e. extract keywords) from my data frame's column and put those keywords into a new column.
I tried to make a corpus, but it didn't help me.
df$C3 is what I currently have. I would like to add column df$C4, but I can't get it to work.
df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L,
10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone",
"hope you all", "I Hope", "I need help", "In life", "It would work",
"On Text-Mining", "Thanks"), class = "factor"), C4 = structure(c(2L,
4L, 1L, 6L, 3L, 7L, 5L, 9L, 8L, 3L), .Label = c("doing good",
"everyone", "help", "hope", "Hope", "life", "Text-Mining", "Thanks",
"work"), class = "factor")), .Names = c("C3", "C4"), row.names = c(NA,
-10L), class = "data.frame")
head(df)
# C3 C4
# 1 hello everyone everyone
# 2 hope you all hope
# 3 Are doing good doing good
# 4 In life life
# 5 I need help help
# 6 On Text-Mining Text-Mining
This solution uses packages dplyr and tidytext.
library(dplyr)
library(tidytext)
# subset of your dataset
dt = data.frame(C1 = c(108,20, 999, 52, 400),
C2 = c(1,3,7, 6, 9),
C3 = c("hello everyone","hope you all","Are doing good","in life","I need help"), stringsAsFactors = F)
# function to combine words (by pasting one next to the other)
f = function(x) { paste(x, collapse = " ") }
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 2 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
The problem here is that the "stop words" built in that package filter out some of the words you want to keep. Therefore, you have to add a manual step where you specify words you need to include. You can do something like this:
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 4 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
# 3 108 1 everyone
# 4 999 7 doing good
This is one of the first things I did in R, it may not be the best but something like:
library(stringi)
df2 <- do.call(rbind, lapply(stop$stop, function(x){
t <- data.frame(c1= df[,1], c2 = df[,2], words = stri_extract(df[,3], coll=x))
t<-na.omit(t)}))
Example data:
df = data.frame(c1 = c(108,20,99), c2 = c(1,3,7), c3 = c("hello everyone", "hope you all", "are doing well"))
stop = data.frame(stop = c("you", "all"))
Then after you can reshapedf2 using:
df2 = data.frame(c1 = unique(u$c1), c2 = unique(u$c2), words = paste(u$words, collapse= ','))
Then cbind df and df2
I would use the tm-package. It has a little dictionary with english stopwords. You can replace these stopwords with a white space using gsub():
library(tm)
prep <- tolower(paste(" ", df$C3, " "))
regex_pat <- paste(stopwords("en"), collapse = " | ")
df$C4 <- gsub(regex_pat, " ", prep)
df$C4 <- gsub(regex_pat, " ", df$C4)
# C3 C4
# 1 hello everyone hello everyone
# 2 hope you all hope
# 3 Are doing good good
# 4 In life life
# 5 I need help need help
You can easily add new words like c("hello", "othernewword", stopwords("en")).
Related
I need to prepare queries that are made of characters strings (DOI, Digital Object Identifier) stored in a data frame. All strings associated with the same case have to be joined to produce one query.
The df looks like this:
Case
DOI
1
1212313/dfsjk23
1
322332/jdkdsa12
2
21323/xsw.w3
2
311331313/q1231
2
1212121/1231312
The output should be a data frame looking like this:
Case
Query
1
DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2
DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
The prefix ("DO="), suffix (")") and "OR" are not critical, I can add them later, but how to aggregate character strings based on a case number?
In base R you could do:
aggregate(DOI~Case, df1, function(x) sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))
Case DOI
1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
if Using R 4.1.0
aggregate(DOI~Case, df1, \(x)sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))
We can use glue with str_c to collapse the 'DOI' column after grouping by 'Case'
library(stringr)
library(dplyr)
df1 %>%
group_by(Case) %>%
summarise(Query = glue::glue("DO=({str_c(DOI, collapse= ' OR ')})"))
-output
## A tibble: 2 x 2
# Case Query
# <int> <glue>
#1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
#2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
data
df1 <- structure(list(Case = c(1L, 1L, 2L, 2L, 2L), DOI = c("1212313/dfsjk23",
"322332/jdkdsa12", "21323/xsw.w3", "311331313/q1231", "1212121/1231312"
)), class = "data.frame", row.names = c(NA, -5L))
Given a CSV with the following structure,
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
I wish to create two tables, based on whether the value in the postCode column is unique or not. The values of the other columns do not matter to me, but they have to be copied to the new tables.
My end data should look like this, with one table based on unique postCodes:
id, postCode, someThing, someOtherThing
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
And another where postCode values are duplicated
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
So far I can load the data but I'm not sure of the next step:
myData <- read.csv("path/to/my.csv",
header=TRUE,
sep=",",
stringsAsFactors=FALSE
)
New to R so help appreciated.
Data in dput format.
df <-
structure(list(id = 1:4, postCode = structure(c(1L, 1L, 2L, 3L
), .Label = c("E3 4AX", "E8 KAK", "VH3 2K2"), class = "factor"),
someThing = structure(c(1L, 2L, 4L, 3L), .Label = c(" cats",
" elephants", " humans", " mice"), class = "factor"),
someOtherThing = structure(c(1L, 3L, 2L, 4L),
.Label = c(" dogs", " rats", " sheep", " whales "
), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))
If df is the name of your data.frame, which can be formed as:
df <- read.table(header = T, text = "
id, postCode, someThing, someOtherThing
1, E3 4AX, cats, dogs
2, E3 4AX, elephants, sheep
3, E8 KAK, mice, rats
4, VH3 2K2, humans, whales
")
Then the uniques and duplicates can be found using the funciton n(), which collects the number of observation per grouped variable. Then,
uniques = df %>%
group_by(postCode) %>%
filter(n() == 1)
dupes = df %>%
group_by(postCode) %>%
filter(n() > 1)
Unclear why someone edited this response. Maybe they hate tribbles
If you can do with a list of the two data.frames, which seems to be better than to have many related objects in the .GlobalEnv, try split.
f <- rev(cumsum(rev(duplicated(df$postCode))))
split(df, f)
#$`0`
# id postCode someThing someOtherThing
#3 3 E8 KAK mice rats
#4 4 VH3 2K2 humans whales
#
#$`1`
# id postCode someThing someOtherThing
#1 1 E3 4AX cats dogs
#2 2 E3 4AX elephants sheep
These reports are coming from quickbooks, downloaded as Excel files. Notice that the left column is this nested hierarchy based on the left spacing.
I need to separate Description column into separate columns based on the number of leading spaces on the left.
As I've been working with financial reports recently, these are super common and extremely difficult to work with. Is there a package or function for importing this type of data?
Here is example reproducible input dataframe:
df1 <- structure(list(Description = c("asset", " current asset", " bank acc",
" banner", " clearing",
" total bank accounts",
" total current assets"),
Total = c(NA, NA, NA, 10L, 20L, 30L, 30L)),
.Names = c("Description", "Total"),
class = "data.frame",
row.names = c(NA, -7L))
You can try tidyxl and unpivotr for these Excel wrangling tasks. Here are the docs:
unpivotr: https://github.com/nacnudus/unpivotr
tidyxl: https://nacnudus.github.io/tidyxl/
Here's a nice tutorial: https://blog.davisvaughan.com/2018/02/16/tidying-excel-cash-flow-spreadsheets-in-r/
I think the real question is:
"How do I treat number of leading spaces to indicate nth column?"
If so, then try this example, code could be improved, but the idea is every leading space indicates nth column.
# example input, we will have similar input after reading in
# the Excel sheet into R.
df1 <- data.frame(x = c("x1", " x2", " x2", " x3", "x1", " x2"),
y = c(NA, 22, 33, 44, 55, 66),
stringsAsFactors = FALSE)
library(dplyr)
cbind(
bind_rows(
lapply(df1$x, function(i){
x <- data.frame(t(strsplit(i, split = " ")[[1]]), stringsAsFactors = FALSE)
colnames(x) <- paste0("col", 1:ncol(x))
x
})
),
df1[, "y", drop = FALSE])
# col1 col2 col3 y
# 1 x1 <NA> <NA> NA
# 2 x2 <NA> 22
# 3 x2 <NA> 33
# 4 x3 44
# 5 x1 <NA> <NA> 55
# 6 x2 <NA> 66
I have two data frames: df1 and df2
df1<- structure(list(sample_1 = structure(c(7L, 6L, 5L, 1L, 2L, 4L,
3L), .Label = c("P41182;Q9HCP0", "Q09472", "Q9Y6H1;Q5T1J5", "Q9Y6I3",
"Q9Y6Q9", "Q9Y6U3", "Q9Y6W5"), class = "factor"), sample_2 = structure(c(7L,
6L, 4L, 3L, 2L, 5L, 1L), .Label = c("O15143", "P31908", "P3R117",
"P41356;P54612;A41PH2", "P54112", "P61809;Q92831", "Q16835"), class = "factor")), .Names = c("sample_1",
"sample_2"), class = "data.frame", row.names = c(NA, -7L))
df2<- structure(list(subunits..UniProt.IDs. = structure(c(4L, 6L, 5L,
12L, 3L, 9L, 14L, 16L, 15L, 11L, 13L, 8L, 1L, 2L, 10L, 7L), .Label = c("O55102,Q9CWG9,Q5U5M8,Q8VED2,Q91WZ8,Q8R015,Q9R0C0,Q9Z266",
"P30561,O08915,P07901,P11499", "P30561,P53762", "P41182,P56524",
"P41182,Q8WUI4", "P41182,Q9UQL6", "P61160,P61158,O15143,O15144,O15145,P59998,O15511",
"P78537,Q6QNY1,Q6QNY0,Q9NUP1,Q96EV8,Q8TDH9,Q9UL45,O95295", "Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3",
"Q8WMR7,(P67776,P11493),(P54612,P54613)", "Q91VB4,P59438,Q8BLY7",
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7",
"Q92902,Q9NQG7", "Q92903,Q96NY9", "Q969F9,Q9UPZ3,Q86YV9"), class = "factor")), .Names = "subunits..UniProt.IDs.", class = "data.frame", row.names = c(NA,
-16L))
I want to look at each semicolon-separated string in df1 and if it contains a match to one of the comma-separated strings in df2, then remove it. So, my output will look like below:
sample_1 sample_2
1 Q9Y6W5 Q16835
2 Q9Y6U3 P61809
3 P41356;A41PH2
4 Q9HCP0 P3R117
5 P31908
6 Q9Y6I3 P54112
7 Q9Y6H1;Q5T1J5
The sample_1 has strings in row 3, 4 and 5 that match one of the strings in df2, and those matching strings are removed.
The sample_2 has strings in row 2, 3 and 7 that match strings in df2, and those matching strings are removed.
First, you could gather all the possible strings to remove:
toRmv <- unique(unlist(strsplit(as.character(df2[,1]), ",", fixed = TRUE)))
toRmv <- gsub("\\W", "", toRmv, perl = TRUE)
Then remove them. I like the stringi package here for its ability to replace multiple strings with an empty string using the handy vectorize_all argument set to FALSE.
library(stringi)
df1[] <- lapply(df1, stri_replace_all_fixed,
pattern = toRmv, replacement = "", vectorize_all = FALSE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809;
#3 P41356;;A41PH2
#4 ;Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
Now, it's just a matter of getting rid of leading semicolons (^;), trailing semicolons (;$), and multiple semicolons ((?<=;);):
df1[] <- lapply(df1, gsub, pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809
#3 P41356;A41PH2
#4 Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
As requested in the comment, here it is in function form. I didn't test this part. Feel free to test and adjust as you see fit:
stringRemove <- function(removeFrom, toRemove) {
library(stringi)
toRemove <- unique(unlist(strsplit(as.character(toRemove), ",", fixed = TRUE)))
toRemove <- gsub("\\W", "", toRemove, perl = TRUE)
removeFrom[] <- lapply(removeFrom, stri_replace_all_fixed,
pattern = toRemove, replacement = "", vectorize_all = FALSE)
removeFrom[] <- lapply(removeFrom, gsub,
pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
removeFrom
}
# use it
stringRemove(removeFrom = df1, toRemove = df2[,1])
Firstly, you should almost definitely rearrange your data so it's tidy, i.e. has a column for each variable and a row for each observation, but not knowing what it is or how it's related, I can't do that for you. Thus, the only way left is to hack through what are effectively list columns:
library(dplyr)
# For each column,
df1 %>% mutate_each(funs(
# convert to character,
as.character(.) %>%
# split each string into a list of strings to evaluate,
strsplit(';') %>%
# loop over the items in each list,
lapply(function(x){
# replacing any in a similarly split and unlisted df2 with NA,
ifelse(x %in% unlist(strsplit(as.character(df2[,1]), '[(),]+')),
NA_character_, x)
}) %>%
# then loop over them again,
sapply(function(x){
# removing NAs where there are non-NA strings.
ifelse(all(is.na(x)), list(NA_character_), list(x[!is.na(x)]))
})))
# sample_1 sample_2
# 1 Q9Y6W5 Q16835
# 2 Q9Y6U3 P61809
# 3 NA P41356, A41PH2
# 4 Q9HCP0 P3R117
# 5 NA P31908
# 6 Q9Y6I3 P54112
# 7 Q9Y6H1, Q5T1J5 NA
If you want to collapse the actual list columns you end with back into strings, you can do so with paste, but really, list columns are more useful.
Edit
If your data is big enough that it's worth the annoyance to make it faster, take the munging of df2 out of the chain and store it separately so you don't calculate it for every iteration. Here's a version that does so, built in purrr, which works with lists instead of data.frames and can be faster than mutate_each for non-trivial functions. Edit as you like.
library(purrr)
df2_unlisted <- df2 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, '[(),]') %>% # split
unlist() # unlist to vector
df1 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, ';') %>% # split
at_depth(2, ~.x[!.x %in% df2_unlisted]) %>% # subset out unwanted
at_depth(2, ~if(is_empty(.x)) NA_character_ else .x) %>% # insert NA for chr(0)
as_data_frame() %>% data.frame() # for printing
Results are identical.
I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.