I have a db with some repeated entries that reports (inconsistently) additional information. I would like to get rid of the information and keep the most simple version for each entry.
db <- data.frame(company=c("ENTRY_X","ENTRY_X COUNTY_1","COUNTY_2 ENTRY_X","ENTRY_Y"))
db_desiderata <- data.frame(company=c(rep("ENTRY_X",3),"ENTRY_Y"))
Entries are possibly lengthy strings (some with spaces). Some examples are: "General Motors Company" and "General Motors".
I manage to isolate all the entries that need to be substituted with their substring (in db$included).
I plan to run it recursively.
Attempted code (all works, I get stuck on how to proceed):
db$included <- lapply(db$company, function(x) c(grep(x,db$company,value=T)))
db$lenght <- lapply(db$included, function(x) length(unlist(x)))
db$included <- ifelse(db$lenght==1,NA,db$included)
The following should work if the data strictly conforms to these patterns:
The desired name must be the first in the sequence of alternative names
The desired name must be the shortest in the sequence of alternative names and can't be followed by a company name which is a shorter subset of the preceding company name.
I'll use a variation of Chuck P's data to illustrate how this works and the problems if the patterns aren't followed.
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General Auto",
"General Motors Company", "General Motors", "European General Motors Company",
"General", "Asia General Toys") )
companies <- Reduce( f = function(y,x) {if(grepl(pattern = y, x=x)) y else x},
x=db$company, accumulate = TRUE)
which gives
companies
[1] General Foods General Foods General Foods General Auto General Motors Company
[6] General Motors General Motors General General
I think I understand your situation a little better after your comment but I still would be very wary of a fully automated solution, One slip or one term that was too general (pun-intended) and you're hosed...
I've taken your early work and done a little renaming. Think of your original length as more a measure of potential. I'd look at the potential column with a human eye and pick and choose the places to replace. I'd approach this with stringr::str_replace_all. If you use the named vector as I've shown below you should be able to handle a wide array of cases with cut and paste. "^.*General Motors.*$" just means if you find it anywhere in the string, front or back. You can work iteratively and just keep adding to the named vector until you have it cleaned.
library(dplyr)
library(stringr)
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General", "General Auto", "General Motors Company", "General Motors", "European General Motors Company"))
db$similar_company <- sapply(db$company, function(x) c(grep(x, db$company, value=T)), simplify = TRUE)
db$potential <- sapply(db$similar_company, function(x) length(unlist(x)), simplify = TRUE)
glimpse(db)
#> Rows: 8
#> Columns: 3
#> $ company <chr> "General Foods", "More General Foods", "General Foods…
#> $ similar_company <named list> [<"General Foods", "More General Foods", "Gene…
#> $ potential <int> 3, 1, 1, 8, 1, 2, 3, 1
db %>% arrange(desc(potential)) %>% select(-similar_company)
#> company potential
#> 1 General 8
#> 2 General Foods 3
#> 3 General Motors 3
#> 4 General Motors Company 2
#> 5 More General Foods 1
#> 6 General Foods Cereal Division 1
#> 7 General Auto 1
#> 8 European General Motors Company 1
db$newcompany <-
str_replace_all(db$company, c("^.*General Foods.*$" = "General Foods",
"^.*General Motors.*$" = "General Motors"),
)
db %>% select(company, newcompany)
#> company newcompany
#> 1 General Foods General Foods
#> 2 More General Foods General Foods
#> 3 General Foods Cereal Division General Foods
#> 4 General General
#> 5 General Auto General Auto
#> 6 General Motors Company General Motors
#> 7 General Motors General Motors
#> 8 European General Motors Company General Motors
Created on 2020-05-08 by the reprex package (v0.3.0)
Related
I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT
I have this character vector of lines from a journal:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
I want to extract the Keywords from these lines, which are Inter-facility transport, Neonatal intensive care, Mortality. I've tried to get the line which has "Keywords" with test_1[str_detect(test_1, "^Keywords:")] I want to get all the keywords below this line and above 1. Introduction
What regex or stringr functions will do this?
Thanks
If I understood correctly, you are sort of scanning the pdf downloaded from here. I think you should find a better way to scan your PDFs.
Till then, the best option could be this:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
EDIT:
You can define a function to find the index of the line at which Keywords occurs and the indices of the lines below that line:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
Based on that function, you can extract the keywords:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\\S+")
[1] "Inter-facility" "Neonatal" "Mortality"
I have tried to resolve this problem all day but without any improvement.
I am trying to replace the following abbreviations into the following desired words in my dataset:
-Abbreviations: USA, H2O, Type 3, T3, bp
Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure
The input data is for example
[1] I have type 3, its considered the highest severe stage of the disease.
[2] Drinking more H2O will make your skin glow.
[3] Do I have T2 or T3? Please someone help.
[4] We don't have this on the USA but I've heard that will be available in the next 3 years.
[5] Having a high bp means that I will have to look after my diet?
The desired output is
[1] i have type 3 disease, its considered the highest severe stage
of the disease.
[2] drinking more water will make your skin glow.
[3] do I have type 3 disease? please someone help.
[4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.
[5] having a high blood pressure means that I will have to look after my diet?
I have tried the following code but without success:
data= read.csv(C:"xxxxxxx, header= TRUE")
lowercase= tolower(data$MESSAGE)
dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"=
"water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"=
"blood pressure")
for(i in 1:length(dict1)){
lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"),
dict[[i]], lowercase)}
I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.
If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).
An example code:
abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
str_replace_all(x,
paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"),
function(z) df$desired_words[df$abbreviations==z][[1]][1]
)
The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.
See the R demo online.
I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:
1) remove words that occur only once
2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently
This is an example of the data:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
Any help would be greatly appreciated, as I am not too familiar with R.
Here's a solution to Q1 in several steps:
Step 1: clean data by removing anything that is not alphanumeric (\\W):
data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))
Step 2: Make a sorted frequency list of the words:
fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))
Step 3: define a pattern to match (namely all the words that occur only once), make sure you wrap them into boundary position markers (\\b) so that only exact matches get matched (e.g., networkbut not networking):
pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")
Step 4: remove matched words:
data3 <- gsub(pattern, "", data2)
Step 5: clean up by removing superfluous spaces:
data4 <- trimws(gsub("\\s{1,}", " ", data3))
Result:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
Here is an approach with tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.
Data
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
Base R solution:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.
According to Wikipedia, lemmatization is defined as:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
A simple Google search for lemmatization in R will only point to the package wordnet of R. When I tried this package expecting that a character vector c("run", "ran", "running") input to the lemmatization function would result in c("run", "run", "run"), I saw that this package only provides functionality similar to grepl function through various filter names and a dictionary.
An example code from wordnet package, which gives maximum of 5 words starting with "car", as the filter name explains itself:
filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)
The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R I want to find true roots of the words: (For e.g. from c("run", "ran", "running") to c("run", "run", "run")).
Hello you can try package koRpus which allow to use Treetagger :
tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
tagged.results#TT.res
## token tag lemma lttr wclass desc stop stem
## 1 run NN run 3 noun Noun, singular or mass NA NA
## 2 ran VVD run 3 verb Verb, past tense NA NA
## 3 running VVG run 7 verb Verb, gerund or present participle NA NA
See the lemma column for the result you're asking for.
As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:
library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)
## [1] "run" "run" "run"
#Andy and #Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.
> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make
terrific th-grade learning tool samuel beckett applied iranian voting process bard
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make
terrific th - grade learn tool samuel beckett apply iranian vote process bard black
comedy willie love another trumpet blast may new mexican cinema - bornin"
Do not forget to run the following line of code after you have done lemmatization:
> corpus <- tm_map(corpus, PlainTextDocument)
This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).
Hope this helps!
Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.
If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.
I think the answers are a bit outdated here. You should be using R package udpipe now - available at https://CRAN.R-project.org/package=udpipe - see https://github.com/bnosac/udpipe or docs at https://bnosac.github.io/udpipe/en
Notice the difference between the word meeting (NOUN) and the word meet (VERB) in the following example when doing lemmatisation and when doing stemming, and the annoying screwing up of the word 'someone' to 'someon' when doing stemming.
library(udpipe)
x <- c(doc_a = "In our last meeting, someone said that we are meeting again tomorrow",
doc_b = "It's better to be good at being the best")
anno <- udpipe(x, "english")
anno[, c("doc_id", "sentence_id", "token", "lemma", "upos")]
#> doc_id sentence_id token lemma upos
#> 1 doc_a 1 In in ADP
#> 2 doc_a 1 our we PRON
#> 3 doc_a 1 last last ADJ
#> 4 doc_a 1 meeting meeting NOUN
#> 5 doc_a 1 , , PUNCT
#> 6 doc_a 1 someone someone PRON
#> 7 doc_a 1 said say VERB
#> 8 doc_a 1 that that SCONJ
#> 9 doc_a 1 we we PRON
#> 10 doc_a 1 are be AUX
#> 11 doc_a 1 meeting meet VERB
#> 12 doc_a 1 again again ADV
#> 13 doc_a 1 tomorrow tomorrow NOUN
#> 14 doc_b 1 It it PRON
#> 15 doc_b 1 's be AUX
#> 16 doc_b 1 better better ADJ
#> 17 doc_b 1 to to PART
#> 18 doc_b 1 be be AUX
#> 19 doc_b 1 good good ADJ
#> 20 doc_b 1 at at SCONJ
#> 21 doc_b 1 being be AUX
#> 22 doc_b 1 the the DET
#> 23 doc_b 1 best best ADJ
lemmatisation <- paste.data.frame(anno, term = "lemma",
group = c("doc_id", "sentence_id"))
lemmatisation
#> doc_id sentence_id
#> 1 doc_a 1
#> 2 doc_b 1
#> lemma
#> 1 in we last meeting , someone say that we be meet again tomorrow
#> 2 it be better to be good at be the best
library(SnowballC)
tokens <- strsplit(x, split = "[[:space:][:punct:]]+")
stemming <- lapply(tokens, FUN = function(x) wordStem(x, language = "en"))
stemming
#> $doc_a
#> [1] "In" "our" "last" "meet" "someon" "said"
#> [7] "that" "we" "are" "meet" "again" "tomorrow"
#>
#> $doc_b
#> [1] "It" "s" "better" "to" "be" "good" "at" "be"
#> [9] "the" "best"
Lemmatization can be done in R easily with textStem package.
Steps are:
1) Install textstem
2) Load the package by
library(textstem)
3) stem_word=lemmatize_words(word, dictionary = lexicon::hash_lemmas)
where stem_word is the result of lemmatization and word is the input word.