R Using grep() to extract characters - r

In the link below is a list (of warning messages):
https://drive.google.com/file/d/1pz-jSkqU5nG_ipaezFCvWNI6WHgekAdE/view?usp=sharing
And I am trying to get:
Only at the start of the string,
Where this pattern exists "x = ECC ",
Retrieve only the "ECC" portion.
I was successful on this test:
regex.com
But R doesn't work with this code:
grep("(?<=\\A\"x\\s=\\s')[A-Z]*", names(warnings), value = TRUE, perl = TRUE)
#> character(0)
What's not working?

In this data you have additional spaces in the text (Eg - "x = 'GEN ') hence the pattern does not match. We may switch to str_match here :
stringr::str_match(names(warnings), "x\\s=\\s'(\\w+)\\s+'")[, 2]
# [1] "ECC" "ECC" "ECOM" "ECOM" "ETX" "ETX" NA NA NA "FEI"
#[11] "FEI" "GEN" "GEN" NA NA NA "SAND" "SAND" NA NA
#[21] NA "STAR" "STAR" NA NA NA

Related

rtweet "search_tweets" function does not return coordinates

I'm using the rtweets library (search_tweets function) to retrieve data from twitter. All data is present except location data, for which I just get "NA". What's even weirder is occasionally I do get coordinates (but very infrequently).
search_tweets("coronavirus", n = 1000, include_rts = FALSE, lang="en")
This is what I get for the coords_coords variable (just showing a random section here to highlight how I sometimes get a value). Thanks in advance for your help!
[[917]]
[1] NA NA
[[918]]
[1] NA NA
[[919]]
[1] NA NA
[[920]]
[1] NA NA
[[921]]
[1] 43.77429 -79.71640
[[922]]
[1] NA NA
[[923]]
[1] NA NA
From Tweet location FAQs:
This feature is off by default and you will need to opt in to use it.
Since you get location data occasionally, I'd say it is likely that most of the data simply doesn't have this info.

Using write.table to create tab separated file with fields which contain commas

I am trying to write a data frame containing installed packages to a tab separated file using write.table:
ip <- as.data.frame(installed.packages())
rownames(ip) <- NULL
write.table(ip,file="installed_packages.tsv",quote = F,sep = "\t",row.names = F)
Unfortunately this seems to result in a file where commas within a field sometimes result in the creation a new line. However, this is not consistent. Quoting the fields makes no difference. Does anyone have any ideas?
Edit: added an example of the broken output, using quote = T instead:
"Package" "LibPath" "Version" "Priority" "Depends" "Imports" "LinkingTo" "Suggests" "Enhances" "License" "License_is_FOSS" "License_restricts_use" "OS_type" "MD5sum" "NeedsCompilation" "Built"
"abind" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "1.4-5" NA "R (>= 1.5.0)" "methods, utils" NA NA NA "LGPL (>= 2)" NA NA NA NA "no" "3.4.4"
"acepack" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "1.4.1" NA NA NA NA "testthat" NA "MIT + file LICENSE" NA NA NA NA "yes" "3.4.4"
"addTextLabels" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "0.0.0.9000" NA "R (>= 3.4.4)" NA NA NA NA "GPL-3" NA NA NA NA NA "3.4.4"
"backports" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "1.1.2" NA "R (>= 3.0.0)" "utils" NA NA NA "GPL-2" NA NA NA NA "yes" "3.4.4"
"ballgownMod" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "2.5.3" NA "R (>= 3.1.1),
methods" "GenomicRanges (>= 1.17.25),
IRanges (>= 1.99.22),
S4Vectors (>= 0.9.39),
RColorBrewer,
splines,
sva,
limma,
rtracklayer (>= 1.29.25),
Biobase (>= 2.25.0),
GenomeInfoDb" NA "testthat,
knitr" NA "Artistic-2.0" NA NA NA NA NA "3.4.4"
"base64enc" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "0.1-3" NA "R (>= 2.9.0)" NA NA NA "png" "GPL-2 | GPL-3" NA NA NA $
"bdsmatrix" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "1.3-3" NA "methods, R (>= 2.0.0)" NA NA NA NA "LGPL-2" NA NA $
"broom" "/home/administrator/R/x86_64-pc-linux-gnu-library/3.4" "0.5.0" NA "R (>= 3.1)" "backports, dplyr, methods, nlme, purrr, reshape2, stringr,
tibble, tidyr" NA "AER, akima, AUC, bbmle, betareg, biglm, binGroup, boot, brms,
btergm, car, caret, coda, covr, e1071, emmeans, ergm, gam (>=
1.15), gamlss, gamlss.data, gamlss.dist, geepack, ggplot2,
glmnet, gmm, Hmisc, irlba, joineRML, Kendall, knitr, ks,
Lahman, lavaan, lfe, lme4, lmodel2, lmtest, lsmeans, maps,
maptools, MASS, Matrix, mclust, mgcv, muhaz, multcomp, network,
nnet, orcutt (>= 2.2), ordinal, plm, plyr, poLCA, psych,
quantreg, rgeos, rmarkdown, robust, rsample, rstan, rstanarm,
sp, speedglm, statnet.common, survey, survival, testthat,
tseries, xergm, zoo" NA "MIT + file LICENSE" NA NA NA NA "no" "3.4.4"
I note that a new line is also created for the line ending 'gam (>='. I also observe that the file opens ok in libre-office calc with the fields delimited correctly (but only when using the quote = T option) but not using nano, vi, or the default ubuntu text editor (with either quote = T or quote = F). So this has something to do with the text editor....
I think this should do the trick:
a <- installed.packages()
a <- a[, 1]
names(a) <- c() # optional line
write.table(a, "installed.tsv", quote = F, sep = "\t", row.names = F)
The result of installed.packages() is a messy matrix, so you need to subset it correctly to get just the vector of all packages.
Result in R:
[1] "abind" "Amelia" "ape" "assertthat"
[5] "backports" "base" "base64enc" "BBmisc"
[9] "BelgiumMaps.StatBel" "BH" "bindr" "bindrcpp"
Result in TextEdit (Mac):
abind
Amelia
ape
assertthat
backports
base
base64enc
Hope it helps!
You can directly write the data frame using :
write.csv(ip,"installed_packages.csv")
Here, commas within a field won't result in the creation of a new line.

extract a pattern from a text in R from a subset of patterns

I have list of codes as below
ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
Now, I want to extract a string from below sample that starts with the code
consolidated_csv_v2 <- c("pt paid rs-8488/- remaining amt","Credit Card Sales","ML 2926 VARSHA LAKHANI (AG)","IMRAN KHAN-PW-4798","Deepali Mishra Ah-5564 Tst", "MANJU S-11226 T","SNEHA S-16191","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
data is spread across 477326 rows
Expected output is code followed by number.
str_extract(consolidated_csv_v2, "AH.*$")
[1] NA NA NA NA NA NA
[7] NA "AH-5747 AG" "AH-5361 AG" NA AG"
This formula worked only on static code "AH". How can I do the same for match with any of codes in ccode.
I assume you need to extract the substrings starting with a "code" after a word boundary and followed with a hyphen.
Then, use
"\\b(?:S|PD|CH|ML|MD|VA|BVI|DB|KD|KE|PW|COL|AD|MET|VP|SI|VR|GAO|LK|RP|PAD|WAN|PWD|PMP|PBR|VN|PPC|NK|K|AH|I|JP|JU|UDZ|CHM|DDN|LN|CL|CLH|DKM|GK|WD|ED|DDK|DLN|DRN|DFD|GZB|DVV|GUR|GGN|ND|HHN|HAS|HYD|HKP|BWF|BBW|BKM|BSN|BL|BIN|ST|KN)-\\w*"
where \b stands for a word boundary, then a group of code alternatives follows ((?:...)), and then a hyphen (-) followed with zero or more alphanumeric/underscore symbols (\w*).
And here is a demo code:
> consolidated_csv_v2 <- c("Head Office","(cancelled)","(cancelled)","(cancelled)","Deepali Mishra Ah-5564 Tst", "(cancelled)","SHRUTI BHAGAT AH-2445 AG","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
> ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
> reg <- paste0("\\b(?:", paste(ccode, collapse="|"),")-\\w*")
> str_extract(consolidated_csv_v2, reg)
[1] NA NA NA NA NA NA "AH-2445"
[8] "AH-5747" "AH-5361" NA
>
UPDATE
not all the words are followed by '-', some are follwed by a space and some don't have any character in between.
The requirement is rather a general one, but we can meet it using a lazy dot matching (.*?) after the group of alternations to match any 0+ characters other than a newline as few as possible up to the first set of digits (\d+) that are followed with a word boundary (\b). Use
reg <- paste0("(?i)\\b(?:", paste(ccode, collapse="|"),").*?\\d+\\b")
See the regex demo
To make this pattern case-insensitive, just add a (?i) in front of the first \b.
We can try
pat <- paste0("(?i)\\b(", paste(ccode, collapse="|"),")-.*")
str_extract(v1, pat)
#[1] NA NA NA NA "Ah-5564 Tst" NA "AH-2445 AG" "AH-5747 AG" "AH-5361 AG" "Ah-5564 Tst"
data
v1 <- c("Head Office", "(cancelled)", "(cancelled)", "(cancelled)",
"Deepali Mishra Ah-5564 Tst", "(cancelled)", "SHRUTI BHAGAT AH-2445 AG",
"SUMIT SETHI AH-5747 AG", "SUJATA VORA AH-5361 AG", "Deepali Mishra Ah-5564 Tst")

Get all of the nouns in a book (.txt file) in R and make a frequency table and wordcloud

I am trying to find all of the nouns in a text file. Originally I converted a .epub to a .pdf file. Then I converted the .pdf to a .txt file successfully and I removed half of the text since I only need to find the nouns from the last half of the book. I want to do this so I can find the frequency of the nouns and then identify them for finals.
I can do the frequency tables normally with the original text files without any transformations and make a wordcloud etc but I cannot seem to filter only the nouns. Any ideas?
cname <- file.path(".","Desktop", "egypt", "pdf")
mytxtfiles <- list.files(path = cname, pattern = "txt", full.names = TRUE)
#nouns2 and nouns doesnt seem to work :O -Ive tried both ways-
nouns2 <- regmatches(mytxtfiles, gregexpr("^([A-Z][a-z]+)+$", mytxtfiles, perl=TRUE))
nouns <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("^([A-Z][a-z]+)+$", j, perl=TRUE))})
#transformation if nouns do not work
docs <- tm_map(docs[1], removeWords, stopwords("english"))
#working wordcloud and freq data
dtm <- DocumentTermMatrix(docs)
findFreqTerms(dtm, lowfreq=100)
findAssocs(dtm, "data", corlimit=0.6)
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
p <- ggplot(subset(wf, freq >500), aes(word, freq))
p <-p + geom_bar(stat ="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
library(wordcloud)
wordcloud(names(freq),freq,min.freq=100, colors=brewer.pal(6,"Dark2"))
I have tried nouns2 and nouns but they return something like:
nouns2
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
Here is a method to find all the nouns, using the qdap package. You can go from here.
text <- "To further enhance our practice, the president was honored to have him join the firm, former commissioner and the first to institute patent reexaminations, bringing a wealth of experience and knowledge to the firm and our clients."
library(qdap)
pos.text <- pos(sentence) # tells the count and parts of speech in the text
vec.tagged <- as.vector(pos.text[[2]]) # retains only the tagged terms in a vector
vec.tagged.split <- str_split(vec.tagged$POStagged, "/") # breaks the vector apart at the "/"
all.nouns <- str_extract(vec.tagged.split[[1]], "^NN .+") # identifies the nouns
all.nouns <- str_replace(all.nouns, "NN\\s", "") # removes NN tag
all.nouns
[1] NA NA NA NA NA "novak" "druce"
[8] "was" NA NA NA NA NA NA
[15] NA NA NA NA NA "commissioner" "and"
[22] NA NA NA NA NA "reexaminations" NA
[29] NA NA "of" NA "and" NA "to"
[36] NA NA "and" NA NA NA

Getting mysterious NA's when trying to parse date

I have not had experience with using dates in R. I have read all of the docs but I still can't figure out why I am getting this error. I am trying to take a vector of strings and convert that into a vector of dates, using some specified format. I have tried both using for loops and converting each date indicidually, or using vector functions like sapply, but neither is working. Here is the code using for loops:
dates = rawData[,ind] # get vector of date strings
print("single date example")
print(as.Date(dates[1]))
dDates = rep(1,length(dates)) # initialize vector of dates
class(dDates)="Date"
for (i in 1:length(dates)){
dDates[i]=as.Date(dates[i])
}
print(dDates[1:10])
EDIT: info on "dates" variables
[1] "dates"
V16 V17 V18 V19 V36
[1,] "2014-01-16" "2014-01-30" "2014-01-16" "2014-01-17" "1999-03-16 12:00"
[2,] "2014-01-04" "2014-01-18" "2014-01-04" "2014-01-08" "1998-09-04 12:00"
[3,] "2014-03-05" "2014-03-19" "2014-03-05" "2014-03-07" "1996-09-30 05:00"
[4,] "2014-01-21" "2014-02-04" "2014-01-22" "2014-01-24" "1995-08-21 12:00"
[5,] "2014-01-07" "2014-01-21" "2014-01-07" "2014-01-09" "1994-04-07 12:00"
[1] "class(dates)"
[1] "matrix"
[1] "class(dates[1,1])"
[1] "character"
[1] "dim(dates)"
[1] 56557 8
The result I am getting is as follows:
[1] "single date example"
[1] "2014-01-16"
Error in charToDate(x) :
character string is not in a standard unambiguous format
So basically, when I try to parse a signle element of the date string into a date, it works fine. But when I try to parse the dates in a loop, it breaks. How could this be so?
The reason why I am using a loop instead of sapply is because that was returning an even stranger result. When I try to run:
dDates = sapply(dDates, function(x) as.Date(x, format = "%Y-%m-%d"))
I am getting the following output:
2014-01-16 2014-01-04 2014-03-05 2014-01-21 2014-01-07 2014-01-02 2014-01-08
NA NA NA NA NA NA NA
2014-02-22 2014-01-09 2014-02-22
NA NA NA
Which is very strange. As you can see, since my format was correct, it was able to parse out the dates. But for some reason, it is also giving a time value of NA (or at least that is what I think the NA means). Maybe this is happening because some of my date strings have times, while others don't. But the thing is I left the time out of the format because I don't care about time.
Does anyone know why this is happening or how to fix it? I can't find anywhere online where you can "set" the time value of a date object easily -- I just can't seem to get rid of that NA. And somehow even a for loop doesn't work! Either was, the output is strange and I am not getting the expected results, even though my format is correct. Very frustrating that a simple thing like parsing a vector of dates is so much more difficult than in matlab or java.
Any help please?
EDIT: when I try simply
dDates = as.Date(dates,format="%m/%d/%Y")
I get the output
"dDates[1:10]"
[1] NA NA NA NA NA NA NA NA NA NA
still those mysterious NA's. I am also getting an error
Error in as.Date.default(value) :
do not know how to convert 'value' to class “Date”
Using a subset of your data,
v <- c("2014-01-16", "2014-01-30", "2014-01-16", "2014-01-17", "1999-03-16 12:00")
these statements are equivalent, since your format is the default one:
as.Date(v)
[1] "2014-01-16" "2014-01-30" "2014-01-16" "2014-01-17" "1999-03-16"
as.Date(v, format = "%Y-%m-%d")
[1] "2014-01-16" "2014-01-30" "2014-01-16" "2014-01-17" "1999-03-16"
If you would like to format the output of your date, use format:
format(as.Date(v), format = "%m/%d/%Y")
[1] "01/16/2014" "01/30/2014" "01/16/2014" "01/17/2014" "03/16/1999"

Resources