how to extract word from row - r

I have a 46 MB csv file containing data. Essentially, I would like select only those rows that have particular word like "PRODUCT". There are 600 000 rows for this data. I have used grep() to search for the string matching. Following are few lines of my data.
head(test)
Item.Description UQC Year
1 PHARMACEUTICALS PRODUCTS.(MEDICINE) DOLEYKA SYRUP 100 ML NOS 2015
2 Multani mati hesh100gm x 160 (AyurvedicProducts) PAC 2015
3 Amla /Shikakai/ Aritha powder 100gm x 160 (Ayurvedic Products) PAC 2015
4 Godrej h.dye blk 40ml x 36 (Ayurvedic Products) PAC 2015
5 DR. COOLERS HERBAL LOZENGES.(2) DR. COOLERS HERBAL LOZENGES (MINT FLAVOUR) PAC 2015
6 Eno lemon/ regular 100gm x 48 (AyurvedicProducts) PAC 2015
Identifier RITC.Code
30049099
30049011
30049011
30049011
30049011
30049011
I have used test[grep("PRODUCT", rownames(test)), ]. It gives me an error.

1)try grepl, it works better.
2)The upper/lower case matters here, and you have both of them in your text.
So try:
1) test$Item.Description <- tolower(test$Item.Description)
2) products <- test[grepl("product", test$Item.Description),].
And yes, the usage of the needed column (ItemDescription) instead of rownames matters too

open csv file using ms-excel
go to menu 'data' and click 'filter'
in filter drop down select 'Text Filters' then select 'contains'
then type word 'product'
list contains word 'product' will be filtered

Related

Compiling API outputs in XML format in R

I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)

trying to create a value of strings using the | or operator

I am trying to scrape a website link. So far I downloaded the text and set it as a dataframe. I have the folllowing;
keywords <- c(credit | model)
text_df <- as.data.frame.table(text_df)
text_df %>%
filter(str_detect(text, keywords))
where credit and model are two values I want to search the website, i.e. return row with the word credit or model in.
I get the following error
Error in filter_impl(.data, dots) : object 'credit' not found
The code only returns the results with the word "model" in and ignores the word "credit".
How can I go about returning all results with either the word "credit" or "model" in.
My plan is to have keywords <- c(credit | model | more_key_words | something_else | many values)
Thanks in advance.
EDIT:
text_df:
Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use
So I am trying to extract just rows 1 and 3.
I think the issue is you need to pass a string as an argument to str_detect. To check for "credit" or "model" you can paste them into a single string separated by |.
library(tidyverse)
library(stringr)
text_df <- read_table("Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use")
keywords <- c("credit", "model")
any_word <- paste(keywords, collapse = "|")
text_df %>% filter(str_detect(text, any_word))
#> # A tibble: 2 x 3
#> Var `1` text
#> <int> <chr> <chr>
#> 1 1 Here is some credit information
#> 2 3 This line may contain the keyword model
Ok I have checked it and I think it will not work you way, as you must use the or | operator inside filter() not inside str_detect()
So it would work this way:
keywords <- c("virg", "tos")
library(dplyr)
library(stringr)
iris %>%
filter(str_detect(Species, keywords[1]) | str_detect(Species, keywords[2]))
as a keywords[1] etc you have to specify each "keyword" from this variable
I would recommend staying away from regex when you're dealing with words. There are packages tailored for your particular task that you can use. Try, for example, the following
library(corpus)
text <- readLines("http://norvig.com/big.txt") # sherlock holmes
terms <- c("watson", "sherlock holmes", "elementary")
text_locate(text, terms)
## text before instance after
## 1 1 …Book of The Adventures of Sherlock Holmes
## 2 27 Title: The Adventures of Sherlock Holmes
## 3 40 … EBOOK, THE ADVENTURES OF SHERLOCK HOLMES ***
## 4 50 SHERLOCK HOLMES
## 5 77 To Sherlock Holmes she is always the woman. I…
## 6 85 …," he remarked. "I think, Watson , that you have put on seve…
## 7 89 …t a trifle more, I fancy, Watson . And in practice again, I …
## 8 145 …ere's money in this case, Watson , if there is nothing else.…
## 9 163 …friend and colleague, Dr. Watson , who is occasionally good …
## 10 315 … for you. And good-night, Watson ," he added, as the wheels …
## 11 352 …s quite too good to lose, Watson . I was just balancing whet…
## 12 422 …as I had pictured it from Sherlock Holmes ' succinct description, but…
## 13 504 "Good-night, Mister Sherlock Holmes ."
## 14 515 …t it!" he cried, grasping Sherlock Holmes by either shoulder and loo…
## 15 553 "Mr. Sherlock Holmes , I believe?" said she.
## 16 559 "What!" Sherlock Holmes staggered back, white with…
## 17 565 …tter was superscribed to " Sherlock Holmes , Esq. To be left till call…
## 18 567 "MY DEAR MR. SHERLOCK HOLMES ,--You really did it very w…
## 19 569 …est to the celebrated Mr. Sherlock Holmes . Then I, rather imprudentl…
## 20 571 …s; and I remain, dear Mr. Sherlock Holmes ,
## ⋮ (189 rows total)
Note that this matches the term regardless of the case.
For your specific use case, do
ix <- text_detect(text, terms)
or
matches <- text_subset(text, terms)

Extract state abbreviation and zip code from strings

I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers
text <- readLines(textConnection(address))
library(stringi)
zip <- stri_extract_last_regex(text, "\\d{5}")
zip
library(qdapRegex)
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
zip <- rm_zip3(text)
zip
[1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA
For the state abbreviation, I have no idea how to extract
Any help is appreciated! Thanks in advance!
Edit 1: Include phone numbers
Code to extract zip code:
zip <- str_extract(text, "\\d{5}")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")
Code to extract phone numbers:
phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")
NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:
Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)
zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")
Demo: https://regex101.com/r/7Im0Mu/2
I am using address as input not the text, see if it works for your case.
Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.
Input:
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
I am using stringr library , you may choose any other to extract the information as you wish.
library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")
EDIT:
In case someone want to use text as input,
text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))
OUTPUT:
State Zip pin
1 AK 99577 907-481-1670
2 AL 35007 205-620-0360
3 CT 06854 860-409-0404
4 ID 83338 208-324-4333
5 ME 4330 207-623-8223
6 VT 5495 802-878-5233
Thank you #Rahul. Both would be great. At least can you show me how to do it with Notepad++?
Extraction using Notepad++
Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.
Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.
Now go to Search --> Bookmark --> Remove Unmarked Lines
Finally search with ^-|-$ and replace with empty string.
Update
So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.

how to extract text from anchor tag inside div class in r

I am trying to fetch text from anchor tag, which is embedded in div tag. Following is the link of website `http://mmb.moneycontrol.com/forum-topics/stocks-1.html
The text I want to extract is Mawana Sugars
Mawana Sugars
So I want to extract all the stocks names listed on this website and description of it.
Here is my attempt to do it in R
doc <- htmlParse("http://mmb.moneycontrol.com/forum-topics/stocks-1.html")
xpathSApply(doc,"//div[#class='clearfix PR PB5']//text()",xmlValue)
But, it does not return anything. How can I do it in R?
My answer is essentially the same as the one I just gave here.
The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at "Network" in Chrome DevTools for instance, we can find a nicely formatted JSON at http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1
To get you started:
library(jsonlite)
dat <- fromJSON("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1")
Output looks like:
dat[1:3, c("msg_id", "user_id", "topic", "heading", "flag", "price", "message")]
# msg_id user_id topic heading flag
# 1 47730730 liontrade NMDC Stocks APR
# 2 47730726 agrawalknath Glenmark Glenmark APR
# 3 47730725 bissy91 Infosys Stocks APR
# price
# 1 Price when posted : BSE: Rs. 127.90 NSE: Rs. 128.15
# 2 Price when posted : NSE: Rs. 714.10
# 3 Price when posted : BSE: Rs. 956.50 NSE: Rs. 955.00
# message
# 1 There is no mention of dividend in the announcement.
# 2 Eagerly Waiting for 670 to 675 to BUY second phase of Buying in Cash Delivery. Already Holding # 800.
# 3 6 ✂ ✂--Don t Pay High Brokerage While Trading. Take Delivery Free & Rs 20 to trade in any size - Join Today .👉 goo.gl/hDqLnm

R - pipe("pbcopy") columns not lining up with pasting

As a follow-up to a question I wrote a few days ago, I finally figured out how to copy to the clipboard to paste into other applications (read: Excel).
However, when using the function to copy and paste, the variable column headers are not lining up correctly when pasting.
Data (taken from a Flowing Data example I happened to be looking at):
data <- read.csv("http://datasets.flowingdata.com/post-data.txt")
Copy function:
write.table(file = pipe("pbcopy"), data, sep = "\t")
When loaded in, the data looks like this:
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
There is a row number without a column variable name (1, 2, 3, 4, ...)
Using the read.table(pipe("pbpaste")) function, it will load back into R fine.
However, when I paste it into Excel, or TextEdit, the column name for the second variable will be in the first variable column name slot, like this:
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
Which leaves the trailing column without a column name.
Is there a way to ensure the data copied to the clipboard is aligned and labeled correctly?
The row numbers do not have a column name in an R data.frame. They were not in the original dataset but they are put into the output to the clipboard unless you suppress it. The default for that option is set to TRUE but you can override it. If you want such a column as a named column, you need to make it. Try this when sending to excel.
df$rownums <- rownames(df)
edf <- df[ c( length(df), 1:(length(df)-1))] # to get the rownums/rownames first
write.table(file = pipe("pbcopy"), edf, row.names=FALSE, sep = "\t")
You may just want to add the argument col.names=NA to your call to write.table(). It has the effect of adding an empty character string (a blank column name) to the header row for the first column.
write.table(file = pipe("pbcopy"), data, sep = "\t", col.names=NA)
To see the difference, compare these two function calls:
write.table(data[1:2,], sep="\t")
# "id" "views" "comments" "category"
# "1" 5019 148896 28 "Artistic Visualization"
# "2" 1416 81374 26 "Visualization"
write.table(data[1:2,], sep="\t", col.names=NA)
# "" "id" "views" "comments" "category"
# "1" 5019 148896 28 "Artistic Visualization"
# "2" 1416 81374 26 "Visualization"

Resources