How to read a .txt file into a dataframe with readr?

How to read a .txt file into a dataframe with readr? - r

I have the following data that I obtained from a .txt file using the read_lines function from readr
txtread<-read_lines("expenses_copy1.txt")
txtread
[1] "Amount:Category:Date:Description"
[2] "5.25:supply:20170222:box of staples"
[3] "79.81:meal:20170222:lunch with ABC Corp. clients Al, Bob, and Cy"
[4] "43.00:travel:20170222:cab back to office"
[5] "383.75:travel:20170223:flight to Boston, to visit ABC Corp."
[6] "55.00:travel:20170223:cab to ABC Corp. in Cambridge, MA"
[7] "23.25:meal:20170223:dinner at Logan Airport"
[8] "318.47:supply:20170224:paper, toner, pens, paperclips, tape"
[9] "142.12:meal:20170226:host dinner with ABC clients, Al, Bob, Cy, Dave, Ellie"
[10] "303.94:util:20170227:Peoples Gas"
[11] "121.07:util:20170227:Verizon Wireless"
[12] "7.59:supply:20170227:Python book (used)"
[13] "79.99:supply:20170227:spare 20\" monitor"
[14] "49.86:supply:20170228:Stoch Cal for Finance II"
[15] "6.53:meal:20170302:Dunkin Donuts, drive to Big Inc. near DC"
[16] "127.23:meal:20170302:dinner, Tavern64"
[17] "33.07:meal:20170303:dinner, Uncle Julio's"
[18] "86.00:travel:20170304:mileage, drive to/from Big Inc., Reston, VA"
[19] "22.00:travel:20170304:tolls"
[20] "378.81:travel:20170304:Hyatt Hotel, Reston VA, for Big Inc. meeting"
I want to read each of these in to vectors that are "Amount", "Category", "Date" and "Description" and create a dataframe out of them so that I have a dataset I can work with
I tried the following
for (i in length(txtread) ) {
data<-read.table(textConnection(txtread[[i]]))
print(data)
}
However this does't seem to work.
how can I read this data into a dataframe in R

Related

lapply() with XPath to obtain all text after a specific tag not working

Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc

The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...

How do you scrape multiple pages from same website on Rstudio

so I want to download data from multiple pages of the same website using RStudio
https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2
The difference between page 2 and page 3, is …at the end of the hyperlink we just have a 3 instead of a 2
I have no problem getting what I need from 25 jobs in 1 page, but I want to get 100 jobs from 4 pages.
I am using the selector gadget chrome extension.
I tried the for loop
for (page_result in seq(from =1, to = 101, by = 25)) {
link = paste0(“ https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2)
page = read_html(link)
I can’t figure out how to do it
I think I need to fit in page_result into the link, but I don’t know where.
I welcome any ideas.
i have the rvest package and the dplyr package. But I want the for loop to go through each page. Any idea how best to do this, thanks

4 links can be easily put in for loop.
Copy the CSS link from DOM and iterate over 5 to 30 to get all 25 jobs.
AllJOBS <- vector()
for (i in 1:4) {
print("s")
url <- paste0("https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=",i,sep="")
for (k in 5:30) {
jobs <- read_html(url) %>% html_node(css = paste0("#page > div.container > div.column-wrap.order-one-two > div.two-thirds > div:nth-child(",k,") > div > div.job-result-logo-title > div.job-result-title > h2 > a")) %>% html_text()
AllJOBS <- append(AllJOBS,jobs)
Sys.sleep(runif(1,1,2))
print(k)
}
print(paste0("Page",i))
}
output
> AllJOBS
[1] "Senior Consultant - Fund Static Data"
[2] "Data Warehouse Engineer"
[3] "Senior Software Engineer - Big Data DevOps"
[4] "HR Data Analyst"
[5] "Data Insights Engineer - Dublin - Permanent/Contract - SQL Server"
[6] NA
[7] "Data Engineer - Master Data Services - SQL Server - Permanent/Contract"
[8] "Senior Data Protection Officer (DPO) - Contract"
[9] "QC Data Analyst (Trending)"
[10] "Senior Data Warehouse Developer"
[11] "Senior Data Analyst FTC"
[12] "Compliance Advisory and Data Protection Relationship Manager"
[13] "Contracts Manager-Data Center"
[14] "Payments Product Data Analyst"
[15] "Data Center Product Hardware Platform Engineer"
[16] "People Data Privacy Program Lead"
[17] "Head of Data Science"
[18] "Data Protection Counsel (Product or Compliance)"
[19] "Data Engineer, GMS"
[20] "Data Protection Associate General Counsel"
[21] "Senior Data Engineer"
[22] "Geospatial Data Scientist"
[23] "Data Solutions Manager"
[24] "Data Protection Solicitor"
[25] "Junior Data Scientist"
[26] "Master Data Specialist"
[27] "Temp QC Electronic Data Management Analyst"
[28] "20725 -Data Scientist - Limerick"
[29] "Technical Support Specialist - Data Centre"
[30] "Lead QC Micro Analyst (data review and compliance)"
[31] "Temp QC Data Analyst"
[32] "#Abbvie Compliance Engineer (Data Integrity)"
[33] "People Data Analyst"
[34] "Senior Electrical Design Engineer - Data Centre Ex"
[35] "Laboratory Data Entry Assistant, UCD NVRL"
[36] "Data Migrations Specialist"
[37] "Data Protection Officer"
[38] "Data Center Operations Engineer (Linux)"
[39] "Senior Electrical Engineer | Data Centre LV Design"
[40] "Data Scientist - (Process Sciences)"
[41] "Mgr Supply Logistics Global Materials Data"
[42] "Data Protection / Privacy Delivery Consultant"
[43] "Global Supply Chain Data Analyst"
[44] "QC Data Analyst"
[45] "0582GradeVIIFOIOLOL1120 - Grade VII Data Protection / Freedom of Information & Compliance Officer"
[46] "DPO001 - Deputy Data Protection Officer (General Manager) Office of the Head of Data Protection, HSE"
[47] "Senior Campaign Data Analyst"
[48] "Data & Reporting Analyst II"
[49] "Azure Data Analytics Solution Architect"
[50] "Head of Risk Assurance for IT, Data, Projects and Outsourcing"
[51] "Trainee Data Technician, Ireland"
[52] NA
You can deal with NAs separately. Does this answer your question or I misinterpreted it?

Coercing "factor" to "transactions" R when using Apriori

I want to perform the Apriori method on the Cuisines column on my dataset.
Cusisine column sample:
[4] Japanese, Sushi
[5] Japanese, Korean
[6] Chinese
[7] Asian, European
[8] Seafood, Filipino, Asian, European
[9] European, Asian, Indian
[10] Filipino
[11] Filipino, Mexican
My code:
install.packages("arules")
library("arules")
itemsets <- apriori(dataSet$Cuisines, parameter=list(support=0.02, minlen=1, maxlen=1, target="frequent itemsets"))
However i keep getting:
no method or default for coercing “factor” to “transactions”
What goes wrong in here?
Is it illogical to use Apriori method on this column in my dataset?
If yes, what type of columns should i use the apriori method on?

You have to convert your data to transaction type:
dats <- strsplit(as.character(dats$Cuisines),',',fixed=T) # split by comma
trans <- as(dats, "transactions")
inspect(trans)
items
[1] { Sushi,Japanese}
[2] { Korean,Japanese}
[3] {Chinese}
[4] { European,Asian}
[5] { Asian, European, Filipino,Seafood}
[6] { Asian, Indian,European}
[7] {Filipino}
[8] { Mexican,Filipino}
Then you can do apriori:
itemsets <- apriori(trans, parameter=list(support=0.3))
inspect(itemsets)
With data:
dats <- read.table(text =" Cuisines
[1] 'Japanese, Sushi'
[2] 'Japanese, Korean'
[3] 'Chinese'
[4] 'Asian, European'
[5] 'Seafood, Filipino, Asian, European'
[6] 'European, Asian, Indian'
[7] 'Filipino'
[8] 'Filipino, Mexican' ", header = T)

Collapse elements separated by ""

I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?

You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc

Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]

Find out POI (within 2km) using latitude and longitude

I have a dataset which corresponding of Zipcode along with lat and log.I want to find out list of hospital/bank(within 2km) from that latitude and longitude.
How to do it?
The Long/Lat data looks like
store_zip lon lat
410710 73.8248981 18.5154681
410209 73.0907 19.0218215
400034 72.8148177 18.9724162
400001 72.836334 18.9385352
400102 72.834424 19.1418961
400066 72.8635299 19.2313448
400078 72.9327444 19.1570343
400078 72.9327444 19.1570343
400007 72.8133825 18.9618411
400050 72.8299518 19.0551695
400062 72.8426858 19.1593396
400083 72.9374227 19.1166191
400603 72.9781047 19.1834148
401107 72.8929 19.2762702
401105 72.8663173 19.3053477
400703 72.9992013 19.0793547
401209 NA NA
401203 72.7983705 19.4166761
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265

If your Points of Interest are unknown and you need to find them, you can use Google's API through my googleway package (as you've suggested in the comments). You will need a valid API key for this to work.
As the API can only accept one request at a time, you'll need to iterate over your data one row at a time. For that you can use whatever looping method you're most comforatable with
library(googleway) ## using v2.4.0 on CRAN
set_key("your_api_key")
lst <- lapply(1:nrow(df), function(x){
google_places(search_string = "Hospital",
location = c(df[x, 'lat'], df[x, 'lon']),
radius = 2000)
})
lst is now a list that contains the results of the queries. For example, the names of the hospitals it has returned for the first row of your data is
place_name(lst[[1]])
# [1] "Jadhav Hospital"
# [2] "Poona Hospital Medical Store"
# [3] "Sanjeevan Hospital"
# [4] "Suyash Hospital"
# [5] "Mehta Hospital"
# [6] "Deenanath Mangeshkar Hospital"
# [7] "Sushrut Hospital"
# [8] "Deenanath Mangeshkar Hospital and Research Centre"
# [9] "MMF Ratna Memorial Hospital"
# [10] "Maharashtra Medical Foundation's Joshi Multispeciality Hospital"
# [11] "Sahyadri Hospitals"
# [12] "Deendayal Memorial Hospital"
# [13] "Jehangir Specialty Hospital"
# [14] "Global Hospital And Research Institute"
# [15] "Prayag Hospital"
# [16] "Apex Superspeciality Hospital"
# [17] "Deoyani Multi Speciality Hospital"
# [18] "Shashwat Hospital"
# [19] "Deccan Multispeciality Hardikar Hospital"
# [20] "City Hospital"
You can also view them on a map
set_key("map_api_key", api = "map")
## the lat/lon of the returned results are found through `place_location()`
# place_location(lst[[1]])
df_hospitals <- place_location(lst[[1]])
df_hospitals$name <- place_name(lst[[1]])
google_map() %>%
add_circles(data = df[1, ], radius = 2000) %>%
add_markers(data = df_hospitals, info_window = "name")
Note:
Google's API is limited to 2,500 queries per key per day, unless you pay for a premium account.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to read a .txt file into a dataframe with readr? - r

Related

lapply() with XPath to obtain all text after a specific tag not working

How do you scrape multiple pages from same website on Rstudio

Coercing "factor" to "transactions" R when using Apriori

Collapse elements separated by ""

Find out POI (within 2km) using latitude and longitude

Categories

Resources