Unable to remove the foreign languages unicode codes - r
I have a .csvfile with multiple foreign languages(russian, japanese, arabic,etc) info within. For example, a column entry look like this:<U+03BA><U+03BF><U+03C5>.I want to remove rows which have this kind of info.
I tried various solutions for, all of them with no result:
test_fb5 <- read_csv('test_fb_data.csv', encoding = 'UTF-8')
or applied for a column:
gsub("[<].*[>]", "")` or `sub("^\\s*<U\\+\\w+>\\s*", "")
or
gsub("\\s*<U\\+\\w+>$", "")
It seems that R 4.1.0 doesn't find the respective chars. I cannot find a way to attach a small chunk of file here.
Here is the capture of the file:
address
33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta
33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec
33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region
33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario
name
33085 md legals canada inc
33086 les aspirateurs jpg inc
33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis
33088 wrench it up plumbing mechanical
category
33085 general practice attorneys divorce family law attorneys notaries
33086 <NA>
33087 mediterranean restaurants fish seafood restaurants
33088 plumbing services damage restoration mold remediation
phone
33085 17808512828
33086 14507781003
33087 302298072134
33088 14168005050
the 3308's are the rows of the dataset
Thank you for your time!
You can use a negative character class to remove the <U...> codes:
gsub("<[^>]+>", "", x)
This matches any substring that:
starts with <,
is followed one or more times by any character except the > character, and
ends on >
If you have other substrings between <and >, which you do not want to remove, just add U to more specifically target unicode codes, thus: <U[^>]+>
Data:
x <- "address 33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta 33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec 33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region 33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario name 33085 md legals canada inc 33086 les aspirateurs jpg inc 33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis 33088 wrench it up plumbing mechanical category 33085 general practice attorneys divorce family law attorneys notaries 33086 <NA> 33087 mediterranean restaurants fish seafood restaurants 33088 plumbing services damage restoration mold remediation phone 33085 17808512828 33086 14507781003 33087 302298072134 33088 14168005050"
Related
Unknown issue prevents geocode_reverse from working
I recently update RStudio to the version RStudio 2022.07.1, working on Windows 10. When I tried different geocode reverse functions(Which is input coordinate, output is the address), they all return no found. Example 1: library(revgeo) revgeo(-77.016472, 38.785026) Suppose return "146 National Plaza, Fort Washington, Maryland, 20745, United States of America". But I got "Getting geocode data from Photon: http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026" [[1]] [1] "House Number Not Found Street Not Found, City Not Found, State Not Found, Postcode Not Found, Country Not Found" Data from https://github.com/mhudecheck/revgeo Example 2: library(tidygeocoder) library(dplyr) path <- "filepath" df <- read.csv (paste (path, "sample.csv", sep = "")) reverse <- df %>% reverse_geocode(lat = longitude, long = latitude, method = 'osm', address = address_found, full_results = TRUE) reverse Where the sample.csv is name addr latitude longitude White House 1600 Pennsylvania Ave NW, Washington, DC 38.89770 -77.03655 Transamerica Pyramid 600 Montgomery St, San Francisco, CA 94111 37.79520 -122.40279 Willis Tower 233 S Wacker Dr, Chicago, IL 60606 41.87535 -87.63576 Suppose to get name addr latitude longitude address_found White House 1600 Pennsylvania Ave NW, Washington, DC 38.89770 -77.03655 White House, 1600, Pennsylvania Avenue Northwest, Washington, District of Columbia, 20500, United States Transamerica Pyramid 600 Montgomery St, San Francisco, CA 94111 37.79520 -122.40279 Transamerica Pyramid, 600, Montgomery Street, Chinatown, San Francisco, San Francisco City and County, San Francisco, California, 94111, United States Willis Tower 233 S Wacker Dr, Chicago, IL 60606 41.87535 -87.63576 South Wacker Drive, Printer’s Row, Loop, Chicago, Cook County, Illinois, 60606, United States But I got # A tibble: 3 × 5 name addr latitude longitude address_found <chr> <chr> <dbl> <dbl> <chr> 1 White House 1600 Pennsylvania Ave NW, Wash… 38.9 -77.0 NA 2 Transamerica Pyramid 600 Montgomery St, San Francis… 37.8 -122. NA 3 Willis Tower 233 S Wacker Dr, Chicago, IL 6… 41.9 -87.6 NA Data source: https://cran.r-project.org/web/packages/tidygeocoder/readme/README.html However, when I tried reverse_geo(lat = 38.895865, long = -77.0307713, method = "osm") I'm able to get # A tibble: 1 × 3 lat long address <dbl> <dbl> <chr> 1 38.9 -77.0 Pennsylvania Avenue, Washington, District of Columbia, 20045, United States I had contact the tidygeocoder developer, he/she didn't find out any problem. Detail in https://github.com/jessecambon/tidygeocoder/issues/175 Not sure which part goes wrong. Anyone want try on their RStudio?
The updated revgeo needs to be submitted to CRAN. This has nothing to do with RStudio. Going to http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026 in my browser also returns an error. However, I searched for the Photon reverse geocoder, and their example uses .io not .de in the URL, and https://photon.komoot.io/reverse?lon=-77.016472&lat=38.785026 works. Photon also include a Note at the bottom of their examples: Until October 2020 the API was available under photon.komoot.de. Requests still work as they redirected to photon.komoot.io but please update your apps accordingly. Seems like that redirect is either broken or deprecated. The version of revgeo on github has this change made already, so you can get a working version by using remotes::install_github("https://github.com/mhudecheck/revgeo")
How to quickly access elements in a list of list without a loop in R? (Googleway package)
I was working with the googleway package and I had a bunch of addresses that I needed to parse out the various components of the addresses that were in a nested list of lists. Loops (not encouraged) and apply functions both seemed confusing and I was not sure if there was a tidy solution. I found the map function (specifically the pluck function that it calls on lists on the backend) could accomplish my goal so I will share my solution. Problem: I need to pull out certain information about the White House such as Latitude Longitude You need to set up your Google Cloud API Key with googleway::set_key(API_KEY), but this is just an example of a nested list that I hope someone working with this package will see. # Address for the White House and the Lincoln Memorial address_vec <- c( "1600 Pennsylvania Ave NW, Washington, DC 20006", "2 Lincoln Memorial Cir NW, Washington, DC 20002" ) address_vec <- pmap(list(address_vec), googleway::google_geocode) outputs [[1]] [[1]]$results address_components 1 1600, Pennsylvania Avenue Northwest, Northwest Washington, Washington, District of Columbia, United States, 20500, 1600, Pennsylvania Avenue NW, Northwest Washington, Washington, DC, US, 20500, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code formatted_address geometry.bounds.northeast.lat 1 1600 Pennsylvania Avenue NW, Washington, DC 20500, USA 38.8979 geometry.bounds.northeast.lng geometry.bounds.southwest.lat geometry.bounds.southwest.lng geometry.location.lat 1 -77.03551 38.89731 -77.03796 38.89766 geometry.location.lng geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng 1 -77.03657 ROOFTOP 38.89895 -77.03539 geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id 1 38.89626 -77.03808 ChIJGVtI4by3t4kRr51d_Qm_x58 types 1 establishment, point_of_interest, premise [[1]]$status [1] "OK" [[2]] [[2]]$results address_components 1 2, Lincoln Memorial Circle Northwest, Southwest Washington, Washington, District of Columbia, United States, 20037, 2, Lincoln Memorial Cir NW, Southwest Washington, Washington, DC, US, 20037, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code formatted_address geometry.location.lat geometry.location.lng 1 2 Lincoln Memorial Cir NW, Washington, DC 20037, USA 38.88927 -77.05018 geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng 1 ROOFTOP 38.89062 -77.04883 geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id 1 38.88792 -77.05152 ChIJgRuEham3t4kRFju4R6De__g plus_code.compound_code plus_code.global_code types 1 VWQX+PW Washington, DC, USA 87C4VWQX+PW street_address [[2]]$status [1] "OK"
Here's some code that I got from the Googleway Vignette: df <- google_geocode(address = "Flinders Street Station", key = key, simplify = TRUE) geocode_coordinates(df) # lat lng # 1 -37.81827 144.9671 It looks like what you need to do is: df <- google_geocode("1600 Pennsylvania Ave") geocode_coordinates(df)
The solution I came up with is a custom function that can access any section of the list: geocode_accessor <- function(df, accessor, ...) { unlist(map(df, list(accessor, ...))) } This has three important parts to understand: The map function is calling the pluck function for us (it replaces the use of [[ ). You can read more about what is happening here, but just know this lets us access things by name The "..." in the function's definition as well as in the list allows us to access multiple levels. Again, the use of list() to access further levels in a list is explained in the pluck documentation The use of unlist converts the list to a vector (what I want in my instance) Putting this all together, we can get the latitude of the White House & Lincoln Memorial: geocode_accessor(address_vec, "results", "geometry", "location", "lat") [1] 38.89766 38.88927
Can I create a vector with regexps?
My data looks somthing like this: 412 U CA, Riverside 413 U British Columbia 414 CREI 415 U Pompeu Fabra 416 Office of the Comptroller of the Currency, US Department of the Treasury 417 Bureau of Economics, US Federal Trade Commission 418 U Carlos III de Madrid 419 U Brescia 420 LUISS Guido Carli 421 U Alicante 422 Harvard Society of Fellows 423 Toulouse School of Economics 424 Decision Economics Inc, Boston, MA 425 ECARES, Free U Brussels I will need to geocode this data in order to get the coordinates for each specific institution. in order to do that I need all state names to be spelled out. At the same time I don't want acronyms like "ECARES" to be transformed into "ECaliforniaRES". I have been toying with the idea of converting the state.abb and state.name vectors into vectors of regular expressions, so that state.abb would look something like this (Using Alabama and California as state 1 and state 2): c("^AL "|" AL "|" AL,"|",AL "| " AL$", "^CA "[....]) And the state.name vector something like this: c("^Alabama "|" Alabama "|" Alabama,"|",Alabama "| " Alabama$", "^California "[....]) Hopefully, I can then use the mgsub function to replace all expressions in the modified state.abb vector with the corresponding entries in the modified state.name vector. For some reason, however, it doesn't seem to be possible to put regexps in a vector: test<-c(^AL, ^AB) Error: unexpected '^' in "test<-c(^" I have tried excusing the "^"-signs but this doesnt really seem to work: test<-c(\^AL, \^AB) Error: unexpected input in "test<-c(\" > test<-c(\\^AL, \\^AB) Is there any way of putting regexps in a vector, or is there another way of achieving my goal (that is, to replace all two-letter state abbreviations to state names without messing up other acronyms in the process)? Excerpt of my data: c("U Lausanne", "Swiss Finance Institute", "U CA, Riverside", "U British Columbia", "CREI", "U Pompeu Fabra", "Office of the Comptroller of the Currency, US Department of the Treasury", "Bureau of Economics, US Federal Trade Commission", "U Carlos III de Madrid", "U Brescia", "LUISS Guido Carli", "U Alicante", "Harvard Society of Fellows", "Toulouse School of Economics", "Decision Economics Inc, Boston, MA", "ECARES, Free U Brussels", "Baylor U", "Research Centre for Education", "the Labour Market, Maastricht U", "U Bonn", "Swarthmore College" )
We can make use of the state.abb vector and paste it together by collapseing with | pat1 <- paste0("\\b(", paste(state.abb, collapse="|"), ")\\b") The \\b signifies the word boundary so that indiscriminate matches "SAL" can be avoided and similarly with state.name, paste the ^ and $ as prefix/suffix to mark the start, end of the string respectively pat2 <- paste0("^(", paste(state.name, collapse="|"), ")$")
Getting 'Error in `[.data.frame`(.x, ...) : undefined columns selected' using purrr::map
I've been trying to learn purrr, as I've been working with some deeply nested JSON data, but I keep getting errors that don't seem to be appearing elsewhere online. Below is my JSON data / code: json <- {"countries":[[{"holdings":[{"quantity":"50,000","cost":"7,597,399","currency":"USD","evaluation":"6,853,500","percentageOfNetAssets":"4.52","title":"Alibaba Group Holding Ltd ADR ","page":"66"},{"quantity":"625,000","cost":"1,842,933","currency":"HKD","evaluation":"3,033,457","percentageOfNetAssets":"2.00","title":"Anhui Conch Cement Co Ltd ","page":"66"},{"quantity":"1,949,000","cost":"1,480,711","currency":"HKD","evaluation":"0","percentageOfNetAssets":"0.00","title":"China Animal Healthcare Ltd ","page":"66"},{"quantity":"2,888,000","cost":"2,992,011","currency":"HKD","evaluation":"2,382,890","percentageOfNetAssets":"1.57","title":"China Construction Bank Corp ","page":"66"},{"quantity":"2,030,000","cost":"2,994,592","currency":"HKD","evaluation":"3,137,298","percentageOfNetAssets":"2.07","title":"CNOOC Ltd ","page":"66"},{"quantity":"400,000","cost":"3,127,007","currency":"HKD","evaluation":"3,548,187","percentageOfNetAssets":"2.34","title":"ENN Energy Holdings Ltd ","page":"66"},{"quantity":"349,297","cost":"3,288,042","currency":"CNH","evaluation":"3,497,876","percentageOfNetAssets":"2.31","title":"Foshan Haitian Flavouring & Food Co Ltd ","page":"66"},{"quantity":"929,012","cost":"3,187,360","currency":"CNH","evaluation":"3,093,845","percentageOfNetAssets":"2.04","title":"Inner Mongolia Yili Industrial Group Co Ltd ","page":"66"},{"quantity":"630,000","cost":"4,720,889","currency":"HKD","evaluation":"5,564,255","percentageOfNetAssets":"3.67","title":"Ping An Insurance Group Co of China Ltd ","page":"66"},{"quantity":"422,000","cost":"3,030,250","currency":"HKD","evaluation":"4,783,603","percentageOfNetAssets":"3.16","title":"Shenzhou International Group Holdings Ltd ","page":"66"},{"quantity":"250,000","cost":"7,161,493","currency":"HKD","evaluation":"10,026,375","percentageOfNetAssets":"6.61","title":"Tencent Holdings Ltd ","page":"66"},{"quantity":"986,000","cost":"2,263,521","currency":"HKD","evaluation":"2,525,024","percentageOfNetAssets":"1.67","title":"TravelSky Technology Ltd ","page":"66"},{"quantity":"50,600","cost":"5,969,458","currency":"USD","evaluation":"2,956,558","percentageOfNetAssets":"1.95","title":"Weibo Corp ADR ","page":"66"},{"quantity":"100,000","cost":"3,581,365","currency":"USD","evaluation":"3,353,000","percentageOfNetAssets":"2.21","title":"Yum China Holdings Inc ","page":"66"}],"total":{"page":"66","cost":"53,237,031","evaluation":"54,755,868","percentageOfNetAssets":"36.12"},"title":"China ","page":"66"},{"holdings":[{"quantity":"1,130,000","cost":"8,812,716","currency":"HKD","evaluation":"9,381,366","percentageOfNetAssets":"6.19","title":"AIA Group Ltd ","page":"66"},{"quantity":"700,000","cost":"3,099,505","currency":"HKD","evaluation":"2,601,749","percentageOfNetAssets":"1.71","title":"BOC Hong Kong Holdings Ltd ","page":"66"},{"quantity":"12,500,000","cost":"2,685,700","currency":"HKD","evaluation":"2,378,869","percentageOfNetAssets":"1.57","title":"Pacific Basin Shipping Ltd ","page":"66"}],"total":{"page":"66","cost":"14,597,921","evaluation":"14,361,984","percentageOfNetAssets":"9.47"},"title":"Hong Kong ","page":"66"},{"holdings":[{"quantity":"380,000","cost":"2,713,957","currency":"INR","evaluation":"2,731,820","percentageOfNetAssets":"1.80","title":"Future Retail Ltd ","page":"66"},{"quantity":"135,000","cost":"3,260,344","currency":"INR","evaluation":"3,806,163","percentageOfNetAssets":"2.51","title":"Housing Development Finance Corp Ltd ","page":"66"},{"quantity":"192,556","cost":"2,280,854","currency":"INR","evaluation":"4,411,012","percentageOfNetAssets":"2.91","title":"IndusInd Bank Ltd ","page":"66"},{"quantity":"40,000","cost":"2,584,823","currency":"INR","evaluation":"4,277,304","percentageOfNetAssets":"2.82","title":"Maruti Suzuki India Ltd ","page":"66"},{"quantity":"89,000","cost":"2,460,333","currency":"INR","evaluation":"2,413,256","percentageOfNetAssets":"1.59","title":"Tata Consultancy Services Ltd ","page":"66"},{"quantity":"265,281","cost":"3,375,523","currency":"INR","evaluation":"3,537,586","percentageOfNetAssets":"2.34","title":"Titan Co Ltd ","page":"66"}],"total":{"page":"66","cost":"16,675,834","evaluation":"21,177,141","percentageOfNetAssets":"13.97"},"title":"India ","page":"66"},{"holdings":[{"quantity":"3,280,100","cost":"4,020,564","currency":"IDR","evaluation":"5,930,640","percentageOfNetAssets":"3.91","title":"Bank Central Asia Tbk PT ","page":"66"}],"total":{"page":"66","cost":"4,020,564","evaluation":"5,930,640","percentageOfNetAssets":"3.91"},"title":"Indonesia ","page":"66"},{"holdings":[{"quantity":"1,550,000","cost":"3,239,787","currency":"MYR","evaluation":"3,143,134","percentageOfNetAssets":"2.07","title":"Malaysia Airports Holdings Bhd ","page":"66"}],"total":{"page":"66","cost":"3,239,787","evaluation":"3,143,134","percentageOfNetAssets":"2.07"},"title":"Malaysia ","page":"66"},{"holdings":[{"quantity":"3,687,000","cost":"2,747,584","currency":"PHP","evaluation":"2,846,671","percentageOfNetAssets":"1.88","title":"Ayala Land Inc ","page":"66"}],"total":{"page":"66","cost":"2,747,584","evaluation":"2,846,671","percentageOfNetAssets":"1.88"},"title":"Philippines ","page":"66"},{"holdings":[{"quantity":"234,969","cost":"4,056,529","currency":"SGD","evaluation":"4,083,944","percentageOfNetAssets":"2.69","title":"DBS Group Holdings Ltd ","page":"66"}],"total":{"page":"66","cost":"4,056,529","evaluation":"4,083,944","percentageOfNetAssets":"2.69"},"title":"Singapore ","page":"66"},{"holdings":[{"quantity":"18,000","cost":"6,727,232","currency":"KRW","evaluation":"5,597,777","percentageOfNetAssets":"3.69","title":"LG Chem Ltd ","page":"66"},{"quantity":"3,800","cost":"4,844,479","currency":"KRW","evaluation":"3,749,597","percentageOfNetAssets":"2.48","title":"LG Household & Health Care Ltd ","page":"66"},{"quantity":"171,000","cost":"5,875,373","currency":"KRW","evaluation":"5,930,902","percentageOfNetAssets":"3.91","title":"Samsung Electronics Co Ltd ","page":"66"},{"quantity":"40,000","cost":"2,595,521","currency":"KRW","evaluation":"2,168,847","percentageOfNetAssets":"1.43","title":"SK Hynix Inc ","page":"66"}],"total":{"page":"66","cost":"20,042,605","evaluation":"17,447,123","percentageOfNetAssets":"11.51"},"title":"South Korea ","page":"66"},{"holdings":[{"quantity":"244,000","cost":"2,788,471","currency":"TWD","evaluation":"1,786,121","percentageOfNetAssets":"1.18","title":"Catcher Technology Co Ltd ","page":"67"},{"quantity":"23,000","cost":"3,167,793","currency":"TWD","evaluation":"2,405,732","percentageOfNetAssets":"1.59","title":"Largan Precision Co Ltd ","page":"67"},{"quantity":"1,400,000","cost":"8,639,650","currency":"TWD","evaluation":"10,271,009","percentageOfNetAssets":"6.77","title":"Taiwan Semiconductor Manufacturing Co Ltd ","page":"67"}],"total":{"page":"67","cost":"14,595,914","evaluation":"14,462,862","percentageOfNetAssets":"9.54"},"title":"Taiwan ","page":"67"},{"holdings":[{"quantity":"1,739,000","cost":"3,416,652","currency":"THB","evaluation":"3,431,534","percentageOfNetAssets":"2.27","title":"Airports of Thailand PCL ","page":"67"},{"quantity":"1,269,500","cost":"2,379,289","currency":"THB","evaluation":"2,914,470","percentageOfNetAssets":"1.92","title":"Central Pattana PCL ","page":"67"}],"total":{"page":"67","cost":"Total - Transferable securities admitted to an official stock exchange listing 139,009,710","evaluation":"144,555,371","percentageOfNetAssets":"95.35"},"title":"Thailand ","page":"67"}]],"total":[{}],"page":["66"],"title":["Shares "]} library(jsonlite) library(purrr) library(magrittr) test <- fromJSON(json) %>% purrr::map(extract, c("page", "title")) Error in `[.data.frame`(.x, ...) : undefined columns selected Basically, I'm trying to see if is.unsorted(test$countries[[1]]$title) returns true, and if it does, then I need to keep test$page. I thought extracting countries and page would be the best way to do this, but I am super unfamiliar with purrr, and the extract (or [) methods I see online are not working. Does anyone see what I'm doing wrong, or know of an alternative approach?
If you want to return an element of a list if something else is true, why not just a simple ifelse? data <- fromJSON(json) ifelse(is.unsorted(data$countries[[1]]$title), data$page, FALSE) # Look at negation (so if it's not unsorted) ifelse(!is.unsorted(data$countries[[1]]$title), data$page, FALSE)
Extract date from texts in corpus R
I have a corpus object from which I want to extract data so I can add them as docvar. The object looks like this v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.", "Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region", "I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993", "Report No. PID9188 Project Name East Timor-TP-Emergency School (#) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000", "Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992") c1 <- corpus(v1) The first thing I want to do is extract the first occurring date, mostly it occurs as "Month Year" (December 1990) or "Month Day, Year" (JUNE 19, 1991) or with a typo FREBRUARY 10, 1995 in which case the month could be discarded. My code is a combination of Extract date text from string & Extract Dates in any format from Text in R: lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}"))) and get the error: Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type However, I do not know how to supply the date format. Furthermore, I don't really get how to write the correct regular expressions. https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html other questions on this subject are: Extract date from text Need to extract date from a text file of strings in R http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html Extract date from given string in r
str_extract_all(texts(c1) , "(\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|(\\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2},\\s+\\d{4})|(\\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\\s+\\d{1,2},\\s+\\d{4})" , simplify = TRUE)[,1] This gives the first occurrence of format JUNE 19, 1991 or December 1990