Extract date from texts in corpus R

Extract date from texts in corpus R - r

I have a corpus object from which I want to extract data so I can add them as docvar.
The object looks like this
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (#) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
The first thing I want to do is extract the first occurring date, mostly it occurs as "Month Year" (December 1990) or "Month Day, Year" (JUNE 19, 1991) or with a typo FREBRUARY 10, 1995 in which case the month could be discarded.
My code is a combination of
Extract date text from string
&
Extract Dates in any format from Text in R:
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")))
and get the error:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
However, I do not know how to supply the date format. Furthermore, I don't really get how to write the correct regular expressions.
https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html
other questions on this subject are:
Extract date from text
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html
Extract date from given string in r

str_extract_all(texts(c1)
, "(\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|(\\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2},\\s+\\d{4})|(\\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\\s+\\d{1,2},\\s+\\d{4})"
, simplify = TRUE)[,1]
This gives the first occurrence of format JUNE 19, 1991 or December 1990

Related

Extracting first word after a specific expression in R

I have a column that contains thousands of descriptions like this (example) :
Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA
I'd like to create a column with the first word after "city of", like that :
Description
City
Building a hospital in the city of LA, USA
LA
Building a school in the city of NYC, USA
NYC
Building shops in the city of Chicago, USA
Chicago
I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.

Solution
This should make the trick for the data you showed:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
Alternative
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Documentation
Check out ?regex:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....

Unable to remove the foreign languages unicode codes

I have a .csvfile with multiple foreign languages(russian, japanese, arabic,etc) info within. For example, a column entry look like this:<U+03BA><U+03BF><U+03C5>.I want to remove rows which have this kind of info.
I tried various solutions for, all of them with no result:
test_fb5 <- read_csv('test_fb_data.csv', encoding = 'UTF-8')
or applied for a column:
gsub("[<].*[>]", "")` or `sub("^\\s*<U\\+\\w+>\\s*", "")
or
gsub("\\s*<U\\+\\w+>$", "")
It seems that R 4.1.0 doesn't find the respective chars. I cannot find a way to attach a small chunk of file here.
Here is the capture of the file:
address
33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta
33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec
33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region
33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario
name
33085 md legals canada inc
33086 les aspirateurs jpg inc
33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis
33088 wrench it up plumbing mechanical
category
33085 general practice attorneys divorce family law attorneys notaries
33086 <NA>
33087 mediterranean restaurants fish seafood restaurants
33088 plumbing services damage restoration mold remediation
phone
33085 17808512828
33086 14507781003
33087 302298072134
33088 14168005050
the 3308's are the rows of the dataset
Thank you for your time!

You can use a negative character class to remove the <U...> codes:
gsub("<[^>]+>", "", x)
This matches any substring that:
starts with <,
is followed one or more times by any character except the > character, and
ends on >
If you have other substrings between <and >, which you do not want to remove, just add U to more specifically target unicode codes, thus: <U[^>]+>
Data:
x <- "address 33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta 33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec 33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region 33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario name 33085 md legals canada inc 33086 les aspirateurs jpg inc 33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis 33088 wrench it up plumbing mechanical category 33085 general practice attorneys divorce family law attorneys notaries 33086 <NA> 33087 mediterranean restaurants fish seafood restaurants 33088 plumbing services damage restoration mold remediation phone 33085 17808512828 33086 14507781003 33087 302298072134 33088 14168005050"

how to clean irregular strings & organize them into a dataframe at right column

I have two long strings that look like this in a vector:
x <- c("Job Information\n\nLocation: \n\n\nScarsdale, New York, 10583-3050, United States \n\n\n\n\n\nJob ID: \n53827738\n\n\nPosted: \nApril 22, 2020\n\n\n\n\nMin Experience: \n3-5 Years\n\n\n\n\nRequired Travel: \n0-10%",
"Job Information\n\nLocation: \n\n\nGlenview, Illinois, 60025, United States \n\n\n\n\n\nJob ID: \n53812433\n\n\nPosted: \nApril 21, 2020\n\n\n\n\nSalary: \n$110,000.00 - $170,000.00 (Yearly Salary)")
and my goal is to neatly organized them in a dataframe (output form) something like this:
#View(df)
Location Job ID Posted Min Experience Required Travel Salary
[1] Scarsdale,... 53827738 April 22... 3-5 Years 0-10% NA
[2] Glenview,... 53812433 April 21... NA NA $110,000.00 - $170,000.00 (Yearly Salary)
(...) was done to present the dataframe here neatly.
However as you see, two strings doesn't necessarily have same attibutes. Forexample, first string has Min Experience and Required Travel, but on the second string, those field don't exist, but has Salary. So this getting very tricky for me. I thought I will read between \n character but they are not set, some have two newlines, other have 4 or 5. I was wondering if someone can help me out. I will appreciate it!

We can split the string on one or more '\n' ('\n{1,}'). Remove the first word from each (which is 'Job Information') as we don't need it anywhere (x <- x[-1]). For remaining part of the string we can see that they are in pairs in the form of columnname - columnvalue. We create a dataframe from this using alternating index and bind_rows combine all of them by name.
dplyr::bind_rows(sapply(strsplit(gsub(':', '', x), '\n{1,}'), function(x) {
x <- x[-1]
setNames(as.data.frame(t(x[c(FALSE, TRUE)])), x[c(TRUE, FALSE)])
}))
# Location Job ID Posted Min Experience
#1 Scarsdale, New York, 10583-3050, United States 53827738 April 22, 2020 3-5 Years
#2 Glenview, Illinois, 60025, United States 53812433 April 21, 2020 <NA>
# Required Travel Salary
#1 0-10% <NA>
#2 <NA> $110,000.00 - $170,000.00 (Yearly Salary)

How to extract unique string in between string pattern in full text in R?

I'm looking to extract names and professions of those who testified in front of Congress from the following text:
text <- c(("FULL COMMITTEE HEARINGS\\", \\" 2017\\",\n\\" April 6, 2017—‘‘The 2017 Tax Filing Season: Internal Revenue\\", \", \"\\"\nService Operations and the Taxpayer Experience.’’ This hearing\\", \\" examined\nissues related to the 2017 tax filing season, including\\", \\" IRS performance,\ncustomer service challenges, and information\\", \\" technology. Testimony was\nheard from the Honorable John\\", \\" Koskinen, Commissioner, Internal Revenue\nService, Washington,\\", \", \"\\" DC.\\", \\" May 25, 2017—‘‘Fiscal Year 2018 Budget\nProposals for the Depart-\\", \\" ment of Treasury and Tax Reform.’’ The hearing\ncovered the\\", \\" President’s 2018 Budget and touched on operations of the De-\n\\", \\" partment of Treasury and Tax Reform. Testimony was heard\\", \\" from the\nHonorable Steven Mnuchin, Secretary of the Treasury,\\", \", \"\\" United States\nDepartment of the Treasury, Washington, DC.\\", \\" July 18, 2017—‘‘Comprehensive\nTax Reform: Prospects and Chal-\\", \\" lenges.’’ The hearing covered issues\nsurrounding potential tax re-\\", \\" form plans including individual, business,\nand international pro-\\", \\" posals. Testimony was heard from the Honorable\nJonathan Talis-\\", \", \"\\" man, former Assistant Secretary for Tax Policy 2000–\n2001,\\", \\" United States Department of the Treasury, Washington, DC; the\\",\n\\" Honorable Pamela F. Olson, former Assistant Secretary for Tax\\", \\" Policy\n2002–2004, United States Department of the Treasury,\\", \\" Washington, DC; the\nHonorable Eric Solomon, former Assistant\\", \", \"\\" Secretary for Tax Policy\n2006–2009, United States Department of\\", \\" the Treasury, Washington, DC; and\nthe Honorable Mark J.\\", \\" Mazur, former Assistant Secretary for Tax Policy\n2012–2017,\\", \\" United States Department of the Treasury, Washington, DC.\\",\n\\" (5)\\", \\"VerDate Sep 11 2014 14:16 Mar 28, 2019 Jkt 000000 PO 00000 Frm 00013\nFmt 6601 Sfmt 6601 R:\\\\DOCS\\\\115ACT.000 TIM\\"\", \")\")"
)
The full text is available here: https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf
It seems that the names are in between "Testimony was heard from" until the next ".". So, how can I extract the names between these two patterns? The text is much longer (50 page document), but I figured that if I can do it one, I'll do it for the rest of the text.
I know I can't use NLP for name extraction because they are names of persons that didn't testify, for example.

NLP is likely unavoidable because of the many abbreviations in the text. Try this workflow:
Tokenize by sentence
Remove sentences without "Testimony"
Extract persons + professions from remaining sentences
There are a couple of packages with sentence tokenizers, but openNLP has generally worked best for me when dealing with abbreviation laden sentences. The following code should get you close to your goal:
library(tidyverse)
library(pdftools)
library(openNLP)
# Get the data
testimony_url <- "https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf"
download.file(testimony_url, "testimony.pdf")
text_raw <- pdf_text("testimony.pdf")
# Clean the character vector and smoosh into one long string.
text_string <- str_squish(text_raw) %>%
str_replace_all("- ", "") %>%
paste(collapse = " ") %>%
NLP::as.String()
# Annotate and extract the sentences.
annotations <- NLP::annotate(text_string, Maxent_Sent_Token_Annotator())
sentences <- text_string[annotations]
# Some sentences starting with "Testimony" list multiple persons. We need to
# split these and clean up a little.
name_title_vec <- str_subset(sentences, "Testimony was") %>%
str_split(";") %>%
unlist %>%
str_trim %>%
str_remove("^(Testimony .*? from|and) ") %>%
str_subset("^\\(\\d\\)", negate = T)
# Put in data frame and separate name from profession/title.
testimony_tibb <- tibble(name_title_vec) %>%
separate(name_title_vec, c("name", "title"), sep = ", ", extra = "merge")
You should end up with the below data frame. Some additional cleaning may be necessary:
# A tibble: 95 x 2
name title
<chr> <chr>
1 the Honorable John Koskin… Commissioner, Internal Revenue Service, Washington, DC.
2 the Honorable Steven Mnuc… Secretary of the Treasury, United States Department of the Treasury…
3 the Honorable Jonathan Ta… former Assistant Secretary for Tax Policy 2000–2001, United States …
4 the Honorable Pamela F. O… former Assistant Secretary for Tax Policy 2002–2004, United States …
5 the Honorable Eric Solomon former Assistant Secretary for Tax Policy 2006–2009, United States …
6 the Honorable Mark J. Maz… "former Assistant Secretary for Tax Policy 2012–2017, United States…
7 Mr. Daniel Garcia-Diaz Director, Financial Markets and Community Investment, United States…
8 Mr. Grant S. Whitaker president, National Council of State Housing Agencies, Washington, …
9 the Honorable Katherine M… Ph.D., professor of public policy and planning, and faculty directo…
10 Mr. Kirk McClure Ph.D., professor, Urban Planning Program, School of Public Policy a…
# … with 85 more rows

Getting 'Error in `[.data.frame`(.x, ...) : undefined columns selected' using purrr::map

I've been trying to learn purrr, as I've been working with some deeply nested JSON data, but I keep getting errors that don't seem to be appearing elsewhere online. Below is my JSON data / code:
json <- {"countries":[[{"holdings":[{"quantity":"50,000","cost":"7,597,399","currency":"USD","evaluation":"6,853,500","percentageOfNetAssets":"4.52","title":"Alibaba Group Holding Ltd ADR ","page":"66"},{"quantity":"625,000","cost":"1,842,933","currency":"HKD","evaluation":"3,033,457","percentageOfNetAssets":"2.00","title":"Anhui Conch Cement Co Ltd ","page":"66"},{"quantity":"1,949,000","cost":"1,480,711","currency":"HKD","evaluation":"0","percentageOfNetAssets":"0.00","title":"China Animal Healthcare Ltd ","page":"66"},{"quantity":"2,888,000","cost":"2,992,011","currency":"HKD","evaluation":"2,382,890","percentageOfNetAssets":"1.57","title":"China Construction Bank Corp ","page":"66"},{"quantity":"2,030,000","cost":"2,994,592","currency":"HKD","evaluation":"3,137,298","percentageOfNetAssets":"2.07","title":"CNOOC Ltd ","page":"66"},{"quantity":"400,000","cost":"3,127,007","currency":"HKD","evaluation":"3,548,187","percentageOfNetAssets":"2.34","title":"ENN Energy Holdings Ltd ","page":"66"},{"quantity":"349,297","cost":"3,288,042","currency":"CNH","evaluation":"3,497,876","percentageOfNetAssets":"2.31","title":"Foshan Haitian Flavouring & Food Co Ltd ","page":"66"},{"quantity":"929,012","cost":"3,187,360","currency":"CNH","evaluation":"3,093,845","percentageOfNetAssets":"2.04","title":"Inner Mongolia Yili Industrial Group Co Ltd ","page":"66"},{"quantity":"630,000","cost":"4,720,889","currency":"HKD","evaluation":"5,564,255","percentageOfNetAssets":"3.67","title":"Ping An Insurance Group Co of China Ltd ","page":"66"},{"quantity":"422,000","cost":"3,030,250","currency":"HKD","evaluation":"4,783,603","percentageOfNetAssets":"3.16","title":"Shenzhou International Group Holdings Ltd ","page":"66"},{"quantity":"250,000","cost":"7,161,493","currency":"HKD","evaluation":"10,026,375","percentageOfNetAssets":"6.61","title":"Tencent Holdings Ltd ","page":"66"},{"quantity":"986,000","cost":"2,263,521","currency":"HKD","evaluation":"2,525,024","percentageOfNetAssets":"1.67","title":"TravelSky Technology Ltd ","page":"66"},{"quantity":"50,600","cost":"5,969,458","currency":"USD","evaluation":"2,956,558","percentageOfNetAssets":"1.95","title":"Weibo Corp ADR ","page":"66"},{"quantity":"100,000","cost":"3,581,365","currency":"USD","evaluation":"3,353,000","percentageOfNetAssets":"2.21","title":"Yum China Holdings Inc ","page":"66"}],"total":{"page":"66","cost":"53,237,031","evaluation":"54,755,868","percentageOfNetAssets":"36.12"},"title":"China ","page":"66"},{"holdings":[{"quantity":"1,130,000","cost":"8,812,716","currency":"HKD","evaluation":"9,381,366","percentageOfNetAssets":"6.19","title":"AIA Group Ltd ","page":"66"},{"quantity":"700,000","cost":"3,099,505","currency":"HKD","evaluation":"2,601,749","percentageOfNetAssets":"1.71","title":"BOC Hong Kong Holdings Ltd ","page":"66"},{"quantity":"12,500,000","cost":"2,685,700","currency":"HKD","evaluation":"2,378,869","percentageOfNetAssets":"1.57","title":"Pacific Basin Shipping Ltd ","page":"66"}],"total":{"page":"66","cost":"14,597,921","evaluation":"14,361,984","percentageOfNetAssets":"9.47"},"title":"Hong Kong ","page":"66"},{"holdings":[{"quantity":"380,000","cost":"2,713,957","currency":"INR","evaluation":"2,731,820","percentageOfNetAssets":"1.80","title":"Future Retail Ltd ","page":"66"},{"quantity":"135,000","cost":"3,260,344","currency":"INR","evaluation":"3,806,163","percentageOfNetAssets":"2.51","title":"Housing Development Finance Corp Ltd ","page":"66"},{"quantity":"192,556","cost":"2,280,854","currency":"INR","evaluation":"4,411,012","percentageOfNetAssets":"2.91","title":"IndusInd Bank Ltd ","page":"66"},{"quantity":"40,000","cost":"2,584,823","currency":"INR","evaluation":"4,277,304","percentageOfNetAssets":"2.82","title":"Maruti Suzuki India Ltd ","page":"66"},{"quantity":"89,000","cost":"2,460,333","currency":"INR","evaluation":"2,413,256","percentageOfNetAssets":"1.59","title":"Tata Consultancy Services Ltd ","page":"66"},{"quantity":"265,281","cost":"3,375,523","currency":"INR","evaluation":"3,537,586","percentageOfNetAssets":"2.34","title":"Titan Co Ltd ","page":"66"}],"total":{"page":"66","cost":"16,675,834","evaluation":"21,177,141","percentageOfNetAssets":"13.97"},"title":"India ","page":"66"},{"holdings":[{"quantity":"3,280,100","cost":"4,020,564","currency":"IDR","evaluation":"5,930,640","percentageOfNetAssets":"3.91","title":"Bank Central Asia Tbk PT ","page":"66"}],"total":{"page":"66","cost":"4,020,564","evaluation":"5,930,640","percentageOfNetAssets":"3.91"},"title":"Indonesia ","page":"66"},{"holdings":[{"quantity":"1,550,000","cost":"3,239,787","currency":"MYR","evaluation":"3,143,134","percentageOfNetAssets":"2.07","title":"Malaysia Airports Holdings Bhd ","page":"66"}],"total":{"page":"66","cost":"3,239,787","evaluation":"3,143,134","percentageOfNetAssets":"2.07"},"title":"Malaysia ","page":"66"},{"holdings":[{"quantity":"3,687,000","cost":"2,747,584","currency":"PHP","evaluation":"2,846,671","percentageOfNetAssets":"1.88","title":"Ayala Land Inc ","page":"66"}],"total":{"page":"66","cost":"2,747,584","evaluation":"2,846,671","percentageOfNetAssets":"1.88"},"title":"Philippines ","page":"66"},{"holdings":[{"quantity":"234,969","cost":"4,056,529","currency":"SGD","evaluation":"4,083,944","percentageOfNetAssets":"2.69","title":"DBS Group Holdings Ltd ","page":"66"}],"total":{"page":"66","cost":"4,056,529","evaluation":"4,083,944","percentageOfNetAssets":"2.69"},"title":"Singapore ","page":"66"},{"holdings":[{"quantity":"18,000","cost":"6,727,232","currency":"KRW","evaluation":"5,597,777","percentageOfNetAssets":"3.69","title":"LG Chem Ltd ","page":"66"},{"quantity":"3,800","cost":"4,844,479","currency":"KRW","evaluation":"3,749,597","percentageOfNetAssets":"2.48","title":"LG Household & Health Care Ltd ","page":"66"},{"quantity":"171,000","cost":"5,875,373","currency":"KRW","evaluation":"5,930,902","percentageOfNetAssets":"3.91","title":"Samsung Electronics Co Ltd ","page":"66"},{"quantity":"40,000","cost":"2,595,521","currency":"KRW","evaluation":"2,168,847","percentageOfNetAssets":"1.43","title":"SK Hynix Inc ","page":"66"}],"total":{"page":"66","cost":"20,042,605","evaluation":"17,447,123","percentageOfNetAssets":"11.51"},"title":"South Korea ","page":"66"},{"holdings":[{"quantity":"244,000","cost":"2,788,471","currency":"TWD","evaluation":"1,786,121","percentageOfNetAssets":"1.18","title":"Catcher Technology Co Ltd ","page":"67"},{"quantity":"23,000","cost":"3,167,793","currency":"TWD","evaluation":"2,405,732","percentageOfNetAssets":"1.59","title":"Largan Precision Co Ltd ","page":"67"},{"quantity":"1,400,000","cost":"8,639,650","currency":"TWD","evaluation":"10,271,009","percentageOfNetAssets":"6.77","title":"Taiwan Semiconductor Manufacturing Co Ltd ","page":"67"}],"total":{"page":"67","cost":"14,595,914","evaluation":"14,462,862","percentageOfNetAssets":"9.54"},"title":"Taiwan ","page":"67"},{"holdings":[{"quantity":"1,739,000","cost":"3,416,652","currency":"THB","evaluation":"3,431,534","percentageOfNetAssets":"2.27","title":"Airports of Thailand PCL ","page":"67"},{"quantity":"1,269,500","cost":"2,379,289","currency":"THB","evaluation":"2,914,470","percentageOfNetAssets":"1.92","title":"Central Pattana PCL ","page":"67"}],"total":{"page":"67","cost":"Total - Transferable securities admitted to an official stock exchange listing 139,009,710","evaluation":"144,555,371","percentageOfNetAssets":"95.35"},"title":"Thailand ","page":"67"}]],"total":[{}],"page":["66"],"title":["Shares "]}
library(jsonlite)
library(purrr)
library(magrittr)
test <- fromJSON(json) %>% purrr::map(extract, c("page", "title"))
Error in `[.data.frame`(.x, ...) : undefined columns selected
Basically, I'm trying to see if is.unsorted(test$countries[[1]]$title) returns true, and if it does, then I need to keep test$page. I thought extracting countries and page would be the best way to do this, but I am super unfamiliar with purrr, and the extract (or [) methods I see online are not working. Does anyone see what I'm doing wrong, or know of an alternative approach?

If you want to return an element of a list if something else is true, why not just a simple ifelse?
data <- fromJSON(json)
ifelse(is.unsorted(data$countries[[1]]$title), data$page, FALSE)
# Look at negation (so if it's not unsorted)
ifelse(!is.unsorted(data$countries[[1]]$title), data$page, FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract date from texts in corpus R - r

Related

Extracting first word after a specific expression in R

Unable to remove the foreign languages unicode codes

how to clean irregular strings & organize them into a dataframe at right column

How to extract unique string in between string pattern in full text in R?

Getting 'Error in `[.data.frame`(.x, ...) : undefined columns selected' using purrr::map

Categories

Resources