I have created a function that calls an api and parses for the id, label, description, and score of each annotation. But I can't seem to get the dataframe to display properly.
Here's code:
get_wikidata_links <- function(input_text, minimum_score) {
#
# Function which takes a character vector of length 1 as input (i.e. all text
# needs to be combined into a single character) as well as a minimum certainty
# score, and returns a tibble with key information and links to Wikidata
#
# Input
# - input_text: Text input (character)
# - minimum_score: Minimum score that every returned entity needs to have
# (numeric)
#
# Output
# - top_wikidata_links: Table with the first four columns being 'id', 'label',
# 'description', 'score' (tibble)
#
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
data = content(r)$annotations
framed = list()
vec = list()
dummy = 0
for (i in 1:length(data)) {
data1 = data[[i]]$tags
for (j in 1:length(data1)) {
data2 = data1[[j]]
if (data2$score>minimum_score) {
vec[1] <- data2$id
vec[2] <- data2$label
vec[3] <- data2$desc
vec[4] <- data2$score
dummy <- dummy + 1
framed[[dummy]] <- vec
}
}
}
data_matrix <- do.call("rbind", framed)
top_wikidata_links <- as.data.frame(data_matrix, stringsAsFactors = FALSE)
colnames(top_wikidata_links) <- c("ID", "Label", "Description", "Score")
return(top_wikidata_links)
}
Now I test this function with a couple phrases:
# Test 1
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(input_text_1, -0.5)
#
# Hint: The output should be a tibble similar to the one outlined below
#
# | id | label | description | score |
# | "Q81244" | "Karl Popper" | "Austrian-British philosopher of science" | 2.4568285 |
# | "Q174570" | "London School of Economics and Political Science" | "university in Westminster, UK" | "1.4685043" |
# | "Q171240" | "London Stock Exchange" | "stock exchange in the City of London" | "-0.4124461" |
# Test 2
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
Now for some reason the matrix data_matrix works fine:
Output
But the data frame conversion fails as such:
Output
I guess it's bit easier to manage through some hoisting and unnesting. Inspired by https://tidyr.tidyverse.org/articles/rectangle.html :
library(httr)
library(tidyr)
library(dplyr)
get_wikidata_links <- function(input_text, minimum_score) {
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
tibble(link = content(r)$annotations) %>%
hoist(link, tags = "tags") %>%
unnest_longer(tags) %>%
hoist(tags, ID = "id", Label = "label", Description = "desc", Score = "score") %>%
select(ID:Score) %>%
filter(Score >= minimum_score)
}
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(text_example_1, -0.5)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q81244 Karl Popper Austrian-Brit… 2.46
#> 2 Q174570 London School of Economics and Political Science university in… 1.47
#> 3 Q171240 London Stock Exchange stock exchang… -0.412
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q92760 Claude Shannon American mathematician an… 1.96
#> 2 Q230492 University of Michigan public research universit… 1.29
#> 3 Q49108 Massachusetts Institute of Technology research university in Ca… 0.902
Created on 2023-01-19 with reprex v2.0.2
Service itself seems bit unstable or overloaded.
Related
I am trying to pivot_wider() the column X of a data frame containing various persons names. Within group_by() another variable Y of the df there are always 2 of these names. I would like R to take the 2 unique X names values within each unique identifier of Y and put them in 2 new columns ex_X_Name_1 and ex_X_Name_2.
My data frame is looking like this:
df <- data.frame(Student = rep(c(17383, 16487, 17646, 2648, 3785), each = 2),
Referee = c("Paul Severe", "Cathy Nice", "Jean Exigeant", "Hilda Ehrlich", "John Rates",
"Eva Luates", "Fred Notebien", "Aldous Grading", "Hans Streng", "Anna Filaktic"),
Rating = format(round(x = sqrt(sample(15:95, 10, replace = TRUE)), digits = 3), nsmall = 3)
)
df
I would like to make the transformation of the Referee column to 2 new columns Referee_1 and Referee_2 with the 2 unique Referees assigned to each student and end with this result:
even_row_df <- as.logical(seq_len(length(df$Referee)) %% 2)
df_wanted <- data_frame(
Student = unique(df$Student),
Referee_1 = df$Referee[even_row_df],
Rating_Ref_1 = df$Rating[even_row_df],
Referee_2 = df$Referee[!even_row_df],
Rating_Ref_2 = df$Rating[!even_row_df]
)
df_wanted
I guess I could achieve this with by subsetting unique rows of student/referee combinations and make joins , but is there a way to handle this in one call to pivot_wider?
You should create a row id per group first:
library(dplyr)
library(tidyr)
df %>%
group_by(Student) %>%
mutate(row_n = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = "row_n", values_from = c("Referee", "Rating"))
# A tibble: 5 × 5
Student Referee_1 Referee_2 Rating_1 Rating_2
<dbl> <chr> <chr> <chr> <chr>
1 17383 Paul Severe Cathy Nice 9.165 7.810
2 16487 Jean Exigeant Hilda Ehrlich 5.196 6.557
3 17646 John Rates Eva Luates 7.211 5.568
4 2648 Fred Notebien Aldous Grading 4.000 8.124
5 3785 Hans Streng Anna Filaktic 7.937 6.325
using data.table
library(data.table)
setDT(df)
merge(df[, .SD[1], Student], df[, .SD[2], Student], by = "Student", suffixes = c("_1", "_2"))
# Student Referee_1 Rating_1 Referee_2 Rating_2
# 1: 2648 Fred Notebien 6.708 Aldous Grading 9.747
# 2: 3785 Hans Streng 6.245 Anna Filaktic 8.775
# 3: 16487 Jean Exigeant 7.681 Hilda Ehrlich 4.359
# 4: 17383 Paul Severe 4.583 Cathy Nice 7.616
# 5: 17646 John Rates 6.708 Eva Luates 8.246
Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a . There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Hélène),female10/06/2011#Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956#London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013#Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980#Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
Here's a tidy solution with tidyr's functions separate_rows and extract:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)#(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace all commas in (First[,Middle1,Middle2,...]) with space
stringr::str_replace_all(
# Match anything inside brackets, but as few times as possible, so we don't
# match multiple brackets
"\\(.*?\\)",
# Inside the regex that was matched, replace comma with space
function(x) stringr::str_replace_all(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace # with ,
stringr::str_replace_all("#", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains \n)
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE,
show_col_types = FALSE
)
# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Hélène female 10/06/2… Geneva Switzerland PPF
#> 2 BOUKAR Mohamed El-Hadi male 04/12/1… London England PPF
#> 3 HARIMA Olak N’nassik G… female 25/06/2… Paris France PPF
#> 4 THOMAS Hajil Pau Joëli female 03/03/1… Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Created on 2022-03-20 by the reprex package (v2.0.1)
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.
I have couple of pdfs and I wish to extract the shareholders table. How can I specify such that only table appearing after the string 'TWENTY LARGEST SHAREHOLDERS' is extracted?
I tried but was not quite sure of the function part.
library("pdftools")
library("tidyverse")
url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")
raw_text <- map(url, pdf_text)
clean_table <- function(table){
table <- str_split(table, "\n", simplify = TRUE)
table_start <- stringr::str_which(table, "TWENTY LARGEST SHAREHOLDERS")
table <- table[1, (table_start +1 ):(table_end - 1)]
table <- str_replace_all(table, "\\s{2,}", "|")
text_con <- textConnection(table)
data_table <- read.csv(text_con, sep = "|")
colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
}
shares <- map_df(raw_text, clean_table)
Try this. Besides some minor issues the main change is that I first get the page which contains the desired table. BTW: You have to search for "Twenty Largest Shareholders" and not "TWENTY LARGEST SHAREHOLDERS".
library(pdftools)
library(tidyverse)
# download pdf
url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")
raw_text <- map(url, pdf_text)
clean_table1 <- function(raw) {
# Split the single pages
raw <- map(raw, ~ str_split(.x, "\\n") %>% unlist())
# Concatenate the splitted pages
raw <- reduce(raw, c)
table_start <- stringr::str_which(tolower(raw), "twenty largest shareholders")
table_end <- stringr::str_which(tolower(raw), "total")
table_end <- table_end[min(which(table_end > table_start))]
table <- raw[(table_start + 3 ):(table_end - 1)]
table <- str_replace_all(table, "\\s{2,}", "|")
text_con <- textConnection(table)
data_table <- read.csv(text_con, sep = "|")
colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
data_table
}
shares <- map_df(raw_text, clean_table1)
head(shares)
#> Name Number of Shares
#> 1 J P Morgan Nominees Australia Pty Limited 109,500,852
#> 2 Citicorp Nominees Pty Limited 57,714,777
#> 3 Mr Chris Morris 32,231,000
#> 4 National Nominees Limited 19,355,892
#> 5 Welas Pty Ltd 18,950,000
#> 6 BNP Paribas Nominees Pty Ltd <Agency Lending DRP A/C> 11,520,882
#> Percentage
#> 1 20.17
#> 2 10.63
#> 3 5.94
#> 4 3.56
#> 5 3.49
#> 6 2.12
I have data in which one column is genre(chr) in that, values are like "Drama | Musical | Crime", I need to split this data and need to make new row for every entry, like in this value there are 3 values so I need to make three entries with all column in that Data Frame.
imdbId <- "tt0118578"
title <-"Albela"
releaseYear<- 2010
releaseDate <- "2-12-2010"
genre <- "Adventure | Drama | Musical"
writers <- "Ashutosh Gowariker (story) | Ashutosh Gowariker (screenplay) |
Kumar Dave (screenplay) | Sanjay Dayma (screenplay) | K.P. Saxena
(dialogue)"
actors <-"Aamir Khan | Gracy Singh | Rachel Shelley | Paul Blackthorne"
directors<-"Ashutosh Gowariker"
sequel <-"No"
hitFlop <-2
df <- data.frame(imdbId, title, releaseYear, releaseDate, genre,
writers, actors, directors, sequel, hitFlop
, stringsAsFactors=FALSE)**
This is the str of data frame now in this I need to split data and make unique entry for each film based on single genre value.
Something like this could work:
data:
multiChar<-
"tt0169102
Lagaan: Once Upon a Time in India
2001
08 May 2002
Adventure | Drama | Musical
Ashutosh Gowariker (story) | Ashutosh Gowariker (screenplay) | Kumar Dave (screenplay) | Sanjay Dayma (screenplay) | K.P. Saxena (dialogue)
Aamir Khan | Gracy Singh | Rachel Shelley | Paul Blackthorne
Ashutosh Gowariker
0
6"
code:
library(magrittr)
patterni <- "(?i)(?<=\\n).*(adventure|drama|musical)(\\s+?(\\|)?\\s+?).*(?=\\n)"
getGenres<- stringr::str_extract(multiChar, patterni) %>%
str_split("\\|",simplify = T) %>% c %>% trimws
result <- purrr::map(getGenres,~sub(patterni,.,multiChar,perl=T))
result:
lapply(result,cat)
please note:
You probably have to come up with a more precise pattern patterni.
This here takes the 5th ROW (genre Row). If your genre is always in the 5th row that's your pattern.
patterni <- "^(.*?\\n){4}.*(?=\\n)"
getGenres<- stringr::str_extract(multiChar, patterni) %>% sub(".*\\n","",.) %>%
str_split("\\|",simplify = T) %>% c %>% trimws
Answering a question is easy ... if the question is well framed. No code is supplied, so let's assume a dataframe:
title <- "Lagaan: Once Upon a Time in India"
year <- 2001
genre <- "Adventure | Drama | Musical"
df <- data.frame(title, year, genre, stringsAsFactors=FALSE)
Add or duplicate as many rows as are required. Then replace the values in the genre column as appropriate.
For a single vector of genre names:
genres <- strsplit(df$genre, " \\| ")[[1]]
For a list of vectors of genre names:
genres <- strsplit(df$genre, " \\| ")
I created a function that uses stringr to split a column, given the pattern and a name prefix for the generated column.
**split_into_multiple <- function(column, pattern = ", ", into_prefix){
cols <- str_split_fixed(column, pattern, n = Inf)
# Sub out the ""'s returned by filling the matrix to the right, with NAs which
are useful
cols[which(cols == "")] <- NA
cols <- as.tibble(cols)
# name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ...,
'into_prefix_m'
# where m = # columns of 'cols'
m <- dim(cols)[2]
names(cols) <- paste(into_prefix, 1:m, sep = "_")
return(cols)
}**
We can then use split_into_multiple in a dplyr pipe as follows:
**after <- BollywoodMovieDetail %>%
bind_cols(split_into_multiple(.$genre,"\\|", "genre")) %>%
# selecting those that start with 'genre_' will remove the original 'genre' column
select(imdbId, starts_with("genre_"))
> after
# A tibble: 1,284 x 4
imdbId genre_1 genre_2 genre_3
<chr> <chr> <chr> <chr>
1 tt0118578 Romance NA NA
2 tt0169102 "Adventure " " Drama " " Musical"
3 tt0187279 "Action " " Comedy" NA
4 tt0222024 "Drama " " Romance" NA
# ... with 1,274 more rows**
And then we can use gather to tidy up...
**> after %>%
+ gather(key, val, -imdbId, na.rm = T)
A tibble: 2,826 x 3
imdbId key val
* <chr> <chr> <chr>
1 tt0118578 genre_1 Romance
2 tt0169102 genre_1 "Adventure "
3 tt0187279 genre_1 "Action "
4 tt0222024 genre_1 "Drama "
5 tt0227194 genre_1 "Action "
# ... with 2,816 more rows**
Trying to make do.call() work in the context of tidy evaluation:
library(rlang)
library(dplyr)
data <- tibble(item_name = c("apple", "bmw", "bmw"))
mutate(data, category = case_when(item_name == "apple" ~ "fruit",
item_name == "bmw" ~ "car"))
# # A tibble: 3 x 2
# item_name category
# <chr> <chr>
# 1 apple fruit
# 2 bmw car
# 3 bmw car
What differs between:
category_fn <- function(df, ...){
# browser()
cat1 <- quos(...)
mutate(df, category = case_when(!!! cat1))
}
category_fn(df = data, item_name == "apple" ~ "fruit",
item_name == "bmw" ~ "car")
# # A tibble: 3 x 2
# item_name category
# <chr> <chr>
# 1 apple fruit
# 2 bmw car
# 3 bmw car
and:
cat <- list(item_name == "apple" ~ "fruit", item_name == "bmw" ~ "car")
do.call(category_fn, c(list(df = data), cat), quote = FALSE)
# Or:
do.call(category_fn, c(list(df = data), cat), quote = TRUE)
# Or:
rlang::invoke(category_fn, c(list(df = data), cat))
which all give the same error:
# Error in mutate_impl(.data, dots) :
# Evaluation error: object 'item_name' not found.
I stepped into the function with browser(), examined the arguments, ran expr(mutate(df, category = case_when(!!! cat1))) there (as suggested as a universal debugging strategy in http://rpubs.com/lionel-/programming-draft), with the same output in both cases: mutate(df, category = case_when(~(item_name == "apple" ~ "fruit"), ~(item_name == "bmw" ~ "car"))).
I've also tried to tweak the envir or .env arguments to no avail.
My understanding is that it has likely something to do with different quosure environments, but environment(cat1[[1]]) is also identical (<environment: R_GlobalEnv>).
Note:
This is somehow a follow-up of Tidy evaluation programming with dplyr::case_when which I was trying to answer.
> sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────
setting value
version R version 3.4.3 (2017-11-30)
os Linux Mint 18
system x86_64, linux-gnu
[...]
─ Packages ────────────────────────────────────────────────────────────
package * version date source
[...]
dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.3)
[...]
rlang * 0.1.6 2017-12-21 CRAN (R 3.4.3)
[...]
We could create 'cat' as a quosure and then do the evaluation with !!!
cat <- quos(item_name == "apple" ~ "fruit", item_name == "bmw" ~ "car")
category_fn(data, !!!(cat))
# A tibble: 3 x 2
# item_name category
# <chr> <chr>
#1 apple fruit
#2 bmw car
#3 bmw car
I think it's a similar issue to the other post; quoting the list itself is not the same as quoting the elements of the list individually.
I have modified the cat definition to quote the elements individually, and the function slightly to remove the quosure statement and explicitly name the argument. In the do.call statements the second argument, the list of arguments to be supplied to the function, I have included the cat element as part of the list.
With these modifications the two do.call statements and the invoke then return the same result as the direct execution in your post:
data <- tibble(item_name = c("apple", "bmw", "bmw"))
cat <- list(quo(item_name == "apple" ~ "fruit"),
quo(item_name == "bmw" ~ "car"))
category_fn <- function(df, category){
mutate(df, category = case_when(!!! category))
}
> do.call(category_fn, list(data, cat), quote = FALSE)
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
> # Or:
> do.call(category_fn, list(data, cat), quote = TRUE)
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
> # Or:
> rlang::invoke(category_fn, list(df = data, cat))
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
The value of the quote argument makes no difference in the two do.call examples.
I find quosures conceptually difficult, and not made a great deal easier by the current programming with dplyr vignette on Cran.
The answer in part (1a) of my response to Tidy evaluation programming with dplyr::case_when works here too.
If cat, data and category_fn are as in the present question then this works. The first line transforms cat to cat_ which is of a form that will work here.
cat_ <- lapply(cat, function(x) do.call("substitute", list(x)))
do.call("category_fn", c(list(df = data), cat_))
giving:
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
Regarding the question at the end which seems to ask for alternatives to quosures in my answer to the original problem which I have linked to above are solutions to that question using the wrapr package and base R. The seplyr package, by the author of wrapr, may also be an alternative.