how to feed a tibble to spacyr? - r

Consider this simple example
bogustib <- tibble(doc_id = c(1,2,3),
text = c('bug', 'one love', '838383838'))
# A tibble: 3 x 2
doc_id text
<dbl> <chr>
1 1 bug
2 2 one love
3 3 838383838
This tibble is called bogustib because I know spacyr will fail on row 3.
> spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "text1") :
replacement has 1 row, data has 0
so, naturally, feeding the tibble to spacyr will fail as well
spacy_parse(bogustib, lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "3") :
replacement has 1 row, data has 0
My question is: I think I can avoid this issue by calling spacy_parse row by row.
However, this looks inefficient and I would like to use the multithread argument of spacyr to speed up the computation over my large tibble.
Is there any solution here?
Thanks!

Actually, it does not happen in my environment. In my environment, the output is like:
library(tidyverse)
library(spacyr)
bogustib <- tibble(doc_id = c(1,2,3),
text = c('bug', 'one love', '838383838'))
spacy_parse(bogustib)
spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
## No noun phrase found in documents.
## doc_id sentence_id token_id token pos entity
## 1 text1 1 1 838383838 NUM CARDINAL_B
To get this result, I used the latest master on github. However, I was able to reproduce your error when I ran with the CRAN version of spacyr. I'm sure that I fixed the bug a while ago, but that seems not reflected on CRAN version. We will try to update the CRAN asap.
In the meantime, you can:
devtools::install_github('quanteda/spacyr')
Or zip download the repo and run:
devtools::install('******')
**** is the path to the unzipped repository.

Related

r - check if columns renamed when unnesting

Is there a way to know if a table was renamed in the process of unnesting? I want to know if there is something where I can intercept any messages that come through with New names: and give more context about solutions
# min reprex
library(tidyverse)
f <- function() {
tibble(
x = 1:2,
y = 2:1,
z = tibble(x = 1)
) |>
unnest_wider(z, names_repair = "unique")
}
f()
New names:
• `x` -> `x...1`
• `x` -> `x...3`
x...1 y x...3
----- ----- -----
1 2 1
2 1 1
More context:
The message stems from
vctrs::vec_as_names(c("x", "x"), repair = "universal")
I see information about withCallingHandlers() but not sure if that is the right route. I thought there was a way for errors/messages to have classes that you can intercept but I can't remember what I read.
Something in testthat::expect_message() may help. I thought there would be a has_message() function out there.
There is a lot of tidy evaluation and comparing names before and after might be tricky. I could look for the names with the regex "\\.+\\d+$" but not sure that is robust enough since data could have fields with that syntax already.
Thank you!
Taking inspiration from hadley's answer, on the question that #ritchie-sacramento linked, you should check out the evaluate package.
> eval_res <- evaluate::evaluate("f()")
> eval_res[[2]]$message
[1] "New names:\n* x -> x...1\n* x -> x...3\n"
This will require more testing to see what happens to the data structure when there are errors, warnings, or even multiple messages. But this seems like the right track.

Issue reading data with ipumsr using PUMAs

I'm trying to read some data from ipums USA and it's worked before, but I'm suddenly getting the error "Error in levels<-(*tmp*, value = as.character(levels)) : factor level [2] is duplicated" Earlier, when just trying to display the PUMA data, I also got "Error: 'labels' must be unique" on a different computer. I'll put the code I was using below, but I've been using this data with PUMA and it hasn't happened before. Can anyone tell me what this means or what changed?
ddi <- read_ipums_ddi("usa_00021.xml")
data <- read_ipums_micro(ddi)
data[13] #13 is the IND column and this produces the error
data$IND #this does not produce an error
this gets the "Error in levels<-(*tmp*, value = as.character(levels)) : factor level [2] is duplicated" error on my current computer
ddi <- read_ipums_ddi("usa_00021.xml")
data <- read_ipums_micro(ddi)
data[8] #this is the PUMA column
this gets the 'Error: 'labels' must be unique' error on the other computer. This computer has the same issue listed above, but also gives me this. This is also the computer I had been using with no previous issue
(Sorry if anything is formated wrong--first question)
This is related to an error in the print formatting introduced by recent versions of ipumsr and haven.
It has been fixed as a pull request into haven, so if you're able to install C++ packages from github, you can run the following command:
# install.packages("devtools")
devtools::install_github("tidyverse/haven", pull = 425)
If that's not an option, you can disable the printing behavior by doing the following:
options(haven.show_pillar_labels = FALSE)
options(ipumsr.show_pillar_labels = FALSE)
Edit:
Just to confirm - this is how the options work on my computer - I'm curious why this wouldn't work on yours. If you have time, can you see if this code works for you?
library(ipumsr)
x <- tibble::tibble(x = haven::labelled(c(1, 2, 3), c(x = 1, x = 2)))
x
#> Error in `levels<-`(`*tmp*`, value = as.character(levels)): factor level [2] is duplicated
options(haven.show_pillar_labels = FALSE)
options(ipumsr.show_pillar_labels = FALSE)
x
#> # A tibble: 3 x 1
#> x
#> <dbl+lbl>
#> 1 1
#> 2 2
#> 3 3
Created on 2019-04-10 by the reprex package (v0.2.1)

Why is R not angry with my tibble: the tale of the dangling comma that could

R wants things to be just so. Commands must be exactly correct, and quite rightly so.
Thus, dangling commas are bad.
For example, on a vector:
> c(1,)
Error in c(1, ) : argument 2 is empty
Or a data frame:
> data.frame(a = 1,)
Error in data.frame(a = 1, ) : argument is missing, with no default.
But not on a tibble for some reason:
> tibble(a = 1,)
# A tibble: 1 x 1
a
<dbl>
1 1
Why is it so? What's gone ... right?
I believe that the code works because the arguments to tibble() are name-value pairs which are processed using rlang::quos().
quos() has an argument .ignore_empty = c("trailing", "none", "all").
So the default for .ignore_empty is "trailing" - i.e. the last argument to tibble is ignored if empty. If you change this, you'll see an error:
tibble(a = 1, .ignore_empty = "none",)
Error in eval_tidy(xs[[i]], unique_output) : object '' not found
See ?tibble and ?quos for the details.

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

Missing `parse` information inside vignette build

Goal
The goal is to create a package that parses R scripts and lists functions (from the package - like mvbutils- but also imports).
Function
The main function relies on parsing R script with
d<-getParseData(x = parse(text = deparse(x)))
Reproducible code
For example in an interactive R session the output of
x<-test<-function(x){x+1}
d<-getParseData(x = parse(text = deparse(x)))
Has for first few lines:
line1 col1 line2 col2 id parent token terminal text
23 1 1 4 1 23 0 expr FALSE
1 1 1 1 8 1 23 FUNCTION TRUE function
2 1 10 1 10 2 23 '(' TRUE (
3 1 11 1 11 3 23 SYMBOL_FORMALS TRUE x
4 1 12 1 12 4 23 ')' TRUE )
Error
When building a vignette with knitr containing - either with knit html from RStudio or devtools::build_vignettes, the output of the previous chunk of code is NULL. On the other hand using "knitr::knit" inside an R session will give the correct output.
Questions:
Is there a reason for the parser to behave differently inside the knit function/environment, and is there a way to bypass this?
Update
Changing code to:
x<-test<-function(x){x+1}
d<-getParseData(x = parse(text = deparse(x),keep.source = TRUE))
Fixes the issue, but this does not answer the question of why the same function behaves differently.
From the help page ?options:
keep.source:
When TRUE, the source code for functions (newly defined or loaded) is stored internally allowing comments to be kept in the right places. Retrieve the source by printing or using deparse(fn, control = "useSource").
The default is interactive(), i.e., TRUE for interactive use.
When building the vignette, you are running a non-interactive R session, so the source code is discarded in parse().
parse(file = "", n = NULL, text = NULL, prompt = "?",
keep.source = getOption("keep.source"), srcfile,
encoding = "unknown")

Resources