Issue reading data with ipumsr using PUMAs - r

I'm trying to read some data from ipums USA and it's worked before, but I'm suddenly getting the error "Error in levels<-(*tmp*, value = as.character(levels)) : factor level [2] is duplicated" Earlier, when just trying to display the PUMA data, I also got "Error: 'labels' must be unique" on a different computer. I'll put the code I was using below, but I've been using this data with PUMA and it hasn't happened before. Can anyone tell me what this means or what changed?
ddi <- read_ipums_ddi("usa_00021.xml")
data <- read_ipums_micro(ddi)
data[13] #13 is the IND column and this produces the error
data$IND #this does not produce an error
this gets the "Error in levels<-(*tmp*, value = as.character(levels)) : factor level [2] is duplicated" error on my current computer
ddi <- read_ipums_ddi("usa_00021.xml")
data <- read_ipums_micro(ddi)
data[8] #this is the PUMA column
this gets the 'Error: 'labels' must be unique' error on the other computer. This computer has the same issue listed above, but also gives me this. This is also the computer I had been using with no previous issue
(Sorry if anything is formated wrong--first question)

This is related to an error in the print formatting introduced by recent versions of ipumsr and haven.
It has been fixed as a pull request into haven, so if you're able to install C++ packages from github, you can run the following command:
# install.packages("devtools")
devtools::install_github("tidyverse/haven", pull = 425)
If that's not an option, you can disable the printing behavior by doing the following:
options(haven.show_pillar_labels = FALSE)
options(ipumsr.show_pillar_labels = FALSE)
Edit:
Just to confirm - this is how the options work on my computer - I'm curious why this wouldn't work on yours. If you have time, can you see if this code works for you?
library(ipumsr)
x <- tibble::tibble(x = haven::labelled(c(1, 2, 3), c(x = 1, x = 2)))
x
#> Error in `levels<-`(`*tmp*`, value = as.character(levels)): factor level [2] is duplicated
options(haven.show_pillar_labels = FALSE)
options(ipumsr.show_pillar_labels = FALSE)
x
#> # A tibble: 3 x 1
#> x
#> <dbl+lbl>
#> 1 1
#> 2 2
#> 3 3
Created on 2019-04-10 by the reprex package (v0.2.1)

Related

Incorrect Dimensions error with the function MRM, in the package ecodist

when using the MRM function in the package Ecodist, I get the following error:
Error in xj[i, , drop = FALSE] : incorrect number of dimensions
I get this error no matter what I do, I even get it with the example code in the documentation:
data(graze)
# Abundance of this grass is related to forest cover but not location
MRM(dist(LOAR10) ~ dist(sitelocation) + dist(forestpct), data=graze, nperm=10)
I don't know what's going on. I have tried other computers and get the same error, so it's not even confined to my machine (windows 10, fully updated).
Best,
Joe
Thanks to Torsten Biemann for pointing me at this. I don't check stackoverflow regularly, but you are always welcome to email me at the ecodist maintainer address or open an issue at https://github.com/phiala/ecodist
As pointed out above, the example works correctly in a clean R session, but fails if spdep is loaded. I haven't figured out the conflict yet, but the problem is in the implict coercion of distance object to vector within the mechanics of using a formula. If you do that explicitly, the command works properly. I'll work on a patch, which will first be at the github above, and sent to CRAN after testing.
# R --vanilla --no-save
library(ecodist)
data(graze)
# Works
set.seed(1234)
MRM(dist(LOAR10) ~ dist(sitelocation) + dist(forestpct), data=graze, nperm=10)
$coef
dist(LOAR10) pval
Int 6.9372046 1.0
dist(sitelocation) -0.4840631 0.6
dist(forestpct) 0.1456083 0.1
$r.squared
R2 pval
0.04927212 0.10000000
$F.test
F F.pval
31.66549 0.10000
library(spdep)
# Fails
MRM(dist(LOAR10) ~ dist(sitelocation) + dist(forestpct), data=graze, nperm=10)
Error in xj[i, , drop = FALSE] : incorrect number of dimensions
# Explicit conversion to vector
graze.d <- with(graze, data.frame(LOAR10 = as.vector(dist(LOAR10)), sitelocation = as.vector(dist(sitelocation)), forestpct = as.vector(dist(forestpct))))
# Works
set.seed(1234)
MRM(LOAR10 ~ sitelocation + forestpct, data=graze.d, nperm=10)
$coef
LOAR10 pval
Int 6.9372046 1.0
sitelocation -0.4840631 0.6
forestpct 0.1456083 0.1
$r.squared
R2 pval
0.04927212 0.10000000
$F.test
F F.pval
31.66549 0.10000

how to feed a tibble to spacyr?

Consider this simple example
bogustib <- tibble(doc_id = c(1,2,3),
text = c('bug', 'one love', '838383838'))
# A tibble: 3 x 2
doc_id text
<dbl> <chr>
1 1 bug
2 2 one love
3 3 838383838
This tibble is called bogustib because I know spacyr will fail on row 3.
> spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "text1") :
replacement has 1 row, data has 0
so, naturally, feeding the tibble to spacyr will fail as well
spacy_parse(bogustib, lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "3") :
replacement has 1 row, data has 0
My question is: I think I can avoid this issue by calling spacy_parse row by row.
However, this looks inefficient and I would like to use the multithread argument of spacyr to speed up the computation over my large tibble.
Is there any solution here?
Thanks!
Actually, it does not happen in my environment. In my environment, the output is like:
library(tidyverse)
library(spacyr)
bogustib <- tibble(doc_id = c(1,2,3),
text = c('bug', 'one love', '838383838'))
spacy_parse(bogustib)
spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
## No noun phrase found in documents.
## doc_id sentence_id token_id token pos entity
## 1 text1 1 1 838383838 NUM CARDINAL_B
To get this result, I used the latest master on github. However, I was able to reproduce your error when I ran with the CRAN version of spacyr. I'm sure that I fixed the bug a while ago, but that seems not reflected on CRAN version. We will try to update the CRAN asap.
In the meantime, you can:
devtools::install_github('quanteda/spacyr')
Or zip download the repo and run:
devtools::install('******')
**** is the path to the unzipped repository.

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

R: trouble assigning values to a dynamic variable in a dataframe

I am trying to assign values to a dataframe variable defined by the user. The user specifies the name of the variable, let's call this x, in the dataframe df. For simplicity I want to assign a value of 3 to everything in the column the user specifies. The simplified code is:
variableName <- paste("df$", x, sep="")
eval(parse(text=variableName)) <- 3
But I get an error:
Error in file(filename, "r") : cannot open the connection
In addition: Warning message:
In file(filename, "r") :
cannot open file 'df$x': No such file or directory
I've tried all kinds of remedies to no avail. If I simply try to print the values of the column.
eval(parse(text=variableName))
I get no errors and it prints out ok. It's only when I try to give that column a value that I get the error. Any help would be appreciated.
I believe the issue is that there is no way to use the result of eval() on the LHS of an assignment.
df = data.frame(foo = 1:5,
bar = -3)
x = "bar"
variableName <- paste("df$", x, sep="")
eval(parse(text=variableName)) <- 3
#> Warning in file(filename, "r"): cannot open file 'df$bar': No such file or
#> directory
#> Error in file(filename, "r"): cannot open the connection
## This error is a bit misleading. Breaking it apart I get a different error.
eval(expression(df$bar)) <- 3
#> Error in eval(expression(df$bar)) <- 3: could not find function "eval<-"
## And it works if you put it all in the string to be parsed.
ex1 <- paste0("df$", x, "<-3")
eval(parse(text=ex1))
df
#> foo bar
#> 1 1 3
#> 2 2 3
#> 3 3 3
#> 4 4 3
#> 5 5 3
## But I doubt that's the best way to do it!

subsetting data.cube inside custom function

I am trying to make a function of my own to subset a data.cube in R, and format the result automatically for some predefined plots I aim to build.
This is my function.
require(data.table)
require(data.cube)
secciona <- function(cubo = NULL,
fecha_valor = list(),
loc_valor = list(),
prod_valor = list(),
drop = FALSE){
cubo[fecha_valor, loc_valor, prod_valor, drop = drop]
## The line above will really be an asignment of type y <- format(cubo[...drop])
## Rest of code which will end up plotting the subset of the function
}
The thing is I keep on getting the error: Error in eval(expr, envir, enclos) : object 'fecha_valor' not found
What is most strange for me, is that on the console everything works fine, but not when defined inside the subsetting function of mine.
In console:
> dc[list(as.Date("2013/01/01"))]
> dc[list(as.Date("2013/01/01")),]
> dc[list(as.Date("2013/01/01")),,]
> dc[list(as.Date("2013/01/01")),list(),list()]
all give as result:
<data.cube>
fact:
5627 rows x 2 dimensions x 1 measures (0.32 MB)
dimensions:
localizacion : 4 entities x 3 levels (0.01 MB)
producto : 153994 entities x 3 levels (21.29 MB)
total size: 21.61 MB
But whenever I try
secciona(dc)
secciona(dc, fecha_valor = list(as.Date("2013/01/01")))
secciona(dc, fecha_valor = list())
I always get the error above mentioned.
Any ideas why this is happening? should I proceed in else way for my approach of editing the subset for plotting?
This is the standard issue that R users will face when dealing with non-standard evaluation. This is a consequence of Computing on the language R language feature.
[.data.cube function expects to be used in interactive way, that extends the flexibility of the arguments passed to it, but gives some restrictions. In that aspect it is similar to [.data.table when passing expressions from wrapper function to [ subset operator. I've added dummy example to make it reproducible.
I see you are already using data.cube-oop branch, so just to clarify for other readers. data.cube-oop branch is 92 commits ahead of master branch, to install use the following.
install.packages("data.cube", repos = paste0("https://", c(
"jangorecki.gitlab.io/data.cube",
"Rdatatable.github.io/data.table",
"cran.rstudio.com"
)))
library(data.cube)
set.seed(1)
ar = array(rnorm(8,10,5), rep(2,3),
dimnames = list(color = c("green","red"),
year = c("2014","2015"),
country = c("IN","UK"))) # sorted
dc = as.data.cube(ar)
f = function(color=list(), year=list(), country=list(), drop=FALSE){
expr = substitute(
dc[color=.color, year=.year, country=.country, drop=.drop],
list(.color=color, .year=year, .country=country, .drop=drop)
)
eval(expr)
}
f(year=list(c("2014","2015")), country="UK")
#<data.cube>
#fact:
# 4 rows x 3 dimensions x 1 measures (0.00 MB)
#dimensions:
# color : 2 entities x 1 levels (0.00 MB)
# year : 2 entities x 1 levels (0.00 MB)
# country : 1 entities x 1 levels (0.00 MB)
#total size: 0.01 MB
You can track the expression just by putting print(expr) before/instead eval(expr).
Read more about non-standard evaluation:
- R Language Definition: Computing on the language
- Advanced R: Non-standard evaluation
- manual of substitute function
And some related SO questions:
- Passing on non-standard evaluation arguments to the subset function
- In R, why is [ better than subset?

Resources