I have an unusual problem with an spss dataset that I import to R via the Haven package (I also made a post about this on GitHub). The dataset is full of variables with missing value definitions that are not included among value labels, which leads to errors in R. Eg. -77 is defined as a missing value, but not as a value label. Indexing the variable's column returns
Error: `x` and `labels` must be same type
The only way I've found to fix the issue is to apply a label, remove the missing value, then remove the label:
ds <- read_spss(sav.file, user_na=TRUE)
val_label(ds[[1]], -77) <- "temp"
na_values(ds[[1]]) <- NULL
val_label(ds[[1]], -77) <- NULL
The solution relies on double brackets (or $). I'm wondering what the fastest way to apply this to all numeric variables in a large dataset would be. I could easily do it with a for loop, but I'm looking for something faster.
Related
I am trying to learn how to produce pretty tables using the package huxtable. It's a learning curve, but so far I am really impressed. However, I have run into a few problems that I can't seem to solve.
Firstly, I am trying to format numbers so that there is comma separator for the thousands position (using the mutate_at function from the dplyr package, and prettyNum. It works well except that, for columns with class numeric, internal zeros are excised (e.g., 1001 becomes 1,1 instead of the desired 1,001). If the col class is integer, then the desired output is produced. Also, the correct output is produced if the input data is a dataframe rather than a huxtable, regardless of whether the column is numeric or integer.
Secondly, when I add other table formatting (in particular, a caption), the caption does not seem to be carried over when I write the table to a Word file. Additionally, a note is produced:
Note: zip::zip() is deprecated, please use zip::zipr() instead
Below is some example code that I think illustrates the issue.
My questions are:
1) Why does the mutate function produce the odd result for numeric column in huxtables, but not in data frames, and how can I ensure that it does work? I could, of course, do the number formatting before converting the dataframe to a table, but I'd still like to know what is going on here.
2) Why is the table formatting not preserved in the output file?
3) What does the note about using zipr mean, and could that issue it references also be responsible for the failure to export table properties?
Thanks,
Glenn
library(dplyr)
library(flextable)
library(huxtable)
test=data.frame(var1=1918:1925,var2=c(9009,1000:1006),var3 = 1100:1107)
str(test)
HUX <- hux(test)
number_format(HUX)
number_format(HUX[,2]) <- 0
# works as expected on data frame
mutate_at(test,-1,.funs=list(~prettyNum(.,big.mark=",")))
# does not work as expected on huxtable, for var2 of class numeric
mutate_at(HUX,-1,.funs=list(~prettyNum(., big.mark=",")))
# add caption, borders, and colnames
set_caption(HUX,"Example table") %>%
set_caption_pos("topleft") %>%
set_top_border(1,,1) %>%
set_bottom_border(final(1), , 1) %>%
add_colnames()
# write out the table (this produces a note about zipr)
quick_docx(HUX)
Re the note about using zipr: see https://github.com/awalker89/openxlsx/issues/454
Re mutate_at: your data is being transformed correctly, but huxtable is displaying it wrongly. It is recognising each number, before and after the comma, as separate. (Number recognition is hard, let's go shopping…) I would suggest using number_format instead of transforming the data directly:
number_format(HUX)[,2:3] <- list(function(x) prettyNum(x, big.mark=","))
Finally, your second problem has a simple solution: you are changing all of the features of HUX but you're not saving the result back to the original variable. Remember that R is a functional language, objects are very rarely modified in place. Add HUX <- to the start of your dplyr chain.
I would like to know how to make a reference to a data frame and variable generic, please. Say I have a data frame named 's' and a variable in that data frame named 'Y'.
Regular R code:
look = s$Y
What I would like to do:
data = s
variable = Y
look = data$variable (which functions the same as look = s$Y)
Any thoughts? The reason I would like to do this is that I have s$Y throughout my code, and later I may want to change s for t (or Y for some other variable), and don't want to have to go through all of my code manually replacing s$Y with t$Y where I need it changed.
Thanks!
This is the reason that the $-operator is considered poor-practice inside function definitions, i.e. it "locks you in" to a particular spelling of a column name. You are not going to do this, however:
variable = Y
Rather you are going to do this:
variable = "Y"
And that is because the first version would have caused the R-interpreter to go out and try to identify a value for the symbol Y someplace in what is known as its "search path" which is roughly speaking all that functions and values that have been called and are still being processed since code was started. In the case of the second version "Y" is its own value and no further searching is needed. With that fundamental confusion corrected you would now do this
look <- data[[ variable ]] # although using 'data' as a name is another "poor-practice"
Whereupon R will look for a value of variable and find it in the global environment, returning the character "Y" and delivering a column named "Y" from the dataset s. Column names are not considered first-class objects in R, whereas named dataframes are. The "names" of columns are not true R names (even though they are called colnames).. The $-operator is just shorthand for "[[" with a character value. Here's a full transcript to test this:
> s <- data.frame(Y=1:10, X=LETTERS[1:10]); data = s
>
> variable <- "Y"
>
> look1 <- data$Y; look2 <- data[["Y"]]
> identical(look1, look2)
[1] TRUE
The confusion that this "non-standard evaluation" (NSE) shorthand feature of R has caused new users appears to be one of the motivations for the creation of first the ggplot aes function and later the evolution of the package-dplyr and the tidyverse-bundle-of-packages. Those packages allow the use of non-quoted names or tokens to refer to column identities.
In addition to #42-'s answer, you can dynamically reference columns like this:
colName <- "something"
myDataFrame[,colname]
Edit: Since you also asked about dynamically referencing data.frames #Rich Scriven suggested making a function that takes the data.frame as an argument, which is one working solution. You can also just load the data you need at the top of your script, which is easy to change on the fly if you need:
fileName <- "file1.csv"
data <- read.table(fileName, header = TRUE, stringsAsFactors = FALSE)
As per -42 above, the best choice seems to be the packages referenced. Using a function is close but doesn't seem to allow 'data' and 'variable' to be generic in 'data$variable'.
Thanks everyone!
I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)
I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!
Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names