I can't really create a code example because I'm not quite sure what the problem is and my actual problem is rather involved. That said it seems like kind of a generic problem that maybe somebody's seen before.
Basically I'm constructing 3 different dataframes and rbinding them together, which is all as expected smooth sailing but when I try to write that merged frame back to the DB I get this error:
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
I've tried manually coercing them using as.data.frame() before and after the rbinds and the returned object (the same one that fails to write with the above error message) exists in the environment as class data.frame so why does dbWriteTable not seem to have got the memo?
Sorry, I'm connecting to a MySQL DB using RMySQL. The problem I think as I look a little closer and try to explain myself is that the columns of my data frame are themselves lists (of the same length), which sorta makes sense of the error. I'd think (or like to think anyways) that a call to as.data.frame() would take care of that but I guess not?
A portion of my str() since it's long looks like:
.. [list output truncated]
$ stcong :List of 29809
..$ : int 3
..$ : int 8
..$ : int 4
..$ : int 2
I guess I'm wondering if there's an easy way to force this coercion?
Hard to say for sure, since you provided so little concrete information, but this would be one way to convert a list column to an atomic vector column:
> d <- data.frame(x = 1:5)
> d$y <- as.list(letters[1:5])
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y:List of 5
..$ : chr "a"
..$ : chr "b"
..$ : chr "c"
..$ : chr "d"
..$ : chr "e"
> d$y <- unlist(d$y)
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y: chr "a" "b" "c" "d" ...
This assumes that each element of your list column is only a length one vector. If any aren't, things will be more complicated, and you'd likely need to rethink your data structure anyhow.
Related
I'm trying to import a csv with blanks read as "". Unfortunately they're all reading as "NA" now.
To better demonstrate the problem I'm also showing how NA, "NA", and "" are all mapping to the same thing (except in the very bottom example), which would prevent the easy workaround dt[is.na(dt)] <- ""
> write.csv(matrix(c("0","",NA,"NA"),ncol = 2),"MRE.csv")
Opening this in notepad, it looks like this
"","V1","V2"
"1","0",NA
"2","","NA"
So reading that back...
> fread("MRE.csv")
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
The documentation seems to suggest this but it does not work as described
> fread("MRE.csv",na.strings = NULL)
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
Also tried this which reads the NA as an actual NA, but the problem remains for the empty string which is read as "NA"
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))
V1 V1 V2
1: 1 0 <NA>
2: 2 NA NA
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))[,V2]
[1] NA "NA"
data.table version 1.11.4
R version 3.5.1
A few possible things going on here:
Regardless of you writing "0" here, the reading function (fread) is inferring based on looking at a portion of the file. This is not uncommon (readr does it, too), and is controllable (with colClasses=).
This might be unique to your question here (and not your real data), but your call to write.csv is implicitly putting the literal NA letters in the file (not to be confused with "NA" where you have the literal string). This might be confusing things, even when you override with colClasses=.
You might already know this, but since fread is inferring that those columns are really integer classes, then they cannot contain empty strings: once determined to be a number column, anything non-number-like will be NA.
Let's redo your first csv-generating side to make sure we don't confound the situation.
write.csv(matrix(c("0","",NA,"NA"),ncol = 2), "MRE.csv", na="")
(Below, I'm using magrittr's pipe operator %>% merely for presentation, it is not required.)
The first example demonstrates fread's inference. The second shows our overriding that behavior, and now we have blank strings in each NA spot that is not the literal string "NA".
fread("MRE.csv") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: int 0 NA
# $ V2: logi NA NA
# - attr(*, ".internal.selfref")=<externalptr>
fread("MRE.csv", colClasses="character") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
This can also be controlled on a per-column basis. One issue with this example is that fread is for some reason forcing the column of row-names to be named V1, the same as the next column. This looks like a bug to me, perhaps you can look at Rdatatable's issues and potentially post a new one. (I might be wrong, perhaps this is intentional/known behavior.)
Because of this, per-column overriding seems to stop at the first occurrence of a column name.
fread("MRE.csv", colClasses=c(V1="character", V2="character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: int 0 NA
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
One way around this is to go with an unnamed vector, requiring the same number of classes as the number of columns:
fread("MRE.csv", colClasses=c("character","character","character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Another way (thanks #thelatemail) is with a list:
fread("MRE.csv", colClasses=list(character=2:3)) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Side note: if you need to preserve them as ints/nums, then:
if your concern is about how it affects follow-on calculations, then you can:
fix the source of the data so that nulls are not provided;
filter out the incomplete observations (rows); or
fix the calculations to deal intelligently with missing data.
if your concern is about how it looks in a report, then whatever tool you are using to render in your report should have a mechanism for how to display NA values; for example, setting options(knitr.kable.NA="") before knitr::kable(...) will present them as empty strings.
if your concern is about how it looks on your console, you have two options:
interfere with the data by iterating over each (intended) column and changing NA values to ""; this only works on character columns, and is irreversible; or
write your own subclass of data.frame that changes how it is displayed on the console; the benefit to this is that it is non-destructive; the problem is that you have to re-class each object where you want this behavior, and most (if not all) functions that output frames will likely inadvertently strip or omit that class from your input. (You'll need to write an S3 method of print for your subclass to do this.)
library(dtplyr)
library(xlsx)
library(lubridate)
'data.frame': 612 obs. of 7 variables:
$ Company : Factor w/ 10 levels "Harbor","HCG",..: 6 10 10 3 6 8 6 8 6 6 ...
$ Title : chr NA NA NA NA ...
$ Send.Offer.Letter :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Send Offer Letter: Date, format: NA NA NA NA ...
..- attr(*, "spec")=List of 2
.. ..$ cols :List of 1
.. .. ..$ Send Offer Letter: list()
.. .. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ default: list()
.. .. ..- attr(*, "class")= chr "collector_guess" "collector"
.. ..- attr(*, "class")= chr "col_spec"
$ Accepted.Position : chr NA NA NA NA ...
$ Application.Date : chr NA NA NA NA ...
$ Hire.Date..Start. :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Hire Date (Start): POSIXct, format: "2008-05-20" NA NA "2008-05-13" ...
$ Rehire..Yes.or.No.: Factor w/ 23 levels "??","36500","continuing intern",..: NA NA NA NA NA NA NA NA NA NA ...
I have an extremely messy dataset (it was entered entirely freehand on excel spreadsheets) regarding new hires. Variables associated with dates are, of course, making things difficult. There was no consistency in entry format, sometimes random character strings were a part of a date (think 5/17, day tbd) etc. I finally got the dates consistently formatted into POSIXct format, but it led to the odd situation you see above where it appears there are nested variables in my columns. I have already coerced two date variables into as.character ($Accepted.Position and $Application.Date), as I have seen examples of POSIXct date formatting causing issues with write.xlsx.
When I attempt to write to xlsx, I get the following:
write.xlsx(forstack, file = "forstackover.xlsx", col.names = TRUE)
Error in .jcall(cell, "V", "setCellValue", value) :
method setCellValue with signature ([D)V not found
In addition: There were 50 or more warnings (use warnings() to see the first 50)
My dput is too long to post here, so here is the pastebin for it:
Dput forstack
Attempting to coerce $Hire.Date..Start with as.character produces the odd result which I have partially pasted here:
as.character result
I am not sure what action to take here. I found a similar discussion here:
stack question similar to this one
but this user was trying to call a specific portion of a column for ggplot2 graphing. Any help is appreciated.
I had this issue when trying to write a tibble tbl_df to xlsx using the xlsx package.
The error threw when I added the row.names = FALSE option, but no error without row.names call.
I converted the tbl_df to data.frame and it worked.
I agree with #greg dubrow's solution. I have a simpler suggestion as code.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
You can be more free with file.choose()
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE)
By the way, in my code, similar to #Lee's, it gave an error for row.names = FALSE. The error is now resolved. If we expand it a little bit more:
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE, row.names=FALSE)
For me the issue was because the data.frame was grouped so I added ungroup(forstack) prior to write.xlsx and that fixed the issue.
It's a trivial issue. But this trick works.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
This happens because when col.names or row.names are called out, the input file must be a data.frame and not a tibble.
I want to first calculate a markov transition matrix and then take exponent of it. To achieve the first goal I use the markovchainFit function inside markovchain package and it return me a data.frame , rather than a matrix. So I need to convert it to matrix before I take exponent.
My R code snippet is like
#################################
# Estimate Transition Matrix #
#################################
setwd("G:/Data_backup/GDP_per_Capita")
library("foreign")
library("Hmisc")
mydata <- stata.get("G:/Data_backup/GDP_per_Capita/states.dta")
mydata
library(markovchain)
library(expm)
rgdp_e=mydata[,2:7]
rgdp_o=mydata[,8:13]
createSequenceMatrix(rgdp_e)
rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
rgdp_e_trans<-as.matrix(rgdp_e_trans)
is.matrix(rgdp_e_trans)
rgdp_e_trans %^% 1/5
the rgdp_e_trans is a data frame, and I try to convert it to a numeric matrix. It seems work when I test it using is.matrix command. However, the final line give me an error said
Error in rgdp_e_trans %^% 2 :
(list) object cannot be coerced to type 'double'
After some searching work in stackoverflow, I find this question sharing the similar problem and use rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans)) to coerce the object to be `double', but it seems not work.
Besides, the data.frame rgdp_e_trans contains no factor or characters
The output in the console is like
> rgdp_e=mydata[,2:7]
> rgdp_o=mydata[,8:13]
> createSequenceMatrix(rgdp_e)
Error: not compatible with STRSXP
> rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
> rgdp_e_trans
$estimate
1 2 3 4 5
1 0.6172840 0.18930041 0.09053498 0.074074074 0.02880658
2 0.1125828 0.59602649 0.28476821 0.006622517 0.00000000
3 0.0000000 0.03846154 0.60256410 0.358974359 0.00000000
4 0.0000000 0.01162791 0.03488372 0.691860465 0.26162791
5 0.0000000 0.00000000 0.00000000 0.044247788 0.95575221
> rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
Error: (list) object cannot be coerced to type 'double'
> rgdp_e_trans<-as.matrix(rgdp_e_trans)
> is.matrix(rgdp_e_trans)
[1] TRUE
> rgdp_e_trans %^% 1/5
Error in rgdp_e_trans %^% 1 :
(list) object cannot be coerced to type 'double'
>
Any suggestion to fix the problem, or alternative way to calculate the exponent ? Thank you.
Additional:
> str(rgdp_e_trans)
List of 1
$ estimate:Formal class 'markovchain' [package "markovchain"] with 4 slots
.. ..# states : chr [1:5] "1" "2" "3" "4" ...
.. ..# byrow : logi TRUE
.. ..# transitionMatrix: num [1:5, 1:5] 0.617 0.113 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..# name : chr "Bootstrap Mc"
and I comment out the as.matrix part
rgdp_e=mydata[,2:7]
rgdp_o=mydata[,8:13]
createSequenceMatrix(rgdp_e)
rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
rgdp_e_trans
str(rgdp_e_trans)
# rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
# rgdp_e_trans<-as.matrix(rgdp_e_trans)
# is.matrix(rgdp_e_trans)
rgdp_e_trans$estimate %^% 1/5
You can access the transition matrix directly from the object returned by markovchainFit as:
rgdp_e_trans$estimate#transitionMatrix
Here rgdp_e_trans is your return value from markovchainFit, which is actually a list containing the information from the fitting process. You access the estimates item of that list by using the $ operator. The estimate object is from a formal S4 class (see e.g. Advanced R by Hadley Wickham for a description of the object systems used in R), which is why in order to access its items you have to use the # operator instead of the standard $ used for the more common S3 objects.
If you print out the return value of as.matrix(rgdp_e_trans) it should be immediately obvious where your initial approach went wrong. In general it's a good idea to check the structure of an object with the str function - instead of relying on its print method - when you encounter unexpected results or are working with new types of objects.
From what I can see here I would assume that data.table v1.8.0+ does not automatically convert strings to factors.
Specifically, to quote Matthew Dowle from that page:
No need for stringsAsFactors. Done like this in v1.8.0 : o character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported.
I'm not seeing that ... here's my R session transcript:
First, I make sure I have a recent enough version of data.table > 1.8.0
> library(data.table)
data.table 1.8.8 For help type: help("data.table")
Next, I create a 2x2 data.table. Notice that it creates factors ...
> m <- matrix(letters[1:4], ncol=2)
> str(data.table(m))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ V1: Factor w/ 2 levels "a","b": 1 2
$ V2: Factor w/ 2 levels "c","d": 1 2
- attr(*, ".internal.selfref")=<externalptr>
When I use stringsAsFactors in data.frame() and then call data.table(), all is well ...
> str(data.table(data.frame(m, stringsAsFactors=FALSE)))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ X1: chr "a" "b"
$ X2: chr "c" "d"
- attr(*, ".internal.selfref")=<externalptr>
What am I missing? Is data.frame() supposed to convert strings to factors, and if so, is there a "better way" of turning that behavior off?
Thanks!
Update:
This issue seems to have slipped past somehow until now. Thanks to #fpinter for filing the issue recently. It is now fixed in commit 1322. From NEWS, No:39 under bug fixes for v1.9.3:
as.data.table.matrix does not convert strings to factors by default. data.table likes and prefers using character vectors to factors. Closes #745. Thanks to #fpinter for reporting the issue on the github issue tracker and to vijay for reporting here on SO.
It appears that this non-coercion is not yet implemented.
data.table deals with matrix arguments using as.data.table
if (is.matrix(xi) || is.data.frame(xi)) {
xi = as.data.table(xi, keep.rownames = keep.rownames)
x[[i]] = xi
numcols[i] = length(xi)
}
and
as.data.table.matrix
contains
if (mode(x) == "character") {
for (i in ic) value[[i]] <- as.factor(x[, i])
}
Might be worth reporting this to the bug tracker. (it is still implemented in 1.8.9, the current r-forge version)
As a workaround and to complete #mnel answer, if you want to turn off the default behavior of data.frame you can use the dedicated option.
options(stringsAsFactors=FALSE)
str(data.table(data.frame(m)))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ X1: chr "a" "b"
$ X2: chr "c" "d"
- attr(*, ".internal.selfref")=<externalptr>
for starters: I searched for hours on this problem by now - so if the answer should be trivial, please forgive me...
What I want to do is delete a row (no. 101) from a data.frame. It contains test data and should not appear in my analyses. My problem is: Whenever I subset from the data.frame, the attributes (esp. comments) are lost.
str(x)
# x has comments for each variable
x <- x[1:100,]
str(x)
# now x has lost all comments
It is well documented that subsetting will drop all attributes - so far, it's perfectly clear. The manual (e.g. http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.data.frame.html) even suggests a way to preserve the attributes:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i= 0:7, f= gl(2,4),
u= structure(11:18, unit = "kg", class="avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I am not yet so far into R to understand what exactly happens here. However, simply running these lines (except the last three) does not change the behavior of my subsetting. Using the command subset() with an appropriate vector (100-times TRUE + 1 FALSE) gives me the same result. And simply storing the attributes to a variable and restoring it after the subset, does not work, either.
# Does not work...
tmp <- attributes(x)
x <- x[1:100,]
attributes(x) <- tmp
Of course, I could write all comments to a vector (var=>comment), subset and write them back using a loop - but that does not seem a well-founded solution. And I am quite sure I will encounter datasets with other relevant attributes in future analyses.
So this is where my efforts in stackoverflow, Google, and brain power got stuck. I would very much appreciate if anyone could help me out with a hint. Thanks!
If I understand you correctly, you have some data in a data.frame, and the columns of the data.frame have comments associated with them. Perhaps something like the following?
set.seed(1)
mydf<-data.frame(aa=rpois(100,4),bb=sample(LETTERS[1:5],
100,replace=TRUE))
comment(mydf$aa)<-"Don't drop me!"
comment(mydf$bb)<-"Me either!"
So this would give you something like
> str(mydf)
'data.frame': 100 obs. of 2 variables:
$ aa: atomic 3 3 4 7 2 7 7 5 5 1 ...
..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2 2 5 4 2 1 3 5 3 ...
..- attr(*, "comment")= chr "Me either!"
And when you subset this, the comments are dropped:
> str(mydf[1:2,]) # comment dropped.
'data.frame': 2 obs. of 2 variables:
$ aa: num 3 3
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
To preserve the comments, define the function [.avector, as you did above (from the documentation) then add the appropriate class attributes to each of the columns in your data.frame (EDIT: to keep the factor levels of bb, add "factor" to the class of bb.):
mydf$aa<-structure(mydf$aa, class="avector")
mydf$bb<-structure(mydf$bb, class=c("avector","factor"))
So that the comments are preserved:
> str(mydf[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Class 'avector' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
EDIT:
If there are many columns in your data.frame that have attributes you want to preserve, you could use lapply (EDITED to include original column class):
mydf2 <- data.frame( lapply( mydf, function(x) {
structure( x, class = c("avector", class(x) ) )
} ) )
However, this drops comments associated with the data.frame itself (such as comment(mydf)<-"I'm a data.frame"), so if you have any, assign them to the new data.frame:
comment(mydf2)<-comment(mydf)
And then you have
> str(mydf2[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Classes 'avector', 'numeric' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
- attr(*, "comment")= chr "I'm a data.frame"
For those who look for the "all-in" solution based on BenBarnes explanation: Here it is.
(give the your "up" to the post from BenBarnes if this is working for you)
# Define the avector-subselection method (from the manual)
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
# Assign each column in the data.frame the (additional) class avector
# Note that this will "lose" the data.frame's attributes, therefore write to a copy
df2 <- data.frame(
lapply(df, function(x) {
structure( x, class = c("avector", class(x) ) )
} )
)
# Finally copy the attribute for the original data.frame if necessary
mostattributes(df2) <- attributes(df)
# Now subselects work without losing attributes :)
df2 <- df2[1:100,]
str(df2)
The good thing: When attached the class to all the data.frame's element once, the subselects never again bother attributes.
Okay - sometimes I am stunned how complicated it is to do the most simple operations in R. But I surely did not learn about the "classes" feature if I just marked and deleted the case in SPSS ;)
This is solved by the sticky package. (Full Disclosure: I am the package author.) Apply the sticky() to your vectors and the attributes are preserved through subset operations. For example:
> df <- data.frame(
+ sticky = sticky( structure(1:5, comment="sticky attribute") ),
+ nonstick = structure( letters[1:5], comment="non-sticky attribute" )
+ )
>
> comment(df[1:3, "nonstick"])
NULL
> comment(df[1:3, "sticky"])
[1] "sticky attribute"
This works for any attribute and not only comment.
See the sticky package for details:
on Github
on CRAN
I spent hours trying to figure out how to retain attribute data (specifically variable labels) when subsetting a dataframe (removing columns). The answer was so simple, I couldn't believe it. Just use the function spss.get from the Hmisc package, and then no matter how you subset, the variable labels are retained.