R dataframe define column names at creation - r

I get monthly price value for the two assets below from Yahoo:
if(!require("tseries") | !require(its) ) { install.packages(c("tseries", 'its')); require("tseries"); require(its) }
startDate <- as.Date("2000-01-01", format="%Y-%m-%d")
MSFT.prices = get.hist.quote(instrument="msft", start= startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
SP500.prices = get.hist.quote(instrument="^gspc", start=startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
I want to put these two into a single data frame with specified columnames (Pandas allows this now - a bit ironic since they take the data.frame concept from R). As below, I assign the two time series with names:
MSFTSP500.prices <- data.frame(msft = MSFT.prices, sp500= SP500.prices )
However, this does not preserve the column names [msft, snp500] I have appointed. I need to define column names in a separate line of code:
colnames(MSFTSP500.prices) <- c("msft", "sp500")
I tried to put colnames and col.names inside the data.frame() call but it doesn't work. How can I define column names while creating the data frame?
I found ?data.frame very unhelpful...

The code fails with an error message indicating no availability of as.its. So I added the missing code (which appears to have been successful after two failed attempts.) Once you issue the missing require() call you can use str to see what sort of object get.hist.quote actually returns. It is neither a dataframe nor a zoo object, although it resembles a zoo-object in many ways:
> str(SP500.prices)
Formal class 'its' [package "its"] with 2 slots
..# .Data: num [1:180, 1] 1394 1366 1499 1452 1421 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
.. .. ..$ : chr "AdjClose"
..# dates: POSIXct[1:180], format: "2000-01-02 16:00:00" "2000-01-31 16:00:00" ...
If you run cbind on those two objects you get a regular matrix with dimnames:
> str(cbind(SP500.prices, MSFT.prices) )
num [1:180, 1:2] 1394 1366 1499 1452 1421 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
..$ : chr [1:2] "AdjClose" "AdjClose"
You will still need to change the column names since there does not seem to be a cbind.its that lets you assign column-names. I would caution about using the data.frame method, since the object is might get confusing in its behavior:
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ AdjClose :Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
$ AdjClose.1:Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
The columns are still S4 objects. I suppose that might be useful if you were going to pass them to other its-methods but could be confusing otherwise. This might be what you were shooting for:
> MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500= as.vector(SP500.prices) ,
row.names= as.character(MSFT.prices#dates) )
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ msft : num 35.1 32 38.1 25 22.4 ...
$ sp500: num 1394 1366 1499 1452 1421 ...
> head(rownames(MSFTSP500.prices))
[1] "2000-01-02 16:00:00" "2000-01-31 16:00:00" "2000-02-29 16:00:00"
[4] "2000-04-02 17:00:00" "2000-04-30 17:00:00" "2000-05-31 17:00:00"

MSFT.prices is a zoo object, which seems to be a data-frame-alike, with its own column name which gets transferred to the object. Confer
tmp <- data.frame(a=1:10)
b <- data.frame(lost=tmp)
which loses the second column name.
If you do
MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500=as.vector(SP500.prices))
then you will get the colnames you want (though you won't get zoo-specific behaviours). Not sure why you object to renaming columns in a second command, though.

Related

remove the extra information (attr) after reading file spss file using read_sav

I used read_sav() to read SPSS file in R.
How do I remove the extra information (attr).
I don't know how to create reprex for this question, but I have a sample below. I wish to remove attr from the column PersonID and convert it into normal dataframe/tibble
Thanks
'data.frame': 543 obs. of 1 variable:
$ PersonID : num 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "Person identifier"
..- attr(*, "format.spss")= chr "F8.0"
To remove all the attributes of the column you can use :
attributes(data$PersonID) <- NULL
To remove only specific ones you can do :
attr(data$PersonID, 'format.spss') <- NULL
To remove all attributes from all the columns :
data[] <- lapply(data, function(x) {attributes(x) <- NULL;x})
We can also use zap_labels and zap_formats from haven.
library(haven)
data <- zap_formats(zap_labels(data))

Error in .jcall(cell, "V", "setCellValue", value) : method setCellValue with signature ([D)V not found when attempting write.xlsx

library(dtplyr)
library(xlsx)
library(lubridate)
'data.frame': 612 obs. of 7 variables:
$ Company : Factor w/ 10 levels "Harbor","HCG",..: 6 10 10 3 6 8 6 8 6 6 ...
$ Title : chr NA NA NA NA ...
$ Send.Offer.Letter :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Send Offer Letter: Date, format: NA NA NA NA ...
..- attr(*, "spec")=List of 2
.. ..$ cols :List of 1
.. .. ..$ Send Offer Letter: list()
.. .. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ default: list()
.. .. ..- attr(*, "class")= chr "collector_guess" "collector"
.. ..- attr(*, "class")= chr "col_spec"
$ Accepted.Position : chr NA NA NA NA ...
$ Application.Date : chr NA NA NA NA ...
$ Hire.Date..Start. :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Hire Date (Start): POSIXct, format: "2008-05-20" NA NA "2008-05-13" ...
$ Rehire..Yes.or.No.: Factor w/ 23 levels "??","36500","continuing intern",..: NA NA NA NA NA NA NA NA NA NA ...
I have an extremely messy dataset (it was entered entirely freehand on excel spreadsheets) regarding new hires. Variables associated with dates are, of course, making things difficult. There was no consistency in entry format, sometimes random character strings were a part of a date (think 5/17, day tbd) etc. I finally got the dates consistently formatted into POSIXct format, but it led to the odd situation you see above where it appears there are nested variables in my columns. I have already coerced two date variables into as.character ($Accepted.Position and $Application.Date), as I have seen examples of POSIXct date formatting causing issues with write.xlsx.
When I attempt to write to xlsx, I get the following:
write.xlsx(forstack, file = "forstackover.xlsx", col.names = TRUE)
Error in .jcall(cell, "V", "setCellValue", value) :
method setCellValue with signature ([D)V not found
In addition: There were 50 or more warnings (use warnings() to see the first 50)
My dput is too long to post here, so here is the pastebin for it:
Dput forstack
Attempting to coerce $Hire.Date..Start with as.character produces the odd result which I have partially pasted here:
as.character result
I am not sure what action to take here. I found a similar discussion here:
stack question similar to this one
but this user was trying to call a specific portion of a column for ggplot2 graphing. Any help is appreciated.
I had this issue when trying to write a tibble tbl_df to xlsx using the xlsx package.
The error threw when I added the row.names = FALSE option, but no error without row.names call.
I converted the tbl_df to data.frame and it worked.
I agree with #greg dubrow's solution. I have a simpler suggestion as code.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
You can be more free with file.choose()
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE)
By the way, in my code, similar to #Lee's, it gave an error for row.names = FALSE. The error is now resolved. If we expand it a little bit more:
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE, row.names=FALSE)
For me the issue was because the data.frame was grouped so I added ungroup(forstack) prior to write.xlsx and that fixed the issue.
It's a trivial issue. But this trick works.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
This happens because when col.names or row.names are called out, the input file must be a data.frame and not a tibble.

as.matrix(A$mat) for a given list A

I have n matrices of which I am trying to apply nearPD()from the Matrixpackage.
I have done this using the following code:
A<-lapply(b, nearPD)
where b is the list of n matrices.
I now would like to convert the list A into matrices. For an individual matrix I would use the following code:
A<-matrix(runif(n*n),ncol = n)
PD_mat_A<-nearPD(A)
B<-as.matrix(PD_mat_A$mat)
But I am trying to do this with a list. I have tried the following code but it doesn't seem to work:
d<-lapply(c, as.matrix($mat))
Any help would be appreciated. Thank you.
Here is a code so you can try and reproduce this:
n<-10
generate<-function (n){
matrix(runif(10*10),ncol = 10)
}
b<-lapply(1:n, generate)
Here is the simplest method using as.matrix as noted by #nicola in the comments below and (a version using apply) by #cimentadaj in the comments above:
d <- lapply(A, function(i) as.matrix(i$mat))
My original answer, exploiting the nearPD data structure was
With a little fiddling with the nearPD object type, here is an extraction method:
d <- lapply(A, function(i) matrix(i$mat#x, ncol=i$mat#Dim[2]))
Below is some commentary on how I arrived at my answer.
This object is fairly complicated as str(A[[1]]) returns
List of 7
$ mat :Formal class 'dpoMatrix' [package "Matrix"] with 5 slots
.. ..# x : num [1:100] 0.652 0.477 0.447 0.464 0.568 ...
.. ..# Dim : int [1:2] 10 10
.. ..# Dimnames:List of 2
.. .. ..$ : NULL
.. .. ..$ : NULL
.. ..# uplo : chr "U"
.. ..# factors : list()
$ eigenvalues: num [1:10] 4.817 0.858 0.603 0.214 0.15 ...
$ corr : logi FALSE
$ normF : num 1.63
$ iterations : num 2
$ rel.tol : num 0
$ converged : logi TRUE
- attr(*, "class")= chr "nearPD"
You are interested in the "mat" which is accessed by $mat. The # symbols show that "mat" is an s4 object and its components are accessed using #. The components of interest are "x", the matrix content, and "Dim" the dimension of the matrix. The code above puts this information together to extract the matrices from the list of "nearPD" objects.
Below is a brief explanation of why as.matrix works in this case. Note the matrix inside a nearPD object is not a matrix:
is.matrix(A[[1]]$mat)
[1] FALSE
However, it is a "Matrix":
class(A[[1]]$mat)
[1] "dpoMatrix"
attr(,"package")
[1] "Matrix"
From the note in the help file, help("as.matrix,Matrix-method"),
Loading the Matrix namespace “overloads” as.matrix and as.array in the base namespace by the equivalent of function(x) as(x, "matrix"). Consequently, as.matrix(m) or as.array(m) will properly work when m inherits from the "Matrix" class.
So, the Matrix package is taking care of the as.matrix conversion "under the hood."

Extract the data from the sublist into dataframe

the structure of the list as follow (the list goes on with the same structure):
> str(parsedData)
> List of 1658
> $ :List of 2
> ..$ Date : chr "2010-08-16"
> ..$ Volatility: num 11.1
> $ :List of 2
> ..$ Date : chr "2010-08-17"
> ..$ Volatility: num 26.2
as you can see, on the name of the first level of structure is empty space. I tried to extract the elements but fail:
> parsedData$Date
>NULL
anyone can tell me how to extract only the Date and Volatility from this list (especially with no title) and put them all in the same dataframe like this? Thanks!
Date Volatility
2010-08-16 11.1
2010-08-17 26.2
... ...
(this is the first time i ask question, sorry for any editing mistake :) )
Not tested:
setNames(data.frame(do.call(rbind,lapply(1:length(parsedData),function(i)cbind(parsedData[[i]][1],parsedData[[i]][2])))),c("Date","Volatility")
OR:
setNames(data.frame(do.call(rbind,lapply(1:length(parsedData),function(i)t(parsedData[[i]][1:2])))),c("Date","Volatility"))

read.zoo works but then as.xts fails with "currently unsupported data type"

I've a csv file of daily bars, with just two lines:
"datestamp","Open","High","Low","Close","Volume"
"2012-07-02",79.862,79.9795,79.313,79.509,48455
(That file was an xts that was converted to a data.frame then passed on to write.csv)
I load it with this:
z=read.zoo(file='tmp.csv',sep=',',header=T,format = "%Y-%m-%d")
And it is fine as print(z) shows:
Open High Low Close Volume
2012-07-02 79.862 79.9795 79.313 79.509 48455
But then as.xts(z) gives: Error in coredata.xts(x) : currently unsupported data type
Here is the str(z) output:
‘zoo’ series from 2012-07-02 to 2012-07-02
Data:List of 5
$ : num 79.9
$ : num 80
$ : num 79.3
$ : num 79.5
$ : int 48455
- attr(*, "dim")= int [1:2] 1 5
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Open" "High" "Low" "Close" ...
Index: Date[1:1], format: "2012-07-02"
I've so far confirmed it is not that 4 columns are num and one column is int, as I still get the error even after removing the Volume column. But, then, what could that error message be talking about?
As Sebastian pointed out in the comments, the problem is in the single row. Specifically the coredata is a list when read.zoo reads a single row, but something else (a matrix?) when there are 2+ rows.
I replaced the call to read.zoo with the following, and it works fine whether 1 or 2+ rows:
d=read.table(fname,sep=',',header=T)
x=as.xts(subset(d,select=-datestamp),order.by=as.Date(d$datestamp))

Resources