I used read_sav() to read SPSS file in R.
How do I remove the extra information (attr).
I don't know how to create reprex for this question, but I have a sample below. I wish to remove attr from the column PersonID and convert it into normal dataframe/tibble
Thanks
'data.frame': 543 obs. of 1 variable:
$ PersonID : num 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "Person identifier"
..- attr(*, "format.spss")= chr "F8.0"
To remove all the attributes of the column you can use :
attributes(data$PersonID) <- NULL
To remove only specific ones you can do :
attr(data$PersonID, 'format.spss') <- NULL
To remove all attributes from all the columns :
data[] <- lapply(data, function(x) {attributes(x) <- NULL;x})
We can also use zap_labels and zap_formats from haven.
library(haven)
data <- zap_formats(zap_labels(data))
I am doing hierarchical clustering in R and need all the cluster's elements separately.
When I use following data splits into 3 list of num [1:2628] (no info of columns in original dataframe (dataA) is transferred)
clusterA <- hclust(dist(dataA),method = "single")
NumA = 3
label <- cutree(clusterA, NumA)
clusterXlist<-split(dataA,f=label)
str(clusterXlist[[1]])
how to make shure that it maintains the structure of dataA
edit:
in my case
>str(clusterXlist[[1]])
num [1:2628] 0.0529 -0.3909 -0.4465 0.1 0.8393 ...
where as for dataA
> str(dataA)
num [1:440, 1:6] 0.0529 -0.3909 -0.4465 0.1 0.8393 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
- attr(*, "scaled:center")= Named num [1:6] 12000 5796 7951 3072 2881 ...
..- attr(*, "names")= chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
- attr(*, "scaled:scale")= Named num [1:6] 12647 7380 9503 4855 4768 ...
..- attr(*, "names")= chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
edit2 :
for dataA
> dput(head(dataA,n=20))
structure(c(0.0528730042415329, -0.390857056063646, -0.44652098379972,
0.0999975794271863, 0.839284119671916, -0.204572661537808, 0.00993903725191922,
-0.349583518736614, -0.477357534676238, -0.473957607271904, -0.682697336282181,
0.0905884780058897, 1.55872457204484, 0.728746944991474, 1.00042486502152,
-0.138155475034538, -0.868191050016313, -0.484236457564077, 0.521904849881291,
-0.333690834823332, 0.522972471408079, 0.543838613660349, 0.408073194590386,
-0.623310408164662, -0.0523368792616442, 0.333686752405346, -0.351915064454946,
-0.113851350576777, -0.291078065290861, 0.717677967619194, -0.053285340273111,
-0.63306600713975, 0.883794139056095, 0.0557876760455718, 0.497093035238056,
-0.634420951441845, 0.409157150032062, 0.0488774601048851, 0.0719115132405076,
-0.447303143322465, -0.0410681453901357, 0.170124700204028, -0.0281250860936324,
-0.3925300807586, -0.0792659545334748, -0.297298628211157, -0.10273182626616,
0.15518230654465, -0.185125447641461, 1.15011422238562, 0.528531691780372,
-0.360751187201331, 0.400469064432042, 0.739829765498898, 0.435615257968889,
-0.434621330503326, 0.438772101699743, -0.528063904936618, 0.226000834240152,
0.159180975270399, -0.588697039406295, -0.269829034507317, -0.137379339965946,
0.68636300602308, 0.173661155768845, -0.495590877769126, -0.533904475256987,
-0.288985833251248, -0.545233764836731, -0.394039245717966, 0.273564891153861,
-0.340276616984998, -0.573659982327726, 0.00475174748902491,
-0.572218072744849, -0.551001403168238, -0.605176006067741, -0.459955112363749,
-0.178576756619561, -0.494972916519322, -0.0435191938188023,
0.0863085949200282, 0.13308015693741, -0.498021323377842, -0.23165413161966,
-0.227878848586867, 0.0542186891412866, 0.0921812574154842, -0.244448146341904,
0.952945788892319, 0.649245242698738, -0.489212329634658, 0.209634507324604,
0.802353943473126, 0.456496070080021, -0.40217108193415, 0.341140199633565,
-0.526755422016323, -0.0240135648160378, -0.0762383134363428,
-0.066263629344282, 0.0890496850231094, 2.24074190324533, 0.0933048443208461,
1.29786952218849, -0.0261942126239276, -0.347458739603052, 0.369181005457445,
-0.274766434933383, 0.203229792845712, 0.0777025935624781, -0.364479376793999,
0.498608767430271, -0.327246732938803, 0.228051555415843, -0.394620088486301,
-0.157749554245622, 1.04716972023017, 0.587257919466454, -0.36306099036142
), .Dim = c(20L, 6L), .Dimnames = list(NULL, c("Fresh", "Milk",
"Grocery", "Frozen", "Detergents_Paper", "Delicassen")))
for clusterXlist[[1]] which was obtained by split of dataA
> dput(head(clusterXlist[[1]],n=20))
c(0.0528730042415329, -0.390857056063646, -0.44652098379972,
0.0999975794271863, 0.839284119671916, -0.204572661537808, 0.00993903725191922,
-0.349583518736614, -0.477357534676238, -0.473957607271904, -0.682697336282181,
0.0905884780058897, 1.55872457204484, 0.728746944991474, 1.00042486502152,
-0.138155475034538, -0.868191050016313, -0.484236457564077, 0.521904849881291,
-0.333690834823332)
What you have there is a matrix, not a data frame.
class(dataA)
# [1] "matrix"
The quick and easy way to split() would be to do
split(as.data.frame(dataA), label)
However, this may cause issues in later calculations and you may need to resort to coercing those list elements back to a matrix. I would recommend you use lapply() to split the data, as follows.
clusterXlist <- lapply(
unique(label),
function(i) dataA[label == i, , drop = FALSE]
)
to properly maintain your matrix structure throughout your list elements.
str(clusterXlist[[1]])
# num [1:18, 1:6] 0.0529 -0.3909 0.1 0.8393 -0.2046 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
library(dtplyr)
library(xlsx)
library(lubridate)
'data.frame': 612 obs. of 7 variables:
$ Company : Factor w/ 10 levels "Harbor","HCG",..: 6 10 10 3 6 8 6 8 6 6 ...
$ Title : chr NA NA NA NA ...
$ Send.Offer.Letter :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Send Offer Letter: Date, format: NA NA NA NA ...
..- attr(*, "spec")=List of 2
.. ..$ cols :List of 1
.. .. ..$ Send Offer Letter: list()
.. .. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ default: list()
.. .. ..- attr(*, "class")= chr "collector_guess" "collector"
.. ..- attr(*, "class")= chr "col_spec"
$ Accepted.Position : chr NA NA NA NA ...
$ Application.Date : chr NA NA NA NA ...
$ Hire.Date..Start. :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable:
..$ Hire Date (Start): POSIXct, format: "2008-05-20" NA NA "2008-05-13" ...
$ Rehire..Yes.or.No.: Factor w/ 23 levels "??","36500","continuing intern",..: NA NA NA NA NA NA NA NA NA NA ...
I have an extremely messy dataset (it was entered entirely freehand on excel spreadsheets) regarding new hires. Variables associated with dates are, of course, making things difficult. There was no consistency in entry format, sometimes random character strings were a part of a date (think 5/17, day tbd) etc. I finally got the dates consistently formatted into POSIXct format, but it led to the odd situation you see above where it appears there are nested variables in my columns. I have already coerced two date variables into as.character ($Accepted.Position and $Application.Date), as I have seen examples of POSIXct date formatting causing issues with write.xlsx.
When I attempt to write to xlsx, I get the following:
write.xlsx(forstack, file = "forstackover.xlsx", col.names = TRUE)
Error in .jcall(cell, "V", "setCellValue", value) :
method setCellValue with signature ([D)V not found
In addition: There were 50 or more warnings (use warnings() to see the first 50)
My dput is too long to post here, so here is the pastebin for it:
Dput forstack
Attempting to coerce $Hire.Date..Start with as.character produces the odd result which I have partially pasted here:
as.character result
I am not sure what action to take here. I found a similar discussion here:
stack question similar to this one
but this user was trying to call a specific portion of a column for ggplot2 graphing. Any help is appreciated.
I had this issue when trying to write a tibble tbl_df to xlsx using the xlsx package.
The error threw when I added the row.names = FALSE option, but no error without row.names call.
I converted the tbl_df to data.frame and it worked.
I agree with #greg dubrow's solution. I have a simpler suggestion as code.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
You can be more free with file.choose()
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE)
By the way, in my code, similar to #Lee's, it gave an error for row.names = FALSE. The error is now resolved. If we expand it a little bit more:
write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE, row.names=FALSE)
For me the issue was because the data.frame was grouped so I added ungroup(forstack) prior to write.xlsx and that fixed the issue.
It's a trivial issue. But this trick works.
write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE)
This happens because when col.names or row.names are called out, the input file must be a data.frame and not a tibble.
I'm new to R so I'm sure this is simple but I can't figure it out. You can see the structure of my object n below. I want to loop through n and take each non-null value from the right side of the colon (e.g. "57454470") and apply a function to it.
> str(n)
List of 1
$ :List of 10
..$ 15793766: NULL
..$ 15793767: chr "57454470"
..$ 15793769: chr "123652395"
..$ 15793770: chr "38098549"
..$ 15793771: chr "56864789"
..$ 15793776: chr "38722835"
..$ 15793779: chr "37962343"
..$ 15793784: chr "2100162920"
..$ 15793787: chr "2099439832"
..$ 15793791: chr "37992986"
..- attr(*, "dim")= int 10
..- attr(*, "dimnames")=List of 1
.. ..$ rmaddrs$ReportID: chr [1:10] "15793766" "15793767" "15793769" "15793770" ...
..- attr(*, "call")= language by.data.frame(data = rmaddrs, INDICES = rmaddrs$ReportID, FUN = getValueFromXML)
..- attr(*, "class")= chr "by"
Here is the result of dput:
dput(n[1])
list(structure(list(`15793766` = NULL, `15793767` = "57454470",
`15793769` = "123652395", `15793770` = "38098549", `15793771` = "56864789",
`15793776` = "38722835", `15793779` = "37962343", `15793784` = "2100162920",
`15793787` = "2099439832", `15793791` = "37992986"), .Dim = 10L, .Dimnames = structure(list(
`rmaddrs$ReportID` = c("15793766", "15793767", "15793769",
"15793770", "15793771", "15793776", "15793779", "15793784",
"15793787", "15793791")), .Names = "rmaddrs$ReportID"), call = by.data.frame(data = rmaddrs,
INDICES = rmaddrs$ReportID, FUN = getValueFromXML), class = "by"))
UPDATE: I removed the "print" testing and I'm trying to use mean() for a better test.
sapply(n[1], function(x) mean(x, na.rm=TRUE))
Then I had to use unlist and as.numeric and now I think I have what I need to use my custom function.
The way you are using sapply it prints everything, but then it also returns the object which (since it isn't assigned) is also printed. To avoid the printing of the returned object, you can wrap in invisible() or assign it
invisible(sapply(n[1], print))
xx = sapply(n[1], print)
(Note: this printing is just like if you enter 1 + 1 in the console, the resulting 2 will print. But if you enter x = 1 + 1 nothing prints. I also simplified your sapply by omitting the anonymous function, but that isn't related to your issue.)
I get monthly price value for the two assets below from Yahoo:
if(!require("tseries") | !require(its) ) { install.packages(c("tseries", 'its')); require("tseries"); require(its) }
startDate <- as.Date("2000-01-01", format="%Y-%m-%d")
MSFT.prices = get.hist.quote(instrument="msft", start= startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
SP500.prices = get.hist.quote(instrument="^gspc", start=startDate,
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="m", retclass="its")
I want to put these two into a single data frame with specified columnames (Pandas allows this now - a bit ironic since they take the data.frame concept from R). As below, I assign the two time series with names:
MSFTSP500.prices <- data.frame(msft = MSFT.prices, sp500= SP500.prices )
However, this does not preserve the column names [msft, snp500] I have appointed. I need to define column names in a separate line of code:
colnames(MSFTSP500.prices) <- c("msft", "sp500")
I tried to put colnames and col.names inside the data.frame() call but it doesn't work. How can I define column names while creating the data frame?
I found ?data.frame very unhelpful...
The code fails with an error message indicating no availability of as.its. So I added the missing code (which appears to have been successful after two failed attempts.) Once you issue the missing require() call you can use str to see what sort of object get.hist.quote actually returns. It is neither a dataframe nor a zoo object, although it resembles a zoo-object in many ways:
> str(SP500.prices)
Formal class 'its' [package "its"] with 2 slots
..# .Data: num [1:180, 1] 1394 1366 1499 1452 1421 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
.. .. ..$ : chr "AdjClose"
..# dates: POSIXct[1:180], format: "2000-01-02 16:00:00" "2000-01-31 16:00:00" ...
If you run cbind on those two objects you get a regular matrix with dimnames:
> str(cbind(SP500.prices, MSFT.prices) )
num [1:180, 1:2] 1394 1366 1499 1452 1421 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:180] "2000-01-02" "2000-01-31" "2000-02-29" "2000-04-02" ...
..$ : chr [1:2] "AdjClose" "AdjClose"
You will still need to change the column names since there does not seem to be a cbind.its that lets you assign column-names. I would caution about using the data.frame method, since the object is might get confusing in its behavior:
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ AdjClose :Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
$ AdjClose.1:Formal class 'AsIs', 'its' [package ""] with 1 slot
.. ..# .S3Class: chr "AsIs" "its"
The columns are still S4 objects. I suppose that might be useful if you were going to pass them to other its-methods but could be confusing otherwise. This might be what you were shooting for:
> MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500= as.vector(SP500.prices) ,
row.names= as.character(MSFT.prices#dates) )
> str( MSFTSP500.prices )
'data.frame': 180 obs. of 2 variables:
$ msft : num 35.1 32 38.1 25 22.4 ...
$ sp500: num 1394 1366 1499 1452 1421 ...
> head(rownames(MSFTSP500.prices))
[1] "2000-01-02 16:00:00" "2000-01-31 16:00:00" "2000-02-29 16:00:00"
[4] "2000-04-02 17:00:00" "2000-04-30 17:00:00" "2000-05-31 17:00:00"
MSFT.prices is a zoo object, which seems to be a data-frame-alike, with its own column name which gets transferred to the object. Confer
tmp <- data.frame(a=1:10)
b <- data.frame(lost=tmp)
which loses the second column name.
If you do
MSFTSP500.prices <- data.frame(msft = as.vector(MSFT.prices),
sp500=as.vector(SP500.prices))
then you will get the colnames you want (though you won't get zoo-specific behaviours). Not sure why you object to renaming columns in a second command, though.