why cannot sf object use all data.table methods in R? - r

I am learning sf in R. Since I like data.table very much, I though I could use both. However, it seems that sf object deriving from data.table cannot use methods in data.table any more. Following is an example:
First I generate a very simple data.table and make it to a sf object. So far so good.
> dfr <- data.table(id = c("hwy1", "hwy2"),
+ cars_per_hour = c(78, 22),
+ lat = c(1, 2),
+ lon = c(3, 4))
> my_sf <- st_as_sf(dfr , coords = c("lon", "lat"))
Then I check the structure of the my_sf. It is an sf object, a data.table and a data.frame.
> str(my_sf)
Classes ‘sf’, ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
$ id : chr "hwy1" "hwy2"
$ cars_per_hour: num 78 22
$ geometry :sfc_POINT of length 2; first list element: 'XY' num 3 1
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
..- attr(*, "names")= chr "id" "cars_per_hour"
Then I tried some arbitary function unique, and it does not work. Actually this my_sf does not work as data.table at all.
> my_sf[, unique(id)]
Error in unique(id) : object 'id' not found
Does anyone know the reason for it? Is it not possible to use data.table for sf?

My guess is the function st_as_sf has destroyed .internal.selfref attribute turning back the data.table into data.frame although the class name has been preserved.
> str(dfr)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 4 variables:
#$ id : chr "hwy1" "hwy2"
#$ cars_per_hour: num 78 22
#$ lat : num 1 2
#$ lon : num 3 4
#- attr(*, ".internal.selfref")=<externalptr>
setDT(my_sf) might be enough to turn back the data.frame into a data.table

Related

Keep specific columns in a data.frame while unlisting a column which is a data.frame type

I have a data.frame called StockWeights. The structure of the data.frame is as follows:
'data.frame': 3 obs. of 6 variables:
$ Id : chr "159347" "161863" "22646"
$ ISIN : chr "DK0061156759" "DK0061533726" "DK0060681468"
$ $id : chr "21" "22" "23"
$ Name : chr "159347" "161863" "22646"
$ SumPeriod:'data.frame': 3 obs. of 27 variables:
..$ AccPeriodBasTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ AccPeriodLocTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ BopDate : chr "2022-02-28T00:00:00" "2022-02-28T00:00:00" "2022-02-28T00:00:00"
..$ BopBasHoldingValueAtMarketPrice: num 7592267 5135961 7166816
My question is then: How can I "unlist" this SumPeriod data.frame column and display the BopBasHoldingValueAtMarketPrice column together with the Id and ISIN columns? What I have done so far is to use the pluck function in the purrr package as such:
StockWeights %>%
pluck('SumPeriod') %>%
select("EopBasHoldingValueAtMarketPrice")
Which only gives me the "EopBasHoldingValueAtMarketPrice":
'data.frame': 3 obs. of 1 variable:
$ EopBasHoldingValueAtMarketPrice: num 7599626 5163591 7159142
But I can't find a way to get theese three values together with the corresponding "Id" and "ISIN" in the original data.frame. Anyone got an idea how to achieve this? Sorry for not producing a reproducible code. The data I am looking at is made from an API call and I am having some trouble in recreating it manually. But the end goal is to get a data.frame that looks like:
df = data.frame(
Id = c("159347", "161863", "22646"),
ISIN = c("DK0061156759", "DK0061533726", "DK0060681468"),
BopBasHoldingValueAtMarketPrice = c(7592267,5135961,7166816)
)

Remove data.table column labels/attributes (imported data)

With such rudimentary application, I'm having trouble removing data.table column labels/attributes from imported data (SAS)
My data.table DT is an import from a SAS file. Not all columns have labels, and some have two labels. I can't share my data as it's imported (so i can't replicate it), but here is a partial structure of DT:
> str(DT)
Classes ‘data.table’ and 'data.frame': 96293709 obs. of 150 variables:
$ Col1 : chr "Y" "N" "N" "N" ...
..- attr(*, "label")= chr "some label, description goes on and on"
$ Col2 : chr "N" "N" "N" "Y" ...
..- attr(*, "label")= chr "some label 2, description goes on and on"
$ Col3 : Date, format: "1994-08-07" "1994-08-07" "1994-08-07" "1994-08-07" ...
$ Col4 : chr "M" "M" "M" "M" ...
..- attr(*, "label")= chr "some label 3, description goes on and on"
..- attr(*, "format.sas")= chr "$"
$ Col5 : num 1e+07 1e+07 1e+07 1e+07 1e+07 ...
..- attr(*, "label")= chr "some label 4, description goes on and on"
$ Col6 : Date, format: "2000-01-01" "2005-03-10" "2013-06-01" "2015-06-01" ...
I'm trying to remove all attributes, because when I use certain columns to create news ones these attributes are inherited in the new column, which is very annoying and undesired (prevents me from merging with another data.table without the labels). I thought the only way to prevent that is to remove the attributes (labels) from the original data DT.
I tried
> setattr(DT, "label", NULL)
> setattr(DT, "format.sas", NULL)
and i get no error. but nothing happens.
after I try the above and check the structure, i get the same thing as before. labels/attributes have not been removed.
what am I doing wrong here?
I know i have to use setattr somehow as I don't want DT to be copied (it's rather large)
The attributes are stored against each column, not for the data.table as a whole I think. Check attributes(DT) vs lapply(DT, attributes) and see if this is the case. Here's an example which I think replicates what you're trying to do:
DT <- data.table(a=1:3,b=2:4)
attr(DT$a, "label") <- "a label"
attr(DT$b, "label") <- "a label"
attr(DT$b, "sas format") <- "ddmmyy10."
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: atomic 1 2 3
# ..- attr(*, "label")= chr "a label"
# $ b: atomic 2 3 4
# ..- attr(*, "label")= chr "a label"
# ..- attr(*, "sas format")= chr "ddmmyy10."
# - attr(*, ".internal.selfref")=<externalptr>
DT[, names(DT) := lapply(.SD, setattr, "label", NULL)]
DT[, names(DT) := lapply(.SD, setattr, "sas format", NULL)]
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: int 1 2 3
# $ b: int 2 3 4
# - attr(*, ".internal.selfref")=<externalptr>

merge data.frame with multidimensional list

I have a data frame 'QARef" whith 25 variables. There are only 5 unique jobs (3rd column) but lots of rows per job:
str(QARef)
'data.frame': 648 obs. of 25 variables:
I'm using tapply to generate mean values across all 5 jobs for certain rows:
RefMean <- tapply(QARef$MTN,
list(QARef$Target_CD, QARef$Feature_Type, QARef$Orientation, QARef$Contrast, QARef$Prox),
FUN=mean, trim=0, na.rm=TRUE)
and I get something I'm hoping is referred to as multidimensional list:
str(RefMean)
num [1:17, 1:2, 1:2, 1:2, 1:2] 34.1 34.2 25.2 28.9 29.2 ...
- attr(*, "dimnames")=List of 5
..$ : chr [1:17] "55" "60" "70" "80" ...
..$ : chr [1:2] "LINE" "SQUARE"
..$ : chr [1:2] "X" "Y"
..$ : chr [1:2] "CLEAR" "DARK"
..$ : chr [1:2] "1:1" "Iso"
What I want to do is add a column to QARef which contains the correct RefMean value for each row depending on a match between values in columns of QARef and dimnames of RefMean. E.g. QARef column Feature_Type=="LINE" should match the dimname "LINE" etc.
Any hint how to do this or where to find the answer would be highly appreciated.
I think I found solution. Probably not elegant but it works:
RefMean <- data.frame(tapply(QARef$MTN,paste(QARef$Target_CD,QARef$Feature_Type,QARef$Orientation,QARef$Contrast,QARef$Prox,QARef$Measurement_Type),FUN=mean,trim=0,na.rm=TRUE))
colnames(RefMean) <- c("MTN_Ref")
Ident <- do.call(rbind, strsplit(rownames(RefMean), " "))
RefMean["Target_CD"] <- Ident[,1]
RefMean["Feature_Type"] <- Ident[,2]
RefMean["Orientation"] <- Ident[,3]
RefMean["Contrast"] <- Ident[,4]
RefMean["Prox"] <- Ident[,5]
RefMean["Measurement_Type"] <- Ident[,6]
QA4 <- merge(QARef,RefMean,by=c("Target_CD","Feature_Type","Orientation","Contrast","Prox","Measurement_Type"),all.x=TRUE,sort=FALSE)

Sorting several dates by one observation

I am at a loss! I am trying to sort my data by business_id. Each id has several dates associated with it. I am trying to create a new variable that shows the time in days between the first and last date associated with a business_id. Such that
row.names business_id Days
1 x8453 DxUn-ukNL27GOuwjnFGFKA 876
The data currently is structured as:
row.names date business_id
1 X27038 2012-04-21 FV0BkoGOd3Yu_eJnXY15ZA
2 X60951 2012-05-14 Trar_9cFAj6wXiXfKfEqZA
3 X60462 2011-10-05 DxUn-ukNL27GOuwjnFGFKA
4 X2078 2010-12-19 PlcCjELzSI3SqX7mPF5cCw
5 X166883 2011-09-29 pF7uRzygyZsltbmVpjIyvw
6 X177828 2010-09-19 XkNQVTkCEzBrq7OlRHI11Q
7 X128628 2012-05-05 6TWRuHn24DL6vnW8Uyu4Vw
8 X202882 2011-12-10 Xo9Im4LmIhQrzJcO4R3ZbA
9 X64569 2012-02-07 Z67obTep38V9HMtA10yu5A
10 X14667 2009-07-18 xsSnuGCCJD4OgWnOZ0zB4A
11 X17432 2012-08-11 XkNQVTkCEzBrq7OlRHI11Q
Thanks in advance!
Update:
str(data)
'data.frame': 2299 obs. of 2 variables:
$ date :List of 2299
..$ X2736 : chr "2012-05-29"
..$ X160403: chr "2011-08-29"
..$ X19897 : chr "2010-09-27"
..$ X44519 : chr "2012-05-22"
..$ X75910 : chr "2012-10-22"
..$ X13052 : chr "2010-07-14"
$ business_id:List of 2299
..$ X2736 : chr "EFJAVVBQQqftuqY5Wb3WtQ"
..$ X160403: chr "YDlk9buwF8JQE3JgQgraOw"
..$ X19897 : chr "sc1UacpE3cVNJueMdXiCyA"
..$ X44519 : chr "VY_tvNUCCXGXQeSvJl757Q"
..$ X75910 : chr "fowXs9zAM0TQhSfSkPeVuw"
..$ X13052 : chr "xM5F0cLAlKWoB8rOgt5ZOw"
..$ X87807 : chr "nLL0sjLdZ13YdvhXKyss7A"
Edit now that the OP has provided the structure:
Your data is structured quite oddly. A usual structure in R is a data.frame, which is technically a list of vectors where the vectors are all the same length. In your case, you have a list of two (named) lists.
Store the somewhere else for the time being:
old.names <- names(x[[1]])
Then turn the data into an ordinary data.frame, using the handy unlist() function:
x$date <- unlist(x$date)
x$business_id <- unlist(x$business_id)
Use str(x) to see the difference. The names can go back in now, and it's also a good time to turn your "date" column from a character into a proper date, and sort by date order.
x$old.names <- old.names
x$date <- as.POSIXct(x$date)
x <- x[order(x$date), ]
My original answer should now work.
Original answer:
Like agstudy I'd use the plyr package, but if you have the "date" column in a date format and want to keep it that way, you could try:
require(plyr)
ddply(x, "business_id", summarise
, duration = difftime(max(date), min(date), units = "days")
, old.names = old.names[1])
This also gives you flexibility on the units.
With your example data, sorted by date ascending with dat <- dat[order(dat$date), ] means that old.names[1] gives you the name of the earliest row, and old.names[length(old.names)] would give you the name of the most recent row, but I don't know whether that is reliable given the magic inside ddply.
Further edit:
I only showed how to handle the names because they're in your example. They look as though they were originally column headers from imported data, and R has prepended "X" to them because names aren't allowed to begin with numerals.
Using plyr package:
ddply(dat,.(business_id),function(x)
if(length(x$date)>1)
diff(range(as.POSIXct(x$date)))
else 0)
business_id V1
1 6TWRuHn24DL6vnW8Uyu4Vw 0
2 DxUn-ukNL27GOuwjnFGFKA 0
3 FV0BkoGOd3Yu_eJnXY15ZA 0
4 pF7uRzygyZsltbmVpjIyvw 0
5 PlcCjELzSI3SqX7mPF5cCw 0
6 Trar_9cFAj6wXiXfKfEqZA 0
7 XkNQVTkCEzBrq7OlRHI11Q 692
8 Xo9Im4LmIhQrzJcO4R3ZbA 0
9 xsSnuGCCJD4OgWnOZ0zB4A 0
10 Z67obTep38V9HMtA10yu5A 0

R question - How to extract attributes values from bystat object and place them in variables

I'm using the bystat function from the Hmisc package in R. How can I extract attribute values and place them into variables. For example, I want to calculate mean and SD for variable aaf and put them in a dataframe or matrix.
t <- with(d.aaf,bystats(y=aaf,plot_bid,fun=function(x) {
c(Mean = round(mean(x),digits=2),SD = round(sd(x),digits=2))
}))
> str(t)
bystats [1:121, 1:3] 5 5 5 5 5 4 5 5 3 4 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:121] "P00000000006001288020278" "P00000000006001288085814"
"P00000000006001288151350" "P00000000006001288216886" ...
..$ : chr [1:3] "N" "Mean" "SD"
- attr(*, "heading")= chr "function(x) { c(Mean = round(mean(x),digits=2),
SD = round(sd(x),digits=2)) }
of aaf by plot_bid"
- attr(*, "byvarnames")= chr "plot_bid"
The way I'm doing it is by first converting "t" into a dataframe, which I do not think is very efficient.
Thanks for your suggestions.
You could use ddply from the plyr package which outputs directly to a data frame.
library(plyr)
t<-ddply(d.aaf, "plot_bid", summarise, mean=round(mean(aaf),2), SD=round(sd(aaf),2))
SD<-t$SD
mean<-t$mean

Resources