Remove data.table column labels/attributes (imported data) - r

With such rudimentary application, I'm having trouble removing data.table column labels/attributes from imported data (SAS)
My data.table DT is an import from a SAS file. Not all columns have labels, and some have two labels. I can't share my data as it's imported (so i can't replicate it), but here is a partial structure of DT:
> str(DT)
Classes ‘data.table’ and 'data.frame': 96293709 obs. of 150 variables:
$ Col1 : chr "Y" "N" "N" "N" ...
..- attr(*, "label")= chr "some label, description goes on and on"
$ Col2 : chr "N" "N" "N" "Y" ...
..- attr(*, "label")= chr "some label 2, description goes on and on"
$ Col3 : Date, format: "1994-08-07" "1994-08-07" "1994-08-07" "1994-08-07" ...
$ Col4 : chr "M" "M" "M" "M" ...
..- attr(*, "label")= chr "some label 3, description goes on and on"
..- attr(*, "format.sas")= chr "$"
$ Col5 : num 1e+07 1e+07 1e+07 1e+07 1e+07 ...
..- attr(*, "label")= chr "some label 4, description goes on and on"
$ Col6 : Date, format: "2000-01-01" "2005-03-10" "2013-06-01" "2015-06-01" ...
I'm trying to remove all attributes, because when I use certain columns to create news ones these attributes are inherited in the new column, which is very annoying and undesired (prevents me from merging with another data.table without the labels). I thought the only way to prevent that is to remove the attributes (labels) from the original data DT.
I tried
> setattr(DT, "label", NULL)
> setattr(DT, "format.sas", NULL)
and i get no error. but nothing happens.
after I try the above and check the structure, i get the same thing as before. labels/attributes have not been removed.
what am I doing wrong here?
I know i have to use setattr somehow as I don't want DT to be copied (it's rather large)

The attributes are stored against each column, not for the data.table as a whole I think. Check attributes(DT) vs lapply(DT, attributes) and see if this is the case. Here's an example which I think replicates what you're trying to do:
DT <- data.table(a=1:3,b=2:4)
attr(DT$a, "label") <- "a label"
attr(DT$b, "label") <- "a label"
attr(DT$b, "sas format") <- "ddmmyy10."
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: atomic 1 2 3
# ..- attr(*, "label")= chr "a label"
# $ b: atomic 2 3 4
# ..- attr(*, "label")= chr "a label"
# ..- attr(*, "sas format")= chr "ddmmyy10."
# - attr(*, ".internal.selfref")=<externalptr>
DT[, names(DT) := lapply(.SD, setattr, "label", NULL)]
DT[, names(DT) := lapply(.SD, setattr, "sas format", NULL)]
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: int 1 2 3
# $ b: int 2 3 4
# - attr(*, ".internal.selfref")=<externalptr>

Related

Automatic num-to-char conversion in R using apply and data.table

I'd like to calculate the mean difference of two columns of my data.frame, grouping by a third.
apply doesn't even let me compute any arithmetic operation without explicit conversion of already-numeric columns.
data.table makes the operation and grouping but returns a character vector.
dplyr syntax returns numeric values correctly.
Why does apply() convert numeric vectors to character? Why does data.table convert the results to char?
library(dplyr); library(data.table)
a <- letters[c(1,1:9)]
b <- (1:10)/10
c <- sin(1:10)
dat <- data.frame(a,b,c)
table(dat$a)
typeof(dat$b) #double
dat$bb <- apply(dat, 1,function(x) x["b"])
typeof(dat$bb) #character
dat$bb <- apply(dat, 1,function(x) x["b"]-x["c"])
# Error in x["b"] - x["c"] : non-numeric argument to binary operator
tidydat <- dat %>% group_by(a) %>% summarise(diffr = mean(b-c))
typeof(tidydat$diffr) #double
dt <- data.table(dat)
dt[,bb:=mean(b-c), by=a]
typeof(dt$bb) #character
> dt$bb
[1] "-0.725384205816789" "-0.725384205816789" "0.158879991940133" "1.15680249530793" "1.45892427466314"
[6] "0.879415498198926" "0.0430134012812109" "-0.189358246623382" "0.487881514758243" "1.54402111088937"
> tidydat$diffr
[1] -0.7253842 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
EDIT this data.table part is untrue, I was just modifying by reference an already existing char column, from #Akrun
Using apply, convert the dataset from data.frame to matrix
> is.matrix(apply(dat, 1, I))
[1] TRUE
and matrix can have only a single class i.e. if there is a character element, it converts the whole data into character. Instead use lapply (if it is columnwise) or may also subset the numeric columns before doing the apply
out <- apply(dat[-1], 1,function(x) x["b"]-x["c"])
-output
> out
[1] -0.7414710 -0.7092974 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
> str(out)
num [1:10] -0.741 -0.709 0.159 1.157 1.459 ...
The reason for change in behavior is that vector element have only a single class and in data.frame/data.table/tibble etc, the columns are the list elements and not rows i.e. class is specific to a column and not a row
Regarding the data.table case
> library(data.table)
> dt <- as.data.table(dat)
> dt$bb <- NULL # in case if the character column was already created
> dt[,bb:=mean(b-c), by=a]
> str(dt)
Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
$ a : chr "A" "A" "B" "C" ...
$ b : num 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
$ c : num 0.841 0.909 0.141 -0.757 -0.959 ...
$ bb: num -0.725 -0.725 0.159 1.157 0.704 ...
I think #akrun has provided sufficient information for understanding the reason behind. Actually you can try the code below to see what's going on when you use apply by rows
> apply(dat, 1, str)
Named chr [1:3] "a" "0.1" " 0.8414710"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "a" "0.2" " 0.9092974"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "b" "0.3" " 0.1411200"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "c" "0.4" "-0.7568025"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "d" "0.5" "-0.9589243"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "e" "0.6" "-0.2794155"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "f" "0.7" " 0.6569866"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "g" "0.8" " 0.9893582"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "h" "0.9" " 0.4121185"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "i" "1.0" "-0.5440211"
- attr(*, "names")= chr [1:3] "a" "b" "c"
NULL
As you can see, when you run apply(dat,1,FUN = ...) ,the data passed to FUN is coalesced to a vector of characters, instead of data.frame any more.

why cannot sf object use all data.table methods in R?

I am learning sf in R. Since I like data.table very much, I though I could use both. However, it seems that sf object deriving from data.table cannot use methods in data.table any more. Following is an example:
First I generate a very simple data.table and make it to a sf object. So far so good.
> dfr <- data.table(id = c("hwy1", "hwy2"),
+ cars_per_hour = c(78, 22),
+ lat = c(1, 2),
+ lon = c(3, 4))
> my_sf <- st_as_sf(dfr , coords = c("lon", "lat"))
Then I check the structure of the my_sf. It is an sf object, a data.table and a data.frame.
> str(my_sf)
Classes ‘sf’, ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
$ id : chr "hwy1" "hwy2"
$ cars_per_hour: num 78 22
$ geometry :sfc_POINT of length 2; first list element: 'XY' num 3 1
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
..- attr(*, "names")= chr "id" "cars_per_hour"
Then I tried some arbitary function unique, and it does not work. Actually this my_sf does not work as data.table at all.
> my_sf[, unique(id)]
Error in unique(id) : object 'id' not found
Does anyone know the reason for it? Is it not possible to use data.table for sf?
My guess is the function st_as_sf has destroyed .internal.selfref attribute turning back the data.table into data.frame although the class name has been preserved.
> str(dfr)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 4 variables:
#$ id : chr "hwy1" "hwy2"
#$ cars_per_hour: num 78 22
#$ lat : num 1 2
#$ lon : num 3 4
#- attr(*, ".internal.selfref")=<externalptr>
setDT(my_sf) might be enough to turn back the data.frame into a data.table

How can extract row names after PCA implementation?

I am reducing the dimensional of a test DataFrame(contain 30rows and 750 colunm) with PCA model with PCA (using the FactoMineR library) as follows:
pca_base <- PCA(test, ncp=5, graph=T)
I used function dimdesc() [in FactoMineR], for dimension description,to
identify the most significantly associated variables with a given principal component as follow:
pca_dim<-dimdesc(pca_base)
pca_dim is a list of 3 length.
My question is How can I extract row names of pca_dim from the list[1] and list[2]??.
I try this code:
#to select dim 1,2 use axes
pca_dim<-dimdesc(pca_base,axes = c(1,2))
rownames(pca_dim[[1]])
But the result was NULL.
For instant, I'll use the demo data sets decathlon2 from the factoextra package:data(decathlon2)
It contains 27 individuals (athletes) described by 13 variables.
library(factoextra)
data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
res.pca <- PCA(decathlon2.active,scale.unit = TRUE, graph = FALSE)
res.desc <- dimdesc(res.pca, axes = c(1,2))
Thanks!
When you have that kind of issues, to access information on an R object, the best way to solve them is to start by examining the output of function str.
str(pca_dim)
#List of 2
# $ Dim.1:List of 1
# ..$ quanti: num [1:8, 1:2] 0.794 0.743 0.734 0.61 0.428 ...
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:8] "Long.jump" "Discus" "Shot.put" "High.jump" ...
# .. .. ..$ : chr [1:2] "correlation" "p.value"
# $ Dim.2:List of 1
# ..$ quanti: num [1:3, 1:2] 8.07e-01 7.84e-01 -4.65e-01 3.21e-06 9.38e-06 ...
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:3] "Pole.vault" "X1500m" "High.jump"
# .. .. ..$ : chr [1:2] "correlation" "p.value"
So the structure of the object is simple, it is a list of two lists. Each of these sublists has just one member, a matrix with the dimnames attribute set.
So you can use standard accessor functions to get those attributes.
rownames(pca_dim$Dim.1$quanti)
#[1] "Long.jump" "Discus" "Shot.put" "High.jump" "Javeline"
#[6] "X400m" "X110m.hurdle" "X100m"
rownames(pca_dim$Dim.2$quanti)
#[1] "Pole.vault" "X1500m" "High.jump"
You have to move the result of dimdesc to data.frame for each element, like this:
rownames(data.frame(res.desc[1]))
[1] "Long.jump" "Discus" "Shot.put" "High.jump" "Javeline" "X400m" "X110m.hurdle"
[8] "X100m"
> rownames(data.frame(res.desc[2]))
[1] "Pole.vault" "X1500m" "High.jump"

merge data.frame with multidimensional list

I have a data frame 'QARef" whith 25 variables. There are only 5 unique jobs (3rd column) but lots of rows per job:
str(QARef)
'data.frame': 648 obs. of 25 variables:
I'm using tapply to generate mean values across all 5 jobs for certain rows:
RefMean <- tapply(QARef$MTN,
list(QARef$Target_CD, QARef$Feature_Type, QARef$Orientation, QARef$Contrast, QARef$Prox),
FUN=mean, trim=0, na.rm=TRUE)
and I get something I'm hoping is referred to as multidimensional list:
str(RefMean)
num [1:17, 1:2, 1:2, 1:2, 1:2] 34.1 34.2 25.2 28.9 29.2 ...
- attr(*, "dimnames")=List of 5
..$ : chr [1:17] "55" "60" "70" "80" ...
..$ : chr [1:2] "LINE" "SQUARE"
..$ : chr [1:2] "X" "Y"
..$ : chr [1:2] "CLEAR" "DARK"
..$ : chr [1:2] "1:1" "Iso"
What I want to do is add a column to QARef which contains the correct RefMean value for each row depending on a match between values in columns of QARef and dimnames of RefMean. E.g. QARef column Feature_Type=="LINE" should match the dimname "LINE" etc.
Any hint how to do this or where to find the answer would be highly appreciated.
I think I found solution. Probably not elegant but it works:
RefMean <- data.frame(tapply(QARef$MTN,paste(QARef$Target_CD,QARef$Feature_Type,QARef$Orientation,QARef$Contrast,QARef$Prox,QARef$Measurement_Type),FUN=mean,trim=0,na.rm=TRUE))
colnames(RefMean) <- c("MTN_Ref")
Ident <- do.call(rbind, strsplit(rownames(RefMean), " "))
RefMean["Target_CD"] <- Ident[,1]
RefMean["Feature_Type"] <- Ident[,2]
RefMean["Orientation"] <- Ident[,3]
RefMean["Contrast"] <- Ident[,4]
RefMean["Prox"] <- Ident[,5]
RefMean["Measurement_Type"] <- Ident[,6]
QA4 <- merge(QARef,RefMean,by=c("Target_CD","Feature_Type","Orientation","Contrast","Prox","Measurement_Type"),all.x=TRUE,sort=FALSE)

Import contingency table (.csv-format) as "table" rather than "data.frame" in R

I am working with the (I think) very cool titanic data that is publicly available.
There are two principal ways of how to import it to R:
(1) You can either use the built-in dataset Titanic (library(datasets)) or
(2) you can download it as .csv-file, e.g. here.
Now, the data is aggregated frequency data. I would like to convert the multi-dimensional contingency table into an individual-level data frame.
PROBLEM: If I use the built-in dataset, this is no problem; if I use the imported .csv-file, however, it doesn't work. This is the error message I get:
Error in rep(1:nrow(tablevars), counts) : invalid 'times' argument In
addition: Warning message: In expand.table(Titanic.table) : NAs
introduced by coercion
Why? And what do I wrong? Many thanks.
R CODE
#required packages
library(datasets)
library(epitools)
#(1) Expansion of built-in data set
data(Titanic)
Titanic.raw <- Titanic
class(Titanic.raw) # data is stored as "table"
Titanic.expand <- expand.table(Titanic.raw)
#(2) Expansion of imported data set
Titanic.raw <- read.table("Titanic.csv", header=TRUE, sep=",", row.names=1)
class(Titanic.raw) #data is stored as "data.frame"
Titanic.table <- as.table(as.matrix(Titanic.raw))
class(Titanic.table) #data is stored as "table"
Titanic.expand <- expand.table(Titanic.table)
I think you probably want xtabs: Watch out that the factor coding is different for the factors in the Titanic and the Titanic.new objects. By default factor levels have lexicographic order, while two of the Titanic factors do not :
str(Titanic)
table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
- attr(*, "dimnames")=List of 4
..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
..$ Sex : chr [1:2] "Male" "Female"
..$ Age : chr [1:2] "Child" "Adult"
..$ Survived: chr [1:2] "No" "Yes"
Titanic.raw <- read.table("~/Downloads/Titanic.csv", header=TRUE, sep=",", row.names=1)
str( Titanic.new <-
xtabs( Freq ~ Class + Sex + Age +Survived, data=Titanic.raw))
xtabs [1:4, 1:2, 1:2, 1:2] 4 13 89 3 118 154 387 670 0 0 ...
- attr(*, "dimnames")=List of 4
..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
..$ Sex : chr [1:2] "Female" "Male"
..$ Age : chr [1:2] "Adult" "Child"
..$ Survived: chr [1:2] "No" "Yes"
- attr(*, "class")= chr [1:2] "xtabs" "table"
- attr(*, "call")= language xtabs(formula = Freq ~ Class + Sex + Age + Survived, data = Titanic.raw)
An 'xtabs'-object inherits from 'table'-class so you can use that expand.table function.

Resources