I'd like to calculate the mean difference of two columns of my data.frame, grouping by a third.
apply doesn't even let me compute any arithmetic operation without explicit conversion of already-numeric columns.
data.table makes the operation and grouping but returns a character vector.
dplyr syntax returns numeric values correctly.
Why does apply() convert numeric vectors to character? Why does data.table convert the results to char?
library(dplyr); library(data.table)
a <- letters[c(1,1:9)]
b <- (1:10)/10
c <- sin(1:10)
dat <- data.frame(a,b,c)
table(dat$a)
typeof(dat$b) #double
dat$bb <- apply(dat, 1,function(x) x["b"])
typeof(dat$bb) #character
dat$bb <- apply(dat, 1,function(x) x["b"]-x["c"])
# Error in x["b"] - x["c"] : non-numeric argument to binary operator
tidydat <- dat %>% group_by(a) %>% summarise(diffr = mean(b-c))
typeof(tidydat$diffr) #double
dt <- data.table(dat)
dt[,bb:=mean(b-c), by=a]
typeof(dt$bb) #character
> dt$bb
[1] "-0.725384205816789" "-0.725384205816789" "0.158879991940133" "1.15680249530793" "1.45892427466314"
[6] "0.879415498198926" "0.0430134012812109" "-0.189358246623382" "0.487881514758243" "1.54402111088937"
> tidydat$diffr
[1] -0.7253842 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
EDIT this data.table part is untrue, I was just modifying by reference an already existing char column, from #Akrun
Using apply, convert the dataset from data.frame to matrix
> is.matrix(apply(dat, 1, I))
[1] TRUE
and matrix can have only a single class i.e. if there is a character element, it converts the whole data into character. Instead use lapply (if it is columnwise) or may also subset the numeric columns before doing the apply
out <- apply(dat[-1], 1,function(x) x["b"]-x["c"])
-output
> out
[1] -0.7414710 -0.7092974 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
> str(out)
num [1:10] -0.741 -0.709 0.159 1.157 1.459 ...
The reason for change in behavior is that vector element have only a single class and in data.frame/data.table/tibble etc, the columns are the list elements and not rows i.e. class is specific to a column and not a row
Regarding the data.table case
> library(data.table)
> dt <- as.data.table(dat)
> dt$bb <- NULL # in case if the character column was already created
> dt[,bb:=mean(b-c), by=a]
> str(dt)
Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
$ a : chr "A" "A" "B" "C" ...
$ b : num 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
$ c : num 0.841 0.909 0.141 -0.757 -0.959 ...
$ bb: num -0.725 -0.725 0.159 1.157 0.704 ...
I think #akrun has provided sufficient information for understanding the reason behind. Actually you can try the code below to see what's going on when you use apply by rows
> apply(dat, 1, str)
Named chr [1:3] "a" "0.1" " 0.8414710"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "a" "0.2" " 0.9092974"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "b" "0.3" " 0.1411200"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "c" "0.4" "-0.7568025"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "d" "0.5" "-0.9589243"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "e" "0.6" "-0.2794155"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "f" "0.7" " 0.6569866"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "g" "0.8" " 0.9893582"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "h" "0.9" " 0.4121185"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "i" "1.0" "-0.5440211"
- attr(*, "names")= chr [1:3] "a" "b" "c"
NULL
As you can see, when you run apply(dat,1,FUN = ...) ,the data passed to FUN is coalesced to a vector of characters, instead of data.frame any more.
Related
With such rudimentary application, I'm having trouble removing data.table column labels/attributes from imported data (SAS)
My data.table DT is an import from a SAS file. Not all columns have labels, and some have two labels. I can't share my data as it's imported (so i can't replicate it), but here is a partial structure of DT:
> str(DT)
Classes ‘data.table’ and 'data.frame': 96293709 obs. of 150 variables:
$ Col1 : chr "Y" "N" "N" "N" ...
..- attr(*, "label")= chr "some label, description goes on and on"
$ Col2 : chr "N" "N" "N" "Y" ...
..- attr(*, "label")= chr "some label 2, description goes on and on"
$ Col3 : Date, format: "1994-08-07" "1994-08-07" "1994-08-07" "1994-08-07" ...
$ Col4 : chr "M" "M" "M" "M" ...
..- attr(*, "label")= chr "some label 3, description goes on and on"
..- attr(*, "format.sas")= chr "$"
$ Col5 : num 1e+07 1e+07 1e+07 1e+07 1e+07 ...
..- attr(*, "label")= chr "some label 4, description goes on and on"
$ Col6 : Date, format: "2000-01-01" "2005-03-10" "2013-06-01" "2015-06-01" ...
I'm trying to remove all attributes, because when I use certain columns to create news ones these attributes are inherited in the new column, which is very annoying and undesired (prevents me from merging with another data.table without the labels). I thought the only way to prevent that is to remove the attributes (labels) from the original data DT.
I tried
> setattr(DT, "label", NULL)
> setattr(DT, "format.sas", NULL)
and i get no error. but nothing happens.
after I try the above and check the structure, i get the same thing as before. labels/attributes have not been removed.
what am I doing wrong here?
I know i have to use setattr somehow as I don't want DT to be copied (it's rather large)
The attributes are stored against each column, not for the data.table as a whole I think. Check attributes(DT) vs lapply(DT, attributes) and see if this is the case. Here's an example which I think replicates what you're trying to do:
DT <- data.table(a=1:3,b=2:4)
attr(DT$a, "label") <- "a label"
attr(DT$b, "label") <- "a label"
attr(DT$b, "sas format") <- "ddmmyy10."
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: atomic 1 2 3
# ..- attr(*, "label")= chr "a label"
# $ b: atomic 2 3 4
# ..- attr(*, "label")= chr "a label"
# ..- attr(*, "sas format")= chr "ddmmyy10."
# - attr(*, ".internal.selfref")=<externalptr>
DT[, names(DT) := lapply(.SD, setattr, "label", NULL)]
DT[, names(DT) := lapply(.SD, setattr, "sas format", NULL)]
str(DT)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ a: int 1 2 3
# $ b: int 2 3 4
# - attr(*, ".internal.selfref")=<externalptr>
I have an appearingly very simple task but can't figure out what I'm doing wrong. I have a list of 3 Xx2 tibbles (in the example 2x2) having a character vector and an integer vector. I want to convert it to a list of 3 named vectors where the letters are the vector elements and the numbers are the names. Here is my approach:
tbl <- tibble(numbers=c(1:2), letters=letters[1:2])
vec_names <- c("name1", "name2", "name3")
lst <- list(tbl, tbl, tbl)
names(lst) <- vec_names
lst_n <- lapply(lst, function(x) x[["letters"]])
lst_n <- sapply(vec_names,
function(x) names(lst_n[[x]]) <- lst[[x]]$numbers)
I get this result
lst_n
name1 name2 name3
[1,] 1 1 1
[2,] 2 2 2
and I can't see my mistake.
Doing
names(lst_n[["name1"]]) <- lst[["name1"]]$numbers
gives me exactly what I want for "name1" but why doesn't it work with sapply?
I had [] before and changed it to [[]] to access the tibbles inside the list instead of the list elements but it still doesn't work. Can anyone help? It seems like a very basic task.
Here's one way to do it, all in one anonymous function:
z = lapply(lst, function(x) {
result = x$letters
names(result) = x$numbers
return(result)
})
str(z)
# List of 3
# $ name1: Named chr [1:2] "a" "b"
# ..- attr(*, "names")= chr [1:2] "1" "2"
# $ name2: Named chr [1:2] "a" "b"
# ..- attr(*, "names")= chr [1:2] "1" "2"
# $ name3: Named chr [1:2] "a" "b"
# ..- attr(*, "names")= chr [1:2] "1" "2"
Your approach got stuck because after you extracted all the letters, you need to iterate over both the letters and the numbers to set the names, but lapply only lets you iterate over one thing. (And assigning inside the lapply doesn't work well, the only thing that matters is the returned object.)
If you couldn't use the approach above, doing things in one pass through lst, you can use Map instead which iterates over multiple lists. We'll use the setNames function instead of names<-(), which is what is called when you try to do names(x) <-.
Map(
f = setNames,
object = lapply(lst, "[[", "letters"),
nm = lapply(lst, "[[", "numbers")
)
# $`name1`
# 1 2
# "a" "b"
#
# $name2
# 1 2
# "a" "b"
#
# $name3
# 1 2
# "a" "b"
I have a data frame 'QARef" whith 25 variables. There are only 5 unique jobs (3rd column) but lots of rows per job:
str(QARef)
'data.frame': 648 obs. of 25 variables:
I'm using tapply to generate mean values across all 5 jobs for certain rows:
RefMean <- tapply(QARef$MTN,
list(QARef$Target_CD, QARef$Feature_Type, QARef$Orientation, QARef$Contrast, QARef$Prox),
FUN=mean, trim=0, na.rm=TRUE)
and I get something I'm hoping is referred to as multidimensional list:
str(RefMean)
num [1:17, 1:2, 1:2, 1:2, 1:2] 34.1 34.2 25.2 28.9 29.2 ...
- attr(*, "dimnames")=List of 5
..$ : chr [1:17] "55" "60" "70" "80" ...
..$ : chr [1:2] "LINE" "SQUARE"
..$ : chr [1:2] "X" "Y"
..$ : chr [1:2] "CLEAR" "DARK"
..$ : chr [1:2] "1:1" "Iso"
What I want to do is add a column to QARef which contains the correct RefMean value for each row depending on a match between values in columns of QARef and dimnames of RefMean. E.g. QARef column Feature_Type=="LINE" should match the dimname "LINE" etc.
Any hint how to do this or where to find the answer would be highly appreciated.
I think I found solution. Probably not elegant but it works:
RefMean <- data.frame(tapply(QARef$MTN,paste(QARef$Target_CD,QARef$Feature_Type,QARef$Orientation,QARef$Contrast,QARef$Prox,QARef$Measurement_Type),FUN=mean,trim=0,na.rm=TRUE))
colnames(RefMean) <- c("MTN_Ref")
Ident <- do.call(rbind, strsplit(rownames(RefMean), " "))
RefMean["Target_CD"] <- Ident[,1]
RefMean["Feature_Type"] <- Ident[,2]
RefMean["Orientation"] <- Ident[,3]
RefMean["Contrast"] <- Ident[,4]
RefMean["Prox"] <- Ident[,5]
RefMean["Measurement_Type"] <- Ident[,6]
QA4 <- merge(QARef,RefMean,by=c("Target_CD","Feature_Type","Orientation","Contrast","Prox","Measurement_Type"),all.x=TRUE,sort=FALSE)
I want to do a loop with letter..i have a matrix(named 'a') and i want to have all the column names..
k<-arrayInd(2,dim(a))
colnames(a)[k[,1]]
colnames(a)[k[,2]]
colnames(a)[k[,3]]
.
.
.
colnames(a)[k[,n]]
i guess the loop will be something like that
aa<-list()
for (i in 1:n) {
aa[[i]]<-colnames(a)[k[,i]]
}
But i don't get any results. I think that the loop is ok but i have to change with something else the
aa<-list()
and replace the "list" with something else..
Suppose you have a matrix mat, which looks like this:
mat <- matrix(1:4, ncol = 2, dimnames = list(letters[1:2], LETTERS[1:2]))
You can inspect its structure like this:
str(mat)
# int [1:2, 1:2] 1 2 3 4
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "a" "b"
# ..$ : chr [1:2] "A" "B"
And you get the column names by using this:
colnames(mat)
# [1] "A" "B"
Data is generated by similar process:
x <- rnorm(10)
y <- c("a", "b", "c")
# chr vectors might have varying length and contents, simplified for sake of example
data_list <- list()
for(i in 1:length(x)) {
data_list <- append(data_list, list(list(numeric = x[i], char = y)))
}
Basically, structure of generated list looks like:
$ :List of 2
..$ numeric: num 0.928
..$ char : chr [1:3] "a" "b" "c"
$ :List of 2
..$ numeric: num 1.4
..$ char : chr [1:3] "a" "b" "c"
...
I would like to sort this list by numeric in ascending order, retaining initial structure.
I have tried solution explained here but it disrupts the structure of chr vectors.
data_list = data_list[order(sapply(data_list, `[[`, i=1))]
For example:
data_list[order(rapply(data_list,"numeric",f=c))]
rapply intsruction will extract all numeric values, then we order them.
Also:
data_list[order(unlist(do.call(`c`, data_list)[c(T,F)]))]