Quantiles of a data.frame - r

There is a data.frame() for which's columns I'd like to calculate quantiles:
tert <- c(0:3)/3
data <- dbGetQuery(dbCon, "SELECT * FROM tablename")
quans <- mapply(quantile, data, probs=tert, name=FALSE)
But the result only contains the last element of quantiles return list and not the whole result. I also get a warning longer argument not a multiple of length of shorter. How can I modify my code to make it work?
PS: The function alone works like a charme, so I could use a for loop:
quans <- quantile(a$fileName, probs=tert, name=FALSE)
PPS: What also works is not specifying probs
quans <- mapply(quantile, data, name=FALSE)

The problem is that mapply is trying to apply the given function to each of the elements of all of the specified arguments in sequence. Since you only want to do this for one argument, you should use lapply, not mapply:
lapply(data, quantile, probs=tert, name=FALSE)
Alternatively, you can still use mapply but specify the arguments that are not to be looped over in the MoreArgs argument.
mapply(quantile, data, MoreArgs=list(probs=tert, name=FALSE))

I finally found a workaround which I don't like but kinda works. Perhaps someone can tell the right way to do it:
q <- function(x) { quantile(x, probs=c(0:3)/3, names=FALSE) }
mapply(q, data)
works, no Idea where the difference is.

Related

Expand grid in R with paste

I am trying to analyse a dataframe using hierarchical clustering hclust function in R.
I would like to pass in a vector of p values I'll write beforehand (maybe something like c(5/4, 3/2, 7/4, 9/4)) and be able to have these specified as the different p value options with Minkowski distance when I use expand.grid. Ideally, when hyperparams is viewed, it would also be clear which value of p has been used for each minkowski, i.e. they should be labelled. So for example, where (if you run my code for hyperparams) there would currently just be one minkowski under Dists, for each of the methods in Meths, there would be, if I supplied the p vector as c(5/4, 3/2, 7/4, 9/4), now instead 4 rows for Minkowski distance: minkowski, p=5/4, minkowski, p=3/2, minkowski, p=7/4, minkowski, p=9/4 (or looking something like that, making the p values clear). Any ideas?
(Note: no packages please, only base R!)
Edit: I worded it poorly before, now rewritten. Let's take the following example instead:
acc <- function(x){
first = sum(x)
second = sum(x^2)
return(list(First=first,Second=second))
}
iris0 <- iris
iris1 <- cbind(log(iris[,1:4]),iris[5])
iris2 <- cbind(sqrt(iris[,1:4]),iris[5])
Now the important bit:
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This will work. But now if I want to include a term like "minkowski",p=3 in expand.grid, how would I do it?
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary","minkowski,p=3"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This gives an error.
In reality there should be no p argument unless the method="minkowski". I have tried to use strsplit to get the first part of the expression into ds, and a switch with strsplit to get the second part and then use parse (it would return NULL if the length of the strsplit was not 2 -- this should pass no argument, I think). The issue seems to be that strsplit is not strsplit(x,",") fails to evaluate the vectorized x but rather tries to evaluate the character x which is not a string. Can anyone suggest any workaround/fix or other method for including the minkowski,p=1.6 terms and the like?
We can create a 'p' value column
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary",
"minkowski3", "minkowski4", "minkowski5"),
DS=c("iris0","iris1","iris2"))
Suppose, we have another column of 'p' values in 'tests', the above solution can be changed to
tests$p <- as.list(args(dist))$p # default value
i1 <- grepl("minkowski", tests$Dists)
tests$Dists <- sub("[0-9.]+$", "", tests$Dists)
tests$p[i1] <- rep(3:5, length.out = sum(i1))
Map(function(x, ds, p){
dist1 <- dist(get(ds)[, 1:4], method = x, p = p)
ct <- cutree(hclust(dist1), 3)
acc(table(get(ds)$Species, ct))},
as.character(tests[[1]]), as.character(tests[[2]]), tests$p )

get() not working for column in a data frame in a list in R (phew)

I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.

Way to do this using apply?

I want to take an average for each row across different data frames. Does anyone know of a more clever way to do this using apply statements? Sorry for the wall of code.
Youl would need a vector of 1000:1006 for each hiXXXX file and then a vector 2:13 for the columns. I have used mapply for something weird like this before so maybe that could do it somehow?
for (i in 1:nrow(subavg)) {
subavg[i,c(2)] <- mean(c(hi1000[i,c(2)],hi1001[i,c(2)],hi1002[i,c(2)],hi1003[i,c(2)],hi1004[i,c(2)],hi1005[i,c(2)],hi1006[i,c(2)]))
subavg[i,c(3)] <- mean(c(hi1000[i,c(3)],hi1001[i,c(3)],hi1002[i,c(3)],hi1003[i,c(3)],hi1004[i,c(3)],hi1005[i,c(3)],hi1006[i,c(3)]))
subavg[i,c(4)] <- mean(c(hi1000[i,c(4)],hi1001[i,c(4)],hi1002[i,c(4)],hi1003[i,c(4)],hi1004[i,c(4)],hi1005[i,c(4)],hi1006[i,c(4)]))
subavg[i,c(5)] <- mean(c(hi1000[i,c(5)],hi1001[i,c(5)],hi1002[i,c(5)],hi1003[i,c(5)],hi1004[i,c(5)],hi1005[i,c(5)],hi1006[i,c(5)]))
subavg[i,c(6)] <- mean(c(hi1000[i,c(6)],hi1001[i,c(6)],hi1002[i,c(6)],hi1003[i,c(6)],hi1004[i,c(6)],hi1005[i,c(6)],hi1006[i,c(6)]))
subavg[i,c(7)] <- mean(c(hi1000[i,c(7)],hi1001[i,c(7)],hi1002[i,c(7)],hi1003[i,c(7)],hi1004[i,c(7)],hi1005[i,c(7)],hi1006[i,c(7)]))
subavg[i,c(8)] <- mean(c(hi1000[i,c(8)],hi1001[i,c(8)],hi1002[i,c(8)],hi1003[i,c(8)],hi1004[i,c(8)],hi1005[i,c(8)],hi1006[i,c(8)]))
subavg[i,c(9)] <- mean(c(hi1000[i,c(9)],hi1001[i,c(9)],hi1002[i,c(9)],hi1003[i,c(9)],hi1004[i,c(9)],hi1005[i,c(9)],hi1006[i,c(9)]))
subavg[i,c(10)] <- mean(c(hi1000[i,c(10)],hi1001[i,c(10)],hi1002[i,c(10)],hi1003[i,c(10)],hi1004[i,c(10)],hi1005[i,c(10)],hi1006[i,c(10)]))
subavg[i,c(11)] <- mean(c(hi1000[i,c(11)],hi1001[i,c(11)],hi1002[i,c(11)],hi1003[i,c(11)],hi1004[i,c(11)],hi1005[i,c(11)],hi1006[i,c(11)]))
subavg[i,c(12)] <- mean(c(hi1000[i,c(12)],hi1001[i,c(12)],hi1002[i,c(12)],hi1003[i,c(12)],hi1004[i,c(12)],hi1005[i,c(12)],hi1006[i,c(12)]))
subavg[i,c(13)] <- mean(c(hi1000[i,c(13)],hi1001[i,c(13)],hi1002[i,c(13)],hi1003[i,c(13)],hi1004[i,c(13)],hi1005[i,c(13)],hi1006[i,c(13)]))
}
As there are only 7 datasets, we can use that as arguments for Map, then cbind it, and get the rowMeans
Map(function(...) rowMeans(cbind(...)), hi1000, hi1001, hi1002, hi1003,
hi1004, hi1005, hi1006)
Or use + with Reduce after getting the datasets in a list and then divide by the total number of datasets, i.e. 7
Reduce(`+`, mget(paste0("hi", 1000:1006)))/7
The second solution is more compact, but if we have NAs in the dataset, it is better to use the first one as the rowMeans have na.rm argument. By default it is FALSE, but we can set it to TRUE.

Applying multiple function via sapply

I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})

Replace characters in a column, based on a translation table from another data frame

I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.

Resources