subsetting quickly over many columns - r

I have some code that identifies outliers in a data frame and then either removes or caps them. I'm trying to speed up the removal process using an apply() function (or perhaps another method).
Example data
https://github.com/crossfitAL/so_ex_data/blob/master/subset
# this is the contents of a csv file, you will need to load it into your R session.
# set up an example decision-matrix
# rm.mat is a {length(cols) x 4} matrix -- in this example 8 x 4
# rm.mat[,1:2] - identify the values for min/max outliers, respectively.
# rm.mat[,3:4] - identify if you wish to remove min/max outliers, respectively.
cols <- c(1, 6:12) # specify the columns you wish to examine
rm.mat <- matrix(nrow = length(cols), ncol= 4,
dimnames= list(names(fico2[cols]),
c("out.min", "out.max","rm outliers?", "rm outliers?")))
# add example decision criteria
rm.mat[, 1] <- apply(fico2[, cols], 2, quantile, probs= .05)
rm.mat[, 2] <- apply(fico2[, cols], 2, quantile, probs= .95)
rm.mat[, 3] <- replicate(4, c(0,1))
rm.mat[, 4] <- replicate(4, c(1,0))
Here's my current code for subsetting:
df2 <- fico2 # create a copy of the data frame
cnt <- 1 # add a count variable
for (i in cols) {
# for each column of interest in the data frame. Determine if there are min/max
# outliers that you wish to remove, remove them.
if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 1) {
# subset / remove min and max outliers
df2 <- df2[df2[, i] >= rm.mat[cnt, 1] & df2[, i] <= rm.mat[cnt, 2], ]
} else if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 0) {
# subset / remove min outliers
df2 <- df2[df2[, i] >= rm.mat[cnt, 1], ]
} else if (rm.mat[cnt, 3] == 0 & rm.mat[cnt, 4] == 1) {
# subset / remove max outliers
df2 <- df2[df2[, i] <= rm.mat[cnt, 2], ]
}
cnt <- cnt + 1
}
proposed solution:
I think I should be able to do this via an apply type function, with the removal of the for loop / vectorization speeding up the code. The problem that I'm running into is that I'm trying to apply a function if-and-only-if the the decision-matrix indicates that I should. IE- using a logical vector rm.mat[,3] or rm.mat[,4] to determine if subsetting "[" should be applied to the dataframe df2.
Any help you have would be greatly appreciated! Also, please let me know if the example data / code is sufficient.

Here a solution. just to clarify your code. Hope that others can use it to give a better solution.
So if understand, you have a decision matrix, that looks like this :
rm.mat
c1 c2 c3 c4
amount.funded.by.investors 27925.000 NA 0 1
monthly.income 11666.670 NA 1 0
open.credit.lines 18.000 NA 0 1
revolving.credit.balance 40788.750 NA 1 0
inquiries.in.the.last.6.months 3.000 NA 0 1
debt.to.inc 28.299 NA 1 0
int.rate 20.490 NA 0 1
fico.num 775.000 NA 1 0
And you try to filter a big matrix according to the values of this matrix
colnames(rm.mat) <- paste('c',1:4,sep='')
rm.mat <- as.data.frame(rm.mat)
apply(rm.mat,1,function(y){
h <- paste(y['c3'],y['c4'],sep='')
switch(h,
'11'= apply(df2,2, function(x)
df2[x >= y['c1'] & x <= y['c2'],]), ## we never have this!!
'10'= apply(df2,2, function(x)
df2[x >= y['c1'] , ]), ## here we apply by columns!
'01'= apply(df2,2,function(x)
df2[x <= y['c2'], ])) ## c2 is NA!! so !!!
}
)

Related

Trying to create a new column in a data frame using a function in R

I have a large data frame, and I would like to create a new column for the data frame in R but I am struggling.
I am a relative beginner and I would be very grateful for some help.
Essentially I am looking to create a new column of AKI stage, based on an individuals peak and baseline creatinine measurements, and whether they have been on renal-replacement therapy (RRT), according to the following criteria:
stage 1: Peak Cr/Baseline Cr = 1.5–1.9 OR Peak Cr ≥ Baseline Cr + 26.5mmol/l)
stage 2: Peak Cr/Baseline Cr = 2.0–2.9
stage 3: Peak Cr/Baseline Cr ≥ 3 OR Peak cr ≥353.6mmol/l OR Initiation of RRT
My data looks like this, in which I have 3 main variables.
head(data)
Peak.Creatinine.1 baseline.Cr.within.12.months new.RRT
1 421 82 1
2 659 98 1
3 569 89 1
4 533 113 1
5 533 212 1
6 396 65 1
I would like to create a new column called "AKI.stage", which returns a number 0,1,2,3 or 4.
Which essentially uses this function:
akistage <- function(peak_cr, bl_cr, rrt=0) {
ratio <- peak_cr / bl_cr
if (rrt == "1"){return(3)}
else if (ratio >= 3){return(3)}
else if (peak_cr > 353.6){return(3)}
else if (ratio > 2 & ratio <3){return(2)}
else if (ratio > 1.5 & ratio <2){return(1)}
else if ((peak_cr >= bl_cr + 26.5)){return(1)}
else {return (0)}
}
The function works well when I test it, but I can't seem to apply it to the dataframe in order to create the new column.
I have attempted this in multiple ways including using apply,mapply,mutate,transform etc but I just can't seem to get it to work.
Here are some of my failed attempts:
data2$Peak.Creatinine.1 <- as.numeric(data2$Peak.Creatinine.1)
data2$baseline.Cr.within.12.months <- as.numeric(data2$baseline.Cr.within.12.months)
data2$test <- apply(data2, 1, function(x){
ratio <- x[1] / x[2]
peak_cr <- x[1]
bl_cr <- x[2]
rrt <- x[3]
if (rrt == "1"){return(3)}
else if (ratio >= 3){return(3)}
else if (peak_cr > 353.6){return(3)}
else if (ratio > 2 & ratio <3){return(2)}
else if (ratio > 1.5 & ratio <2){return(1)}
else if ((peak_cr >= bl_cr + 26.5)){return(1)}
else {return (0)}
})
But this returns the following error message, despite being of class numerical:
Error in x[1]/x[2] : non-numeric argument to binary operator
Another attempt:
data2 %>%
mutate(test =
akistage(Peak.Creatinine.1,baseline.Cr.within.12.months,new.RRT))
Returns
Warning message:
In if (rrt == "1") { :
the condition has length > 1 and only the first element will be used
I have attempted it in lots of other ways, and I'm not sure why it's not working.
It does not seem very difficult to do, I would be extremely grateful if someone could come up with a solution!
Many thanks for your help!
The following vectorized function does what the question describes. It uses index vectors to assign the return values to a previously created vector AKI.stage.
akistage <- function(peak_cr, bl_cr, rrt = 0) {
AKI.stage <- numeric(length(peak_cr))
ratio <- peak_cr / bl_cr
rrt1 <- rrt == 1
i <- findInterval(ratio, c(0, 1.5, 2, 3, Inf))
AKI.stage[rrt1 | i == 4 | peak_cr > 353.6] <- 3
AKI.stage[!rrt1 & i == 3] <- 2
AKI.stage[!rrt1 & i == 2] <- 1
AKI.stage[!rrt1 & i == 1 & peak_cr >= bl_cr + 26.5] <- 1
AKI.stage
}
data %>%
mutate(test = akistage(Peak.Creatinine.1,baseline.Cr.within.12.months,new.RRT))
I propose you different solutions to add a new colum to a data.frame using only base R :
df <- data.frame(v1 = rep(0, 100), v2 = seq(1, 100))
v3 <- rep(0, 100)
# first way with a $
df$v3 <- v3
# second way with cbind
df <- cbind(df, v3)
# third way
df[, 3] <- 3
EDIT 1
Your problem is coming from the fact that your third column is a factor so when you use apply it transforms all your data into character. The right way to do what you want is :
sapply(1:nrow(data2), function(i, df){
x <- df[i,]
ratio <- x[1] / x[2]
peak_cr <- x[1]
bl_cr <- x[2]
rrt <- x[3]
if (rrt == "1"){return(3)}
else if (ratio >= 3){return(3)}
else if (peak_cr > 353.6){return(3)}
else if (ratio > 2 & ratio <3){return(2)}
else if (ratio > 1.5 & ratio <2){return(1)}
else if ((peak_cr >= bl_cr + 26.5)){return(1)}
else {return (0)}
}, df = data2)

Change cell value in one raster based on another raster

I have two raster maps from two points in time (t1 and t2) with two land-cover categories in each (LC1, LC2). I want impose a rule that a LC2-cell in t1 cannot change to LC1-cell in t2, i.e., only LC1 can change to LC2 through time but not the other way around. I am having a hard time coming up with a rule for that in R. What I had in mind was something like this:
#create test rasters
r <- raster(nrows=25, ncols=25, vals=round(rnorm(625, 3), 0)) #land-use/cover raster
r[ r > 2 ] <- 2
r[ r < 1 ] <- 1
r2 <- r
plot(r2) #r2 is t2
r <- raster(nrows=25, ncols=25, vals=round(rnorm(625, 3), 0)) #land-use/cover raster
r[ r > 2 ] <- 2
r[ r < 1 ] <- 1
plot(r) #r is t1
r_fix <- overlay(r, r2, fun = function(x, y) {
if (x[ x==2 ] & y[ y==1 ]) { #1 is LC1, 2 is LC2
x[ x==2 ] <- 1 }
return(x)
})
But it returns an error (because of they way I am using the if statement with rasters?):
Error in (function (x, fun, filename = "", recycle = TRUE, forcefun = FALSE, :
cannot use this formula, probably because it is not vectorized
I wonder if there is a simple way to implement something similar to that that works with rasters? Thank you in advance.
You were really close,
overlay(r, r2, fun = function(x, y) {x[x == 2 & y == 1] <- 1; x})
seems to do the job.
In terms of your solution,
x[x == 2] <- 1
doesn't cause any errors, although it's not exactly what you want to use in your case either. However,
if (x[x == 2] & y[y == 1])
is a problem because x[x == 2] & y[y == 1] returns a matrix, while if wants just a single logical input. Subsetting, on the other hand, can handle logical matrices, which is exactly what is happening in x[x == 2 & y == 1].

How do I pass mulitple columns to a function within dplyr::summarize

I am trying to pass all columns from a data.frame matching a criteria to a function within the summarize function of dplyr as follows:
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Error: argument is of length zero
Is there a way to do this? A working example follows:
Build a simulated data.frame of sample predictions. These are interpreted as the output of a classification algorithm.
library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)
getProbs <- function(i) {
p <- runif(i)
return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))
add a column indicating the true class
df$TrueClass <- factor(ceiling(runif(nrow, min = 0, max = ncol)))
add categorical columns for sub-setting
df$Type <- c(rep("a", nrow / 2), rep("b", nrow / 2))
df$Version <- rep(1:4, times = nrow / 4)
now I want to calculate the Multiclass LogLoss for these predictions using the function below:
mcll <- function (act, pred)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != nrow(pred)) {
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
this is easily done with the entire data set
act <- df$TrueClass
pred <- df %>% select(starts_with("pred"))
mcll(act, pred)
but I want to use dplyr group_by to calculate mcll for each subset of the data
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Ideally I could do this without changing the mcll() function, but I am open to doing that if it simplifies the other code.
Thanks!
EDIT: Note that the input to mcll is a vector of true values and a matrix of probabilities with one column for each "pred" column. For each subset of data, mcll should return a scalar. I can get exactly what I want with the code below, but I was hoping for something within the context of dplyr.
mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
for (type in unique(df$Type)) {
subdat <- df %>% filter(Type == type & Version == ver)
val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
count = count + 1
}
}
head(mcll_df)
Type Version mcll
1 a 1 1.42972507510096
2 b 1 1.97189000832723
3 a 2 1.97988830406062
4 b 2 1.21387875938737
5 a 3 1.30629638026735
6 b 3 1.48799237895462
This is easy to do using data.table:
library(data.table)
setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))]
# Version Type V1
#1: 1 a 1.429725
#2: 2 a 1.979888
#3: 3 a 1.306296
#4: 4 a 1.668330
#5: 1 b 1.971890
#6: 2 b 1.213879
#7: 3 b 1.487992
#8: 4 b 1.171286
I had to change the mcll function a little bit but then it worked. The problem is occurring with the second if statement. You are telling the function to get nrow(pred), but if you are summarizing over multiple columns you are actually only supplying a vector each time (because each column gets analyzed separately). Additionally, I switched the order of the arguments being entered into the function.
mcll <- function (pred, act)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != length(pred)) { # the main change is here
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
From there we can use the summarise_each function.
df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))
Version Type pred.1 pred.2 pred.3 pred.4
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 a 1.475232 1.972779 1.743491 1.161984
2 1 b 2.030829 1.331629 1.397577 1.484865
3 2 a 1.589256 1.740858 1.898906 2.005511
I checked this against a subset of the data and it looks like it works.
mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
df$TrueClass[which(df$Type=="a" & df$Version==1)])
[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.

Repeated conditional change with sapply or a loop in R

I am trying to do a conditional change for a list of 11 columns in R. My conditional is always the same survey$only0 == 1. I wrote the following code:
survey$w.house[survey$only0 == 1] <- 1
survey$w.inc[survey$only0 == 1] <- 1
survey$w.jobs[survey$only0 == 1] <- 1
survey$w.com[survey$only0 == 1] <- 1
survey$w.edu[survey$only0 == 1] <- 1
survey$w.env[survey$only0 == 1] <- 1
survey$w.health[survey$only0 == 1] <- 1
survey$w.satisf[survey$only0 == 1] <- 1
survey$w.safe[survey$only0 == 1] <- 1
survey$w.bal[survey$only0 == 1] <- 1
survey$w.civic[survey$only0 == 1] <- 1
My code works well, but I would like to shorten my code using a loop or a function as sapply or lapply. Does anyone know how to do it ?
Thank you for your help !
David
We can do this easily with lapply by looping through the columns of interest ('nm1'), and replace the values of it to 1 where 'only0' is 1.
survey[nm1] <- lapply(survey[nm1], function(x) replace(x, survey$only0==1, 1))
Or as #Vlo mentioned the anonymous function call is not needed
survey[nm1] <- lapply(survey[nm1], replace, list = survey$only0==1, values=1)
where
nm1 <- c("w.house", "w.inc", "w.jobs", "w.com", "w.edu", "w.env",
"w.health", "w.satisf", "w.safe", "w.bal", "w.civic")
You can try,
survey[survey$only0 == 1, cols] <- 1
where cols are the columns for which you want to check the condition.
cols <- c("w.house", "w.inc", "w.jobs", "w.com", "w.edu", "w.env",
"w.health", "w.satisf", "w.safe", "w.bal", "w.civic")

Extracting pixels from a raster based on specific value of another raster using R

I imported two rasters (raster A and B)in R using the raster function.
I would like to extract the pixels of A where B equals 1 into a data frame.
I am trying the following, however, all the pixels I obtain have the same values, although they are various in the original dataset.
The two rasters have the same dimensions (ncols, nrows, ncell, resolution, extent, projection).
library(raster)
library(rgdal)
# import inputs
A <- raster('/pat/to/rasterA.tif')
B <- raster('/pat/to/rasterB.tif')
# extract raster values from A over raster B where B == 1
mydata <- data.frame(A[B[B == 1]])
EDIT 1
Might be that when I do A[B[B == 1]], the class of object A and B from RasterLayer becomes numeric, and this creates problems? I discovered this by doing class(A[B[B == 1]]), which gives numeric.
EDIT 2
Ok, this is weird. I tried to do mydata <- data.frame(A[B]) and now the output has the original A only at B == 1 locations. Trying this before it extracted all the pixels from A (as I would expect). I can coinfirm it is right by counting the number of ones in B and the number of elements in mydata, which is the same. It's like if the indexing has skipped all the zeros in B. Can anyone explain this?
Example data:
library(raster)
r <- raster(nrow=5, ncol=5, xmn=0, xmx=1, ymn=0, ymx=1)
set.seed(1010)
A <- setValues(r, sample(0:5, ncell(r), replace=TRUE))
B <- setValues(r, sample(0:2, ncell(r), replace=TRUE))
Now you can do:
s <- stack(A,B)
v <- as.data.frame(s)
v[v[,2] == 1, 1]
Alternatively:
A[B==1]
Or:
D <- overlay(A, B, fun=function(x,y){ x[y!=0] <- NA; x})
na.omit(values(D))
Or:
xy <- rasterToPoints(B, function(x) x == 1)
extract(A, xy[,1:2])
Or:
A[B!=1] <- NA
rasterToPoints(A)[, 3]
etc...
Now why does this: A[B[B == 1]] not work? Unpack it:
B[B == 1]
# [1] 1 1 1 1 1 1 1 1 1 1
The cell values of B where B==1 are, of course, 1. A[B[B == 1]] thus becomes A[c(1,1,1,..)], and returns the value of the first cell many times.
A[B] is equivalent to A[B!=0] as B is considered a logical statement in this case, and 0 == FALSE and all other values are TRUE
this should work for the 1 values:
A[B == 1]
as well as this, for the 0 values
A[B == 0]

Resources