Basically, I have a time-series of rasters in a stack. Here is my workflow:
Convert the stack to a data frame so each row represents a pixel, and each column represents a data. This process is fairly straightforward, so no issues here.
For each row (pixel), identify outliers and set them to NA. So in this case, I want to set what the outlier is. For example, let's say I want to set all the values larger than the 75th percentile to NA. The goal is so that when I calculate the mean, the outliers don't affect the calculation. The outliers in this case are several magnitudes higher, so they influence the mean significantly.
I got some help online and came up with this code:
my_data %>%
rowwise() %>%
mutate(across(is.numeric, ~ if (. > as.numeric(quantile(across(), .75, na.rm=TRUE))) NA else .))
The problem is that since it is a raster, there are a lot of NA values in some rows that I need the quantile function to ignore while calculating evaluating the cells (see below)
Using na.rm=TRUE seemed to be the solution, but now I am encountering a new error
Error: Problem with mutate() input ..1. i ..1 = across(...). x
missing value where TRUE/FALSE needed i The error occurred in row 1.
I understand that to get around this, I need to tell the if function to ignore the value if it is NA, but the dplyr syntax is very complicated for me, I so need some help on how to do this.
Looking forward to learning more and if there is a better way to do what I'm trying to do. I don't think I did a good job explaining it but, hopefully the code helps.
When asking a R question, you should always include some example data. Either create data with code (see below) or use a file that ships with R (do not use dput if it can be avoided). See the help files that ship with R, or other questions on this site for examples and inspiration.
Example data:
library(terra)
r <- rast(ncols=10, nrows=10, nlyr=10)
set.seed(1)
v <- runif(size(r))
v[sample(size(r), 100)] <- NA
values(r) <- v
Solution:
First write a function that does what you want, and works with a vector
f <- function(x) {
q <- quantile(x, .75, na.rm=TRUE)
x[x>q] <- NA
x
}
Now apply it to the raster data
x <- app(r, f)
With the raster package it would go like
library(raster)
rr <- brick(r)
xx <- calc(rr, f)
Note that you should not create a data.frame, but if you did you could do something like dd <- t(apply(d, 1, f))
Related
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
I think this is a very beginner question, but searching the web (and SO) hasn't led me to figure out the answer despite trying quite a few solutions. Here's the problem:
I have a csv dataset with many columns, for example: yearID X Y Z. I read this in using: data<-read.csv("/foo/bar.csv")
From there, I use X Y and Z to calculate A for each line: data$A<-(X+Y)/Z
Now I want to plot the average A in each year, so I do: list_df <- split(data, data$yearID). Hooray, I can see that if I do summary(list_df[[5]]) I see a summary of X Y Z and A for the fifth year.
Here is where I'm stuck, I then try to do something like:
for(year in list_df){
xy<-data.frame(mean(year$yearID, na.rm=T), mean(year$A, na.rm=T))
}
This loop "works" (it doesn't throw an error), but what comes out in xy is just the last year and the average A for that year. Ideally, I want to eventually plot "Avg A vs YearID." I've tried a number of permutations on the for loop based on other code examples I've found, but none have yet given me a working solution. Suggestions are most welcome to any part of this process, as I've just started learning R.
Cheers,
Zach
Unless you need the list split out for other reasons, you can use aggregate:
data <- data.frame(yearId=rep(2010:2014,each=2),X=runif(10,1,100),Y=runif(10,50,150),Z=runif(10,100,200))
data$A <- (data$X+data$Y)/data$Z
data2 <- aggregate(A~yearId,data,mean)
plot(data2$yearId,data2$A)
This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.
I am running into some difficulty after trying to divide each element in a given row by that row's mean. A dummy set of data:
set.seed(1)
x <- cbind(Plant = letters[1:5],
as.data.frame(matrix(rnorm(60), ncol = 12)))
x
Therefore, for Plant a, I would like V1, V2...V12 to divided by the mean of that row.
I thought it could be done using:
x/rowMeans(x)
But I get the error:
Error in rowMeans(x) : 'x' must be numeric
I assume that this error is due to the format of the data, because it's a data.frame and not a vector. I however managed to calculate the mean per row, by changing the data's format:
library(data.table)
x.T <- as.data.table(x)
x.T[,list(Mean=rowMeans(.SD)), by=Plant]
From there, I am not sure where to go. I am thinking that a loop would work, but doing some searches, I see where it is not advised. I would therefore like to have the normalized data for each sample Plant. Any suggestions please?
The first error is coming from trying to take the mean including Plant variable/column, which is non-numeric. Try:
cbind(x$Plant, x[,-1]/rowMeans(x[,-1]))
How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?
You can't do that with the basic cor() function but you can
use a correlation function from one of the robust statistics packages, eg robCov() from package robust
use a winsorize() function, eg from robustHD, to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.
If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values
In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.
It makes sense to put all your data on a data frame, so it's easier to handle.
I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)
You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.
Assuming your dataframe is A and columns are V1 and V2.
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])
or you can remove several indexes. Let's say 1, 5 and 20
ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])