Accessing Data after Splitting into Lists - r

I think this is a very beginner question, but searching the web (and SO) hasn't led me to figure out the answer despite trying quite a few solutions. Here's the problem:
I have a csv dataset with many columns, for example: yearID X Y Z. I read this in using: data<-read.csv("/foo/bar.csv")
From there, I use X Y and Z to calculate A for each line: data$A<-(X+Y)/Z
Now I want to plot the average A in each year, so I do: list_df <- split(data, data$yearID). Hooray, I can see that if I do summary(list_df[[5]]) I see a summary of X Y Z and A for the fifth year.
Here is where I'm stuck, I then try to do something like:
for(year in list_df){
xy<-data.frame(mean(year$yearID, na.rm=T), mean(year$A, na.rm=T))
}
This loop "works" (it doesn't throw an error), but what comes out in xy is just the last year and the average A for that year. Ideally, I want to eventually plot "Avg A vs YearID." I've tried a number of permutations on the for loop based on other code examples I've found, but none have yet given me a working solution. Suggestions are most welcome to any part of this process, as I've just started learning R.
Cheers,
Zach

Unless you need the list split out for other reasons, you can use aggregate:
data <- data.frame(yearId=rep(2010:2014,each=2),X=runif(10,1,100),Y=runif(10,50,150),Z=runif(10,100,200))
data$A <- (data$X+data$Y)/data$Z
data2 <- aggregate(A~yearId,data,mean)
plot(data2$yearId,data2$A)

Related

Removing outliers in time series rasters per pixel in R

Basically, I have a time-series of rasters in a stack. Here is my workflow:
Convert the stack to a data frame so each row represents a pixel, and each column represents a data. This process is fairly straightforward, so no issues here.
For each row (pixel), identify outliers and set them to NA. So in this case, I want to set what the outlier is. For example, let's say I want to set all the values larger than the 75th percentile to NA. The goal is so that when I calculate the mean, the outliers don't affect the calculation. The outliers in this case are several magnitudes higher, so they influence the mean significantly.
I got some help online and came up with this code:
my_data %>%
rowwise() %>%
mutate(across(is.numeric, ~ if (. > as.numeric(quantile(across(), .75, na.rm=TRUE))) NA else .))
The problem is that since it is a raster, there are a lot of NA values in some rows that I need the quantile function to ignore while calculating evaluating the cells (see below)
Using na.rm=TRUE seemed to be the solution, but now I am encountering a new error
Error: Problem with mutate() input ..1. i ..1 = across(...). x
missing value where TRUE/FALSE needed i The error occurred in row 1.
I understand that to get around this, I need to tell the if function to ignore the value if it is NA, but the dplyr syntax is very complicated for me, I so need some help on how to do this.
Looking forward to learning more and if there is a better way to do what I'm trying to do. I don't think I did a good job explaining it but, hopefully the code helps.
When asking a R question, you should always include some example data. Either create data with code (see below) or use a file that ships with R (do not use dput if it can be avoided). See the help files that ship with R, or other questions on this site for examples and inspiration.
Example data:
library(terra)
r <- rast(ncols=10, nrows=10, nlyr=10)
set.seed(1)
v <- runif(size(r))
v[sample(size(r), 100)] <- NA
values(r) <- v
Solution:
First write a function that does what you want, and works with a vector
f <- function(x) {
q <- quantile(x, .75, na.rm=TRUE)
x[x>q] <- NA
x
}
Now apply it to the raster data
x <- app(r, f)
With the raster package it would go like
library(raster)
rr <- brick(r)
xx <- calc(rr, f)
Note that you should not create a data.frame, but if you did you could do something like dd <- t(apply(d, 1, f))

Loop Aggregate with Weighted Mean in R

Apologies in advance for wording, English is not my native language and this is my first post. I have been able to aggregate my data to this point, but am having issues condensing it further. I am trying to get the weighted average depth by biomass of several species.
My data currently has columns (station, time, layer, depth, biomass_X, biomass_Y, biomass_Z, ...) and I want to condense it to (station, time, weighted_depth_X, weighted_depth_Y, weighted_depth_Z, ...).
I got this code to work, but is there a way to loop it so it can complete all my columns?
library(plyr)
newData<-ddply(data, ~station+time, summarize, weighted.mean(data[,6], w=depth))
There is certainly a nicer way but this should work:
# data: dataframe containing columns to be averaged
# weights: vector containing the corresponding weights
weighted_mean_all_cols<- function(data,weights){
res<-do.call(cbind,llply(colnames(data), function(col) {weighted.mean(data[,col], w=weights)}))
colnames(res) <- colnames(data)
res
}
# collect the names of the target columns to average
targetCols <- grep("^biomass",colnames(data))
# apply weighted average by group, for every target column
newData <- ddply(data, c('station','time'), function(groupDF) {
print(groupDF[targetCols])
weighted_mean_all_cols(groupDF[,targetCols],groupDF$depth)
})

Applying function to entire data-set x number of times for each observation

I am trying to apply a function to a simulated data set (the easy part), though I have to apply this function x number of times. My overall goal in this project is to simulate abundances of species over time for many populations over many years. In this example we will work with 15 species over thirty years for one community.
I have used a function and called it:
curve<-function(Ao,m,r,a,g){(Ao*((((x-m)/r)+(a/(a+g)))^a)*((1-(((x-m)/r)+(a/(a+g))))^g))/(((a/(a+g))^a)*((1-(a/(a+g)))^g))}
x<-seq(1,365, by=14) #this is the number of times that I get sampled abundances, and is included in the function
I then run a loop and create an abundance table, along with a table giving me the values of each variable.
TotSpecies<-15
Community<-30
for(n in 1:TotSpecies){
Ao<-rlnorm(TotSpecies,3,2)
m<-sample(seq(min(x)+5:max(x)-5),TotSpecies)
r<-runif(TotSpecies,min=0,max=max(x))
a<-(runif(TotSpecies,min=.1,max=4))
g<-(runif(TotSpecies,min=.1,max=4))
}
Abundance <- matrix(0,nrow=length(x),ncol=TotSpecies)
colnames(Abundance) = c("Sp1","Sp2","Sp3","Sp4","Sp5","Sp6","Sp7","Sp8","Sp9","Sp10","Sp11","Sp12","Sp13","Sp14","Sp15")
for(L in 1:TotSpecies){
Abundance[,L] <- curve(Ao[L],m[L],r[L],a[L],g[L])
}
#Alter matrix to removed NANs and replace with zeroes
Abundance.NA<-is.na(Abundance)
Abundance[Abundance.NA]<-0 #this makes Abundance have 0's where abundance is NaN
Pres.Abs<-Abundance
Pres.Abs[Pres.Abs>0]<-1 #presence-absence matrix
#creates a data frame with the values of each variable
Species<-1:TotSpecies
Year<-rep(1,TotSpecies)
year1data<-data.frame(Species,Year,Ao,m,r,a,g)
At this point, I only have data of abundances for one year and one community. Now I want to simulate for this community over thirty years, altering the abundances of species sequentially from year to year by adding error.
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
This is now a data frame with each given variable for each species for each year. Now I do not know how to apply the function to this table and create an abundance table that has all fifteen species for the thirty years.
I started off thinking that a nested loop would be the best instead of trying to use the apply function because the apply function could not handle me trying to run the function x number or times (or am I wrong in this??).
TotSpeciesAbundance<-matrix(0,nrow=Community*length(x),ncol=TotSpecies)
colnames(TotSpeciesAbundance) = c("Sp1","Sp2","Sp3","Sp4","Sp5","Sp6","Sp7","Sp8","Sp9","Sp10","Sp11","Sp12","Sp13","Sp14","Sp15")
Year<-rep(1:Community, each=length(x))
TotSpeciesAbundance<-cbind(TotSpeciesAbundance,Year)
for(p in 1:450){
for(j in 1:TotSpecies){
Ao<-TotSpeciesData$Ao
m<-TotSpeciesData$m
r<-TotSpeciesData$r
a<-TotSpeciesData$a
g<-TotSpeciesData$g
TotSpeciesAbundance[,j]<- curve(Ao[j],m[j],r[j],a[j],g[j])
}
}
I have tried a number of different ways to alter the double loop, though cannot find a way to get it to work. This may be a bit amateur, but can anyone help in this ?
I still don't really understand what you want, but I can make a guess. Your second for loop contains the same problem as your first one: you overwrite the same data.frame 450 times. I think what you intended is something like this:
# Bug fix: make TotSpeciesAbundance a data.frame so with works
TotSpeciesAbundance<-data.frame(TotSpeciesAbundance)
# Make a storage list beforehand.
big.list<-list()
for(p in 1:450){
TotSpeciesAbundance <- matrix(ncol=length(x),nrow=TotSpecies)
for(j in 1:TotSpecies){
TotSpeciesAbundance[,j]<- with(TotSpeciesData,curve(Ao[j],m[j],r[j],a[j],g[j]))
}
big.list[[p]] <- TotSpeciesAbundance
}
But, conveniently, you can replace the inner for:
for(p in 1:450) {
big.list[[p]] <- with(TotSpeciesAbundance,mapply(curve,Ao,m,r,a,g))
}
Which makes it quite clear that you were not only rerunning the thing 450 times, but you were doing it with exactly the same thing. You could replace this with:
replicate(450,with(TotSpeciesAbundance,mapply(curve,Ao,m,r,a,g)),simplify=FALSE)
I am guessing you want to add a bit of noise or something each time, but I can't figure out exactly what you want. Perhaps if you clearly explained what you mean by an abundance matrix, and give a small example of what the output data would look like.

Using ddply() to Get Frequency of Certain IDs, by Appearance in Multiple Rows (in R)

Goal
If the following description is hard follow, please see the example "before" and "after" to see a straightforward example.
I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction.
I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade).
Further, I'd like to do this with the plyr ddply function.
(If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.)
In terms of pseudo code I'm working off of:
merge each row of Side1 and Side2
by row, get unique() appearances of each item id
apply table() function
transpose and relabel output from table
Example of the structure of my data, and the output I desire.
Data Example (before):
df <- data.frame(TradeID = c("01","02","03","04"))
df$Side1 = list(c("001","001","002"),
c("002","002","003"),
c("001","004"),
c("001","002","003","004"))
df$Side2 = list(c("001"),c("007"),c("009"),c())
Desired Output (after):
df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"),
RelFreq_byTrade = c(3,3,2,2,1,1))
One method to do this without ddply
I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me.
temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique)))
df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp),
RelFreq_byTrade = temp[])
Thanks for any help you can offer!
Curtis
I believe this will do what you're asking for. It uses ddply. Twice!
res <- ddply(df, .(TradeID), function(df) data.frame(ItemID = c(df$Side1[[1]],df$Side2[[1]]), TradeID = df$TradeID))
ddply(res, .(ItemID), summarise, RelFreq_byTrade = length(unique(TradeID)))
Note that the ItemsIDs are slightly out of order.

How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?
You can't do that with the basic cor() function but you can
use a correlation function from one of the robust statistics packages, eg robCov() from package robust
use a winsorize() function, eg from robustHD, to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.
If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values
In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.
It makes sense to put all your data on a data frame, so it's easier to handle.
I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)
You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.
Assuming your dataframe is A and columns are V1 and V2.
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])
or you can remove several indexes. Let's say 1, 5 and 20
ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])

Resources