apply function to subsets of dataframe r

apply function to subsets of dataframe r - r

I am trying to subset a dataframe by two variables ('site' and 'year') and apply a function (dismo::biovars) to each subset. Biovars requires monthly inputs (12 values) and outputs 19 variables per year. I'd like to store the outputs for each subset and combine them.
Example data:
data1<-data.frame(Meteostation=c(rep("OBERHOF",12),rep("SOELL",12)),
Year=c(rep(1:12),rep(1:12)),
tasmin=runif(24, min=-20, max=5),
tasmax=runif(24, min=-1, max=30),
pr=runif(24, min=0, max=300))
The full dataset contains 900 stations and 200 years.
I'm currently attempting a nested loop, which I realised isn't the most efficient, and which I'm struggling to make work - code below:
sitesList <- as.character(unique(data1$Meteostation))
#yearsList<- unique(data1$Year)
bvList<-list()
for (i in c(1:length(unique(sitesList)))) {
site<-filter(data1, Meteostation==sitesList[i])
yearsList[i]<-unique(site$Year)
for (j in c(1:length(yearsList))){
timestep<-filter(site,Year==yearsList[j])
tmin<-timestep$tasmin
tmax<-timestep$tasmax
pr<-timestep$pr
bv<-biovars(pr,tmin,tmax)
bvList[[j]]<- bv
}}
bv_all <- do.call(rbind, bvList)
I'm aware there are much better ways to go about this, and have been looking to variations of apply, and dplyr solutions, but am struggling to get my head around it. Any advice much appreciated.

You could use the dplyr package, as follows perhaps?
library(dplyr)
data1 %>%
group_by(Meteostation, Year) %>%
do(data.frame(biovars(.$pr, .$tasmin, .$tasmax)))

Use by and rbind the result.
library("dismo")
res <- do.call(rbind, by(data1, data1[c("Year", "Meteostation")], function(x) {
cbind(x[c("Year", "Meteostation")], biovars(x$pr, x$tasmin, x$tasmax))
}))
Produces
head(res[, 1:10])
# Meteostation Year bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8
# 1 OBERHOF 1 12.932403 18.59525 100 NA 22.2300284 3.634777 18.59525 NA
# 2 OBERHOF 2 5.620587 7.66064 100 NA 9.4509069 1.790267 7.66064 NA
# 3 OBERHOF 3 0.245540 12.88662 100 NA 6.6888506 -6.197771 12.88662 NA
# 4 OBERHOF 4 5.680438 45.33159 100 NA 28.3462326 -16.985357 45.33159 NA
# 5 OBERHOF 5 -6.971906 16.83037 100 NA 1.4432801 -15.387092 16.83037 NA
# 6 OBERHOF 6 -7.915709 14.63323 100 NA -0.5990945 -15.232324 14.63323 NA

Related

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.

You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

How to calculate enthalpy by using mutate for each column in R?

I would like to calculate enthalpy by using steam table function .
I want to adapt the function to a Tibble table which include temp and pressure, but failed.
For example, I want to add the enthalpy row.
sample_table
temp pressure
800 16
900 17
1000 18
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = hTp(temp, pressure))
The result is
temp pressure enthalpy
800 16 3375.08509
900 17 3375.08509
1000 18 3375.08509
In this case, the calculation is only adapted to the first column.
How should I do to calculate for all column by using mutate?

After thinking more about your question, now I understand that you were not talking about multiple columns. Instead, it seems like you would like to have a function that can process data for multiple rows.
Here I provided two solutions. The first one is to use Vectorize function to covert your function to a version that can generate vectorized output.
library(IAPWS95)
library(tidyverse)
hTp_vectorize <- Vectorize(hTp)
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = hTp_vectorize(temp, pressure))
sample_table_add_enthalpy
# temp pressure enthalpy
# 1 800 16 3375.08509
# 2 900 17 3636.88144
# 3 1000 18 3889.57761
The second one is to use map2 from the purrr package to vectorized the operations.
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = map2(temp, pressure, hTp))
sample_table_add_enthalpy
# temp pressure enthalpy
# 1 800 16 3375.08509
# 2 900 17 3636.88144
# 3 1000 18 3889.57761

Listing my splitted data in R

My data looks like this:
colnames(dati)< - c("grupa", "regions5", "regions6", "novads.rep", "pilseta.lt", "specialists", "limenis.1", "limenis.2", "cipari.3", "ratio", "gads", "KV", "DS")
and I have manually applied split to it in order to have 24 splits (12 splits including year and 12 without splitting by years). I did them following way:
k1<-split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k2<-split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
...
k13<-split(dati$ratio,list(dati$grupa),drop=TRUE)
k14<-split(dati$ratio,list(dati$grupa,dati$regions5),drop=TRUE)
...etc
and what I mean to do is to apply these splits to my function as follows:
function(k1,k13)
but instead of inserting the values manually I would like to change them so that I could do my function similar to this:
for(i in 1:12){function(k[i],k[i+12])}
I just can't seem to find the right way to do it
dati after i split them look like this:
grupa regions5 regions6 novads.rep pilseta.lt specialists
1 1* Zemgales Zemgales Novads lauki Silva
2 1* Kurzemes Kurzemes Novads lauki Sniedze
3 3* Kurzemes Kurzemes REP pilsēta AnitaE
4 1* Vidzemes Vidzemes Novads pilsēta Dainis
limenis.1 limenis.2 cipari.3 ratio gads KV
1 Jelgavas nov. Svētes pag. 1 0.8682626 2011 2162
2 Ventspils nov. Vārves pag. 1 0.3923857 2011 27467
3 _Liepāja _Liepāja 4 0.4069100 2011 30107
4 Alūksnes nov. Alūksne 2 0.5641127 2011 8147
DS
1 2490.03
2 70000.00
3 73989.33
4 14442.15
...
and here is the output i'm looking for:
count mean lowermean uppermean median ...
2011.1*.Kurzemes 119 0.83322820 7.719323e-01 0.8945241 0.79888324
2012.1*.Kurzemes 171 0.82800498 7.836221e-01 0.8723879 0.84424821
2013.1*.Kurzemes 144 0.77551814 7.347631e-01 0.8162731 0.80745150
2014.1*.Kurzemes 180 0.78134649 7.396007e-01 0.8230923 0.81635065
2015.1*.Kurzemes 80 0.78146588 7.135070e-01 0.8494248 0.73659659
2011.10*.Kurzemes 16 1.09552970 6.930780e-01 1.4979814 1.02127841
2012.10*.Kurzemes 22 0.87442906 5.721409e-01 1.1767172 0.74787482
2013.10*.Kurzemes 25 0.84406131 6.947097e-01 0.9934129 0.91786319
2014.10*.Kurzemes 22 0.79385199 5.880507e-01 0.9996533 0.71708060
2015.10*.Kurzemes 12 1.19059850 8.213604e-01 1.5598365 1.25322750
2012.11*.Kurzemes 1 0.09461065 NA NA 0.09461065
2013.11*.Kurzemes 2 0.18134522 -1.823437e+00 2.1861274 0.18134522
2014.11*.Kurzemes 1 0.11097174 NA NA 0.11097174
2013.12*.Kurzemes 1 0.44620780 NA NA 0.44620780
...

You could use a list:
k <- list()
k[[1]] <- split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k[[2]] <- split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
# etc
Then the following is valid:
for(i in 1:12){
function(k[[i]],k[[i+12]])
}
Note that k3 is the name of a variable, which could be x, myvar32, whatever. When you type k[3], you state that you want to access the third cell of the vector k. Note that k and k3 are totally distinct variables. If you want to be able to access you variables using k[i], you must first create the vector k and store what you need in k[i]...
The double bracket notation is used to access lists, which are basically handy store anything -- what you need in your case.

Multiple plots in R with different settings for each axis with less lines of code

In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
#FigC
label1=c(0,100,200,300)
plot(data$TimeVariable2C,data$Variable2C,axes=FALSE,ylab="",xlab="",xlim=c(0,24),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
lines(data$TimeVariable3C,data$Variable3C)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,24,by=6),label=seq(0,24,by=6))
mtext("(C)",side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
plot(data$TimeVariable1C,data$Variable1C,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(data$TimeVariable1C,data$Variable1C,col='violetred4',border=NA)

You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
v.names=c('TimeVariable','Variable'),
varying=1:6)
dd$time <- factor(dd$time)
dd$Type <- let
dd
}
)
ll is a list of 4 data.frame, where each one that looks like :
head(ll[[1]])
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
library(Hmisc)
layout(matrix(1:4, 2, 2, byrow = TRUE))
lapply(ll,function(data){
label1=c(0,100,200,300)
Type <- unique(dat$Type)
dat <- subset(data,time==2)
x.mm <- max(dat$Variable,na.rm=TRUE)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,ylab="",xlab="",xlim=c(0,x.mm),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
dat <- subset(data,time==2)
lines(dat$TimeVariable,dat$Variable)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,x.mm,by=6),label=seq(0,x.mm,by=6))
mtext(Type,side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
dat <- subset(data,time==1)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(dat$TimeVariable,dat$Variable,col='violetred4',border=NA)
})
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <- do.call(rbind,ll)
library(ggplot2)
ggplot(subset(dat.l,time !=1)) +
geom_line(aes(x=TimeVariable,y=Variable,group=time,color=time))+
geom_polygon(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=60-Variable/10,fill=Type))+
geom_line(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=Variable,fill=Type))+
facet_wrap(~Type,scales='free')

counting unique factors in r

I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!

What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))

You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2

This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2

In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

apply function to subsets of dataframe r - r

You could use the dplyr package, as follows perhaps? library(dplyr) data1 %>% group_by(Meteostation, Year) %>% do(data.frame(biovars(.$pr, .$tasmin, .$tasmax)))

Related

compute diff of rows with NAs values in data frame using R

How to calculate enthalpy by using mutate for each column in R?

Listing my splitted data in R

Multiple plots in R with different settings for each axis with less lines of code

counting unique factors in r

Categories

Resources