How to repeat calculation on a subset of a database - r

I have a large data frame, with many transects and for each of those transect I want to calculate for each year the intercept of the cros (x) and the value (y). Then I want to know how the intercept changed over the different years. I know how to calculate the intercept, however I have a lot of transects and I have to repeat this a lot, and I would like to do that more automatically.
So this is how my data looks like:
df
transects year cros value
10 1996 11 -3
10 1996 12 5
10 2005 11 -9
10 2005 12 -3
10 2010 11 -8
10 2010 12 -8
11 1996 11 7
11 1996 12 -4
11 2005 11 -6
11 2005 12 9
11 2010 11 6
11 2010 12 17
12 1996 14 -16
12 1996 15 -17
12 2005 14 -18
12 2005 15 -11
12 2010 14 16
12 2010 15 7
So I made a function, to subset the dataset and to do some calculation with this subset.
Here is the code. I used lapply because I want the outcome of the code to put in a list. However it could be that lapply is not the right function for this problem.
transect <- c(10, 11, 12)
o <- lapply(1:length(transect), function(i) {
s101 <- subset(df, along == transect[[i+1]])
# I want to create a subset for every transect and with that subset I want to do multiple calculations.
# Dune volume
# This makes sure that I have an intercept, also if there is no value above the 3
AUC96<-0
AUC05<-0
AUC10<-0
# Here I calculate the intercept for the different years.
d96 <- subset(s101, (cros >= 3.00) & (year == 1996))
AUC96<-sintegral(d96$cros,d96$value)$int
lengthdune96 <- max(d96$value)-min(d96$value)
AUC962 <- lengthdune96*8.00
AUC96 <- AUC96 +AUC962
d05 <- subset(s101, (cros >= 3.00) & (year == 2005))
AUC05<-sintegral(d05$cros,d05$value)$int
lengthdune05 <- max(d05$alti)-min(d05$value)
AUC052 <- lengthdune05*8.00
AUC05 <- AUC05 +AUC052
d10 <- subset(s101, (cros >= 3.00) & (year == 2010))
AUC10<-sintegral(d10$cros,d10$value)$int
lengthdune10 <- max(d05$value)-min(d05$value)
AUC102 <- lengthdune10*8.00
AUC10 <- AUC10 +AUC102
# Here the difference between the years
dune96.05 <- AUC05-AUC96
dune05.10 <- AUC10-AUC05
c(transect[[i+1]], dune96.05, dune05.10)
})
out <- as.data.frame(do.call(rbind, o))
However when I try this I get this error:
`Error in approx(x, fx, n = 2 * n.pts + 1) :
need at least two non-NA values to interpolate`
This is the first time that I try to make such a function, so it could be that I am doing this totally wrong. I hope that you can help me.
EDIT:
So after I changed the answer a bit, because that did not totally worked out. However I still get a error message and I am really stuck. I also tried different ways to solve this questions, eg looking at the plyr package, however I still get the same error questions:
So this is how my code looked like:
test<-lapply(unique(df$transect),function(i){s101 <- subset(df,df$transect==i)
{
AUC96<-0
AUC05<-0
AUC10<-0
d96 <- subset(s101, (cros >= 3.00) & (year == 1996))
AUC96<-sintegral(d96$cros,d96$value)$int
lengthdune96 <- max(d96$value)-min(d96$value)
AUC962 <- lengthdune96*8.00
AUC96 <- AUC96 +AUC962
d05 <- subset(s101, (cros >= 3.00) & (year == 2005))
AUC05<-sintegral(d05$cros,d05$value)$int
lengthdune05 <- max(d05$alti)-min(d05$value)
AUC052 <- lengthdune05*8.00
AUC05 <- AUC05 +AUC052
d10 <- subset(s101, (cros >= 3.00) & (year == 2010))
AUC10<-sintegral(d10$cros,d10$value)$int
lengthdune10 <- max(d05$value)-min(d05$value)
AUC102 <- lengthdune10*8.00
AUC10 <- AUC10 +AUC102
dune96.05 <- AUC05-AUC96
dune05.10 <- AUC10-AUC05
}
c(i,dune96.05, dune05.10)
})
However I still get this error message:
`Error in approx(x, fx, n = 2 * n.pts + 1) :
need at least two non-NA values to interpolate`
I am not really sure what I am doing wrong, the function should work like this. I hope that somebody can help me.

I see two problems with your use of lapply. You index transect as a list (it's a vector) and do not pass it (nor df) as an argument to your function in lapply, hence no luck with the subsetting. Try something like this:
lapply(unique(df$transect),function(i,df){s101 <- subset(df,transect==i)
,...
c(i, dune96.05, dune05.10)
},df)

Related

How to rearrange time series data in 10 days time window in R?

I am currently working on a timeseries streamflow data range from 2002-10-04 to 2012-10-28, what I need to do is to develop regression models using 10-day time window data. To be more specific, I need to use Oct-04 to Oct 13 across all years from 2002 to 2012 to build a regression model. Then I need to use Oct-05 to Oct 14 across all years from 2002 to 2012 to build another regression model, then use Oct-06 to Oct 15 across all years to build next model and so on repeatedly till the end.
This is how my data look like.
> head(CFbasin)
Data Qkl Qllt Qhaw Qdp Qlit
1 2002-10-04 19.25546 8.353470 4.502379 2.217209 1.985011
2 2002-10-05 19.56694 8.126935 4.615646 1.622555 1.628219
3 2002-10-06 19.73684 7.560598 4.389111 1.251605 1.492298
4 2002-10-07 18.12278 7.079212 3.992675 1.158159 1.413011
5 2002-10-08 18.12278 6.824360 3.794457 1.070377 1.393189
6 2002-10-09 17.83961 6.739409 3.369705 1.073208 1.353545
> tail(CFbasin)
Data Qkl Qllt Qhaw Qdp Qlit
3673 2012-10-23 24.89051 16.67862 8.608321 2.477724 1.432832
3674 2012-10-24 25.00378 16.48040 11.638224 2.820358 1.393189
3675 2012-10-25 25.37189 16.99011 7.758816 3.001586 1.322397
3676 2012-10-26 26.07982 16.87684 6.484558 2.814695 1.279921
3677 2012-10-27 27.41071 17.07506 4.813864 3.086536 1.228951
3678 2012-10-28 28.88318 17.16001 5.635052 3.114853 1.220456
I only tried once and this is my code:
CFbasin %>% filter(month(Date) == 10 & day(Date) >= 4 & day(Date) <= 14)
this allows me to get all the data within Oct 4 to Oct13 from 2002 to 2012, and then conduct linear regression. But I am not sure how to have it work on the whole dataset and then conduct linear regressions, I am considering for loop and function rollapply(), but really unclear about how to arrange my dataset. Any suggestion and recommendation will be really appreciated, thank you in advance!
Here is an example using the data in the Note at the end and a width of 5.
library(zoo)
coefs <- function(x) coef(lm.fit(cbind(1, x[, -1]), x[, 1]))
rollapplyr(CFbasin[, -1], 5, coefs, by.column = FALSE, fill = NA)
Added
This uses NA for the first output 4 rows and then the rows that correspond to rows 1:5 of all years to form the next regression, rows 2:6 of all years to form the regression after that and so on. Dec 31st is not used in leap years.
yday <- as.POSIXlt(CFbasin$Data)$yday
coefs <- function(ix) {
x <- CFbasin[yday %in% ix, -1]
if (NROW(x) == 0) NA else coef(lm(Qkl ~., x))
}
rollapplyr(0:364, 5, coefs, fill = NA)
Note
Lines <- "
Data Qkl Qllt Qhaw Qdp Qlit
1 2002-10-04 19.25546 8.353470 4.502379 2.217209 1.985011
2 2002-10-05 19.56694 8.126935 4.615646 1.622555 1.628219
3 2002-10-06 19.73684 7.560598 4.389111 1.251605 1.492298
4 2002-10-07 18.12278 7.079212 3.992675 1.158159 1.413011
5 2002-10-08 18.12278 6.824360 3.794457 1.070377 1.393189
6 2002-10-09 17.83961 6.739409 3.369705 1.073208 1.353545"
CFbasin <- read.table(text = Lines)

Using approx function within tapply or by in R

I have a temperature profiler (tp) data for date, depth and temperature. The depth for each date is not exactly the same so I need to unify it to the same depth and set the temperature for that depth by linear approximation. I was able to do this with a loop using ‘approx’ function (see first part of the enclosed code). But I know that I should do it better without a loop (considering I will have about 600,000 rows). I tried to do it with ‘by’ function but was not successful transforming the result (list) into a data frame or matrix (see second part of the code).
Keep in mind that length of the rounded depth is not always the same as in the example.
Rounded depth is in Depth2 column, interpulated temperature is put in Temp2
What is the ‘right’ way to solve this?
# create df manually
tp <- data.frame(Date=double(31), Depth=double(31), Temperature=double(31))
tp$Date[1:11] <- '2009-12-17' ; tp$Date[12:22] <- '2009-12-18'; tp$Date[23:31] <- '2009-12-19'
tp$Depth <- c(24.92,25.50,25.88,26.33,26.92,27.41,27.93,28.37,28.82,29.38,29.92,25.07,25.56,26.06,26.54,27.04,27.53,28.03,28.52,29.02,29.50,30.01,25.05,25.55,26.04,26.53,27.02,27.52,28.01,28.53,29.01)
tp$Temperature <- c(19.08,19.06,19.06,18.87,18.67,17.27,16.53,16.43,16.30,16.26,16.22,17.62,17.43,17.11,16.72,16.38,16.28,16.20,16.15,16.13,16.11,16.08,17.54,17.43,17.32,17.14,16.89,16.53,16.28,16.20,16.13)
# create rounded depth column
tp$Depth2 <- round(tp$Depth)
# loop on date to calculate linear approximation for rounded depth
dtgrp <- tp[!duplicated(tp[,1]),1]
for (i in dtgrp) {
x1 <- tp[tp$Date == i, "Depth"]
y1 <- tp[tp$Date == i, "Temperature"]
x2 <- tp[tp$Date == i, "Depth2"]
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
tp[tp$Date == i, "Temp2"] <- tpa$y
}
# reduce result to rounded depth
tp1 <- tp[!duplicated(tp[,-c(2:3)]),-c(2:3)]
# not part of the question, but the end need is for a matrix, so this complete it:
library(reshape2)
tpbydt <- acast(tp1, Date~Depth2, value.var="Temp2")
# second part: I tried to use the by function (instead of loop) but got lost when tring to convert it to data frame or matrix
rdpth <- function(x1,y1,x2) {
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
return(tpa)
}
tp2 <- by(tp, tp$Date,function(tp) rdpth(tp$Depth,tp$Temperature,tp$Depth2), simplify = TRUE)
Very close with by call but remember it returns a list of objects. Therefore, consider building a list of data frames to be row binded at very end:
df_list <- by(tp, tp$Date, function(sub) {
tpa <- approx(x=sub$Depth, y=sub$Temperature, xout=sub$Depth2, rule=2)
df <- unique(data.frame(Date = sub$Date,
Depth2 = sub$Depth2,
Temp2 = tpa$y,
stringsAsFactors = FALSE))
return(df)
})
tp2 <- do.call(rbind, unname(df_list))
tp2
# Date Depth2 Temp2
# 1 2009-12-17 25 19.07724
# 2 2009-12-17 26 19.00933
# 5 2009-12-17 27 18.44143
# 7 2009-12-17 28 16.51409
# 9 2009-12-17 29 16.28714
# 11 2009-12-17 30 16.22000
# 12 2009-12-18 25 17.62000
# 21 2009-12-18 26 17.14840
# 4 2009-12-18 27 16.40720
# 6 2009-12-18 28 16.20480
# 8 2009-12-18 29 16.13080
# 10 2009-12-18 30 16.08059
# 13 2009-12-19 25 17.54000
# 22 2009-12-19 26 17.32898
# 41 2009-12-19 27 16.90020
# 61 2009-12-19 28 16.28510
# 81 2009-12-19 29 16.13146
And if you reset row.names, this is exactly identical to your tp1 output:
identical(data.frame(tp1, row.names = NULL),
data.frame(tp2, row.names = NULL))
# [1] TRUE

R - Automatically split time series in equal parts

I am trying to do a regression mode with calibration periods. For that I want to split my time series into 4 equal parts.
library(lubridate)
date_list = seq(ymd('2000-12-01'),ymd('2018-01-28'),by='day')
date_list = date_list[which(month(date_list) %in% c(12,1,2))]
testframe = as.data.frame(date_list)
testframe$values = seq (1, 120, length = nrow(testframe))
The testframe above is 18 seasons long and I want to devide that into 4 parts, meaning 2 Periodes of 4 winter seasons and 2 Periodes of 5 winter seasons.
My try was:
library(lubridate)
aj = year(testframe[1,1])
ej = year(testframe[nrow(testframe),1])
diff = ej - aj
But when I devide diff now with 4, its 4.5, but I would need something like 4,4,5,5 and use that to extract the seasons. Any idea how to do that automatically?
You can start with something like this:
library(lubridate)
testframe$year_ <- year(testframe$date_list)
testframe$season <- getSeason(testframe$date_list)
If you're wondering the origin of getSeason() function, read this. Now you can split have the datasets with the seasons:
by4_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[1:4],]
by4_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[5:8],]
by5_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[9:13],]
by5_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[14:18],]
Now you can test it, for example:
table(by4_1$year_, by4_1$season)
Fall Winter
2000 14 17
2001 14 76
2002 14 76
2003 14 76

R : Create specific bin based on data range

I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107

How do I write a generic function to pick out distance between positive values?

I have a dataset that looks like so:
x y
1 0.0000 0.4459183993
2 125.1128 0.4068805502
3 250.2257 0.3678521348
4 375.3385 0.3294434397
5 500.4513 0.2922601919
6 625.5642 0.2566381551
7 750.6770 0.2229130927
8 875.7898 0.1914207684
9 1000.9026 0.1624969456
10 1126.0155 0.1364773879
11 1251.1283 0.1136978589
12 1376.2411 0.0944717371
13 1501.3540 0.0786550515
14 1626.4668 0.0656763159
15 1751.5796 0.0549476349
16 1876.6925 0.0458811131
17 2001.8053 0.0378895151
18 2126.9181 0.0304416321
19 2252.0309 0.0231041362
20 2377.1438 0.0154535572
21 2502.2566 0.0070928195
22 2627.3694 -0.0020708606
23 2752.4823 -0.0119351534
24 2877.5951 -0.0223944877
25 3002.7079 -0.0332811155
26 3127.8208 -0.0442410358
27 3252.9336 -0.0548855203
...
Full data available here.
It's easier to see visually by plotting x and y with a zero intercept line:
ggplot(dat,aes(x,y)) + geom_line() + geom_hline(yintercept=0)
You can see the plot here (if you don't want to download the data and plot it yourself.)
I want to pick out 'patches' defined as the distance along x from when the line goes above zero on the y till it goes below zero. This will always happen at least once (since the line starts above zero), but can happen many times.
Picking out the first patch is easy.
patch1=dat[min(which(dat$y<=0.000001)),]
But how would I loop through and pick up subsequent patches?
Here's a complete working solution:
# sample data
df <- data.frame(x=1:10, y=rnorm(10))
# find positive changes in "y"
idx <- which(c(FALSE, diff(df$y > 0) == 1))
# get the change in "x"
patches <- diff(c(0, df[idx, "x"]))

Resources