I am trying to do a regression mode with calibration periods. For that I want to split my time series into 4 equal parts.
library(lubridate)
date_list = seq(ymd('2000-12-01'),ymd('2018-01-28'),by='day')
date_list = date_list[which(month(date_list) %in% c(12,1,2))]
testframe = as.data.frame(date_list)
testframe$values = seq (1, 120, length = nrow(testframe))
The testframe above is 18 seasons long and I want to devide that into 4 parts, meaning 2 Periodes of 4 winter seasons and 2 Periodes of 5 winter seasons.
My try was:
library(lubridate)
aj = year(testframe[1,1])
ej = year(testframe[nrow(testframe),1])
diff = ej - aj
But when I devide diff now with 4, its 4.5, but I would need something like 4,4,5,5 and use that to extract the seasons. Any idea how to do that automatically?
You can start with something like this:
library(lubridate)
testframe$year_ <- year(testframe$date_list)
testframe$season <- getSeason(testframe$date_list)
If you're wondering the origin of getSeason() function, read this. Now you can split have the datasets with the seasons:
by4_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[1:4],]
by4_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[5:8],]
by5_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[9:13],]
by5_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[14:18],]
Now you can test it, for example:
table(by4_1$year_, by4_1$season)
Fall Winter
2000 14 17
2001 14 76
2002 14 76
2003 14 76
Related
I am currently working on a timeseries streamflow data range from 2002-10-04 to 2012-10-28, what I need to do is to develop regression models using 10-day time window data. To be more specific, I need to use Oct-04 to Oct 13 across all years from 2002 to 2012 to build a regression model. Then I need to use Oct-05 to Oct 14 across all years from 2002 to 2012 to build another regression model, then use Oct-06 to Oct 15 across all years to build next model and so on repeatedly till the end.
This is how my data look like.
> head(CFbasin)
Data Qkl Qllt Qhaw Qdp Qlit
1 2002-10-04 19.25546 8.353470 4.502379 2.217209 1.985011
2 2002-10-05 19.56694 8.126935 4.615646 1.622555 1.628219
3 2002-10-06 19.73684 7.560598 4.389111 1.251605 1.492298
4 2002-10-07 18.12278 7.079212 3.992675 1.158159 1.413011
5 2002-10-08 18.12278 6.824360 3.794457 1.070377 1.393189
6 2002-10-09 17.83961 6.739409 3.369705 1.073208 1.353545
> tail(CFbasin)
Data Qkl Qllt Qhaw Qdp Qlit
3673 2012-10-23 24.89051 16.67862 8.608321 2.477724 1.432832
3674 2012-10-24 25.00378 16.48040 11.638224 2.820358 1.393189
3675 2012-10-25 25.37189 16.99011 7.758816 3.001586 1.322397
3676 2012-10-26 26.07982 16.87684 6.484558 2.814695 1.279921
3677 2012-10-27 27.41071 17.07506 4.813864 3.086536 1.228951
3678 2012-10-28 28.88318 17.16001 5.635052 3.114853 1.220456
I only tried once and this is my code:
CFbasin %>% filter(month(Date) == 10 & day(Date) >= 4 & day(Date) <= 14)
this allows me to get all the data within Oct 4 to Oct13 from 2002 to 2012, and then conduct linear regression. But I am not sure how to have it work on the whole dataset and then conduct linear regressions, I am considering for loop and function rollapply(), but really unclear about how to arrange my dataset. Any suggestion and recommendation will be really appreciated, thank you in advance!
Here is an example using the data in the Note at the end and a width of 5.
library(zoo)
coefs <- function(x) coef(lm.fit(cbind(1, x[, -1]), x[, 1]))
rollapplyr(CFbasin[, -1], 5, coefs, by.column = FALSE, fill = NA)
Added
This uses NA for the first output 4 rows and then the rows that correspond to rows 1:5 of all years to form the next regression, rows 2:6 of all years to form the regression after that and so on. Dec 31st is not used in leap years.
yday <- as.POSIXlt(CFbasin$Data)$yday
coefs <- function(ix) {
x <- CFbasin[yday %in% ix, -1]
if (NROW(x) == 0) NA else coef(lm(Qkl ~., x))
}
rollapplyr(0:364, 5, coefs, fill = NA)
Note
Lines <- "
Data Qkl Qllt Qhaw Qdp Qlit
1 2002-10-04 19.25546 8.353470 4.502379 2.217209 1.985011
2 2002-10-05 19.56694 8.126935 4.615646 1.622555 1.628219
3 2002-10-06 19.73684 7.560598 4.389111 1.251605 1.492298
4 2002-10-07 18.12278 7.079212 3.992675 1.158159 1.413011
5 2002-10-08 18.12278 6.824360 3.794457 1.070377 1.393189
6 2002-10-09 17.83961 6.739409 3.369705 1.073208 1.353545"
CFbasin <- read.table(text = Lines)
I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107
I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.
I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.
Obviously, this is an easy-to-do method, but it is not statistically relevant.
I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.
My dataframe is structured in this way :
shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45 | 15 | 40 | 25 | 25
000002 | 100 | 20 | 20 | 20 | 20 | 20
000003 | 500 | 200 | 0 | 100 | 100 | 100
... (2200 rows)
I've tried to put my timeserie in a list, which is successful, with the following functon:
reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters
Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).
In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...
Thank you for your help! (and merry Christmas)
An implementation of mixed model:
install.packages("nlme")
library("nlme")
library(dplyr)
# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)
# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")
Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)
# Extract the random effects for comparison:
random.effects(mod)
(Intercept) time
1 74.0790805 3.7034172
2 7.8713699 4.2138001
3 -8.0670810 -5.8754060
4 -16.5114428 16.4920663
5 -16.7098229 6.4685228
6 -11.9630688 -8.0411504
7 -12.9669777 21.3071366
8 -24.1099280 32.9274361
9 8.5107335 -9.7976905
10 -13.2707679 -6.6028927
11 3.6206163 -4.1017784
12 21.2342886 -6.7120725
13 -14.6489512 11.6847109
14 -14.7291647 2.1365768
15 10.6791941 3.2097199
16 -14.1524187 -1.6933291
17 5.2120647 8.0119320
18 -2.5172933 -6.5011416
19 -9.0094366 -5.6031271
20 1.4857512 -5.9913865
21 -16.5973442 3.5164298
22 -26.7724763 27.9264081
23 49.0764631 -12.9800871
24 -0.1512509 2.3589947
25 15.7723150 -7.9295698
26 2.1955489 11.0318875
27 -8.0890346 -5.4145977
28 0.1338790 -8.3551182
29 9.7113758 -9.5799588
30 -6.0257683 42.3140432
31 -15.7655545 -8.6226255
32 -4.1450984 18.7995079
33 4.1510104 -1.6384103
34 2.5107652 -2.0871890
35 -23.8640815 7.6680185
36 -10.8228653 -7.7370976
37 -14.1253093 -8.1738468
38 42.4114024 -9.0436585
39 -10.7453627 2.4590883
40 -12.0947901 -5.2763010
41 -7.6578305 -7.9630013
42 -14.9985612 -0.4848326
43 -13.4081771 -7.2655456
44 -11.5646620 -7.5365387
45 6.9116844 -10.5200339
46 70.7785492 -11.5522014
47 -7.3556367 -8.3946072
48 27.3830419 -6.9049164
49 14.3188079 -9.9334156
50 -15.2077850 -7.9161690
I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:
round(apply(random.effects(mod), 2, mean))
(Intercept) time
0 0
library(zoo)
#Reconstructing the data with four quarter columns (instead of five quarters as in your example)
shopID <- c(1, 2, 3, 4, 5)
sum <- c(150, 100, 500, 350, 50)
qt1 <- c(40, 10, 130, 50, 10)
qt2 <- c(40, 40, 110, 100, 15)
qt3 <- c(50, 30, 140, 150, 10)
qt4 <- c(20, 20, 120, 50, 15)
myDF <- data.frame(shopID, sum, qt1, qt2, qt3, qt4)
#The ts() function converts a numeric vector into an R time series object
ts1 <- ts(as.numeric((myDF[1,3:6])), frequency=4)
ts2 <- ts(as.numeric((myDF[2,3:6])), frequency=4)
ts3 <- ts(as.numeric((myDF[3,3:6])), frequency=4)
ts4 <- ts(as.numeric((myDF[4,3:6])), frequency=4)
ts5 <- ts(as.numeric((myDF[5,3:6])), frequency=4)
#Merge time series objects
tsm <- merge(a = as.zoo(ts1), b = as.zoo(ts2), c = as.zoo(ts3), d = as.zoo(ts4), e = as.zoo(ts5))
#Plotting the Time Series
plot.ts(tsm, plot.type = "single", lty = 1:5, xlab = "Time", ylab = "Sales")
The code is not optimized, and can be improved. More about time series analysis can be read here. Hope this gives some direction.
I have a large data frame, with many transects and for each of those transect I want to calculate for each year the intercept of the cros (x) and the value (y). Then I want to know how the intercept changed over the different years. I know how to calculate the intercept, however I have a lot of transects and I have to repeat this a lot, and I would like to do that more automatically.
So this is how my data looks like:
df
transects year cros value
10 1996 11 -3
10 1996 12 5
10 2005 11 -9
10 2005 12 -3
10 2010 11 -8
10 2010 12 -8
11 1996 11 7
11 1996 12 -4
11 2005 11 -6
11 2005 12 9
11 2010 11 6
11 2010 12 17
12 1996 14 -16
12 1996 15 -17
12 2005 14 -18
12 2005 15 -11
12 2010 14 16
12 2010 15 7
So I made a function, to subset the dataset and to do some calculation with this subset.
Here is the code. I used lapply because I want the outcome of the code to put in a list. However it could be that lapply is not the right function for this problem.
transect <- c(10, 11, 12)
o <- lapply(1:length(transect), function(i) {
s101 <- subset(df, along == transect[[i+1]])
# I want to create a subset for every transect and with that subset I want to do multiple calculations.
# Dune volume
# This makes sure that I have an intercept, also if there is no value above the 3
AUC96<-0
AUC05<-0
AUC10<-0
# Here I calculate the intercept for the different years.
d96 <- subset(s101, (cros >= 3.00) & (year == 1996))
AUC96<-sintegral(d96$cros,d96$value)$int
lengthdune96 <- max(d96$value)-min(d96$value)
AUC962 <- lengthdune96*8.00
AUC96 <- AUC96 +AUC962
d05 <- subset(s101, (cros >= 3.00) & (year == 2005))
AUC05<-sintegral(d05$cros,d05$value)$int
lengthdune05 <- max(d05$alti)-min(d05$value)
AUC052 <- lengthdune05*8.00
AUC05 <- AUC05 +AUC052
d10 <- subset(s101, (cros >= 3.00) & (year == 2010))
AUC10<-sintegral(d10$cros,d10$value)$int
lengthdune10 <- max(d05$value)-min(d05$value)
AUC102 <- lengthdune10*8.00
AUC10 <- AUC10 +AUC102
# Here the difference between the years
dune96.05 <- AUC05-AUC96
dune05.10 <- AUC10-AUC05
c(transect[[i+1]], dune96.05, dune05.10)
})
out <- as.data.frame(do.call(rbind, o))
However when I try this I get this error:
`Error in approx(x, fx, n = 2 * n.pts + 1) :
need at least two non-NA values to interpolate`
This is the first time that I try to make such a function, so it could be that I am doing this totally wrong. I hope that you can help me.
EDIT:
So after I changed the answer a bit, because that did not totally worked out. However I still get a error message and I am really stuck. I also tried different ways to solve this questions, eg looking at the plyr package, however I still get the same error questions:
So this is how my code looked like:
test<-lapply(unique(df$transect),function(i){s101 <- subset(df,df$transect==i)
{
AUC96<-0
AUC05<-0
AUC10<-0
d96 <- subset(s101, (cros >= 3.00) & (year == 1996))
AUC96<-sintegral(d96$cros,d96$value)$int
lengthdune96 <- max(d96$value)-min(d96$value)
AUC962 <- lengthdune96*8.00
AUC96 <- AUC96 +AUC962
d05 <- subset(s101, (cros >= 3.00) & (year == 2005))
AUC05<-sintegral(d05$cros,d05$value)$int
lengthdune05 <- max(d05$alti)-min(d05$value)
AUC052 <- lengthdune05*8.00
AUC05 <- AUC05 +AUC052
d10 <- subset(s101, (cros >= 3.00) & (year == 2010))
AUC10<-sintegral(d10$cros,d10$value)$int
lengthdune10 <- max(d05$value)-min(d05$value)
AUC102 <- lengthdune10*8.00
AUC10 <- AUC10 +AUC102
dune96.05 <- AUC05-AUC96
dune05.10 <- AUC10-AUC05
}
c(i,dune96.05, dune05.10)
})
However I still get this error message:
`Error in approx(x, fx, n = 2 * n.pts + 1) :
need at least two non-NA values to interpolate`
I am not really sure what I am doing wrong, the function should work like this. I hope that somebody can help me.
I see two problems with your use of lapply. You index transect as a list (it's a vector) and do not pass it (nor df) as an argument to your function in lapply, hence no luck with the subsetting. Try something like this:
lapply(unique(df$transect),function(i,df){s101 <- subset(df,transect==i)
,...
c(i, dune96.05, dune05.10)
},df)