How to index a loop by year in dataset - r

My dataset looks like this:
Year Risk Resource Utilization Band Percent
2014 0 .25
2014 1 .19
2014 2 .17
2014 3 .31
2014 4 .06
2014 5 .01
2015 0 .23
2015 1 .21
2015 2 .19
2015 3 .31
2015 4 .06
2015 5 .31
I am attempting to compare percentage change year to year for the dataset I am working with. For example 2014 decreased 2% in 2015. So far, I have created a loop that puts each by year into bins and runs the calculation. The issue I am having is that the loop is indexing each loop as 1 so I have a bunch of repeating 1s next to my calculations. Here is the code I have been using, any help is much appreciated
Results.data <- data.frame()
head(data)
percent <- 0
baseyear <- 0
nextyear <- 0
bin <- 0
yearPlus1 <-0
bin2 <-0
percent1 <-0
percent2 <-0
percentDif <-0
for(i in 1:nrow(data))
{
percent[i] <- data$PERCENT[i]
baseyear[i] <- as.numeric(data$YEAR_RISK[i])
bin[i] <- as.numeric(data$RESOURCE_UTILIZATION_BAND[i])
#print(percent[i])
#print(baseyear[i])
#print(bin[i])
}
for (k in 1:nrow(data))
{
for (j in 1:nrow(data))
{
yearPlus1 <- as.numeric(baseyear[j])-1
firstYear <- as.numeric(baseyear[k])
bin2 <-bin[j]
bin1 <- bin[k]
percent1 <- as.numeric(percent[k])
percent2 <- as.numeric(percent[j])
if(firstYear==yearPlus1 && bin1==bin2)
{
percentDif <- percent2 - percent1
print(percentDif)
Results.data <- rbind(Results.data, c(percentDif))
}
}
}

If I understand your question, you can use grouping and vectorization to avoid loops. Here's an example using the dplyr package.
The code below first sorts by Year_Risk so that the data are ordered properly by time. Then we group by Resource_Utilization_Band so that we can get results separately for each level of Resource_Utilization_Band. Finally, we calculate the difference in Percent from year to year. The lag function returns the previous value in a sequence. (Instead of lag, we could have done Change = c(NA, diff(Percent)) as well.) All of these operations are chained one after the other using the dplyr chaining operator (%>%).
(Note that when I imported your data, I also changed your column names by adding underscores to make them legal R column names.)
library(dplyr)
# Year-over-year change within each Resource_Utilization_Band
# (Assuming your starting data frame is called "dat")
dat %>% arrange(Year_Risk) %>%
group_by(Resource_Utilization_Band) %>%
mutate(Change = Percent - lag(Percent))
Year_Risk Resource_Utilization_Band Percent Change
1 2014 0 0.25 NA
2 2014 1 0.19 NA
3 2014 2 0.17 NA
4 2014 3 0.31 NA
5 2014 4 0.06 NA
6 2014 5 0.01 NA
7 2015 0 0.23 -0.02
8 2015 1 0.21 0.02
9 2015 2 0.19 0.02
10 2015 3 0.31 0.00
11 2015 4 0.06 0.00
12 2015 5 0.31 0.30

Related

Calculate percent from row total in R

I have this table:
group
May 1990
Jun 1990
Jul 1990
1
581
552
465
2
193
184
176
3
207
177
165
Total
981
913
806
I want to calculate percent on row level from the row total.
group
May 1990
Jun 1990
Jul 1990
1
0.59
0.60
0.58
2
0.19
0.21
0.22
3
0.21
0.19
0.20
Total
1
1
1
I got this far for now, but is not what I want.
df <- data.frame(group=c('1','2','3','Total'),may_1990=c(581,193,207,981),jun_1990=c(552,184,177,913),jul_1990=c(465,176,165,806))
total <- df %>% slice_tail(n = 1)
z <- df %>% rowwise() %>% mutate(across(where(is.numeric), ~ .x/total[-1]))
With across:
library(dplyr)
df %>%
mutate(across(where(is.numeric), ~ .x / .x[group == "Total"]))
group may_1990 jun_1990 jul_1990
1 1 0.5922528 0.6046002 0.5769231
2 2 0.1967380 0.2015334 0.2183623
3 3 0.2110092 0.1938664 0.2047146
4 Total 1.0000000 1.0000000 1.0000000
With the nature of your data, this could also work if you prefer base R:
df[-1] <- sapply(df[-1], proportions) * 2
I think the easy way to achieve this kind of table is to use table() function:
df <- data.frame(group=c('1','2','3','Total'),may_1990=c(581,193,207,981),jun_1990=c(552,184,177,913),jul_1990=c(465,176,165,806))
# Compute proportions for the central data
prop = proportions(as.matrix(df[-4,-1]), 2)
# Add total at the column level (margin = 1)
prop = addmargins(prop, 1)
# Create the final table
df_end = data.frame(
group=c('1','2','3','Total'),
prop
)
You obtain this:
group may_1990 jun_1990 jul_1990
1 1 0.5922528 0.6046002 0.5769231
2 2 0.1967380 0.2015334 0.2183623
3 3 0.2110092 0.1938664 0.2047146
Sum Total 1.0000000 1.0000000 1.0000000

Apply an R script over multiple .txt files in a folder

I am extremely new to building functions and loops. I have looked at previous questions that are similar to my issue but I can't seem to find the solution for my problem. My goal is to extract climate data from a webpage like this:
https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019
where I will use this data to calculate growing degree days for a crop growth model. I have had success pulling data using a for loop.
uticaNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE8745&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"
friendNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"
location.urls <- c(uticaNE, friendNE)
location.meso.files <- c("uticaNe.txt", "friendNE.txt")
for(i in seq_along(location.urls)){
download.file(location.urls[i], location.meso.files[i], method="libcurl")
}
I will have around 20 locations I will be pulling data in daily. What I want to do is apply a task where I calculate fahrenheit, GDD, etc. to each file and save the output of each file separately.
This is the following code I have currently.
files <- list.files(pattern="*.txt", full.names=TRUE, recursive=FALSE)
func <- for (i in 1:length(files)){
df <- read.table(files[i], skip=10, stringsAsFactors =
FALSE)
colnames(df) <- c("year", "day", "solrad", "maxC",
"minC", "precipmm")
df$year <- as.f(df$year)
df$day <- as.factor(df$day)
df$maxF <- (df$maxC * (9/5) + 32)
df$minF <- (df$minC * (9/5) + 32)
df$GDD <- (((df$maxF + df$minF)/2)-50)
df$GDD[df$GDD <= 0] <- 0
df$GDD.cumulateive <- cumsum(df$GDD)
df$precipmm.cumulative <- cumsum(df$precipmm)
return(df)
write.table(df, path="./output", quote=FALSE,
row.names=FALSE, col.names=TRUE)
}
data <- apply(files, func)
Any help would be greatly appreciated.
-ML
Here is an approach using base R, and lapply() with an anonymous function to download the data, read it into a data frame, add the conversions to fahrenheit and cumulative precipitation, and write to output files.
First, we create the list of weather stations for which we will download data
# list of 10 stations
stationList <- c("NE3065","NE8745","NE0030","NE0050","NE0130",
"NE0245","NE0320","NE0355","NE0375","NE0420")
Here we create two URL fragments, one for the URL content prior to the station identifier, and another one for the URL content after the station identifier.
urlFragment1 <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations="
urlFragment2 <- "&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year"
Next, we create input and output directories, one to store the downloaded climate input files, and another for the output files.
# create input and output file directories if they do not already exist
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/output")) dir.create("./data/output")
The lapply() function uses paste0() to add the station names to the URL fragments we created above, enabling us to automate the download and subsequent operations against each input file.
stationData <- lapply(stationList,function(x){
theURL <-paste0(urlFragment1,x,urlFragment2)
download.file(theURL,
paste0("./data/",x,".txt"),method="libcurl")
df <- read.table(paste0("./data/",x,".txt"), skip=11, stringsAsFactors =
FALSE)
colnames(df) <- c("year", "day", "solrad", "maxC",
"minC", "precipmm")
df$year <- as.factor(df$year)
df$day <- as.factor(df$day)
df$maxF <- (df$maxC * (9/5) + 32)
df$minF <- (df$minC * (9/5) + 32)
df$GDD <- (((df$maxF + df$minF)/2)-50)
df$GDD[df$GDD <= 0] <- 0
df$GDD.cumulative <- cumsum(df$GDD)
df$precipmm.cumulative <- cumsum(df$precipmm)
df$station <- x
write.table(df,file=paste0("./data/output/",x,".txt"), quote=FALSE,
row.names=FALSE, col.names=TRUE)
df
})
# add names to the data frames returned by lapply()
names(stationData) <- stationList
...and the output, a directory containing one file for each station listed in the stationList object.
Finally, here is the data that has been written to the ./data/output/NE3065.txt file.
year day solrad maxC minC precipmm maxF minF GDD GDD.cumulateive precipmm.cumulative station
2020 1 8.992 2.2 -5 0 35.96 23 0 0 0 NE3065
2020 2 9.604 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 3 4.933 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 4 8.699 3.9 -7.2 0 39.02 19.04 0 0 0 NE3065
2020 5 9.859 6.1 -7.8 0 42.98 17.96 0 0 0 NE3065
2020 6 10.137 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 7 8.754 6.1 -4.4 0 42.98 24.08 0 0 0 NE3065
2020 8 10.121 7.8 -5 0 46.04 23 0 0 0 NE3065
2020 9 9.953 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 10 8.905 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 11 0.416 -3.9 -15.6 2.29 24.98 3.92 0 0 2.29 NE3065
2020 12 10.694 -4.4 -16.1 0 24.08 3.02 0 0 2.29 NE3065
2020 13 1.896 -4.4 -11.1 0.51 24.08 12.02 0 0 2.8 NE3065
2020 14 0.851 0 -7.8 0 32 17.96 0 0 2.8 NE3065
2020 15 11.043 -1.1 -8.9 0 30.02 15.98 0 0 2.8 NE3065
2020 16 10.144 -2.8 -17.2 0 26.96 1.04 0 0 2.8 NE3065
2020 17 10.75 -5.6 -17.2 3.05 21.92 1.04 0 0 5.85 NE3065
Note that there are 11 rows of header data in the input files, so one must set the skip= argument in read.table() to 11, not 10 as was used in the OP.
Enhancing the code
The last line in the anonymous function returns the data frame to the parent environment, resulting in a list of 10 data frames stored in the stationData object. Since we assigned the station name to a column in each data frame, we can combine the data frames into a single data frame for subsequent analysis, using do.call() with rbind() as follows.
combinedData <- do.call(rbind,stationData)
Since this code was run on January 17th, the resulting data frame contains 170 observations, or 17 observations for each of the 10 stations whose data we downloaded.
At this point the data can be analyzed by station, such as finding the average year to date precipitation by station.
> aggregate(precipmm ~ station,combinedData,mean)
station precipmm
1 NE0030 0.01470588
2 NE0050 0.56764706
3 NE0130 0.32882353
4 NE0245 0.25411765
5 NE0320 0.28411765
6 NE0355 1.49411765
7 NE0375 0.55235294
8 NE0420 0.13411765
9 NE3065 0.34411765
10 NE8745 0.47823529
>
Instead of using base R which ,you can install tidyverse library.
https://www.tidyverse.org/
In which you can use load the link into data frame as
tsv(tab separated value) using read_tsv function.
dataframe<-read_tsv(url("http://some.where.net/"))
Then create a loop in R and do calculations
something<-c('link1','link2') #vector in R
for(i in someting){
#make sure to indent with one space
}
At the end, you save data frame to a file using
write_csv(dataframe, file = "c:\\myname\\yourfile.csv")

Multiply previous row value by constant R

Have a simple R question but cannot seem to find an answer:
I have a data frame like this:
assumption_val year
1.2 2015
0 2016
0 2017
0 2018
0 2019
I want to grow each value as 20% greater than compared to the previous year, to output something like this:
assumption_val year
1.2 2015
1.44 2016
1.73 2017
2.07 2018
2.49 2019
How can I reference the previous row and multiply it by 1.2 to achieve this?
Thanks!
You are looking for cumprod:
cumprod(rep(1.2, 5))
Like its better known friend, cumsum, it accumulates past results, but it performs a multiplication rather than addition.
df <- data.frame(assumption_val=cumprod(rep(1.2, 5)),
years=2015:2019)
A nice generalization of these functions is Reduce. For example, here is Reduce performing this calculation. You can replace the "*" with "+" and have cumsum.
Reduce("*", rep(1.2, 5), accumulate = T)
A nice feature of this method is that you can adjust the growth rate in each period. For instance if you wanted to start at 1.5 rather than 1.2, you would simply adjust your growth vector to c(1.5, rep(1.2, 4)) to calculate the new growth as follows:
cumprod(c(1.5, rep(1.2, 4)))
data <- read.table(textConnection("
assumption_val year
1.2 2015
0 2016
0 2017
0 2018
0 2019"), header = TRUE)
data$assumption_val <- data$assumption_val[1]^(1:nrow(data))
data
## assumption_val year
## 1 1.20000 2015
## 2 1.44000 2016
## 3 1.72800 2017
## 4 2.07360 2018
## 5 2.48832 2019

Integrating Data

I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)
On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02

row minus row within different list R

How can I calculate the difference between different rows within different list?
and different list have different dimensions.
I use the code as follows
names(ri1)
[1] "Sedol" "code" "ri" "date"
ri1<-ri1[order(ri1$Sedol,ri1$date),]
sri<-split(ri1,ri1$Sedol)
ri1$r<-as.vector(sapply(seq_along(sri), function(x) diff(c(0, sri[[x]][,3]))))
however it shows the result
"Error in `$<-.data.frame`(`*tmp*`, "r", value = list(c(100, 0.00790000000000646, :
replacement has 1485 rows, data has 4687655"
for example
I have three lists
date ri
1990 1
1991 2
1992 3
date ri
1990 1
1991 2
1992 3
1993 4
date ri
1990 1
1991 2
I want the results like
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
1993 4 100%
date ri r
1990 1 0%
1991 2 100%
notice: r= r(t+1)/r(t)-1
Using diff and lapply you can get something like
# I generate some data
dat1 <- data.frame(date = seq(1990,1999,length.out=5),ri = seq(1,10,length.out=5))
dat2 <- data.frame(date = seq(1990,1999,length.out=5),ri=seq(1,5,length.out=5))
# I put the data.frame in a list
ll <- list(dat1,dat2)
# I use lapply:
ll <- lapply(ll,function(dat){
# I apply the formula you give in a vector version
# maybe you need only diff in percent?
dat$r <- round(c(0,diff(dat$ri))/dat$ri*100)
dat
})
ll
[[1]]
date ri r
1 1990.00 1.00 0
2 1992.25 3.25 69
3 1994.50 5.50 41
4 1996.75 7.75 29
5 1999.00 10.00 22
[[2]]
date ri r
1 1990.00 1 0
2 1992.25 2 50
3 1994.50 3 33
4 1996.75 4 25
5 1999.00 5 20
You should use a combination of head and tail as follows:
r.fun <- function(ri) c(0, tail(ri, -1) / head(ri, -1) - 1)
lapply(sri1, transform, r = r.fun(ri))
If your goal is to recombine (rbind) your data afterwards, then know that you can split/apply/combine everything within a single call to ave from the base package, or ddply from the plyr package:
transform(ri1, r = ave(ri, Sedol, FUN = r.fun))
or
library(plyr)
ddply(ri1, "Sedol", transform, r = r.fun(ri))
Edit: If you want the output to be in XX% as in your example, replace r.fun with:
r.fun <- function(ri) paste0(round(100 * c(0, tail(ri, -1) / head(ri, -1) - 1)), "%")

Resources