getting minimum value after tapply - r

I started learning R recently, and I am completely new. Sorry if my question will seem lame to some of you but I have spent more than an hour trying to research how to do this using indexing or subset but couldn't find anything.
So here it goes :
I have a file which has
temperature lower rain month yr
10.8 6.5 12.2 1 1987
10.5 4.5 1.3 1 1987
7.5 -1 0.1 1 1987
This file contains 6,940 lines of data.
I read the file in R. and I wanted to find the average rainfall per year for which i used :
A <- tapply(temperature,yr,mean)
this function returned:
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
13.27014 13.79126 15.54986 15.62986 14.11945 14.61612 14.30984 15.12877 15.81260 13.98082 15.63918 15.02568 15.63736 14.94071 14.90849 15.47589 16.03260 15.25109 15.06000
Now the question is I need the year where the average rain is the min.
when I apply :
min(A)
It returns 13.27014 which corresponds for the year 1987 but how do I query for the year which corresponds to the min Value.
And when I try :
A[,min(A)]
It returns an error
Sorry again for the lame question but this is driving me crazy

Related

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How do I run a thornthwaite function (which computes standard precipitation index) by location and latitude in R?

I have the following data:
dat <- read.table(text="
id YEAR MONTH TMED PRCP lat
1 1986 1 -14.5 2.3 42.4863
1 1986 2 -13.9 5.7 42.4863
2 1986 1 -12.9 7.2 42.46
2 1986 2 -11.6 19.7 42.46", header=TRUE)
where
id is my location unit ranging from 1 to 90
TMED - mean monthly temperature by location
PRCP - precipitation by location
lat - latitude by location
YEAR ranges from 1986 to 2016
MONTH ranges from 1 to 12
I need to run the following function in R to calculate SPEI index (Standard Precipitation-Evapotranspiration Index) for each location:
library(SPEI)
dat$PET <- thornthwaite(dat$TMED, dat$lat[1])
dat$BAL <- dat$PRCP-dat$PET
spei1 <- spei(dat$BAL, scale = 1)
dat$spei1=l
This code works for one location. But I need to make a loop over latitude and location. One problem is that latitude should enter the function thornthwaite as a number (not as a list/variable).
I played around with the Thornthwaite equation a bit for my ecology masters, and this implementation appears a bit strange. Despite how it might appear here, the equation needs more than just mean temperature and latitudes as input. It actually needs the average daylength of a given month, but this can be calculated from latitude and date, and thornthwaite() gets the date by just assuming the first datapoint represents January, and the rest follows in sequence. The Thornthwaite equation also depends on a yearly heat index, which means you need monthly temperature means for the entire year. thornthwaite() solves this by aggregating over the temperature vector you supply.
In summation, for thornthwaite() to work you need a sequence of monthly mean temperatures, in sequence, starting in January and spanning at least one year. As such, the function won't work on the data you supplied.
I suggest you make sure your series is long enough, and also split it into separate data.frames for each location. You can use split() for this (split(dat, dat$id)).
There are a few examples in ?thornthwaite, including one demonstrating its use on time series, useful if your series doesn't start on January.
I made a mockup demonstrating one possible approach:
(Notice the function will return values even if the data doesn't cover a full year, these values will then be quite unreliable.)
dat <- read.table(text="
id YEAR MONTH TMED PRCP lat
1 1986 1 -14.5 2.3 42.4863
1 1986 2 -13.9 5.7 42.4863
1 1986 3 -10.5 2.3 42.4863
1 1986 4 -7.9 5.7 42.4863
1 1986 5 -4.5 2.3 42.4863
1 1986 6 0.9 5.7 42.4863
1 1986 7 10.5 2.3 42.4863
1 1986 8 17.9 5.7 42.4863
2 1986 1 -12.9 7.2 42.46
2 1986 2 -11.6 19.7 42.46
2 1986 3 -8.9 7.2 42.46
2 1986 4 -5.9 7.2 42.46
2 1986 5 1.6 19.7 42.46
2 1986 6 12.9 7.2 42.46
2 1986 7 21.6 19.7 42.46
2 1986 8 25.6 19.7 42.46", header=TRUE)
dat.s <- split(dat, dat$id)
lapply(dat.s, function(x) thornthwaite(x$TMED, x$lat[1]))

Exclude values from data.frame in R

I have the following dataframe:
Count Year
32 2018
346 2017
524 2016
533 2015
223 2014
1 2010
3 2008
1 1992
Is it possible to exclude the years 1992 and 2008. I tried different ways, but don't find a flexible solution.
I would like to have the same dataframe without the years 1993 and 2008.
Many thanks in advance,
jeemer
library(dplyr); filter(df, year != 1992 | year != 2008)

decompose() for yearly time series in R

I'm trying to perform analysis on a time series data of inflation rates from the year 1960 to 2015. The dataset is a yearly time series over 56 years with 1 real value per each year, which is the following:
Year Inflation percentage
1960 1.783264746
1961 1.752021563
1962 3.57615894
1963 2.941176471
1964 13.35403727
1965 9.479452055
1966 10.81081081
1967 13.0532972
1968 2.996404315
1969 0.574712644
1970 5.095238095
1971 3.081105573
1972 6.461538462
1973 16.92815855
1974 28.60169492
1975 5.738605162
1976 -7.63438068
1977 8.321619342
1978 2.517518817
1979 6.253164557
1980 11.3652609
1981 13.11510484
1982 7.887270664
1983 11.86886396
1984 8.32157969
1985 5.555555556
1986 8.730811404
1987 8.798689021
1988 9.384775808
1989 3.26256011
1990 8.971233545
1991 13.87024609
1992 11.78781925
1993 6.362038664
1994 10.21150033
1995 10.22488756
1996 8.977149075
1997 7.16425362
1998 13.2308409
1999 4.669821024
2000 4.009433962
2001 3.684807256
2002 4.392199745
2003 3.805865922
2004 3.76723848
2005 4.246353323
2006 6.145522388
2007 6.369996746
2008 8.351816444
2009 10.87739112
2010 11.99229692
2011 8.857845297
2012 9.312445605
2013 10.90764331
2014 6.353194544
2015 5.872426595
'stock1' contains my data where the first column stands for Year, and the second for 'Inflation.percentage', as follows:
stock1<-read.csv("India-Inflation time series.csv", header=TRUE, stringsAsFactors=FALSE, as.is=TRUE)
The following is my code for creating the time series object:
stock <- ts(stock1$Inflation.percentage,start=(1960), end=(2015),frequency=1)
Following this, I am trying to decompose the time series object 'stock' using the following line of code:
decom_add <- (decompose(stock, type ="additive"))
Here I get an error:
Error in decompose(stock, type = "additive") : time series has no
or less than 2 periods
Why is this so? I initially thought it has something to do with frequency, but since the data is annual, the frequency has to be 1 right? If it is 1, then aren't there definitely more than 2 periods in the data?
Why isn't decompose() working? What am I doing wrong?
Thanks a lot in advance!
Please try for frequency=2, because frequency needs to be greater than 1. Because this action will change your model, for me the better way is to load data which contain and month column, so the frequency will be 12.

Split and randomly reassemble a time series, but maintain leap years in R

I need to create datasets of weather data to use for modeling over the next 50 years. I am planning to do this by using historical weather data (daily, 1980-2012), but mixing up the years in a random order and then relabeling them with 2014-2054. However, I cannot be completely random, because it is important to maintain leap years. I want to have as many datasets as possible so I can get an average response of the model to different weather patterns.
Here is an example of what the historical data looks like (except there is data for every day). How could I reassemble it so the years are in a different order, but make sure years with 366 days (1980, 1984, 1988) end up in future leap years (2016, 2020, 2024, 2028, 2052)? And then do that at least 50 more times?
year day radn maxt
1980 1 5.827989 -1.59375
1980 2 5.655813 -1.828125
1980 3 6.159346 -0.96875
1981 4 6.065136 -1.84375
1981 5 5.961181 -2.34375
1981 6 5.758733 -2.0625
1981 7 6.458055 -2.90625
1982 8 6.73056 -2.890625
1982 9 6.89472 -1.796875
1983 10 6.687879 -2.140625
1984 11 6.585833 -1.609375
1984 12 6.466392 -0.71875
1984 13 7.100092 -0.515625
1985 14 7.176402 -1.734375
1985 15 7.236122 -2.5
1985 16 7.455515 -2.375
1986 17 7.395174 -1.390625
1986 18 7.341537 -2.21875
1987 19 7.678102 -2.828125
1987 20 7.539239 -2.875
1987 21 7.231031 -2.390625
1988 22 7.397067 -0.21875
1988 23 7.947912 -0.5
1989 24 8.355059 -1.03125
1990 25 8.145792 -1.5
1990 26 8.591616 -2.078125
Here is a function that scrambles the years of a passed data frame df, returning a new data frame:
scramble.years = function(df) {
# Build convenience vectors of years
early.leap = seq(1980, 2012, 4)
late.leap = seq(2016, 2052, 4)
early.nonleap = seq(1980, 2012)[!seq(1980, 2012) %in% early.leap]
late.nonleap = seq(2014, 2054)[!seq(2014, 2054) %in% late.leap]
# Build map from late years to early years
map = data.frame(from=c(sample(early.leap, length(late.leap), replace=T),
sample(early.nonleap, length(late.nonleap), replace=T)),
to=c(late.leap, late.nonleap))
# Build a new data frame with the correct years/days for later period
return.list = lapply(2014:2054, function(x) {
get.df = subset(df, year == map$from[map$to == x])
get.df$year = x
return(get.df)
})
return(do.call(rbind, return.list))
}
You can call scramble.years any number of times to get new scrambled data frames.

Resources