I'd like to calculate monthly temperature anomalies on a time-series with several stations.
I call here "anomaly" the difference of a single value from a mean calculated on a period.
My data frame looks like this (let's call it "data"):
Station Year Month Temp
A 1950 1 15.6
A 1980 1 12.3
A 1990 2 11.4
A 1950 1 15.6
B 1970 1 12.3
B 1977 2 11.4
B 1977 4 18.6
B 1980 1 12.3
B 1990 11 7.4
First, I made a subset with the years comprised between 1980 and 1990:
data2 <- subset(data, Year>=1980& Year<=1990)
Second, I used plyr to calculate monthly mean (let's call this "MeanBase") between 1980 and 1990 for each station:
data3 <- ddply(data2, .(Station, Month), summarise,
MeanBase = mean(Temp, na.rm=TRUE))
Now, I'd like to calculate, for each line of data, the difference between the corresponding MeanBase and the value of Temp... but I'm not sure to be in the right way (I don't see how to use data3).
You can use ave in base R to get this.
transform(data,
Demeaned=Temp - ave(replace(Temp, Year < 1980 | Year > 1990, NA),
Station, Month, FUN=function(t) mean(t, na.rm=TRUE)))
# Station Year Month Temp Demeaned
# 1 A 1950 1 15.6 3.3
# 2 A 1980 1 12.3 0.0
# 3 A 1990 2 11.4 0.0
# 4 A 1950 1 15.6 3.3
# 5 B 1970 1 12.3 0.0
# 6 B 1977 2 11.4 NaN
# 7 B 1977 4 18.6 NaN
# 8 B 1980 1 12.3 0.0
# 9 B 1990 11 7.4 0.0
The result column will have NaNs for Month-Station combinations that have no years in your specified range.
Related
Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)
Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9
A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]
I have a panel dataset where I want to average over a specified number of time periods (t) by variable (column).
An example:
Country Year Var 1 Var 2 Var 3
Austria 1984 1 3.6 95
Austria 1985 2 4.1 94.6
Austria 1986 1 2.6 93.6
Austria 1987 1 3 94.4
Austria 1988 1 3.9 95.2
What I want then is a new column/new dataframe with a new variable for the average for the 5 year period (1984-1988) for Var 1, a variable for the average of Var 2 and var 3 etc.
I also want to loop the function over such that I can apply it to the other countries in my dataset. It would be great if I could avoid that the averaging mixes up countries, so I was thinking of adding some matching string pattern (for code %in% AUT in this case for instance, I have a variable with country codes) but I couldn't figure out how to do it.
Thank you very much in advance
1) Using the sample input in the Note at the end, read in the country and year from the row names and round the year up to the end of the current 5 year period so that each year from 1984 to 1988 gets rounded up to 1988, etc. Then use aggregate to calculate the means of each column by both country and year. No packages are used.
By0 <- read.table(text = rownames(DF), col.names = c("Country", "Year"))
By <- transform(By0, Year = 5 * ((Year - min(Year)) %/% 5) + min(Year) + 4)
aggregate(DF, By, mean)
giving the following:
Country Year Var 1 Var 2 Var 3
1 Australia 1988 1.6 18.46 95.52
2 Austria 1988 1.2 3.44 94.56
2) or if what was wanted was to append the columns to the original data frame lapply over the columns using ave to take the mean by Country for each:
out <- cbind(DF, lapply(DF, function(x) with(By, ave(x, Country, Year, FUN = mean))))
names(out) <- c(names(DF), paste("Mean", names(DF)))
giving:
> out
Var 1 Var 2 Var 3 Mean Var 1 Mean Var 2 Mean Var 3
Austria 1984 1 3.6 95.0 1.2 3.44 94.56
Austria 1985 2 4.1 94.6 1.2 3.44 94.56
Austria 1986 1 2.6 93.6 1.2 3.44 94.56
Austria 1987 1 3.0 94.4 1.2 3.44 94.56
Austria 1988 1 3.9 95.2 1.2 3.44 94.56
Australia 1984 1 3.6 95.0 1.6 18.46 95.52
Australia 1985 2 4.1 94.6 1.6 18.46 95.52
Australia 1986 1 2.6 93.6 1.6 18.46 95.52
Australia 1987 1 3.0 94.4 1.6 18.46 95.52
Australia 1988 3 79.0 100.0 1.6 18.46 95.52
Note
The input used, shown reproducibly, is:
Lines <- "
Var 1,Var 2,Var 3
Austria 1984,1,3.6,95
Austria 1985,2,4.1,94.6
Austria 1986,1,2.6,93.6
Austria 1987,1,3,94.4
Austria 1988,1,3.9,95.2
Australia 1984,1,3.6,95
Australia 1985,2,4.1,94.6
Australia 1986,1,2.6,93.6
Australia 1987,1,3,94.4
Australia 1988,3,79,100"
DF <- read.csv(text = Lines, check.names = FALSE)
I want to convert long time series data of temperature, rainfall to monthly and want to interpolate and compute the temporal trend for a country using 90 meteorological stations.
Data format is like
Year Month Day Rain MaxT MinT
1 1970 1 1 0.0 23.0 -99.9
2 1970 1 2 0.0 23.0 -99.9
3 1970 1 3 0.0 23.0 -99.9
4 1970 1 4 0.0 24.0 -99.9
5 1970 1 5 0.0 23.0 -99.9
6 1970 1 6 0.0 23.0 -99.9
7 1970 1 7 0.0 23.0 -99.9
I have a very large data set, structured as the sample below.
I have been trying to use the na.spline function in order to
1) identify the "fips" category with missing Yield.
2) if less than than 3 Yield values are NA per fips (here 1-3) the spline function should kick in and fill in the NA.
3) If 3 or more Yields are NA for a "fips" the code should remove the entire "fips" subset, in this case fips 2 should be removed.
My code so far:
finX <- dataset
finxx <- transform(subset(finX, ave(na.spline(finX$Yield), fips, FUN=sum)<2))
#or
finxx <- transform(subset(finX, ave(is.na(finX$Yield), fips, FUN=sum)<2))
Year fips Max Min Rain Yield
1980 1 24.7 0.0 71 37
1981 1 22.8 0.0 62 40
1982 1 22.6 0.0 47 37
1983 1 24.2 0.0 51 39
1984 1 23.8 0.0 61 47
1985 1 25.1 0.0 67 43
1980 2 24.8 0.0 72 34
1981 2 23.2 0.4 54 **NA**
1982 2 25.3 0.1 83 55
1983 2 23.0 0.0 68 **NA**
1984 2 22.4 0.7 70 **NA**
1985 2 24.6 0.0 47 31
1980 3 25.5 0.0 51 31
1981 3 25.5 0.0 51 31
1982 3 25.5 0.0 51 31
1983 3 25.5 0.0 51 **NA**
1984 3 25.5 0.0 51 31
...
Currently the codes above either do not fill in all the NA's in the final product, or simply have no result at all.
Any guidance would be very useful, thank you.
Yield needs to be converted from character to numeric or NA. Then use by to divide finX into separate data frames by fips value. For each data frame with less than 3 NA's, do the spline interpolation. Those with 3 or greater are returned as NULL. Combine the list of returned data frames into single data frame. Code would look like:
library(zoo)
# convert finX$Yield values from character to either numeric or NA
finX$Yield <- sapply(finX$Yield, function(x) if(x =="**NA**") NA_real_ else as.numeric(x))
# use spline interpolation on fips sets with less than 3 NA's
finxx <- by(finX, finX$fips, function(x) if(sum(is.na(x$Yield)) < 3) transform(x, Yield=na.spline(object=Yield, x=Year)) )
# combine results into a single data frame
finxx <- do.call(rbind, finxx)
Alternatively after the conversion to numeric values, you could use ave on the Yield column where spline interpolation returns values on fips sets with less than 3 NA's and all NA's on any other sets. All rows with any NA's in the final result would then be deleted. Code is as follows:
finxx2 <- transform(finX, Yield=ave(Yield, fips, FUN=function(x) if(sum(is.na(x)) < 3) na.spline(object=x) else NA))
finxx2 <- na.omit(finxx2)
Both versions give the same result for the sample data but the first version using by allows you to work with a full data frame for each fips set rather than with just Yield. In this case, this allowed Year to be specified for the x values in the spline interpolation so any data set with a missing Year would still give the correct interpolation. The ave version would get an incorrect answer. So the by version seems more robust.
There's also the dplyr version which is very much like the by version above and gives the same answer as the base R versions. If you're OK with working with dplyr, this is probably the most straightforward and robust approach.
library(dplyr)
finxx3 <- finX %>% group_by(fips) %>%
filter(sum(is.na(Yield)) < 3) %>%
mutate(Yield=na.spline(object=Yield, x=Year))
The first version returns
Year fips Max Min Rain Yield
1.1 1980 1 24.7 0 71 37
1.2 1981 1 22.8 0 62 40
1.3 1982 1 22.6 0 47 37
1.4 1983 1 24.2 0 51 39
1.5 1984 1 23.8 0 61 47
1.6 1985 1 25.1 0 67 43
3.13 1980 3 25.5 0 51 31
3.14 1981 3 25.5 0 51 31
3.15 1982 3 25.5 0 51 31
3.16 1983 3 25.5 0 51 31
3.17 1984 3 25.5 0 51 31
I am new in using R and I have a question for which I am trying to find the answer. I have a file organized as follows (it has thousands of rows but I just show a sample for simplicity):
YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
I would like to know for each column (T1, T2, and R1) in which Year, Month and Day the -99.9 are located; e.g. from 1980/1/3 to 1980/1/27 there are X -99.9 for T1, from 1990/2/10 to 1990/3/30 there are Y-99.9 for T1....and so on. Same for T2, and R.
How can do this in R?
This is only one file like this but I have almost 2000 files with the same problem (I know I need to loop it) but if I know how to do it for one file then I will just create a loop.
I really appreciate any help. Thank you very much in advance for reading and helping!!!
I took the liberty of renaming your last dataframe column "R1"
lapply(c('T1', 'T2', 'R1'), function(x) { dfrm[ dfrm[[x]]==-99.9 , # rows to select
1:3 ] }# columns to return
)
#-------------
[[1]]
YEAR Month day
3 1965 3 4
[[2]]
YEAR Month day
3 1965 3 4
[[3]]
YEAR Month day
3 1965 3 4
4 1965 3 5
5 1965 3 6
It wasn't clear whether you wanted the values or counts (and I don't think you can have both in the same report.) If you wanted to name the entries:
> misdates <- .Last.value
> names(misdates) <- c('T1', 'T2', 'R1')
If you wanted a count:
lapply(misdates, NROW)
$T1
[1] 1
$T2
[1] 1
$R1
[1] 3
(You might want to learn how to use NA values. Using numbers as missing values is not recommended data management.)
If I understand correctly, you want to obtain how many "-99.9"s you get per month AND by column,
Here's my code for S1 using plyr. You'll note that I expanded your example to get one more month of data.
library(plyr)
my.table <-read.table(text="YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
1966 1 7 -99.9 7.2 1.1 0.0
1966 1 8 -99.9 7.2 1.1 0.0
", header=TRUE, as.is=TRUE,sep = " ")
#Create a year/month column to summarise per month
my.table$yearmonth <-paste(my.table$YEAR,"/",my.table$Month,sep="")
S1 <-count(my.table[my.table$S1==-99.9,],"yearmonth")
S1
yearmonth freq
1 1965/3 1
2 1966/1 2