Yearly average temperature function in R - r

I need to write a function to calculate the average annual temperature. I have data for each month's temperature from 1880 to 2017 and need a vector that displays each of the 138 average temperatures. This is what I have tried, I know it's not a lot and not very good but I am very new to this so bear with me:
average <- function(x) {
if(any(is.na(x)))
stop("x is missing values")
c(mean(x[,1:138]))
}
sapply(gistemp.new[2:13],average)
The gistemp.new is the name I gave for the data frame and the first column is just the year. It is like this:
Year Jan Feb Mar Apr May Jun
1 1880 -0.29 -0.18 -0.11 -0.19 -0.11 -0.23
2 1881 -0.15 -0.17 0.04 0.04 0.02 -0.20
3 1882 0.15 0.15 0.04 -0.18 -0.16 -0.26
4 1883 -0.31 -0.39 -0.13 -0.17 -0.20 -0.12

df1 <- read.table(text="Year Jan Feb Mar Apr May Jun
1 1880 -0.29 -0.18 -0.11 -0.19 -0.11 -0.23
2 1881 -0.15 -0.17 0.04 0.04 0.02 -0.20
3 1882 0.15 0.15 0.04 -0.18 -0.16 -0.26
4 1883 -0.31 -0.39 -0.13 -0.17 -0.20 -0.12",h=T,strin=F)
rowMeans(df1[-1])
# 1 2 3 4
# -0.18500000 -0.07000000 -0.04333333 -0.22000000

Related

R. Add column to df where rows have names of element from list

I have a list of all files (dataframes) within a directory:
library("plyr")
library("dplyr")
library("broom")
library("tidyr")
snp_list <- list.files(pattern="*.txt", all.files = T,full.names = F)
I also have a dataframe A obtained through the following function:
pv1= lapply(snp_list, function(x) tidy(lm(PV ~ GT*SEX + M + GT*N,read.table(x,header=TRUE)))) %>%
bind_rows()
Dataframe A has 7 rows ((Intercept), GT, SEX, M, N, GT:SEX, GT:N) for each element in list snp_list. In this toy example the list has 3 elements (rs1406947.txt rs25904.txt rs7133579.txt), but in reality there are 1,200,000 elements
A:
term estimate st.error statistic p.value
(Intercept) 7.68 0.17 44.64 0
GT 0.01 0.01 0.07 0.19
SEX 1.52 0.14 10.87 0.1
M 0.12 0.29 0.41 0.67
N -0.06 0.12 -0.48 0.63
GT:SEX -0.03 0.08 -0.44 0.65
GT:N -0.00 0.06 -0.08 0.93
(Intercept) 9.23 0.20 34.64 0
GT 0.05 0.04 0.12 0.22
SEX 1.67 0.76 10.34 0.1
M 0.14 0.39 0.51 0.55
N -0.08 0.05 -0.46 0.55
GT:SEX -0.19 0.11 -0.34 0.44
GT:N -0.22 0.33 -0.44 0.55
(Intercept) 7.99 0.66 44.44 0
GT 0.01 0.3 0.04 0.33
SEX 1.22 0.22 10.44 0.15
M 0.88 0.22 0.33 0.44
N -0.5 0.5 -0.5 0.6
GT:SEX -0.06 0.09 -0.74 0.35
GT:N -0.00 0.03 -0.04 0.78
I want to add a new column "SNP" to A, where each row has the name of the element the rows belongs to (nrows = 7*1,200,000). I would get this:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:SEX -0.06 0.09 -0.74 0.35 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
Here's how to do what you asked:
A$SNP=rep(0,nrow(A))
for (i in 1:nrow(A)){
A$SNP[i]=snp_list[(i%/%8)+1]
}
Using integer division, you can generate an index for 7 elements to map to each element in snp_list.

Filter rows of dataframe based on combinations of conditions

Let's say we have df1 with p values:
Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02
and df2 with Fold Changes:
Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
I would like to do the following in df2:
Keep rows that in df1, have values < 0.05 in 3/5 of columns or greater
Eliminate rows that show discordant signs of FC. FC should be taken into consideration only when the respective p from df1 is lower than 0.05 (i.e. significant)
Sort the resulting data in an intuitive order so as to discriminate rows having positive FC from rows having negative FC, and if possible, discriminate rows whose significances in FC arise sequentially (e.g. FC3 FC4 FC5) from others that don't (e.g. FC1 FC3 FC5)
For example, step 1 would result in:
Symbol FC1 FC2 FC3 FC4 FC5
ABC1 0.13 0.93 -1.61 0.12 1.03
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
and step 2, in:
Symbol FC1 FC2 FC3 FC4 FC5
BCR 1.43 -0.25 1.29 0.54 0.97
BGL 0.33 0.12 -1.33 -1.14 -1.23
How can this be achieved? I imagine using a for loop and the count function would do the job for step 1, but steps 2 and 3 look somewhat complicated to me. Thank you in advance for your elegant solutions.
data
df1:
df1 <- read.table(h=T,strin=F,text="Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02")
df2:
df2 <- read.table(h=T,strin=F,text="Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23")
I'm not sure how elegant this is, but you can get the result you requested using apply and sapply with subsetting, like this:
# Create logical matrix telling us whether p values are significant
sig <- apply(df1[-1], 2, function(x) x < 0.05)
# Create numeric matrix of the sign of each FC (will be either -1 or 1)
sign <- apply(df2[-1], 2, function(x) sign(x))
# Create a vector telling us whether there were 3 or more p < 0.05 in each row
ss1 <- apply(sig, 1, function(x) length(which(x)) > 2)
# Create a vector telling us whether all FC signs match excluding p = ns
ss2 <- sapply(seq(nrow(df1)), function(i) length(table(sign[i,][sig[i,]])) == 1)
# Subset the data frames accordingly:
df1[ss1, ]
#> Symbol p1 p2 p3 p4 p5
#> 2 ABC1 0.13 0.01 0.01 0.12 0.02
#> 4 BAM1 0.01 0.02 0.04 0.01 0.02
#> 5 BCR 0.01 0.36 0.02 0.07 0.04
#> 6 BDSM 0.02 0.43 0.01 0.03 0.41
#> 7 BGL 0.27 0.77 0.01 0.04 0.02
df2[ss1 & ss2, ]
#> Symbol FC1 FC2 FC3 FC4 FC5
#> 5 BCR 1.43 -0.25 1.29 0.54 0.97
#> 7 BGL 0.33 0.12 -1.33 -1.14 -1.23
Created on 2020-07-10 by the reprex package (v0.3.0)

How to subset a time series in R

In particular, I'd like to subset the temperature measurements from 1960 onwards in the time series gtemp in the package astsa:
require(astsa)
gtemp
Time Series:
Start = 1880
End = 2009
Frequency = 1
[1] -0.28 -0.21 -0.26 -0.27 -0.32 -0.32 -0.29 -0.36 -0.27 -0.17 -0.39 -0.27 -0.32
[14] -0.33 -0.33 -0.25 -0.14 -0.11 -0.25 -0.15 -0.07 -0.14 -0.24 -0.30 -0.34 -0.24
[27] -0.19 -0.39 -0.33 -0.35 -0.33 -0.34 -0.32 -0.30 -0.15 -0.10 -0.30 -0.39 -0.33
[40] -0.20 -0.19 -0.14 -0.26 -0.22 -0.22 -0.17 -0.02 -0.15 -0.12 -0.26 -0.08 -0.02
[53] -0.08 -0.19 -0.07 -0.12 -0.05 0.07 0.10 0.01 0.04 0.10 0.03 0.09 0.19
[66] 0.06 -0.05 0.00 -0.04 -0.07 -0.16 -0.04 0.03 0.11 -0.10 -0.10 -0.17 0.08
[79] 0.08 0.06 -0.01 0.07 0.04 0.08 -0.21 -0.11 -0.03 -0.01 -0.04 0.08 0.03
[92] -0.10 0.00 0.14 -0.08 -0.05 -0.16 0.12 0.01 0.08 0.18 0.26 0.04 0.26
[105] 0.09 0.05 0.12 0.26 0.31 0.19 0.37 0.35 0.12 0.13 0.23 0.37 0.29
[118] 0.39 0.56 0.32 0.33 0.48 0.56 0.55 0.48 0.62 0.54 0.57 0.43 0.57
The individual time points are not labeled in years, so although I can do gtemp[3] [1] -0.26, I can't do gtemp[as.date(1960)], for instance to get the value in 1960.
How can I bring out the correspondence between year and measurements, so as to later subset values?
We can make use of the window function
gtemp1 <- window(gtemp, start = 1960)
gtemp1
#Time Series:
#Start = 1960
#End = 2009
#Frequency = 1
#[1] -0.01 0.07 0.04 0.08 -0.21 -0.11 -0.03 -0.01 -0.04 0.08 0.03
#[12]-0.10 0.00 0.14 -0.08 -0.05 -0.16 0.12 0.01 0.08 0.18 0.26
#[23] 0.04 0.26 0.09 0.05 0.12 0.26 0.31 0.19 0.37 0.35 0.12
#[34] 0.13 0.23 0.37 0.29 0.39 0.56 0.32 0.33 0.48 0.56 0.55
#[45] 0.48 0.62 0.54 0.57 0.43 0.57
Function time can also help to answer your question
How can I bring out the correspondence between year and measurements, so as to later subset values?
head(time(gtemp))
[1] 1880 1881 1882 1883 1884 1885
If you want the value that corresponds to 1961, you can write
gtemp[time(gtemp) == 1961]
[1] 0.07
As mentioned in the first answer, you can also use the function window
window(gtemp, start = 1961, end = 1961)
Time Series:
Start = 1961
End = 1961
Frequency = 1
[1] 0.07
that returns the result as one point time series. You can convert it into a number by
as.numeric(window(gtemp, start = 1961, end = 1961))
[1] 0.07

Put result of forecast::ma() as a matrix and compute RMSE

I am really new to R. I am trying to calculate some MA[n] forecasts in R.
Here is my code,
# simple reproducible example
set.seed(0); factory <- round(rnorm(84), 1)
library(forecast)
factory.ts <- ts(factory, start = 1947, frequency = 12)
fit_EMA <- ma(factory.ts, order=5)
It works fine. Below is what fit_EMA looks like in R console. But I don't like the format as I couldn't find a way to take those fitted points for further usage. For example, how can I extract a row or column?
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1947 NA NA 0.80 0.24 0.12 -0.20 -0.46 -0.06 0.40 0.42 0.26 0.20
1948 -0.34 -0.58 -0.36 -0.32 -0.18 -0.36 -0.32 -0.30 -0.10 -0.02 0.20 0.34
1949 0.48 0.32 -0.10 -0.08 -0.22 -0.54 -0.48 -0.34 -0.20 0.08 0.38 0.38
1950 0.74 0.54 0.66 0.58 0.56 0.16 -0.02 -0.60 -1.04 -0.70 -0.38 -0.18
1951 0.10 0.34 0.58 0.26 0.28 0.28 0.48 -0.04 -0.32 -0.56 -0.54 -0.66
1952 -0.80 -0.38 -0.28 -0.32 -0.60 -0.34 -0.28 -0.10 -0.14 0.20 0.00 -0.06
1953 0.06 0.28 0.24 0.34 0.18 -0.24 -0.62 -0.38 -0.20 -0.06 NA NA
Also, how can I calculate RMSE or other error methods? forecast::ma or TTR::SMA, TTR::EMA doesn't give a calculated error measures in summary. Or I have missed a library function?
The result of forecast::ma() is always a "ts" object. Although your fit_EMA appears as a matrix when you print it to screen (because frequence = 12 so you have 12 columns), it is essentially a vector. You can use str(fit_EMA) to inspect it. You can do
mat <- matrix(fit_EMA, ncol = 12, byrow = TRUE)
to get a matrix. Then mat[1, ] gives the fitted values for the first year (year 1947).
Getting RMSE is so straightforward that a function / library routine is not needed. Do:
MSE <- mean((fit_EMA - factory.ts) ^ 2, na.rm = TRUE)
# [1] 0.55876
RMSE <- sqrt(MSE)
# [1] 0.7475025

A better way to plot lots of lines (in ggplot perhaps)?

Using R 3.0.2, I have a dataframe that looks like
head()
0 5 10 15 30 60 120 180 240
YKL134C 0.08 -0.03 -0.74 -0.92 -0.80 -0.56 -0.54 -0.42 -0.48
YMR056C -0.33 -0.26 -0.56 -0.58 -0.97 -1.47 -1.31 -1.53 -1.55
YBR085W 0.55 3.33 4.11 3.47 2.16 2.19 2.01 2.09 1.55
YJR155W -0.44 -0.92 -0.27 0.75 0.28 0.45 0.45 0.38 0.51
YNL331C 0.42 0.01 -0.05 0.23 0.19 0.43 0.73 0.95 0.86
YOL165C -0.49 -0.46 -0.25 0.03 -0.26 -0.16 -0.12 -0.37 -0.34
Where row.names() are variable names, names() are measurement times, and the values are measurements. It's several thousand rows deep. Let's call it tmp.
I want to do a sanity check of plotting every variable as time versus value as a line-plot on one plot. What's a better way to do it than naively plotting each line with plot() and lines():
timez <- names(tmp)
plot(x=timez, y=tmp[1,], type="l", ylim=c(-5,5))
for (i in 2:length(tmp[,1])) {
lines(x=timez,y=tmp[i,])
}
The above crude answer is good enough, but I'm looking for a way to do this right. I had a concusion recently, so sorry if I'm missing something obvious. I've been doing that a lot.
Could it be something with transposing the data.frame so it's each timepoint observed across several thousand variables? Or melt()-ing the data.frame in some meaningful way? Is there someway of handling it in ggplot using aggregate()s of data.frames or something? This isn't the right way to do this, is it?
At a loss.
I personally prefer ggplot2 for all of my plotting needs. Assuming I've understood you correctly, you can put the data in long format with reshape2 and then use ggplot2 to plot all of your lines on the same plot:
library(reshape2)
df2<-melt(df,id.var="var")
names(df2)<-c("var","time","value")
df2$time<-as.numeric(substring(df2$time,2))
library(ggplot2)
ggplot(df2,aes(x=time,y=value,colour=var))+geom_line()
You can simply use matplot as follows
DF
## 0 5 10 15 30 60 120 180 240
## YKL134C 0.08 -0.03 -0.74 -0.92 -0.80 -0.56 -0.54 -0.42 -0.48
## YMR056C -0.33 -0.26 -0.56 -0.58 -0.97 -1.47 -1.31 -1.53 -1.55
## YBR085W 0.55 3.33 4.11 3.47 2.16 2.19 2.01 2.09 1.55
## YJR155W -0.44 -0.92 -0.27 0.75 0.28 0.45 0.45 0.38 0.51
## YNL331C 0.42 0.01 -0.05 0.23 0.19 0.43 0.73 0.95 0.86
## YOL165C -0.49 -0.46 -0.25 0.03 -0.26 -0.16 -0.12 -0.37 -0.34
matplot(t(DF), type = "l", xaxt = "n", ylab = "") + axis(side = 1, at = 1:length(names(DF)), labels = names(DF))
xaxt = "n" suppresses ploting x axis annotations. axis function allows you to specify details for any axis, in this case we are using to specify labels of x axis.
It should produce plot as below.

Resources