Correlation for subsets at a time - r

Maybe this question is posted, but couldn't find something that helps me.
I have a data frame, which is a time series of 40 years with 4 columns: the first is the year (numbers), the second is the month (numbers from 1 to 12), and the third and fourth, the precipitation for place1 and place2. I would like to make a correlation analysis using cor(), for the precipitation of place1 and place2 but would like to make it for every 5 years at a time. Also, in the series I have NA values. Is there a way for doing this?
Here's some sample data:
year<-rep(1940:1959, each = 12)
month<-rep(1:12,20)
place1<-c(14.7,26.3,10.2,132.4,286.3,158.2,72,99.5,217.6,267.9,80.3,NA,38.9,20.9,29.1,312.2,110.1,245,163.2,38.3,251.3,95.3,89.4,13.5,13.3,49.1,26.9,105.6,188.7,186.1,140.5,241.6,143.2,156.9,37.4,29.8,19.6,27.3,80.7,102.9,222.5,88.4,59.1,107.3,119.5,451.2,52.2,0,14.3,7.9,55.4,31.1,152.2,190.7,251,200.2,158.7,93,44.3,40.3,18.6,15.2,11.4,110.3,377.9,42.3,68.2,289.5,219.7,133.2,114.4,115.2,15.3,14,86.7,66.1,204.1,33.9,51.8,83,238.8,231.4,70.6,41.7,99.5,176.4,1.3,63,238.2,48.6,82.6,66.9,257,141.4,14.5,35.5,28.6,32.5,1.3,50.7,300.8,74.1,110.9,64.8,128,309.9,71.1,22.6,2.5,2.3,57.6,24.4,171.9,91,116.3,224.3,123.5,149.1,17.8,26,62.8,47.1,9.6,38.1,72.2,141.2,52.2,110.7,246.6,330.5,8.6,38.6,57.5,26.7,0,210,601.2,79.4,166.2,128.8,133.5,81.8,42,30.4,12.5,20.3,27.7,191.6,223.6,63.5,175.3,42.3,277.9,60.9,26.5,9.7,59.7,9.4,40.5,70.1,307.1,163.5,230,51.8,160.4,115.9,54.4,25.3,15.3,67.6,77.9,108.8,283.5,297.2,99.9,103.4,277.4,474.6,91.8,23.9,43.4,12.7,3,179.5,259.4,154.3,201.1,363.3,253.7,257.9,38.2,71.3,29.5,95.1,128.2,36.7,137.8,182.6,85.8,23.6,48.7,218.1,30.4,42.3,35,43.9,30,58.2,139.2,99,39.6,13.9,152.6,117.6,39,25.9,169.6,31.2,63.1,124.2,377.4,279.8,168.2,100,191.9,108.6,55.2,27.7,16,8.1,5.6,75.7,38.8,131.7,131,135.9,97.4,188.9,304.8,34.6)
place2<-c(5.4,18,0,19,111.5,30.6,39.2,178.8,77.3,292.5,28,21,45.9,31.5,16.5,54.9,117.8,270.2,131.6,45.5,248.6,55.5,32.5,16.3,42.9,18,19.4,112.4,77,315.8,71.9,201.8,37.3,84.8,25.4,10.6,31.3,12.1,54.1,112.4,122.4,44.4,55.6,160.3,81,257.1,65.8,3.8,11.9,10.7,16.5,51.9,81.4,142,321.5,251.7,144.4,97.6,3,1.8,11.1,16.6,13.9,41.7,218,55.7,50.6,159.8,94,57.9,48.1,121.8,8.6,3.3,64.2,21.8,169.8,55.9,26.4,79,77.5,75.5,67.1,41.9,40.9,132.4,37.3,93.7,67.1,128,52.6,17.2,184.9,97.6,4.3,15.2,21.1,39.9,1.5,53.3,89.4,43,97.7,55.1,232.3,27.9,118.2,5.1,0,4.3,66.1,9.2,122.1,191.4,81.1,80.4,79.8,112.9,51.5,13.9,14,21.3,42,16.7,261.1,287,26.1,134.1,106.3,205.1,29.5,1.5,5.9,14.5,1,219.1,451.3,107,213.6,48.2,92.4,105.2,11.5,6.9,3,13.7,44.5,61.2,99.3,95.7,193.4,13.2,217.1,87.8,11.2,3,75.7,5.3,0,31.1,167.8,198.2,42.2,121.6,180.2,121.9,31.3,22.8,31.9,25.5,69.9,19.4,109.6,179.2,73.2,198.6,425,612.1,26.8,3,71.4,34.9,7.1,8.8,69.8,227.7,86.6,88.7,126,195.4,13.5,36.6,1,80.5,23.4,24.1,31.4,139.5,68.6,53.6,40,232.9,77,32.2,21.1,23.1,9.1,15.3,48.6,140.2,50.8,55.8,59.6,46.2,10.2,18.3,105.9,11.1,0,46.3,307.7,110.2,294.2,200.5,74.3,147.9,30.9,31,67.9,15.8,30.1,56.1,128,25.9,119.2,41.1,56.2,235.4,22.9,10.8)
data<-data.frame(year,month,place1,place2)

data$year.group <- cut(data$year,seq(1940,1960,by=5))
lapply(unique(data$year.group),
function(x) with(data[data$year.group==x,],
cor(place1,place2,use='pairwise.complete.obs')))
Alternatively, if you want to extend this to multiple columns, try this:
lapply(unique(data$year.group),
function(x) cor(data[data$year.group==x,c('place1','place2','place3')],
use='pairwise.complete.obs'))
(and change the use option as appropriate to what you want)

Related

How can I group a dataframe's observation 3 by 3?

I am struggling with a dataframe of exchange-rate observations taken 3 times a day for approximately 30 days. This means that currently the dataframe is formed by 90 observations. For the purpose of my research I need to reduce the observations to 1 per day (30 observations), possibly by making the mean every 3 observations. In sum, I need a code that takes the observations 3 by 3 and outputs one observation every 3. I have tried some different codes but my attempts have all completely failed. I was wondering if someone had to do something similar and managed.
Thanks!
Use group_by and summarise like this:
library(tidyverse)
df=tibble(
day = rep(1:30, each=3),
rate = rnorm(90)
)
df %>%
group_by(day) %>%
summarise(mrate = mean(rate))
P.S.
Attach data. It will be easier to help out on specific data.

How to extrapolate/interpolate in R

I am trying to interpolate/extrapolate NA values. The dataset that I have is from a measuring station that measures soil temperature at 4 depths every 5 minutes. In this specific example there are erroneous data (-888.88) at the end of the measurements for the 0 cm depth variable and 1-5 cm depth variable. I transformed this to NA. Now my professor wants me interpolate/extrapolate for this and all other datasets that I have. I am aware that extrapolating for so much values after the last observation could be statistically inaccurate but I am trying to at least come up with a working code. As of now I tried to extrapolate for one of the variables (SoilTemp_1.5cm). The final line runs but when I open the data frame, the NAs are still there.
library(dplyr)
library(Hmisc)
MyD <- read.csv("2319538_Bodentemp_braun_WILDKOGEL_17_18 - Copy.csv",header=TRUE, sep=";")
MyD$date <- as.Date(MyD$Date, "%d.%m.%Y")
MyD$datetime <- as.POSIXct(MyD$Date.Time..GMT.01.00, format = "%d.%m.%Y %H:%M")
MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] <- NA #convert erroneous data to NA
MyD %>% mutate(`SoilTemp_1.5cm`=approxExtrap(x=SoilTemp_5cm, y=SoilTemp_1.5cm, xout=SoilTemp_5cm)$y)
I also tried this way which gives me a list of 2 which has a lot of columns instead of rows when I convert to data frame. I am not going to lie that this approxExtrap syntax confuses me a little bit.
MyD1 <- approxExtrap(MyD$SoilTemp_5cm, MyD$SoilTemp_1.5cm,xout=MyD$SoilTemp_5cm)
MyD1
I am honestly not sure how to reproduce the data so here is pastebin link of a dput() output https://pastebin.com/NFZdmm4L. I tried to include as much output as I could. Have in mind that I excluded some of the columns when running the dput() so the code MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] might differ. Anyways, the dput() output already has the NAs included so you might not even need it.
Thanks in advance.
Best regards,
Zorin
na.approx will fill in NAs with interpolated values and rule=2 will extend the first and last values.
library(zoo)
x <- c(NA, 4, NA, 5, NA) # test input
na.approx(x, rule = 2)
## [1] 4.0 4.0 4.5 5.0 5.0

Creating a Dataframe from data in an existing Dataframe based on bucket'd rows in R

So I get that the title is terrible and generic like. I have no idea how to concisely describe what I am trying to do.
I've got a 2 column data frame in R, column A has data values, column B had data that has now been binned (was year associated with Column A, now is a bin label based on year ranges).
I need to generate a new data frame which uses the bin labels as columns with the associated data values as row entries, preferably sorted, back-filled with 'NA' to prevent columns of different lengths.
Sample data:
df <- data.frame(values=c(1,NA,3,NA,5:6,7:9),
bins=rep(c("yr1_yr2","yr2_yr3","yr3_yr4"),each=3))
SOLUTION EDIT: So after a lot of experimentation I was able to do what I wanted with my data by using the 'cut_width' function from ggplot2 to slice my data into bins then plop it in a distribution graph.
Thank you all for your attempts, sorry again for the vague question and lack of sample data.
Not quite sure if this is getting close to what you want...
library(tidyverse)
reshape2::melt(df, id.vars='bins', measure.vars='values')
returns
bins variable value
1 yr1_yr2 values 1
2 yr1_yr2 values NA
3 yr1_yr2 values 3
4 yr2_yr3 values NA
5 yr2_yr3 values 5
6 yr2_yr3 values 6
7 yr3_yr4 values 7
8 yr3_yr4 values 8
9 yr3_yr4 values 9

How to generate a plot for reported values and missing values in R - timeseries

Hi I am using R to analyze my data. I have time-series data in following format:
dates ID
2008-02-12 3
2008-03-12 3
2008-05-12 3
2008-09-12 3
2008-02-12 8
2008-04-12 6
I would like to create a plot with dates at the x axis and ID on Y axis. Such that it draws a point if id is reported for that data and nothing if there is no data for that.
In the original dataset I only have id if the value is reported on that date. For e.g. for 2008-02-12 for id 6 there is no data reported hence it is missing in my dataset.
I was able to get all the dates with unique(df$dates) function, but dont know enough about R data structures on how to loop through data and make matrix with 1 0 for all ids and then plot it.
I will be grateful if you guys can help me with the code or give me some pointers on what could be effective way to approach this problem.
Thanks in advance.
It seems you want something like a scatter-plot :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
# convert first column from characters to dates
DF$Year <- as.POSIXct(DF$Year,format='%Y-%m-%d',tz='GMT')
# scatter plot
plot(x=DF$Year,y=DF$ID,type='p',xlab='Date',ylab='ID',
main='Reported Values',pch=19,col='red')
Result :
But this approach has a problem. For example if you have unique(ids) = c(1,2,1000) the space on the y axis between id=2 and id=1000 will be very big (the same holds for the dates on the x axis).
Maybe you want a sort of "map" id-dates, like the following :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
dates <- as.factor(DF$Year)
ids <- as.factor(DF$ID)
plot(x=as.integer(dates),y=as.integer(ids),type="p",
xlim=c(0.5,length(levels(dates))+0.5),
ylim=c(0.5,length(levels(ids))+0.5),
xaxs="i", yaxs="i",
xaxt="n",yaxt="n",main="Reported Values",
xlab="Date",ylab="ID",pch=19,col='red')
axis(1,at=1:length(levels(dates)),labels=levels(dates))
axis(2,at=1:length(levels(ids)),labels=levels(ids))
# add grid
abline(v=(1:(length(levels(dates))-1))+0.5,,col="Gray80",lty=2)
abline(h=(1:(length(levels(ids))-1))+0.5,col="Gray80",lty=2)
Result :

Sum up equal variables in data frame in R keeping values in different columns [duplicate]

This question already has an answer here:
Apply function conditionally
(1 answer)
Closed 8 years ago.
I would like to sum equal values in a given data set. Unfortunately I do not really have a clue where to begin with, especially which function to use.
Lets say I have a data frame like this
count<- c(1,4,7,3,7,9,3,4,2,8)
clone<- c("aaa","aaa","aaa","bbb","aaa","aaa","aaa","bbb","ccc","aaa")
a<- c("d","e","k","v","o","s","a","y","q","f")
b<- c("g","e","j","v","i","q","a","x","l","p")
test<-data.frame(count,clone,a,b)
Problem is that there are lots of repetitive single values wich need to be combined in one (all the "aaa" and the two "bbb").
So I would like to aggregate(?) all equal values in column "clone", summing up the "count" values taking the value for "a" and "b" from the clone with the highest count.
My final data set should look like:
count<- c(39,7,2)
clone<- c("aaa","bbb","ccc")
a<- c("s","y","q")
b<- c("q","x","l")
test<-data.frame(count,clone)
Do you have a suggestion which function I could use for that? Thanks alot in advance.
EDIT: Sorry, I was too tired and forgot to put in the "a" and "b" cloumn, which makes quite a difference since aggregating just after clone and count drops these two columns with essential information, I need in my final data set.
Use aggregate
> aggregate(count~clone, FUN=sum, data=test)
clone count
1 aaa 39
2 bbb 7
3 ccc 2
Also see this answer for further alternatives.
This can be handled with tapply:
tapply(count, clone, sum)
# aaa bbb ccc
# 39 7 2
You can also do this with ddply from plyr
library(plyr)
ddply(test,.(clone),function(x) sum(x$count))
A dplyr solution:
library('dplyr')
summarize(group_by(test, clone), count = sum(count))

Resources