Specify time series parameters for data over millions of years - r

I have foram delta 18O data over 6 million years, with one datapoint every 0.003 Myrs, like this:
Age (Myr) | d18O
0 | 3.43
0.003 | 3.37
0.006 | 3.54
0.009 | 3.87
0.012 | 4.36
0.015 | 4.90
0.018 | 5.01
0.021 | 4.96
0.024 | 4.87
0.027 | 4.67
0.03 | 4.58
ts_d18O <- ts(d18O[,2])
plot(ts_d18O)
However, when I tell R to consider it as a time series and I plot it, I get values on the x-axis all the way to 2000, i.e. the scale is not in millions of years. Here's the plot
How do I fix this? I need it to be a time series because I have to do spectral analysis on it, e.g. periodograms

You need to specify the frequency of observations. You have 1 observation every 3000 years, so the frequency would be 1/3000 if you want your answer in years. I'll assume you want your answer in millions of years. This means your frequency is 1/0.003, or 333.333 (i.e. this is how many samples are taken per million years).
You should specify a start time as well (-5 in this example will represent 5 Mya).
Finally, you can label your x axis as required.
ts_d18O <- ts(d18O[,2], start = c(0.003, -5 / 0.003), frequency = 1/0.003)
ts
#> Time Series:
#> Start = -5
#> End = -4.97
#> Frequency = 333.333333333333
#> [1] 3.43 3.37 3.54 3.87 4.36 4.90 5.01 4.96 4.87 4.67 4.58
plot(ts_d18O, xlab = "Million years ago")

Related

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?
After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Time-series average of cross-sectional correlations

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?
[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

Is there a way to resolve this error in cardinality_threshold problem?

I tried to use ggpairs to visualise my dataset but the error message that I am getting is what I don't understand. Can someone please help me?
> describe(Mydata)
vars n mean sd median trimmed mad min max range skew
Time 1 192008 4257.07 2589.28 4156.44 4210.33 3507.03 0 8869.91 8869.91 0.09
Source* 2 192008 9.32 5.95 8.00 8.53 2.97 1 51.00 50.00 3.39
Destination* 3 192008 8.22 6.49 7.00 7.31 2.97 1 51.00 50.00 3.07
Protocol* 4 192008 16.14 4.29 19.00 16.77 0.00 1 20.00 19.00 -1.26
Length 5 192008 166.12 464.07 74.00 96.25 11.86 60 21786.00 21726.00 14.40
Info* 6 192008 63731.70 46463.90 60732.50 62899.62 69904.59 1 131625.00 131624.00 0.14
kurtosis se
Time -1.28 5.91
Source* 15.94 0.01
Destination* 13.21 0.01
Protocol* 0.66 0.01
Length 349.17 1.06
Info* -1.47 106.04
> Mydata[,1][Mydata[,1] ==0]<-NA
> ggpairs(Mydata)
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'Source' has more levels (51) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
As the error suggests, the way to get rid of the error is to set cardinality_threshold=NULL or cardinality_threshold=51 as Source and Destination are both factor variables with 51 levels.
However, they're likely to be hard to see any detail in the plots, if it plots at all because one of the panels of the plot would be attempting to fit 51 barplots with 51 columns into it. You may want to think if grouping your factor levels makes sense for the analysis you're interested in, or exclude the factors (although that only leaves two continuous variables).

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers
It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Histogram frequency count for column

I have a set of data (in the region of 800000 lines), in three columns (longitude, latitude and earthquake magnitude) that are not sorted in any way. A small example below...
-118.074 36.930 2.97
-118.005 36.898 2.61
-116.526 36.621 2.72
-116.488 36.650 2.68
-117.675 36.820 2.00
-117.963 36.514 1.30
-118.090 36.757 1.94
-117.651 36.518 1.40
-116.434 36.506 1.90
-117.914 36.531 2.10
-118.235 36.882 2.00
I am required to create a histogram of the earthquake magnitudes (in the range of 1.0 to 7.0), but I am not sure how to go about creating the frequency of magnitudes.
I understand that in order to create a histogram, I will need to discern the unique values, and set them in ascending order in a column. I believe I can then run a for command with a count function for each value... but I need a bit of help in doing so!
Thank you for any help you can offer!
awk '{counts[$3]++} END {for (c in counts) print c, counts[c]}' inputs.txt | sort -nk2
will print the unique magnitudes and their counts in ascending order:
1.30 1
1.40 1
1.90 1
1.94 1
2.10 1
2.61 1
2.68 1
2.72 1
2.97 1
2.00 2

Resources