Transforming tibble to tsibble - r

I've got a tibble that I'm struggling to turn into a tsibble.
# A tibble: 13 x 8
year `Administration, E~ `All Staff` `Ambulance staff` `Healthcare Assi~ `Medical and De~ `Nursing, Midwife~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2009 3.97 5.08 7.16 6.94 1.36 6.19
2 2010 4.12 5.07 6.89 7.02 1.41 6.02
3 2011 4.06 5.03 6.69 7.06 1.36 6.02
4 2012 4.40 5.40 7.79 7.48 1.52 6.44
5 2013 4.28 5.35 8.19 7.46 1.48 6.44
6 2014 4.45 5.56 8.87 7.82 1.53 6.67
7 2015 4.30 5.29 6.86 7.54 1.44 6.30
8 2016 4.21 5.15 7.56 7.15 1.66 6.17
9 2017 4.33 5.13 7.32 7.20 1.69 6.04
10 2018 4.58 5.30 7.96 7.00 1.73 6.38
11 2019 4.71 5.52 7.66 7.96 1.94 6.65
12 2020 4.69 5.98 7.49 8.37 2.11 7.56
13 2021 4.19 5.72 9.62 8.47 1.71 7.29
# ... with 1 more variable: Scientific, Therapeutic and Technical staff <dbl>
How would I turn this into a tsibble so that I can plot graphs with ggplot2?
When trying as_tsibble()
absence_ts <- as_tsibble(absence, key = absence$All Staff, index = absence$year)
it comes up with the following error:
Error: Must subset columns with a valid subscript vector. x Can't convert from <double> to <integer> due to loss of precision.

Related

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

creating an array of grouped values (means)

I have a large dataset ("bsa", drawn from a 23-year period) which includes a variable ("leftrigh") for "left-right" views (political orientation). I'd like to summarise how the cohorts change over time. For example, in 1994 the average value of this scale for people aged 45 was (say) 2.6; in 1995 the average value of this scale for people aged 46 was (say) 2.7 -- etc etc. I've created a year-of-birth variable ("yrbrn") to facilitate this.
I've successfully created the means:
bsa <- bsa %>% group_by(yrbrn, syear) %>% mutate(meanlr = mean(leftrigh))
Where I'm struggling is to summarise the means by year (of the survey) and age (at the time of the survey). If I could create an array (containing these means) organised by age x survey-year, I could see the change over time by inspecting the diagonals. But I have no clue how to do this -- my skills are very limited...
A tibble: 66,744 x 10
Groups: yrbrn [104]
Rsex Rage leftrigh OldWt syear yrbrn coh per agecat meanlr
1 1 [Male] 40 1 [left] 1.12 2017 1977 17 2017 [37,47) 2.61
2 2 [Female] 79 1.8 0.562 2017 1938 9 2017 [77,87) 2.50
3 2 [Female] 50 1.5 1.69 2017 1967 15 2017 [47,57) 2.59
4 1 [Male] 73 2 0.562 2017 1944 10 2017 [67,77) 2.57
5 2 [Female] 31 3 0.562 2017 1986 19 2017 [27,37) 2.56
6 1 [Male] 74 2.2 0.562 2017 1943 10 2017 [67,77) 2.50
7 2 [Female] 58 2 0.562 2017 1959 13 2017 [57,67) 2.56
8 1 [Male] 59 1.2 0.562 2017 1958 13 2017 [57,67) 2.53
9 2 [Female] 19 4 1.69 2017 1998 21 2017 [17,27) 2.46
Possible format for presenting this information to see change over time:
1994 1995 1996 1997 1998 1999 2000
18
19
20
21
22
23
24
25
etc.
You can group_by both age and year at the same time:
# Setup (& make reproducible data...)
n <- 10000
df1 <- data.frame(
'yrbrn' = sample(1920:1995, size = n, replace = T),
'Syear' = sample(2005:2015, size = n, replace = T),
'leftrigh' = sample(seq(0,5,0.1), size = n, replace = T))
# Solution
df1 %>%
group_by(yrbrn, Syear) %>%
summarise(meanLR = mean(leftrigh)) %>%
spread(Syear, meanLR)
Produces the following:
# A tibble: 76 x 12
# Groups: yrbrn [76]
yrbrn `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1920 3.41 1.68 2.26 2.66 3.21 2.59 2.24 2.39 2.41 2.55 3.28
2 1921 2.43 2.71 2.74 2.32 2.24 1.89 2.85 3.27 2.53 1.82 2.65
3 1922 2.28 3.02 1.39 2.33 3.25 2.09 2.35 1.83 2.09 2.57 1.95
4 1923 3.53 3.72 2.87 2.05 2.94 1.99 2.8 2.88 2.62 3.14 2.28
5 1924 1.77 2.17 2.71 2.18 2.71 2.34 2.29 1.94 2.7 2.1 1.87
6 1925 1.83 3.01 2.48 2.54 2.74 2.11 2.35 2.65 2.57 1.82 2.39
7 1926 2.43 3.2 2.53 2.64 2.12 2.71 1.49 2.28 2.4 2.73 2.18
8 1927 1.33 2.83 2.26 2.82 2.34 2.09 2.3 2.66 3.09 2.2 2.27
9 1928 2.34 2.02 2.1 2.88 2.14 2.44 2.58 1.67 2.57 3.11 2.93
10 1929 2.31 2.29 2.93 2.08 2.11 2.47 2.39 1.76 3.09 3 2.9

how to summarize yearly average temperature from all given daily observation time series with dplyr?

I am wondering dplyr provide any useful utilities to conduct quick data aggregation on land surface temperature time series. However, I already extracted gridded data of Germany from E-OBS dataset (E-OBS grid data) and rendered this extracted raster grid in tabular data with excel format. Now, in newly exported data, data has shown with a respective geo-coordinate pair with 15 years temperature observation (1012 rows ,15x365/366 columns). Plase take a look the data on the fly: time series data.
Here is what I want to do, the data on the fly time series data, I want to do data aggregation by year because original observation was done by daily level observation. In particular, each geo-coordinate pair, I intend to calculate an average yearly temperature for each year and all operation goes to 15 years. More specifically, after the aggregation done, I want to put the result in new data.frame where original geo-coordinate pair come along, but add new column such as 1980_avg_temp, 1981_avg_temp,1982_avg_temp` and so on. So I want to reduce data dimension by column, introducing new aggregation column where the yearly average temperature will be added.
How can I get this done by using dplyr or data.table for excel data? Any easier way to make this data aggregation operation on attached data on the fly time series data? Any thought?
i tried that:
library(tidyverse)
library(readxl)
df <- read_excel("YOUR_XLSX_FILE")
df %>%
gather(date, temp, -x, -y) %>%
separate(date, c("year", "month", "day")) %>%
separate(year, c("trash", "year"), sep = "X") %>%
select(-trash) %>%
group_by(year, x, y) %>%
summarise(avg_temp=mean(temp)) %>%
spread(year, avg_temp)
output is:
# A tibble: 19 x 17
# Groups: x [11]
x y `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988` `1989` `1990` `1991`
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.88 54.4 7.79 8.02 8.76 9.20 8.32 7.51 7.88 7.43 9.20 9.63 9.76 8.55
2 8.88 54.9 7.54 7.61 8.41 8.84 8.15 7.15 7.53 7.15 8.97 9.51 9.55 8.42
3 9.12 54.4 7.65 7.86 8.62 9.05 8.17 7.34 7.70 7.28 9.01 9.46 9.60 8.37
4 9.12 54.6 7.44 7.59 8.38 8.81 8.02 7.11 7.50 7.13 8.88 9.36 9.47 8.31
5 9.12 54.9 7.33 7.36 8.25 8.67 8.02 7.05 7.49 7.10 8.91 9.48 9.55 8.41
6 9.38 54.4 7.69 7.91 8.61 9.02 8.15 7.31 7.69 7.24 8.98 9.49 9.64 8.35
7 9.38 54.6 7.45 7.62 8.46 8.85 8.05 7.16 7.59 7.18 8.92 9.48 9.61 8.41
8 9.38 54.9 7.24 7.29 8.21 8.62 7.95 7.04 7.56 7.15 8.94 9.57 9.66 8.53
9 9.62 54.4 7.65 7.90 8.60 9.01 8.14 7.24 7.64 7.16 8.93 9.52 9.65 8.33
10 9.62 54.6 7.39 7.60 8.45 8.82 8.01 7.10 7.56 7.12 8.86 9.46 9.55 8.34
11 9.62 54.9 7.28 7.38 8.28 8.69 7.98 7.07 7.61 7.18 8.96 9.60 9.68 8.54
12 9.88 54.4 7.70 8.00 8.69 9.14 8.23 7.36 7.76 7.23 9.03 9.63 9.73 8.41
13 9.88 54.6 7.40 7.65 8.46 8.87 8.05 7.11 7.58 7.12 8.87 9.47 9.50 8.30
14 10.1 54.4 7.76 8.12 8.78 9.21 8.30 7.49 7.90 7.34 9.08 9.69 9.79 8.52
15 10.4 54.4 7.66 8.09 8.70 9.17 8.23 7.41 7.87 7.29 9.03 9.70 9.82 8.60
16 11.1 54.9 7.61 8.14 8.74 9.14 8.33 7.32 7.92 7.22 9.17 9.93 10.1 8.86
17 11.4 54.9 7.59 8.17 8.74 9.14 8.32 7.29 7.92 7.20 9.17 9.95 10.1 8.87
18 11.9 54.9 7.54 8.15 8.71 9.10 8.28 7.19 7.85 7.15 9.10 9.92 10.1 8.84
19 12.1 54.9 7.52 8.12 8.69 9.08 8.27 7.12 7.80 7.11 9.05 9.91 10.0 8.82
# ... with 3 more variables: `1992` <dbl>, `1993` <dbl>, `1994` <dbl>
to show you that the geocoordinates are not changed in a tibble (it's just rounded), add as.data.frame() at the end of the pipe and look at your data: an example:
df %>%
gather(date, temp, -x, -y) %>%
separate(date, c("year", "month", "day")) %>%
separate(year, c("trash", "year"), sep = "X") %>%
select(-trash) %>%
group_by(year, x, y) %>%
summarise(avg_temp=mean(temp)) %>%
spread(year, avg_temp) %>%
as.data.frame() %>% # add this
head()
output is:
# x y 1980 1981 1982 1983 1984 1985 1986 1987 1988
# 1 8.875 54.375 7.792978 8.021342 8.762274 9.203424 8.317131 7.505370 7.879068 7.427260 9.197431
# 2 8.875 54.875 7.536229 7.607507 8.414877 8.841260 8.154945 7.151890 7.532164 7.147945 8.969781
# 3 9.125 54.375 7.651393 7.862466 8.620904 9.052630 8.169262 7.337589 7.701205 7.282657 9.014590
# 4 9.125 54.625 7.435983 7.590548 8.381753 8.808904 8.019399 7.109096 7.499589 7.127370 8.875656
# 5 9.125 54.875 7.332978 7.363370 8.247205 8.669370 8.024645 7.045425 7.487424 7.098849 8.911776
# 6 9.375 54.375 7.693907 7.914630 8.612438 9.022055 8.150164 7.305068 7.688164 7.242274 8.984207
# 1989 1990 1991 1992 1993 1994
# 1 9.625781 9.760931 8.550356 9.678907 8.208109 9.390904
# 2 9.513863 9.552767 8.420109 9.425328 8.010082 9.134466
# 3 9.462959 9.602876 8.374575 9.465164 8.052794 9.207041
# 4 9.358986 9.473178 8.305863 9.353743 7.935507 9.050109
# 5 9.478192 9.545781 8.412329 9.403005 7.998877 9.074740
# 6 9.493205 9.635561 8.352740 9.385819 8.017260 9.184959
This works on the data that you provided.
library(tidyverse)
library(lubridate)
demo_data %>%
gather(date, temp, -x, -y) %>%
mutate(date = ymd(str_remove(date, "X"))) %>%
mutate(year = year(date)) %>%
group_by(x, y, year) %>%
summarise_at(vars(temp), mean, na.rm = TRUE) %>%
spread(year, temp)
# # A tibble: 19 x 17
# # Groups: x, y [19]
# x y `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 8.88 54.4 7.79 8.02 8.76 9.20 8.32 7.51 7.88 7.43 9.20
# 2 8.88 54.9 7.54 7.61 8.41 8.84 8.15 7.15 7.53 7.15 8.97
# 3 9.12 54.4 7.65 7.86 8.62 9.05 8.17 7.34 7.70 7.28 9.01
# 4 9.12 54.6 7.44 7.59 8.38 8.81 8.02 7.11 7.50 7.13 8.88
# 5 9.12 54.9 7.33 7.36 8.25 8.67 8.02 7.05 7.49 7.10 8.91
# 6 9.38 54.4 7.69 7.91 8.61 9.02 8.15 7.31 7.69 7.24 8.98
# 7 9.38 54.6 7.45 7.62 8.46 8.85 8.05 7.16 7.59 7.18 8.92
# 8 9.38 54.9 7.24 7.29 8.21 8.62 7.95 7.04 7.56 7.15 8.94
# 9 9.62 54.4 7.65 7.90 8.60 9.01 8.14 7.24 7.64 7.16 8.93
# 10 9.62 54.6 7.39 7.60 8.45 8.82 8.01 7.10 7.56 7.12 8.86
# 11 9.62 54.9 7.28 7.38 8.28 8.69 7.98 7.07 7.61 7.18 8.96
# 12 9.88 54.4 7.70 8.00 8.69 9.14 8.23 7.36 7.76 7.23 9.03
# 13 9.88 54.6 7.40 7.65 8.46 8.87 8.05 7.11 7.58 7.12 8.87
# 14 10.1 54.4 7.76 8.12 8.78 9.21 8.30 7.49 7.90 7.34 9.08
# 15 10.4 54.4 7.66 8.09 8.70 9.17 8.23 7.41 7.87 7.29 9.03
# 16 11.1 54.9 7.61 8.14 8.74 9.14 8.33 7.32 7.92 7.22 9.17
# 17 11.4 54.9 7.59 8.17 8.74 9.14 8.32 7.29 7.92 7.20 9.17
# 18 11.9 54.9 7.54 8.15 8.71 9.10 8.28 7.19 7.85 7.15 9.10
# 19 12.1 54.9 7.52 8.12 8.69 9.08 8.27 7.12 7.80 7.11 9.05
# # ... with 6 more variables: `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
# # `1992` <dbl>, `1993` <dbl>, `1994` <dbl>

R, too much data in data.frame

I have problem with data in R. I'm loading data with:
data<-read.csv2("ceny_paliwo.csv", dec = ",")
data
an this is giving me:
X Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
1 2014 5.32 5.34 5.34 5.27 5.29 5.23 5.29 5.22 5.19 5.17 4.98 4.75
2 2015 4.46 4.47 4.62 4.58 4.65 4.71 4.66 4.49 4.30 4.28 4.36 4.21
3 2016 3.87 3.73 3.86 3.90 4.07 4.23 4.17 4.10 4.26 4.35 4.32 4.53
4 2017 4.62 4.58 4.53 4.48 4.36 4.19 4.17 4.31 4.37 4.44 4.59 4.59
after this:
data2 <- round(unname(unlist(as.data.frame(data))), digits = 2)
data2
I'm receiving:
[1] 2014.00 2015.00 2016.00 2017.00 5.32 4.46 3.87 4.62 5.34 4.47 3.73 4.58 5.34
[14] 4.62 3.86 4.53 5.27 4.58 3.90 4.48 5.29 4.65 4.07 4.36 5.23 4.71
[27] 4.23 4.19 5.29 4.66 4.17 4.17 5.22 4.49 4.10 4.31 5.19 4.30 4.26
[40] 4.37 5.17 4.28 4.35 4.44 4.98 4.36 4.32 4.59 4.75 4.21 4.53 4.59
What I'm trying to do, is to don't have 2014.00 2015.00 2016.00 2017.00 this data in the first row.
Any idea how to do it?
Select only data from the second column like here:
data2 <- round(unname(unlist(as.data.frame(data[,c(2:ncol(data))]))), digits = 2)

subset xts or data.frame to just one particular day every year

I am new to quantmod, it has many ways to subset dates but I need to subset to a specific day of the year, i.e, 12/24 of every year out of a data set of many years and quantmod does not seem to have this function. Is there a way to do that?
Example:
getSymbols('AMD',src='google')
and you get data starting from 2007 and I want to subset it to a dataframe with just
2007-12-24 ...
2008-12-24 ...
2016-12-26 ...
#and so on.
You can try something like this:
getSymbols('AMD',src='google')
#indexmon==11 for every December and indexmday==24 for every 24th
AMD[.indexmon(AMD)==11 & .indexmday(AMD)==24]
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
#2007-12-24 7.78 7.88 7.68 7.77 9193719
#2008-12-24 1.98 2.03 1.97 1.99 2912312
#2009-12-24 9.79 9.95 9.78 9.91 11331966
#2012-12-24 2.54 2.57 2.47 2.48 9625363
#2013-12-24 3.77 3.80 3.75 3.77 5798855
#2014-12-24 2.63 2.70 2.63 2.65 4624005
#2015-12-24 2.88 3.00 2.86 2.92 11900888
Just to add to LyzandeR's answer, you could also convert the data to a tibble and use lubridate:
library(tidyverse)
library(lubridate)
library(quantmod)
getSymbols('AMD',src='google')
AMD %>% as_tibble() %>% rownames_to_column("date") %>%
filter(month(date) == 12, day(date) == 24)
date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-12-24 7.78 7.88 7.68 7.77 9193719
2 2008-12-24 1.98 2.03 1.97 1.99 2912312
3 2009-12-24 9.79 9.95 9.78 9.91 11331966
4 2012-12-24 2.54 2.57 2.47 2.48 9625363
5 2013-12-24 3.77 3.80 3.75 3.77 5798855
6 2014-12-24 2.63 2.70 2.63 2.65 4624005
7 2015-12-24 2.88 3.00 2.86 2.92 11900888

Resources