Converting month column table to chronological order in R - r

I have a table of the following format:
Initial Table Formatting
And I'm seeking an output resembling the following:
Date
Value
January 1659
Value 1
February 1659
Value 2
March 1659
Value 3
April 1659
Value 4
and so on (numerical representations of the Month and Year are perfectly fine also.
I've attempted using merge operations but I'm thinking there must be an easier way (possibly using packages). I've found somewhat similar questions asked but none obviously applicable yet.

You can use pivot_longer and unite, both from the tidyr package:
library(tidyr)
pivot_longer(df, -Year) |>
unite(date, name, Year, sep = " ")
#> # A tibble: 120 x 2
#> date value
#> <chr> <int>
#> 1 Jan 1659 68
#> 2 Feb 1659 97
#> 3 Mar 1659 89
#> 4 Apr 1659 74
#> 5 May 1659 44
#> 6 Jun 1659 2
#> 7 Jul 1659 81
#> 8 Aug 1659 22
#> 9 Sep 1659 87
#> 10 Oct 1659 1
#> # ... with 110 more rows
Data used
set.seed(1)
df <- cbind(1659:1668, replicate(12, sample(99, 10))) |>
as.data.frame() |>
setNames(c("Year", month.abb))
df
#> Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1 1659 68 97 89 74 44 2 81 22 87 1 76 43
#> 2 1660 39 85 37 42 25 45 13 93 83 43 39 1
#> 3 1661 1 21 34 38 70 18 40 28 90 59 24 29
#> 4 1662 34 54 99 20 39 22 89 48 48 26 53 78
#> 5 1663 87 74 44 28 51 78 48 33 64 15 92 22
#> 6 1664 43 7 79 96 42 65 96 45 94 58 86 70
#> 7 1665 14 73 33 44 6 70 23 21 60 29 40 28
#> 8 1666 82 79 84 87 24 87 84 31 51 24 83 37
#> 9 1667 59 98 35 70 32 93 29 17 34 42 90 61
#> 10 1668 51 37 70 40 14 75 98 73 10 48 35 46
Created on 2022-11-29 with reprex v2.0.2

Related

Sum Columns in a dataframe where the names match a vector list

I have a dataframe made up largely of integers and community names.
I have made a list of the community names grouped by their regions like so;
RegionA <- c(a,c,d)
RegionB <- c(b,e,f)
RegionC <- c(g,h,i)
Year a b c d e f g h i `5`
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 2021 61 44 1 78 37 46 33 16 57 5
2 2020 60 54 60 2 72 59 60 34 60 5
3 2019 53 77 39 66 85 82 65 95 50 5
4 2018 78 20 63 26 41 29 19 82 46 5
5 2017 62 38 22 23 6 11 20 51 65 5
6 2021 39 15 38 74 90 83 73 12 71 5
7 2020 28 23 76 57 100 89 62 14 56 5
8 2019 82 48 40 45 93 72 40 45 29 5
9 2018 13 69 100 13 5 52 99 52 47 5
10 2017 92 13 13 96 98 17 46 49 74 5
I am trying to select the names from the Regions vector and sum them in a new columns
I have tried using
df <- df %>%
mutate(Region_A = rowSums(select(., colnames %in% RegionA)))
and
df <- df %>%
rowwise %>%
mutate(Region_A = sum(c_across(where(colnames %in% RegionA))))
with no success, getting this error
Caused by error in `match()`:
! 'match' requires vector arguments
What could be the proper solution?
A possible solution:
library(dplyr)
RegionA <- c("a","c","d")
RegionB <- c("b","e","f")
RegionC <- c("g","h","i")
df %>%
rowwise %>%
mutate(RegionA = sum(c_across(all_of(RegionA))),
RegionB = sum(c_across(all_of(RegionB))),
RegionC = sum(c_across(all_of(RegionC)))) %>%
ungroup
#> # A tibble: 10 × 13
#> Year a b c d e f g h i RegionA RegionB
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2021 61 44 1 78 37 46 33 16 57 140 127
#> 2 2020 60 54 60 2 72 59 60 34 60 122 185
#> 3 2019 53 77 39 66 85 82 65 95 50 158 244
#> 4 2018 78 20 63 26 41 29 19 82 46 167 90
#> 5 2017 62 38 22 23 6 11 20 51 65 107 55
#> 6 2021 39 15 38 74 90 83 73 12 71 151 188
#> 7 2020 28 23 76 57 100 89 62 14 56 161 212
#> 8 2019 82 48 40 45 93 72 40 45 29 167 213
#> 9 2018 13 69 100 13 5 52 99 52 47 126 126
#> 10 2017 92 13 13 96 98 17 46 49 74 201 128
#> # … with 1 more variable: RegionC <int>

R: order() not working as expected outside of RStudio

I have a character vector (consisting of randomly arranged numbers or letters) that I want to use to order a dataframe:
vals = as.numeric(dict$keys)
## ONE
vals = order(vals)
## TWO
dict = dict[vals,]
At ONE:
> vals
[1] 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5
[26] 6 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 10 10
[51] 10 10 10 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 15 15 15
[76] 15 16 16 16 16 16 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18 18 18 18 18
[101] 18 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21
[126] 22 22 22 22 22 22 22 22 22 22 22 22 23
At TWO:
> vals
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138
When I execute this snippet in RStudio in Windows, it orders the dataframe dict fine. Numbers are ordered first, then letters are at the end (this is what I want).
However, in a linux remote desktop where I execute with > Rscript , this snippet doesn't work and the dataframe remains how it was before these lines are executed.
I fixed this by defining stringsAsFactors = F for all uses of data.frame in the script as Henrik suggested. The issue lied in the different versions of R I was using on the two systems.

Filter using paste and name in dplyr

Sample data
df <- data.frame(loc.id = rep(1:5, each = 6), day = sample(1:365,30),
ref.day1 = rep(c(20,30,50,80,90), each = 6),
ref.day2 = rep(c(10,28,33,49,67), each = 6),
ref.day3 = rep(c(31,49,65,55,42), each = 6))
For each loc.id, if I want to keep days that are >= then ref.day1, I do this:
df %>% group_by(loc.id) %>% dplyr::filter(day >= ref.day1)
I want to make 3 data frames, each whose rows are filtered by ref.day1, ref.day2,ref.day3 respectively
I tried this:
col.names <- c("ref.day1","ref.day2","ref.day3")
temp.list <- list()
for(cl in seq_along(col.names)){
col.sub <- col.names[cl]
columns <- c("loc.id","day",col.sub)
df.sub <- df[,columns]
temp.dat <- df.sub %>% group_by(loc.id) %>% dplyr::filter(day >= paste0(col.sub)) # this line does not work
temp.list[[cl]] <- temp.dat
}
final.dat <- rbindlist(temp.list)
I was wondering how to refer to columns by names and paste function in dplyr in order to filter it out.
The reason why your original code doesn't work is that your col.names are strings, but dplyr function uses non-standard evaluation which doesn't accept strings. So you need to convert the string into variables.rlang::sym() can do that.
Also, you can use map function in purrr package, which is much more compact:
library(dplyr)
library(purrr)
col_names <- c("ref.day1","ref.day2","ref.day3")
map(col_names,~ df %>% dplyr::filter(day >= UQ(rlang::sym(.x))))
#it will return you a list of dataframes
By the way I removed group_by() because they don't seem to be useful.
Returned result:
[[1]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 5 199 90 67 42
22 5 141 90 67 42
23 5 241 90 67 42
24 5 187 90 67 42
[[2]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 16 20 10 31
2 1 362 20 10 31
3 1 69 20 10 31
4 1 65 20 10 31
5 1 88 20 10 31
6 1 142 20 10 31
7 2 355 30 28 49
8 2 255 30 28 49
9 2 136 30 28 49
10 2 156 30 28 49
11 2 194 30 28 49
12 2 204 30 28 49
13 3 129 50 33 65
14 3 254 50 33 65
15 3 279 50 33 65
16 3 201 50 33 65
17 3 282 50 33 65
18 4 351 80 49 55
19 4 114 80 49 55
20 4 338 80 49 55
21 4 283 80 49 55
22 4 79 80 49 55
23 5 199 90 67 42
24 5 67 90 67 42
25 5 141 90 67 42
26 5 241 90 67 42
27 5 187 90 67 42
[[3]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 4 79 80 49 55
22 5 199 90 67 42
23 5 67 90 67 42
24 5 141 90 67 42
25 5 241 90 67 42
26 5 187 90 67 42
You may also want to check these:
https://dplyr.tidyverse.org/articles/programming.html
Use variable names in functions of dplyr

R dataset not found?

I'm trying to load the dataset life.expectancy.1971
, but seem to have trouble loading it. I'm inputting
data(life.expectancy.1971)
life.expectancy.1971
and keep getting the following error:
data set �life.expectancy.1971� not foundError: object 'life.expectancy.1971' not found.
I'm still pretty new to R so it could be a simple error on my part, but I haven't been able to figure out what's wrong since that has worked for loading other datasets. Can anyone help me figure out what I'm missing?
Pasting the answers from the comments to an answer so that the question can be closed.
Install the cluster.datasets package and its dependencies
install.packages(c("cluster.datasets"), dependencies = TRUE)
load cluster.datasets
library(cluster.datasets)
load the dataset life.expectancy.1971,
data(life.expectancy.1971)
look at the dataset life.expectancy.1971,
life.expectancy.1971
#> country year m0 m25 m50 m75 f0 f25 f50 f75
#> 1 Algeria 1965 63 51 30 13 67 54 34 15
#> 2 Cameroon 1964 34 29 13 5 38 32 17 6
#> 3 Madagascar 1966 38 30 17 7 38 34 20 7
#> 4 Mauritius 1966 59 42 20 6 64 46 25 8
#> 5 Reunion 1963 56 38 18 7 62 46 25 10
#> 6 Seychelles 1960 62 44 24 7 69 50 28 14
#> 7 South Africa (Nonwhite) 1961 50 39 20 7 55 43 23 8
#> 8 South Africa (White) 1961 65 44 22 7 72 50 27 9
#> 9 Tunisia 1960 56 46 24 11 63 54 33 19
#> 10 Canada 1966 69 47 24 8 75 53 29 10
#> 11 Costa Rica 1966 65 48 26 9 68 50 27 10
#> 12 Dominican Republic 1966 64 50 28 11 66 51 29 11
#> 13 El Salvador 1961 56 44 25 10 61 48 27 12
#> 14 Greenland 1960 60 44 22 6 65 45 25 9
#> 15 Grenada 1961 61 45 22 8 65 49 27 10
#> 16 Guatemala 1964 49 40 22 9 51 41 23 8
#> 17 Honduras 1966 59 42 22 6 61 43 22 7
#> 18 Jamaica 1963 63 44 23 8 67 48 26 9
#> 19 Mexico 1966 59 44 24 8 63 46 25 8
#> 20 Nicaragua 1965 65 48 28 14 68 51 29 13
#> 21 Panama 1966 65 48 26 9 67 49 27 10
#> 22 Trinidad 1962 64 43 21 7 68 47 25 9
#> 23 Trinidad 1967 64 43 21 6 68 47 24 8
#> 24 US 1966 67 45 23 8 74 51 28 10
#> 25 US (Nonwhite) 1966 61 40 21 10 67 46 25 11
#> 26 US (White) 1966 68 46 23 8 75 52 29 10
#> 27 US 1967 67 45 23 8 74 51 28 10
#> 28 Argentina 1964 65 46 24 9 71 51 28 10
#> 29 Chile 1967 59 43 23 10 66 49 27 12
#> 30 Columbia 1965 58 44 24 9 62 47 25 10
#> 31 Ecuador 1965 57 46 25 9 60 49 28 11

Time series, change monthly data to quarterly

Now I have some monthly data like :
1/1/90 620
2/1/90,591
3/1/90,574
4/1/90,542
5/1/90,534
6/1/90,545
#...etc
If I use ts() function, it's easy to make the data into time series structure like:
Jan Feb Mar ... Nov Dec
1990 620 591 574 ... 493 464
1991 100 200 300 ...........
Is there any possibilities to change it into quarterly repeating like this:
1st 2nd 3rd 4th
1990-Q1 620 591 574 464
1990-Q2 100 200 300 400
1990-Q3 ...
1990-Q4 ...
1991-Q1 ...
I tried to change
ts(mydata,start=c(1990,1),frequency=12)
to
ts(mydata,start=c(as.yearqrt("1990-1",1)),frequency=4)
but it seems not working.
Could anyone help me? Thank you very much.
monthly <- ts(mydata, start = c(1990, 1), frequency = 12)
quarterly <- aggregate(monthly, nfrequency = 4)
I don't agree with Hyndman on this one. Which is rare as Hyndman can usually do no wrong. However, I can show you his solution doesn't give the OP what he wants.
test<-c(1:100)
test_ts <- ts(test, start=c(2000,1), frequency=12)
test_ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 1 2 3 4 5 6 7 8 9 10 11 12
2001 13 14 15 16 17 18 19 20 21 22 23 24
2002 25 26 27 28 29 30 31 32 33 34 35 36
2003 37 38 39 40 41 42 43 44 45 46 47 48
2004 49 50 51 52 53 54 55 56 57 58 59 60
2005 61 62 63 64 65 66 67 68 69 70 71 72
2006 73 74 75 76 77 78 79 80 81 82 83 84
2007 85 86 87 88 89 90 91 92 93 94 95 96
2008 97 98 99 100
test_agg <- aggregate(test_ts, nfrequency=4)
test_agg
2000 6 15 24 33
2001 42 51 60 69
2002 78 87 96 105
2003 114 123 132 141
2004 150 159 168 177
2005 186 195 204 213
2006 222 231 240 249
2007 258 267 276 285
2008 294
Well, wait, that first quarter isn't the average of the 3 months, its the sum. (1+2+3 =6 but you want it to show the mean=2). So you will need to modify that a tad.
test_agg <- aggregate(test_ts, nfrequency=4)/3
# divisor is (old freq)/(new freq) = 12/4 = 3
Qtr1 Qtr2 Qtr3 Qtr4
2000 2 5 8 11
2001 14 17 20 23
2002 26 29 32 35
2003 38 41 44 47
2004 50 53 56 59
2005 62 65 68 71
2006 74 77 80 83
2007 86 89 92 95
2008 98
Which now shows you the mean of the monthly data written as quarterly.
The divisor is the trick here. If you had weekly (freq=52) and wanted quarterly (freq=4) you'd divide by 52/4=13.
If you want the mean instead of the sum, just add "mean":
quarterly <- aggregate(monthly, nfrequency=4,mean)

Resources