create sequences of dates depending on another variable - r

I need some help for my work;
I have a dataset like this:
DATE COD QTA
2014-01-02 87 11
2014-01-05 87 5
2015-02-03 45 3
2015-06-21 45 92
2014-09-18 74 34
2015-04-21 74 27
I need to create, for eache value of the variable COD, the sequence of all dates from the min value (example: for COD 87, the min date is 2014-01-02) to the Sys.Date(). The final result that I would like to have is something like that:
DATE COD QTA
2014-01-02 87 11
2014-01-03 87 0
2014-01-04 87 0
2014-01-05 87 5
2014-01-06 87 0
... 87 ...
Sys.Date() 87 x
2015-02-03 45 3
2015-02-04 45 0
2015-02-05 45 0
... 45 ...
Sys.Date() 45 x
How can I do that? Thanks guys!

A data.table solution:
require(data.table)
dt<-as.data.table(df)
dt[dt[,list(DATE=seq(min(DATE),Sys.Date(),by="day")),by=COD],
on=c("COD","DATE")][,QTA:=ifelse(is.na(QTA),0,QTA)][]
# DATE COD QTA
# 1: 2014-01-02 87 11
# 2: 2014-01-03 87 0
# 3: 2014-01-04 87 0
# 4: 2014-01-05 87 5
# 5: 2014-01-06 87 0
# ---
#2601: 2016-12-19 74 0
#2602: 2016-12-20 74 0
#2603: 2016-12-21 74 0
#2604: 2016-12-22 74 0
#2605: 2016-12-23 74 0

Related

Conditionally replace column names in a dataframe based on values in another dataframe

I have downloaded a table of stream diversion data ("df_download"). The column names of this table are primarily taken from the ID numbers of the gauging stations.
I want to conditionally replace the ID numbers that have been used for column names with text for the station names, which will help make the data more readable when I'm sharing the results. I created a table ("stationIDs") with the ID numbers and station names to use as a reference for changing the column names of "df_download".
I can replace the column names individually, but I want to write a loop of some kind that will address all of the columns of "df_download" and change the names of the columns referenced in the dataframe "stationIDs".
An example of what I'm trying to do is below.
Downloaded Data ("df_download")
A portion of the downloaded data is similar to this:
df_downloaded <- data.frame(Var1 = seq(as.Date("2012-01-01"),as.Date("2012-12-01"), by="month"),
Var2 = sample(50:150,12, replace =TRUE),
Var3 = sample(10:100,12, replace =TRUE),
Var4 = sample(15:45,12, replace =TRUE),
Var5 = sample(50:200,12, replace =TRUE),
Var6 = sample(15:100,12, replace =TRUE),
Var7 = c(rep(0,3),rep(13,6),rep(0,3)),
Var8 = rep(5,12))
colnames(df_downloaded) <- c("Diversion.Date","360410059","360410060",
"360410209","361000655","361000656","Irrigation","Seep")
df_download # not run
#
# Diversion.Date 360410059 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
# 7 2012-07-01 86 77 20 130 63 13 5
# 8 2012-08-01 118 29 27 118 57 13 5
# 9 2012-09-01 142 18 45 116 27 13 5
# 10 2012-10-01 74 68 34 182 79 0 5
# 11 2012-11-01 106 48 27 95 74 0 5
# 12 2012-12-01 91 41 20 179 55 0 5
Reference Table ("stationIDs")
stationIDs <- data.frame(ID = c("360410059", "360410060", "360410209", "361000655", "361000656"),
Names = c("RimView", "IPCO", "WMA.Ditch", "RV.Bypass", "LowerFalls"))
stationIDs # not run
#
# ID Names
# 1 360410059 RimView
# 2 360410060 IPCO
# 3 360410209 WMA.Ditch
# 4 361000655 RV.Bypass
# 5 361000656 LowerFalls
I can replace the column names in "df_downloaded" using individual statements. I show the first three iterations below.
After three iterations "RimValley", "IPCO", and "WMA.Ditch" have replaced their respective gauge ID numbers.
names(df_downloaded) <- gsub(stationIDs$ID[1],stationIDs$Name[1],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[2],stationIDs$Name[2],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[3],stationIDs$Name[3],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO WMA.Ditch 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
If I try to do the renaming using a for loop, I end up with NAs for column names.
for(i in seq_along(names(df_downloaded))){
names(df_downloaded) <- gsub(stationIDs$ID[i],stationIDs$Name[i],names(df_downloaded))
}
# head(df_downloaded)
# NA NA NA NA NA NA NA NA
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
I really want to be able to change the names with a for loop or something similar, because because the number of stations that I download data from changes depending on the years that I am analyzing.
Thanks for taking time to look at my question.
We can use match
#Convert factor columns to character
stationIDs[] <- lapply(stationIDs, as.character)
#Match names of df_downloaded with stationIDs$ID
inds <- match(names(df_downloaded), stationIDs$ID)
#Replace the matched name with corresponding Names from stationIDs
names(df_downloaded)[which(!is.na(inds))] <- stationIDs$Names[inds[!is.na(inds)]]
df_downloaded
# Diversion.Date RimView IPCO WMA.Ditch RV.Bypass LowerFalls Irrigation Seep
#1 2012-01-01 142 14 41 200 79 0 5
#2 2012-02-01 97 100 35 176 22 0 5
#3 2012-03-01 85 59 26 88 71 0 5
#4 2012-04-01 68 49 34 63 15 13 5
#5 2012-05-01 62 58 44 87 16 13 5
#6 2012-06-01 70 59 33 145 87 13 5
#7 2012-07-01 112 65 25 52 64 13 5
#8 2012-08-01 75 12 27 103 19 13 5
#9 2012-09-01 73 65 36 172 68 13 5
#10 2012-10-01 87 35 27 146 42 0 5
#11 2012-11-01 122 17 33 183 32 0 5
#12 2012-12-01 108 65 15 120 99 0 5
You can do this dplyr and tidyr. You basically want to make your data long so that the IDs are in a column so that you can do a join on this with your reference of IDs to names. Then you can make your data wide again.
df_downloaded %>%
gather(ID, value, -Diversion.Date, -Irrigation, -Seep) %>%
left_join(., stationIDs) %>%
dplyr::select(-ID) %>%
spread(Names, value)

creating a Sparse matrix from a list of lists- R

I have a list called res which includes 83 lists with the following format. I need to generate one sparse matrix out of these lists. Row and Columns are indecies for the row and column of the sparse matrix and freq is the entry for that corresponding index.
Example of format for res[82] and res[83]:
[[82]]
Row Columns Freq
2 82 33 1
3 82 173 1
4 82 211 1
5 82 247 2
6 82 480 2
7 82 541 1
8 82 974 1
9 82 1197 1
10 82 1416 1
11 82 1531 1
12 82 1797 7
13 82 2416 2
14 82 2530 1
15 82 2772 1
16 82 2970 2
17 82 3264 4
18 82 3416 1
19 82 3995 4
20 82 5593 1
21 82 6557 1
22 82 8141 1
23 82 9044 1
24 82 11889 1
25 82 12608 1
26 82 13352 1
27 82 13463 1
28 82 17937 1
29 82 29730 1
30 82 37712 1
31 82 258434 1
[[83]]
Row Columns Freq
2 83 309 1
3 83 447 1
4 83 480 2
5 83 487 1
6 83 619 1
7 83 651 1
8 83 913 1
9 83 1555 1
10 83 1874 1
11 83 2416 1
12 83 3101 1
13 83 3856 1
14 83 3964 1
15 83 3995 1
16 83 4017 1
17 83 4362 1
18 83 10551 1
19 83 17130 1
20 83 29730 1
We can use sparseMatrix from Matrix after rbinding the list elements.
library(Matrix)
d1 <- do.call(rbind, lst)
res <- sparseMatrix(d1[,1], d1[,2], x = d1[,3])

Calculate quarterly Mean of data.frame

I have this data frame, (df1):
Month index
1 2015-09-01 1.21418847
2 2015-08-01 -4.37919039
3 2015-07-01 -1.16004624
4 2015-06-01 -1.09754890
5 2015-05-01 -4.37919039
6 2015-04-01 -4.37919039
7 2015-03-01 4.37919039
8 2015-02-01 4.37919039
9 2015-01-01 -0.11285150
10 2014-12-01 0.45712044
11 2014-11-01 0.97597018
12 2014-10-01 0.87560496
13 2014-09-01 0.66278156
14 2014-08-01 4.37919039
15 2014-07-01 1.15440685
16 2014-06-01 1.38021497
17 2014-05-01 1.67663242
18 2014-04-01 2.08358406
19 2014-03-01 2.50222843
20 2014-02-01 2.71665822
21 2014-01-01 3.13692051
22 2013-12-01 2.91702023
23 2013-11-01 3.02603774
24 2013-10-01 2.55812363
25 2013-09-01 3.12586325
26 2013-08-01 3.26063617
27 2013-07-01 2.91702023
28 2013-06-01 3.15504505
29 2013-05-01 2.53958494
30 2013-04-01 2.61528861
31 2013-03-01 2.84742861
32 2013-02-01 2.82097624
33 2013-01-01 2.53196473
34 2012-12-01 2.35786991
35 2012-11-01 2.40611260
36 2012-10-01 2.42408844
37 2012-09-01 2.91702023
38 2012-08-01 2.33372249
39 2012-07-01 2.00140636
40 2012-06-01 2.24721387
41 2012-05-01 1.89189602
42 2012-04-01 1.98807663
43 2012-03-01 1.89563925
44 2012-02-01 1.19541625
45 2012-01-01 2.91702023
46 2011-12-01 0.29072412
47 2011-11-01 -2.91702023
48 2011-10-01 -2.91702023
49 2011-09-01 -0.36402331
50 2011-08-01 -0.55409805
51 2011-07-01 -0.05902839
52 2011-06-01 -0.03946940
53 2011-05-01 0.30898661
54 2011-04-01 2.91702023
55 2011-03-01 0.80556310
56 2011-02-01 1.07001901
57 2011-01-01 2.91702023
58 2010-12-01 1.34682208
59 2010-11-01 1.30446466
60 2010-10-01 0.97753435
61 2010-09-01 0.90434619
62 2010-08-01 0.80415571
63 2010-07-01 1.41129808
64 2010-06-01 2.03576435
65 2010-05-01 2.85757135
66 2010-04-01 2.91702023
67 2010-03-01 3.96563441
68 2010-02-01 4.37919039
69 2010-01-01 4.57358010
70 2009-12-01 4.63589893
71 2009-11-01 4.40042885
72 2009-10-01 4.21359930
73 2009-09-01 4.10739350
74 2009-08-01 2.91702023
75 2009-07-01 3.85460338
76 2009-06-01 3.07796824
77 2009-05-01 2.91702023
78 2009-04-01 1.90359672
79 2009-03-01 0.68355248
80 2009-02-01 0.36218125
81 2009-01-01 -0.50814101
82 2008-12-01 0.49310633
83 2008-11-01 2.98877210
84 2008-10-01 2.28716199
85 2008-09-01 0.61433048
86 2008-08-01 0.51258623
87 2008-07-01 1.74079440
88 2008-06-01 2.91702023
89 2008-05-01 1.60899848
90 2008-04-01 2.01574569
91 2008-03-01 1.81341196
92 2008-02-01 1.48482933
93 2008-01-01 1.89122725
94 2007-12-01 1.84400308
95 2007-11-01 1.23545695
96 2007-10-01 0.44341718
97 2007-09-01 0.55630846
98 2007-08-01 0.42806839
99 2007-07-01 -0.75234218
100 2007-06-01 -1.44397151
101 2007-05-01 -2.10673018
102 2007-04-01 -1.40817350
103 2007-03-01 -0.73608848
104 2007-02-01 -0.69200513
105 2007-01-01 -0.51056142
106 2006-12-01 -0.40504212
107 2006-11-01 -0.04161989
108 2006-10-01 -0.10478629
109 2006-09-01 0.07423530
110 2006-08-01 0.13076121
111 2006-07-01 2.91702023
112 2006-06-01 1.02865488
113 2006-05-01 -0.08979180
114 2006-04-01 -1.52792341
115 2006-03-01 -2.52839603
116 2006-02-01 -3.39026284
117 2006-01-01 -3.04045769
I want to calculate quarterly mean for each year. This will result in a data.frame with 39 rows.
I did this code to implement the quarterly mean:
final<-df1[, mean(index), by = quarterly(Month)]
The error mssg is :
Error in `[.data.frame`(df1, , mean(index), :
unused argument (by = month(Month))
Information:
class(df1$index)
"numeric"
class(df1$Month)
"factor"
What i did wrong?
Thanks
It seems you are trying to use data.table syntax on a data frame. So first do
library(data.table)
setDT(df1)
to load the data.table package and set df1 to a data table. Then you can do
final <- df1[, mean(index), keyby = .(year(Month), quarter(Month))]
str(final)
# Classes ‘data.table’ and 'data.frame': 39 obs. of 3 variables:
# $ year : int 2006 2006 2006 2006 2007 2007 2007 2007 2008 2008 ...
# $ quarter: int 1 2 3 4 1 2 3 4 1 2 ...
# $ V1 : num -2.986 -0.196 1.041 -0.184 -0.646 ...
# - attr(*, "sorted")= chr "year" "quarter"
# - attr(*, ".internal.selfref")=<externalptr>
This shows we have 39 rows in the result, as you desire. Some notes: The function is named quarter() not quarterly(), you needed a capital M in Month, and needed to group by year and quarter.

Download Google Trends data with R

I cannot find a similar problem on StackOverflow. I apologize if it is out there...
I have a list of dataframes, all with a date column and a value for each date. I would like to combine this list into one dataframe, with one data column, and the value from each list.
I have this:
> list
$IBM
Date IBM
1 2012-03-01 98
2 2012-03-02 94
3 2012-03-03 49
4 2012-03-04 48
$AAPL
Date AAPL
1 2012-03-01 43
2 2012-03-02 38
3 2012-03-03 13
4 2012-03-04 10
$HPQ
Date HPQ
1 2012-03-01 62
2 2012-03-02 67
3 2012-03-03 24
4 2012-03-04 37
I would like this:
Date IBM AAPL HPQ
1 2012-03-01 98 43 62
2 2012-03-02 94 38 67
3 2012-03-03 49 13 24
4 2012-03-04 48 10 37
Using do.call("cbind", list) I get this:
> do.call("cbind", test)
IBM.Date IBM.IBM AAPL.Date AAPL.AAPL HPQ.Date HPQ.HPQ
1 2012-03-01 98 2012-03-01 43 2012-03-01 62
2 2012-03-02 94 2012-03-02 38 2012-03-02 67
3 2012-03-03 49 2012-03-03 13 2012-03-03 24
4 2012-03-04 48 2012-03-04 10 2012-03-04 37`
This is very simliar to what I want, but with multiple repeated date columns. Is there a way to do this? Preferably in the base package?
Thanks!!!
If your list looks like this
#sample data
dts<-c("2012-03-01","2012-03-02","2012-03-03","2012-03-04")
dd<-list(
IBM=data.frame(Date=dts, IBM=c(98,94,49,48)),
APPL=data.frame(Date=dts, APPL=c(43,38,13,10)),
HPQ=data.frame(Date=dts, HPQ=c(62,67,24,37))
)
Then you can create the output you want with
Reduce(merge, dd)
# Date IBM APPL HPQ
# 1 2012-03-01 98 43 62
# 2 2012-03-02 94 38 67
# 3 2012-03-03 49 13 24
# 4 2012-03-04 48 10 37

How can I "roll up" values into subsequent records?

I have a data set (x) that looks like this:
DATE WEEKDAY A B C D
2011-02-04 Friday 113 67 109 72
2011-02-05 Saturday 1 0 0 1
2011-02-06 Sunday 9 5 0 0
2011-02-07 Monday 154 48 85 60
str(x):
'data.frame': 4 obs. of 6 variables:
$ DATE : Date, format: "2011-02-04" "2011-02-05" "2011-02-06" "2011-02-07"
$ WEEKDAY: Factor w/ 7 levels "Friday","Monday",..: 1 3 4 2
$ A : num 113 1 9 154
$ B : num 67 0 5 48
$ C : num 109 0 0 85
$ D : num 72 1 0 60
Tuesday - Saturday values don't change, but I want Sunday to be the sum of Saturday and Sunday and Monday to be the sum of Saturday, Sunday, and Monday.
I tried shifting Saturday's and Sunday's dates to date + 2 and date + 1 respectively, then aggregating by date, but I lose the weekend records.
For my example, the correct results would be the following:
DATE WEEKDAY A B C D
2011-02-04 Friday 113 67 109 72
2011-02-05 Saturday 1 0 0 1
2011-02-06 Sunday 10 5 0 1
2011-02-07 Monday 164 53 85 61
How can I roll up weekend values into the next day?
Three weeks' worth of data:
DATE WEEKDAY A B C D
1 2011-01-02 Sunday 2 1 0 0
2 2011-01-03 Monday 153 51 7 1
3 2011-01-04 Tuesday 182 103 13 5
4 2011-01-05 Wednesday 192 102 14 12
5 2011-01-06 Thursday 160 67 50 20
6 2011-01-07 Friday 154 96 50 39
7 2011-01-09 Sunday 0 0 0 1
8 2011-01-10 Monday 195 94 48 39
9 2011-01-11 Tuesday 206 72 71 38
10 2011-01-12 Wednesday 232 94 96 52
11 2011-01-13 Thursday 178 113 93 52
12 2011-01-14 Friday 173 97 68 56
13 2011-01-15 Saturday 2 0 1 0
14 2011-01-17 Monday 170 91 66 52
15 2011-01-18 Tuesday 176 76 70 78
16 2011-01-19 Wednesday 164 159 117 37
17 2011-01-20 Thursday 198 87 95 111
18 2011-01-21 Friday 213 86 89 90
19 2011-01-24 Monday 195 73 102 52
20 2011-01-25 Tuesday 193 108 116 70
21 2011-01-26 Wednesday 193 102 118 63
Since you've provided a small data, I've not been able to test this on a bigger data. But the idea is something like this. I'll use data.table as I find it can be very efficient here.
The code:
require(data.table)
my_days <- c("Saturday", "Sunday", "Monday")
dt <- data.table(df)
dt[, `:=`(DATE = as.Date(DATE))]
setkey(dt, "DATE")
dt[WEEKDAY %in% my_days, `:=`(A = cumsum(A), B = cumsum(B),
C = cumsum(C), D = cumsum(D)), by = format(DATE-1, "%W")]
The idea:
First, change the DATE Column to actual Date type using as.Date (line 4).
Second, ensure that the columns are sorted by DATE column by setting the key column of dt to DATE (line 5).
Now, the last line (line 6) is where all the magic happens and is the trickiest:
The first part of the expression WEEKDAY %in% my_days, subsets the data.table dt with only days = Sat, Sun or Mon.
The last part of the same line by = format(DATE-1, "%W"), subsets the data by the week they belong to. Here, since Monday falls on the next week, just subtract 1 from the current Date and then get the week number. This will group the Dates by Week, where, Tuesday until Monday should have the same week.
The expression in the middle ':='(A = ... , D = ...) computes the cumsum and replaces just those values per grouping by reference.
For the new data you've posted, I get this as the result. Let me know if it's not what you seek.
# DATE WEEKDAY A B C D
# 1: 2011-01-02 Sunday 2 1 0 0
# 2: 2011-01-03 Monday 155 52 7 1
# 3: 2011-01-04 Tuesday 182 103 13 5
# 4: 2011-01-05 Wednesday 192 102 14 12
# 5: 2011-01-06 Thursday 160 67 50 20
# 6: 2011-01-07 Friday 154 96 50 39
# 7: 2011-01-09 Sunday 0 0 0 1
# 8: 2011-01-10 Monday 195 94 48 40
# 9: 2011-01-11 Tuesday 206 72 71 38
# 10: 2011-01-12 Wednesday 232 94 96 52
# 11: 2011-01-13 Thursday 178 113 93 52
# 12: 2011-01-14 Friday 173 97 68 56
# 13: 2011-01-15 Saturday 2 0 1 0
# 14: 2011-01-17 Monday 172 91 67 52
# 15: 2011-01-18 Tuesday 176 76 70 78
# 16: 2011-01-19 Wednesday 164 159 117 37
# 17: 2011-01-20 Thursday 198 87 95 111
# 18: 2011-01-21 Friday 213 86 89 90
# 19: 2011-01-24 Monday 195 73 102 52
# 20: 2011-01-25 Tuesday 193 108 116 70
# 21: 2011-01-26 Wednesday 193 102 118 63
# DATE WEEKDAY A B C D

Resources