R, lag( ) has inconsistent behavior for xts and ts objects - r

I would like to take a lag of an xts variable, and the lag() function returns a lag. However, if I use it on a ts variable, it gives a lead. Is this a bug, or working as intended?
library('xts')
a = as.xts(ts(c(5,3,7,2,4,8,3), start=c(1980,1), freq=4))
cbind(a, lag(a)) # provides lag 1
# ..1 ..2
# 1980 Q1 5 NA
# 1980 Q2 3 5
# 1980 Q3 7 3
# 1980 Q4 2 7
# 1981 Q1 4 2
# 1981 Q2 8 4
# 1981 Q3 3 8
b = ts(c(5,3,7,2,4,8,3), start=c(1980,1), freq=4)
cbind(b, lag(b)) # provides lead 1
# b lag(b)
# 1979 Q4 NA 5
# 1980 Q1 5 3
# 1980 Q2 3 7
# 1980 Q3 7 2
# 1980 Q4 2 4
# 1981 Q1 4 8
# 1981 Q2 8 3
# 1981 Q3 3 NA

As was pointed out in the documentation from ?lag.xts, this is the intended behavior.

Related

Extracting strings from links using regex in R

I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014
You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)
Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014
You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014

Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset

I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.
I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.
Here is a descriptive example of my problem (with some very basic reproducible numbers):
first dataset
a <- c(1990)
b <- c(1980:1981)
c <- c(1994:1995)
aa <- rep("A", 1)
bb <- rep("B", 2)
cc <- rep("C", 2)
df1 <- data.frame(comp=factor(c(aa, bb, cc)))
df2 <- data.frame(year=factor(c(a, b, c)))
other.columns <- rep("other_columns", length(df1))
df_f <- cbind(df1, df2, other.columns ) # first dataset
second dataset
z <- c(10:12)
x <- c(7:12)
xx <- c(1:9)
v <- c(2:9)
w <- rep(1990, length(z))
e <- rep(1980, length(x))
ee <- rep (1981, length(xx))
r <- rep(1995, length(v))
t <- rep("A", length(z))
y <- rep("B", length(x) + length(xx))
u <- rep("C", length(v))
df3 <- data.frame(month=factor(c(z, x, xx, v)))
df4 <- data.frame(year=factor(c(w, e, ee, r)))
df5 <- data.frame(comp=factor(c(t, y, u)))
df_m <- cbind(df5, df4, df3) # second dataset
Output:
> df_m
comp year month
1 A 1990 10
2 A 1990 11
3 A 1990 12
4 B 1980 7
5 B 1980 8
6 B 1980 9
7 B 1980 10
8 B 1980 11
9 B 1980 12
10 B 1981 1
11 B 1981 2
12 B 1981 3
13 B 1981 4
14 B 1981 5
15 B 1981 6
16 B 1981 7
17 B 1981 8
18 B 1981 9
19 C 1995 2
20 C 1995 3
21 C 1995 4
22 C 1995 5
23 C 1995 6
24 C 1995 7
25 C 1995 8
26 C 1995 9
> df_f
comp year other.columns
1 A 1990 other_columns
2 B 1980 other_columns
3 B 1981 other_columns
4 C 1994 other_columns
5 C 1995 other_columns
I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.
The wanted output clarifies the problem and the goal:
Wanted output:
comp year month comp year other.columns
1 A 1990 10 A 1990 other_columns
2 A 1990 11 A 1990 other_columns
3 A 1990 12 A 1990 other_columns
4 B 1980 7 B 1980 other_columns
5 B 1980 8 B 1980 other_columns
6 B 1980 9 B 1980 other_columns
7 B 1980 10 B 1980 other_columns
8 B 1980 11 B 1980 other_columns
9 B 1980 12 B 1980 other_columns
10 B 1981 1 B 1980 other_columns
11 B 1981 2 B 1980 other_columns
12 B 1981 3 B 1980 other_columns
13 B 1981 4 B 1980 other_columns
14 B 1981 5 B 1980 other_columns
15 B 1981 6 B 1980 other_columns
16 B 1981 7 B 1981 other_columns
17 B 1981 8 B 1981 other_columns
18 B 1981 9 B 1981 other_columns
19 C 1995 2 C 1994 other_columns
20 C 1995 3 C 1994 other_columns
21 C 1995 4 C 1994 other_columns
22 C 1995 5 C 1994 other_columns
23 C 1995 6 C 1994 other_columns
24 C 1995 7 C 1995 other_columns
25 C 1995 8 C 1995 other_columns
26 C 1995 9 C 1995 other_columns
Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.
The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.
Before the tables can be combined, the numeric columns must be converted to be numeric:
library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))
The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.
Then I add the column to be used for matching:
df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))
And in the last step, I join the two tables:
df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))
The argument by determines which columns of the two data frames should be matched. The output agrees with your result:
## comp year month match_year other.columns
## 1 A 1990 10 1990 other_columns
## 2 A 1990 11 1990 other_columns
## 3 A 1990 12 1990 other_columns
## 4 B 1980 7 1980 other_columns
## 5 B 1980 8 1980 other_columns
## 6 B 1980 9 1980 other_columns
## 7 B 1980 10 1980 other_columns
## 8 B 1980 11 1980 other_columns
## 9 B 1980 12 1980 other_columns
## 10 B 1981 1 1980 other_columns
## 11 B 1981 2 1980 other_columns
## 12 B 1981 3 1980 other_columns
## 13 B 1981 4 1980 other_columns
## 14 B 1981 5 1980 other_columns
## 15 B 1981 6 1980 other_columns
## 16 B 1981 7 1981 other_columns
## 17 B 1981 8 1981 other_columns
## 18 B 1981 9 1981 other_columns
## 19 C 1995 2 1994 other_columns
## 20 C 1995 3 1994 other_columns
## 21 C 1995 4 1994 other_columns
## 22 C 1995 5 1994 other_columns
## 23 C 1995 6 1994 other_columns
## 24 C 1995 7 1995 other_columns
## 25 C 1995 8 1995 other_columns
## 26 C 1995 9 1995 other_columns

aggregate + mean returns wrong result

Using R, I am about to calculate groupwise means with aggregate(..., mean). The mean return however is wrong.
testdata <-read.table(text="
a b c d year
2 10 1 NA 1998
1 7 NA NA 1998
4 6 NA NA 1998
2 2 NA NA 1998
4 3 2 1 1998
2 6 NA NA 1998
3 NA NA NA 1998
2 7 NA 3 1998
1 8 NA 4 1998
2 7 2 5 1998
1 NA NA 4 1998
2 5 NA 6 1998
2 4 NA NA 1998
3 11 2 7 1998
1 18 4 10 1998
3 12 7 5 1998
2 17 NA NA 1998
2 11 4 5 1998
1 3 1 1 1998
3 5 1 3 1998
",header=TRUE,sep="")
aggregate(. ~ year, testdata,
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)))
colMeans(subset(testdata, year=="1998", select=d), na.rm=TRUE)
aggregate says the mean of d for group 1998 is 4.62, but it is 4.5.
Reducing the data to one column only, aggregate gets it right:
aggregate(. ~ year, test[4:5],
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)))
What's wrong with my aggregate() + mean() function?
aggregate is taking out your rows containing NAs in any column before passing it to the mean function. Try running your aggregate call without na.rm=TRUE - it will still work.
To fix this, you need to change the default na.action in aggregate to na.pass:
aggregate(. ~ year, testdata,
function(x) c(mean = round(mean(x, na.rm=TRUE), 2)), na.action = na.pass)
year a b c d
1 1998 2.15 7.89 2.67 4.5

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

For Loop and Table Printing in R

I'm trying to compute confidence intervals for many rows of a table using a for loop, and would like output that is more readable.. Here is a snippet of how the data looks.
QUESTION X_YEAR X_PARTNER X_CAMP X_N X_CODE1
1 Q1 2011 SCSD ITC 15 4
2 Q1 2011 SCSD Nottingham 4 1
3 Q1 2011 SCSD ALL 19 5
4 Q1 2011 CP CP1 18 4
5 Q1 2011 ALL ALL 37 9
6 Q1 2012 SCSD ITC 8 1
7 Q1 2012 SCSD Nottingham 8 2
8 Q1 2012 SCSD ALL 16 3
9 Q1 2012 CP CP1 18 2
10 Q1 2012 CP CP1 22 2
11 Q1 2012 CP ALL 40 4
I'm trying to print out a confidence interval, with the Question, Year and Camp included. I'd like the output to be in table form like this
QUESTION YEAR CAMP X N MEAN LOWER UPPER
Q1 2011 ITC 4 15 0.26 0.07 0.55
Q1 2011 NOTTINGHAM 1 4 0.25 0.006 0.8
with the first three columns being taken directly from the data table, and the latter 4 extracted from a confidence interval test I'm using.
The code I'm currently using:
for (i in 1:26){
print(data[i,1],max.levels=0)
print(data[i,2],max.levels=0)
print(data[i,4],max.levels=0)
print(binom.confint(data[i,6],data[i,5],conf.level=0.95,methods="exact"))
}
provides output that (I have a lot more data than the snippet) will be far too time consuming to sift through...
[1] Q1
[1] 2011
[1] ITC
method x n mean lower upper
1 exact 4 15 0.2666667 0.07787155 0.5510032
[1] Q1
[1] 2011
[1] Nottingham
method x n mean lower upper
1 exact 1 4 0.25 0.006309463 0.8058796
Any advice is appreciated!
If df is the name of your data, and you only want to do this for where QUESTION is Q1 (see comments), then
library(binom)
df2 <- df[df$QUESTION == "Q1",]
x <- vector("list", nrow(df2))
for(i in seq_len(nrow(df2))) {
x[[i]] <- binom.confint(df2[i,6], df2[i,5], methods = "exact")
}
cbind(df2[c(1,2,4)], do.call(rbind, x)[,-1])
# QUESTION X_YEAR X_CAMP x n mean lower upper
# 1 Q1 2011 ITC 4 15 0.26666667 0.077871546 0.5510032
# 2 Q1 2011 Nottingham 1 4 0.25000000 0.006309463 0.8058796
# 3 Q1 2011 ALL 5 19 0.26315789 0.091465785 0.5120293
# 4 Q1 2011 CP1 4 18 0.22222222 0.064092048 0.4763728
# 5 Q1 2011 ALL 9 37 0.24324324 0.117725174 0.4119917
# 6 Q1 2012 ITC 1 8 0.12500000 0.003159724 0.5265097
# 7 Q1 2012 Nottingham 2 8 0.25000000 0.031854026 0.6508558
# 8 Q1 2012 ALL 3 16 0.18750000 0.040473734 0.4564565
# 9 Q1 2012 CP1 2 18 0.11111111 0.013751216 0.3471204
# 10 Q1 2012 CP1 2 22 0.09090909 0.011205586 0.2916127
# 11 Q1 2012 ALL 4 40 0.10000000 0.027925415 0.2366374
Note that conf.level = 0.95 is the default setting for binom.confint, so you don't need to include it in your call.

Resources