Extracting strings from links using regex in R - r

I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014

You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)

Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014

You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Search in a column based on the value of a different column

I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.
However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?
df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))
#This works as desired
big_df = merge(df_a,df_b)
#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)
#Error: No common variables. Please specify `by` param.
You can use crossing from tidyr:
crossing(df_a,df_b)
department period
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 2 2014
6 2 2015
7 2 2016
8 2 2017
9 3 2014
10 3 2015
11 3 2016
12 3 2017
13 4 2014
14 4 2015
15 4 2016
16 4 2017
If there are duplicate rows, crossing doesn't give the same result as merge.
Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b.
library("tidyverse") # version 1.3.2
# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))
merge doesn't de-duplicate.
df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#> department period
#> 1 1 2014
#> 2 2 2014
#> 3 3 2014
#> 4 3 2014
#> 5 1 2015
#> 6 2 2015
#> 7 3 2015
#> 8 3 2015
#> 9 1 2016
#> 10 2 2016
#> 11 3 2016
#> 12 3 2016
#> 13 1 2017
#> 14 2 2017
#> 15 3 2017
#> 16 3 2017
crossing drops duplicate rows.
df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
full_join doesn't remove duplicates either.
df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
#> 13 3 2014
#> 14 3 2015
#> 15 3 2016
#> 16 3 2017
packageVersion("tidyverse")
#> [1] '1.3.2'
Created on 2023-01-13 with reprex v2.0.2

Subset by multiple conditions

Maybe it's something basic, but I couldn't find the answer.
I have
Id Year V1
1 2009 33
1 2010 67
1 2011 38
2 2009 45
3 2009 65
3 2010 74
4 2009 47
4 2010 51
4 2011 14
I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.
Id Year V1
1 2009 33
1 2010 67
1 2011 38
4 2009 47
4 2010 51
4 2011 14
I try
d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )
but it doesn't work.
Can anyone provide some suggestions that how I can do this in R?
I think ave could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x). This gives a logical vector, which can be summed. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.
df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
# Id Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
Another way of using ave
DF
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 4 2 2009 45
## 5 3 2009 65
## 6 3 2010 74
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
This should do the job :)
library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]
I think an approach using ave is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table.
Again, just throwing this out there to provide some options to use different aspects of the language.
d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))
# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
# ID Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)
# medium way
library(data.table)
d2 <- as.data.table(d1)
d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14

Resources