Split column into multiple columns using str_match - r

I want to split column into multiple columns by matching the patterns
test <- data.frame("id" = c("Albertson's Inc.","Albertson's Inc."), "V3" = c("Reiterates FY 2004, Significant Developments, 2 June 2004, 53 words, (English)(Document MULTI00020050122e06201fkk)","EBITDA Hits Four Year Low, Stock Diagnostics, 16:00 GMT, 9 June 2004, 245 words, (English)(Document STODIA0020040609e0690006g)"), stringsAsFactors = F)
So far the code I'm using to get desired result is like
library(stringr)
df <- as.data.frame(str_match(test$V3, "^(.*)GMT,(.*),(.*)words,(.*)Document (.*)$")[,-1], stringsAsFactors = F)
I'm having two issues with above code
First it does not show results when GMT is missing secondly I want "id" column in the output df as well, any suggestion or different approach should I use for results please share thanks to all the moderators programmers for such a helpful forum.

not 100% sure about your "GTM" problem. here is my try:
your rep data:
test <- data.frame("id" = c("Albertson's Inc.","Albertson's Inc."), "V3" = c("Reiterates FY 2004, Significant Developments, 2 June 2004, 53 words, (English)(Document MULTI00020050122e06201fkk)","EBITDA Hits Four Year Low, Stock Diagnostics, 16:00 GMT, 9 June 2004, 245 words, (English)(Document STODIA0020040609e0690006g)"), stringsAsFactors = F)
code:
library(tidyverse)
test$V3 %>% map(~str_split(.,",(?!\\s*\\d{1,2}:\\d{1,2})|(?<=\\))(?=\\()") %>% unlist %>% trimws) %>%
do.call(rbind,.) %>%
cbind(test["id"],.)
result:
# id 1 2 3 4 5 6
# 1 Albertson's Inc. Reiterates FY 2004 Significant Developments 2 June 2004 53 words (English) (Document MULTI00020050122e06201fkk)
# 2 Albertson's Inc. EBITDA Hits Four Year Low Stock Diagnostics, 16:00 GMT 9 June 2004 245 words (English) (Document STODIA0020040609e0690006g)

Related

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

ID variable changes over time, how to calculate pct change by ID

The problem starts from the difficulty of explaining it.
I have a data set that has a time dimension, my ID variables change name over time making it difficult to calculate e.g. percentage changes over time by ID variable.
ID YR Value
01 2004 100
02 2005 50
03 2005 50
04 2005 10
I need to calculate pct. Change in Value over time by ID. The problem is in Yr 2005 the ID variable 01 is split into three IDs (02,03,04), such that one has to aggregate the values for the three IDs in 2005 to get the corresponding value for ID 01 in 2005. The percent change of ID 01 is NOT 50/100, rather sum(50,50,10)/100.
I have data.frame of IDs only matching the changes over time, it looks like this:
x2004 x2005
01 01
01 02
01 03
I used group_by from dplyr to create matching between IDs in the two years
group_by(x2004) %>%
summarize(onetomany = paste(sort(unique(x2005)),collapse=", "))
Which gave me a data.frame of the form
cv2004 onetomany
1 1 1, 2, 3
Where I can see which IDs belong to the same group, and that is where I stopped the percentage calculation.
I totally understand that the problem it self is not easy to understand. This is a common problem in trade statistics, commodity codes change name over time but not content, and one has to keep track of the changes to get the picture of developments in trade over time by commodity. Any suggestion is appreciated.
df <- data.frame("ID" = c("01", "02", "03", "04"),
"YR" = c(2004, 2005, 2005, 2005),
"Value" = c(100, 50, 50, 10))
df %>% group_by(YR) %>% summarise(sum = sum(Value))
# A tibble: 2 x 2
YR sum
<dbl> <dbl>
1 2004 100
2 2005 110

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Resources