How to merge text of two different dataframes by date? - r

I have two different dataframes, each of which contains different texts by month. What I want to do is merging the texts that have the same date in one single dataframe.
Let me take an example to clarify. This is dataframe_A where the third column (Article) contains some text for each date:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article_topic_A_1
2 1 February 2000 PRESS CONFERENCE Article_topic_A_2
3 1 March 2000 PRESS CONFERENCE Article_topic_A_3
This is dataframe_B that contains different text but in the same date:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article_topic_B_1
2 1 February 2000 PRESS CONFERENCE Article_topic_B_2
3 1 March 2000 PRESS CONFERENCE Article_topic_B_3
Now, I want to combine the text of Article_topic_A_1 with the text of Article_topic_B_1, text of Article_topic_A_2 with the text of Article_topic_B_2, and so on. For the same date (e.g.: 1 January 2000), I want to combine different articles (e.g.: Article_topic_A_1 and Article_topic_B_1). Basically, the final dataframe needs to look like this:
Date Title Article
1 1 January 2000 PRESS CONFERENCE Article1
2 1 February 2000 PRESS CONFERENCE Article2
3 1 March 2000 PRESS CONFERENCE Article3
The third column will contain the merged texts that have been grouped by "date".
I tried to use merge and subset but I did not manage to do it.
Can you help me with it?
Thanks a lot!

Here's a solution using merge, with the text for both separated by ,.
df_a <- data.frame(
Date = c("1 January 2000", "1 February 2000", "1 March 2000"),
Title = rep("PRESS CONFERENCE", 3),
Article = c("Article_topic_A_1", "Article_topic_A_2", "Article_topic_A_3")
)
df_b <- data.frame(
Date = c("1 January 2000", "1 February 2000", "1 March 2000"),
Title = rep("PRESS CONFERENCE", 3),
Article = c("Article_topic_B_1", "Article_topic_B_2", "Article_topic_B_3")
)
df <- merge(df_a, df_b, by = c("Date", "Title"))
df$Article <- paste(df$Article.x, df$Article.y, sep = ", ")
df <- df[, !(names(df) %in% c("Article.x", "Article.y"))]
df
#> Date Title Article
#> 1 1 February 2000 PRESS CONFERENCE Article_topic_A_2, Article_topic_B_2
#> 2 1 January 2000 PRESS CONFERENCE Article_topic_A_1, Article_topic_B_1
#> 3 1 March 2000 PRESS CONFERENCE Article_topic_A_3, Article_topic_B_3

Related

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.
The data I want to extract:
ID
monthly value
1 - 1997
Product of Monthly Values in Year 1997
1 - 1998
Product of Monthly Values in Year 1998
1 - 1999
Product of Monthly Values in Year 1999
...
...
2 - 1997
Product of Monthly Values in Year 1997
2 - 1998
Product of Monthly Values in Year 1998
2 - 1999
Product of Monthly Values in Year 1999
...
...
The dataset which is the source:
ID
monthly value
1 - 1997
Monthly Value 1 in Year 1997
1 - 1997
Monthly Value 2 in Year 1997
1 - 1997
Monthly Value 3 in Year 1997
...
...
2 - 1997
Monthly Value 1 in Year 1997
2 - 1997
Monthly Value 2 in Year 1997
2 - 1997
Monthly Value 3 in Year 1997
...
...
I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.
for (i in 1:nrow(yearlyreturns)){
yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
}
I don't know how to get from here to a vectorised function, which takes less time.
Is this possible to do in R?
Something like this:
library(dplyr)
df %>%
mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>%
group_by(ID, monthly_value) %>%
summarise()
ID monthly_value
<chr> <chr>
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997
data:
structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997",
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997",
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997",
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997",
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA,
-6L))
Based on the for loop code, this may be a done with a join
library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount")
:= .(prod(change) -1, .N), on = .(ID), by = .EACHI]
In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.

In R, how do you add a string or characters to only the first number of rows of the data frame

I have a date column as such:
id <- c(1, 2, 3) ,
date <- c("4 May 20", "5 June 20", "16 April 2021")
I want to add "20" to the end of the first 2 rows only and create a new column to make the dataframe look like this:
id date new_date
1 4 May 20 4 May 2020
2 5 June 20 5 June 2020
3 16 April 2021 16 April 2021
#akrun has answered the question you asked, but if what you're really doing is trying to parse dates, lubridate::dmy can handle your problem very easily:
library(lubridate)
data$new_date <- dmy(data$date)
data
id date new_date
1 1 4 May 20 2020-05-04
2 2 5 June 2020 2020-06-05
3 3 16 April 2021 2021-04-16
Data
data <- structure(list(id = 1:3, date = c("4 May 20", "5 June 2020",
"16 April 2021")), class = "data.frame", row.names = c(NA, -3L
))
We can use sub to match the space (\\s+) followed by 2 digits (\\d{2}) at the end ($) of the string, captured as a grouped ((...)) and in the replacement, insert the 20 followed by the backreference (\\1) of the captured group
df1$date <- sub("\\s+(\\d{2})$", " 20\\1", df1$date)
If the OP wanted to do this only on a subset of rows on the original data i.e. predetermined
df1$date[1:10] <- sub("\\s+(\\d{2})$", " 20\\1", df1$date[1:10])
-output
df1
id date
1 1 4 May 2020
2 2 5 June 2020
3 3 16 April 2021
data
df1 <- structure(list(id = c(1, 2, 3), date = c("4 May 20", "5 June 20",
"16 April 2021")), class = "data.frame", row.names = c(NA, -3L
))

How to combine row with column headings?

I have a large dataset which, simplified, looks something like this:
Year
Name
January
February
March
April
May
Street
2000
Bob
$100
$197
$124
$100
ABC
2000
Abe
$100
$100
$117
$123
$100
ABC
2001
Bob
$100
$100
$197
$103
$150
DEF
2001
Abe
$140
$100
$127
$526
$123
ABC
2002
Abe
$100
$100
$198
$102
$101
DEF
2002
Bob
$102
$110
ABC
2003
Carly
$100
$100
$197
ABC
I am trying to combine this data so that each person has one line, with the goal of counting and graphing how many months they paid in a row.
I was thinking of trying to recode the data so that each person gets their own row, with a timeline of how much they paid by year and season, with column names like this, but I am having trouble figuring out how to do that.
Name
2000 January
2000 February
2000 March
2000 April
2000 May
2001 January
2001 February
2001 March
2001 April
2001 May
2002 January
2002 February
2002 March
2002 April
2002 May
Street
Is there a way to condense variables in this way somehow?
Thank you so much!
Using pivot_wider from {tidyr} will achieve this. Calling your dataframe yeardata, you can do the following:
selectmonths <- c("January", "February", "March", "April", "May")
result <- yeardata %>%
pivot_wider(names_from = "Year", values_from = selectmonths)

Merge rows by pattern in R

I am trying to merge rows by pattern.
The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.
Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.
dataset:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
dput for the dataset:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Expected outcome:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput for expected outcome:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Preface each line with a tag and then use read.dcf to create a 3 column character matrix mat. At the end we convert that to a character vector with one element per logical record but you may just want to use mat since that seems like a more useful format.
We assume that the dates have the %d %B %Y format (see ?strptime for the percent codes), that salary lines start with £ and other lines are Address lines.
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
Update
Simplied assumptions and code.
One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a pure data.table approach
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using ## as separator
ans <- df[, .( paste0( detail, collapse = "##") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te ## introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "##" ))) := tstrsplit( V1, "##" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a data.table approach which uses dcast() and rowid() to reshape to wide format. It returns a data.table with four columns: a record number, date,
company_name, and salary.
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3
1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.
dcast() allows to pack all into a "one-liner" (if the library() calls are not counted):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

Resources