I am trying to merge rows by pattern.
The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.
Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.
dataset:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
dput for the dataset:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Expected outcome:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput for expected outcome:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Preface each line with a tag and then use read.dcf to create a 3 column character matrix mat. At the end we convert that to a character vector with one element per logical record but you may just want to use mat since that seems like a more useful format.
We assume that the dates have the %d %B %Y format (see ?strptime for the percent codes), that salary lines start with £ and other lines are Address lines.
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
Update
Simplied assumptions and code.
One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a pure data.table approach
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using ## as separator
ans <- df[, .( paste0( detail, collapse = "##") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te ## introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "##" ))) := tstrsplit( V1, "##" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a data.table approach which uses dcast() and rowid() to reshape to wide format. It returns a data.table with four columns: a record number, date,
company_name, and salary.
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3
1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.
dcast() allows to pack all into a "one-liner" (if the library() calls are not counted):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")
Related
Suppose I have the following DataFrame
ID Result Date
1 Pos 4th Jan, 2020
1 Pos 20th Jan, 2020
1 Neg 21st Jan, 2020
2 Pos 5th Jan, 2020
2 Neg 7th Jan, 2020
I want to record the delta (between days) by ID when the result changes from positive to negative ONLY.
so I would like an answer for this test case as:
ID Result Date Delta Time_Spent_Pos
1 Pos 4th Jan, 2020 0 17
1 Pos 20th Jan, 2020 16 17
1 Neg 21st Jan, 2020 17 17
2 Pos 5th Jan, 2020 0 2
2 Neg 7th Jan, 2020 2 2
Where I plan to use the time_spent_pos column for further analysis.
Further Testing Case
I also would like to point out the data could look like
ID Result Date
1 Neg 12th Dec, 2019
1 Pos 4th Jan, 2020
1 Pos 20th Jan, 2020
1 Neg 21st Jan, 2020
2 Neg 2nd Jan, 2020
2 Pos 5th Jan, 2020
2 Neg 7th Jan, 2020
In which case I would still like the old output. So it is important to find the first time an ID was positive (Record that forever) -> then find the first time it changed to negative. And push the delta to a column.
Any tips + help is appreciated.
You can write a function to do this calculation. Get the first date where result = 'Pos' and subtract it from the immediate next 'Neg' date.
get_delta <- function(res, date) {
d1 <- date[match('Pos', res)]
as.integer(min(date[res == 'Neg' & date > d1]) - d1)
}
library(dplyr)
df %>%
mutate(Date = lubridate::dmy(Date)) %>%
group_by(ID) %>%
mutate(Time_Spent_Pos = get_delta(Result, Date)) %>%
ungroup
# ID Result Date Time_Spent_Pos
# <int> <chr> <date> <int>
#1 1 Pos 2020-01-04 17
#2 1 Pos 2020-01-20 17
#3 1 Neg 2020-01-21 17
#4 2 Pos 2020-01-05 2
#5 2 Neg 2020-01-07 2
So a simple idea is to create two separate columns one whenever the value is positive and another whenever the value is negative then grouping and getting minimum/maximum values for each of these columns.
Here is how you can do it;
# Reading required libraries
library(dplyr)
library(lubridate)
# Create sample dataframes
df <-
data.frame(ID = c(1,1,1,1,2,2,2),
Result = c("Neg", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg"),
Date = c("12th Dec, 2019", "4th Jan, 2020", "20th Jan, 2020",
"21st Jan, 2020", "2nd Jan, 2020", "5th Jan, 2020",
"7th Jan, 2020"))
df %>%
# Convert date into yyyy-mm-dd to easily manipulate it
mutate(Date = dmy(Date),
# In case positive/negative then create a column with value
POSITIVE = as.Date(ifelse(Result == "Pos", Date, NA), origin = lubridate::origin),
NEGATIVE = as.Date(ifelse(Result == "Neg", Date, NA), origin = lubridate::origin)) %>%
# Grouping by ID
group_by(ID) %>%
# Getting firs positive and last negative
mutate(POSITIVE = min(POSITIVE, na.rm = TRUE),
NEGATIVE = max(NEGATIVE, na.rm = TRUE)) %>%
ungroup() %>%
# Calculating difference between positive and negative
mutate(Time_Spent_Pos = NEGATIVE - POSITIVE)
I have a date column as such:
id <- c(1, 2, 3) ,
date <- c("4 May 20", "5 June 20", "16 April 2021")
I want to add "20" to the end of the first 2 rows only and create a new column to make the dataframe look like this:
id date new_date
1 4 May 20 4 May 2020
2 5 June 20 5 June 2020
3 16 April 2021 16 April 2021
#akrun has answered the question you asked, but if what you're really doing is trying to parse dates, lubridate::dmy can handle your problem very easily:
library(lubridate)
data$new_date <- dmy(data$date)
data
id date new_date
1 1 4 May 20 2020-05-04
2 2 5 June 2020 2020-06-05
3 3 16 April 2021 2021-04-16
Data
data <- structure(list(id = 1:3, date = c("4 May 20", "5 June 2020",
"16 April 2021")), class = "data.frame", row.names = c(NA, -3L
))
We can use sub to match the space (\\s+) followed by 2 digits (\\d{2}) at the end ($) of the string, captured as a grouped ((...)) and in the replacement, insert the 20 followed by the backreference (\\1) of the captured group
df1$date <- sub("\\s+(\\d{2})$", " 20\\1", df1$date)
If the OP wanted to do this only on a subset of rows on the original data i.e. predetermined
df1$date[1:10] <- sub("\\s+(\\d{2})$", " 20\\1", df1$date[1:10])
-output
df1
id date
1 1 4 May 2020
2 2 5 June 2020
3 3 16 April 2021
data
df1 <- structure(list(id = c(1, 2, 3), date = c("4 May 20", "5 June 20",
"16 April 2021")), class = "data.frame", row.names = c(NA, -3L
))
I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!
Current format:
ID (1, 1, 2)
ZIPCODE (1234AB, 5678CD, 9012EF)
CITY (NEWYORK, LA, MIAMI)
ENTRY_DATE (2-1-2014, 13-3-2017, 10-11-2011)
END_DATE (13-5-2017, 21-12-2018, 6-9-2017)
New format:
ID (1, 1, 1, 1, 1, 1, 2)
YEAR (2013, 2014, 2015, 2016, 2017, 2018, 2013)
ZIPCODE (NA, 1234AB, 1234AB, 1234AB, 5678CD, 5678CD, 9012EF)
CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)
See link below
Here is one approach.
First, create date intervals for each location from start to end dates. Using map2 and unnest you will create additional rows for each year.
Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE to END_DATE. For each year, you can filter by max(WEEKS) (or to ensure a single address per year, arrange in descending order by WEEKS and slice(1) --- or with latest tidyr consider slice_max). This will keep the row where there is the greatest number of weeks duration overlap between intervals.
The final complete will ensure you have rows for all years between 2013-2018.
library(tidyverse)
library(lubridate)
df %>%
mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
unnest(YEAR) %>%
mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
group_by(ID, YEAR) %>%
arrange(desc(WEEKS)) %>%
slice(1) %>%
group_by(ID) %>%
complete(YEAR = seq(2013, 2018, 1)) %>%
arrange(ID, YEAR) %>%
select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))
Output
# A tibble: 14 x 4
# Groups: ID [2]
ID YEAR ZIPCODE CITY
<dbl> <dbl> <chr> <chr>
1 1 2013 NA NA
2 1 2014 1234AB NEWYORK
3 1 2015 1234AB NEWYORK
4 1 2016 1234AB NEWYORK
5 1 2017 5678CD LA
6 1 2018 5678CD LA
7 2 2011 9012EF MIAMI
8 2 2012 9012EF MIAMI
9 2 2013 9012EF MIAMI
10 2 2014 9012EF MIAMI
11 2 2015 9012EF MIAMI
12 2 2016 9012EF MIAMI
13 2 2017 9012EF MIAMI
14 2 2018 NA NA
Data
df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD",
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072,
17238, 15288), class = "Date"), END_DATE = structure(c(17299,
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))
I have a data frame in the following format that represent a large data set that I have
F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
F.names L.names year grade
1 M Ab August 2015 <NA>
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 <NA>
5 A Ac September 2013 11th grade
The year column is in factor format in the original data set and there are several missing values for grade.Basically I want to fill in the missing grade values based on year column so that it looks like the following.
F.names L.names year grade
1 M Ab August 2015 10th Grade
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 12th Grade
5 A Ac September 2013 11th grade
I was thinking that my first step would be to covert the year column which is in factor format to a date format. and then arrange the columns in order and use something like fill from tidyrto fill the missing columns. How should I go about doing this, or is there a better way to approach this?
F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
library(tidyverse)
df.have %>%
separate(year, c("m","y"), convert = T, remove = F) %>%
separate(grade, c("num","type"), sep="th", convert = T) %>%
arrange(F.names, y) %>%
group_by(F.names) %>%
mutate(num = ifelse(is.na(num), lag(num) + 1, num),
type = "grade") %>%
ungroup() %>%
unite(grade, num, type, sep="th ") %>%
select(-m, -y)
# F.names L.names year grade
# 1 A Ac September 2013 11th grade
# 2 A Ac August 2014 12th grade
# 3 M Ab September 2014 9th grade
# 4 M Ab August 2015 10th grade
# 5 M Ab September 2016 11th grade
This solution assumes that you won't have 2 or more consecutive NAs for a given F.names value.
I have a dataframe df with 3 columns (months, year, value).
>head(df)
months year value
January 01 23875.00
February 01 15343.25
March 01 9584.25
April 01 19026.33
May 01 26324.00
June 01 31228.00
Every 12 rows (starting from the first January), the year goes 02, 03, 04, etc.. until 16.
I need to calculate seasonal means i.e.
For Summer mean of (December,January,February); for Autumn mean of (March,April,May), for Winter mean of (June,July,August) and for Spring mean of (September,October,November).
Then make a new dataframe with seasons, year, and the mean value of them to get something like this.
>head(seasdf)
season year value
DJF 01
MAM 01
JJA 01
SON 01
DJF 02
MAM 02
With all the years until 16. I searched for similar questions with this kind of dataframe, but i couldn't find a way to do it.
Sorry for this noob question.
We assume that adjacent months in the same quarter should all have the same quarter name and year and that quarters are named after the year in which the quarter ends. For example, Dec 2001, Jan 2002 and Feb 2002 would all be part of the DJF 2002 quarter.
First convert the year and month to a "yearmon" class variable, ym, and then add 1/12 to push the months forward one. This is based on the fact that yearmon variables are stored as the year + 0 for Jan, 1/12 for Feb, 2/12 for Mar, etc. Then convert that to a "yearqtr" class variable, yq. Now aggregate value by yq noting that yearqtr variables sort correctly so that 2001 Q1 will come before 2001 Q2, etc. Finally reconstitute the aggregated data frame with the columns shown in the question.
library(zoo) # yearmon and yearqtr classes
ym <- as.yearmon(paste(DF$months, DF$year), "%B %y")
yq <- as.yearqtr(ym + 1/12)
Ag <- aggregate(value ~ yq, DF, mean)
season.name <- c("DJF", "MAM", "JJA", "SON")
with(Ag, data.frame(year = as.integer(yq), season = season.name[cycle(yq)], value))
giving:
year season value
1 2001 DJF 19609.12
2 2001 MAM 18311.53
3 2001 JJA 31228.00
If the exact layout shown in the question is not important then we could omit the last two lines of code above and just use Ag
> Ag
yq value
1 2001 Q1 19609.12
2 2001 Q2 18311.53
3 2001 Q3 31228.00
Note: The input DF in reproducible form was assumed to be:
DF <- structure(list(months = c("January", "February", "March", "April",
"May", "June"), year = c("01", "01", "01", "01", "01", "01"),
value = c(23875, 15343.25, 9584.25, 19026.33, 26324, 31228
)), .Names = c("months", "year", "value"), class = "data.frame", row.names = c(NA, -6L))
Seems like your months variable is standard month name, you can match it against the month.name variable in R to get the month as a number, i.e(January will be 1, February will 2, etc), and take modulo division of 3 to get the season as another group variable aside from year, and then it should be trivial to group by year, season and take the average:
library(dplyr)
df %>% group_by(season = match(months, month.name) %% 12 %/% 3, year) %>%
summarise(value = mean(value)) %>% ungroup() %>%
# optional: convert the season from number to meaningful labels which could also be
# summer, autumn, winter and spring
mutate(season = factor(season, levels = c(0,1,2,3),
labels = c("DJF", "MAM", "JJA", "SON")))
# A tibble: 3 × 3
# season year value
# <fctr> <int> <dbl>
#1 DJF 1 19609.12
#2 MAM 1 18311.53
#3 JJA 1 31228.00
If December needs to be rolled to the next year Summer, you can add one to the year variable when months == "December":
df %>% group_by(season = match(months, month.name) %% 12 %/% 3, year = ifelse(months == "December", year + 1, year)) %>%
summarise(value = mean(value)) %>% ungroup() %>%
# optional: convert the season from number to meaningful labels which could also be
# summer, autumn, winter and spring
mutate(season = factor(season, levels = c(0,1,2,3),
labels = c("DJF", "MAM", "JJA", "SON")))