how to fill in missing values based on dates in R? - r

I have a data frame in the following format that represent a large data set that I have
F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
F.names L.names year grade
1 M Ab August 2015 <NA>
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 <NA>
5 A Ac September 2013 11th grade
The year column is in factor format in the original data set and there are several missing values for grade.Basically I want to fill in the missing grade values based on year column so that it looks like the following.
F.names L.names year grade
1 M Ab August 2015 10th Grade
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 12th Grade
5 A Ac September 2013 11th grade
I was thinking that my first step would be to covert the year column which is in factor format to a date format. and then arrange the columns in order and use something like fill from tidyrto fill the missing columns. How should I go about doing this, or is there a better way to approach this?

F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
library(tidyverse)
df.have %>%
separate(year, c("m","y"), convert = T, remove = F) %>%
separate(grade, c("num","type"), sep="th", convert = T) %>%
arrange(F.names, y) %>%
group_by(F.names) %>%
mutate(num = ifelse(is.na(num), lag(num) + 1, num),
type = "grade") %>%
ungroup() %>%
unite(grade, num, type, sep="th ") %>%
select(-m, -y)
# F.names L.names year grade
# 1 A Ac September 2013 11th grade
# 2 A Ac August 2014 12th grade
# 3 M Ab September 2014 9th grade
# 4 M Ab August 2015 10th grade
# 5 M Ab September 2016 11th grade
This solution assumes that you won't have 2 or more consecutive NAs for a given F.names value.

Related

In 0:(b - 1) : numerical expression has 6 elements: only the first used

I've been working on a project which includes times series. My issue is that I don't have information for each year but for the period. I basically want to duplicate each row as long as the period last: for a n year period, I want to create (n-1) new rows with exactly the same informations. So far, so good.
stack = data.frame(c("ville","commune","université","pole emploi", "ministère","collège"),
c(2014,2015,2016,2014,2015,2014),
c(5,3,2,6,4,1))
colnames(stack) = c("benefit recipient","beginning year", "length of the period")
->
b = stack$`beginning year`
stack2 = stack[rep(rownames(stack),b),]
Now what I want to do is to modify the beginning year into the current year. So I want to add one year after one year into each row. To visualise it, here some code where I do it manually (also a screenshot of what I have and what I want on my real project.
stack3 = data.frame(c("ville","ville","ville","ville","ville","commune","commune","commune","université","université","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi", "ministère","ministère","ministère","ministère","collège"),
c(2014,2015,2016,2017,2018,2015,2016,2017,2016,2017,2014,2015,2016,2017,2018,2019,2015,2016,2017,2018,2014),
c(5,5,5,5,5,3,3,3,2,2,6,6,6,6,6,6,4,4,4,4,1))
colnames(stack3) = c("benefit recipient","effective year", "length of the period")
So far, my idea was to split my period and to add the value of this new vector to my table. I tried with the function:
c = c(0:(b-1))
But it didn't work, I have the message In 0:(b - 1) : numerical expression has 6 elements: only the first used. It's a shame because it did exactly what I wanted but, only for the first element...
Do you have any idea of how I can solve it ?
Thanks a lot for your time!
What I have
What I would like to have
Solution using lapply() and seq():
b = stack$`beginning year`
c = stack$`length of the period`
stack2 = stack[rep(rownames(stack),b),]
stack2$`beginning year` = unlist(lapply(1:length(b), function(x) seq(b[x], b[x]+c[x]-1, by=1)))
We can use rowwise
library(dplyr)
library(tidyr)
stack %>%
rowwise %>%
mutate(year = list(`beginning year`:(`beginning year` +
`length of the period` - 1))) %>%
unnest(year)
You can use map2 to create sequence and unnest to create new rows.
library(tidyverse)
stack %>%
mutate(year = map2(`beginning year`, `beginning year` + `length of the period` - 1, seq)) %>%
unnest(year)
# `benefit recipient` `beginning year` `length of the period` year
# <chr> <dbl> <dbl> <int>
# 1 ville 2014 5 2014
# 2 ville 2014 5 2015
# 3 ville 2014 5 2016
# 4 ville 2014 5 2017
# 5 ville 2014 5 2018
# 6 commune 2015 3 2015
# 7 commune 2015 3 2016
# 8 commune 2015 3 2017
# 9 université 2016 2 2016
#10 université 2016 2 2017
# … with 11 more rows

Merge rows by pattern in R

I am trying to merge rows by pattern.
The dataframe has only one column (string) and normally, it should follow a pattern of date, company_name and salary. However, some cases just don't have the salary.
Is there is a way I can merge the rows by the pattern of the date? By doing so, I can later split them into columns. The reason why I didn't want to do pivot_wider earlier was that it's likely to get mismatched between the company name and salary - unbalanced rows. So I think it's better to merge the rows by the date pattern as the date is never missing and following a pattern.
dataset:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
dput for the dataset:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Expected outcome:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput for expected outcome:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Preface each line with a tag and then use read.dcf to create a 3 column character matrix mat. At the end we convert that to a character vector with one element per logical record but you may just want to use mat since that seems like a more useful format.
We assume that the dates have the %d %B %Y format (see ?strptime for the percent codes), that salary lines start with £ and other lines are Address lines.
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
Update
Simplied assumptions and code.
One more solution assuming only that first row contains a date. It'll work irrespective of the number of rows in between two dates..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\\d\\d? \\w+ \\d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a pure data.table approach
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using ## as separator
ans <- df[, .( paste0( detail, collapse = "##") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te ## introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "##" ))) := tstrsplit( V1, "##" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
Here is a data.table approach which uses dcast() and rowid() to reshape to wide format. It returns a data.table with four columns: a record number, date,
company_name, and salary.
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3
1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
For detecting rows which start a new record, i.e., rows with a date, this approach borrows from Anil's answer as well as from G.Grothendieck's.
dcast() allows to pack all into a "one-liner" (if the library() calls are not counted):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")

Datafram format transforming in R: how to with dates to years (each ID new row per year)

I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!
Current format:
ID (1, 1, 2)
ZIPCODE (1234AB, 5678CD, 9012EF)
CITY (NEWYORK, LA, MIAMI)
ENTRY_DATE (2-1-2014, 13-3-2017, 10-11-2011)
END_DATE (13-5-2017, 21-12-2018, 6-9-2017)
New format:
ID (1, 1, 1, 1, 1, 1, 2)
YEAR (2013, 2014, 2015, 2016, 2017, 2018, 2013)
ZIPCODE (NA, 1234AB, 1234AB, 1234AB, 5678CD, 5678CD, 9012EF)
CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)
See link below
Here is one approach.
First, create date intervals for each location from start to end dates. Using map2 and unnest you will create additional rows for each year.
Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE to END_DATE. For each year, you can filter by max(WEEKS) (or to ensure a single address per year, arrange in descending order by WEEKS and slice(1) --- or with latest tidyr consider slice_max). This will keep the row where there is the greatest number of weeks duration overlap between intervals.
The final complete will ensure you have rows for all years between 2013-2018.
library(tidyverse)
library(lubridate)
df %>%
mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
unnest(YEAR) %>%
mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
group_by(ID, YEAR) %>%
arrange(desc(WEEKS)) %>%
slice(1) %>%
group_by(ID) %>%
complete(YEAR = seq(2013, 2018, 1)) %>%
arrange(ID, YEAR) %>%
select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))
Output
# A tibble: 14 x 4
# Groups: ID [2]
ID YEAR ZIPCODE CITY
<dbl> <dbl> <chr> <chr>
1 1 2013 NA NA
2 1 2014 1234AB NEWYORK
3 1 2015 1234AB NEWYORK
4 1 2016 1234AB NEWYORK
5 1 2017 5678CD LA
6 1 2018 5678CD LA
7 2 2011 9012EF MIAMI
8 2 2012 9012EF MIAMI
9 2 2013 9012EF MIAMI
10 2 2014 9012EF MIAMI
11 2 2015 9012EF MIAMI
12 2 2016 9012EF MIAMI
13 2 2017 9012EF MIAMI
14 2 2018 NA NA
Data
df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD",
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072,
17238, 15288), class = "Date"), END_DATE = structure(c(17299,
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))

R, dplyr: How to divide date frame elements by specific elements

edit: Solution at the end.
I have a dataframe that contains different variables and the sum of these different variables as a variable called "total".
I want to add a new column that calculates each variables' share of the "total"-variable.
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
value <- seq(1:length(month))
df = data.frame(name, month, value)
# Create total variable
dfTotal =
df%>%
group_by_("month")%>%
summarize(value = sum(value, na.rm = TRUE))
dfTotal[["name"]] <- "Total"
dfTotal = as.data.frame(dfTotal)
# Add total column to dataframe
df2 = rbind(df, dfTotal)
df2
which gives the dataframe
name month value
1 A oct 2018 1
2 A nov 2018 2
3 B oct 2018 3
4 B nov 2018 4
5 Total nov 2018 6
6 Total oct 2018 4
What I want is to produce a new column with the shares of the total for each month in the above dataframe, so that I get something like
name month value share
1 A oct 2018 1 0.25 (=1/4)
2 A nov 2018 2 0.33 (=2/6)
3 B oct 2018 3 0.75 (=3/4)
4 B nov 2018 4 0.67 (=4/6)
5 Total nov 2018 6 1.00 (=6/6)
6 Total oct 2018 4 1.00 (=4/4)
Does anybody know how I from the first dataframe can produce the last column in the second dataframe?
Solution:
Based on tmfmnk's comment, the following solves the problem:
df2 =
df2 %>%
group_by(month) %>%
mutate(share = value/max(value))
df2
which gives
name month value share
<fct> <fct> <int> <dbl>
1 A oct 2018 1 0.25
2 A nov 2018 2 0.333
3 B oct 2018 3 0.75
4 B nov 2018 4 0.667
5 Total nov 2018 6 1
6 Total oct 2018 4 1

Use dplyr/tidyr to turn rows into columns in R data frame

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

Resources