select last non-contemporaneous date in group - r

People are buying stuff and I have the dates when someone last purchased the item in their zip code. I want to grab the last noncontemporaneous date in that group.
ZCTA5 = c("b", "c", "a", "b", "b", "c", "a", "a", "a", "c")
App.Complete.Date = c("2005-01-23", "2005-01-23",
"2006-07-13", "2006-11-21",
"2006-11-21", "2006-11-21",
"2007-01-01", "2007-01-01",
"2007-01-01", "2007-01-01")
xxx <- data.frame(ZCTA5,App.Complete.Date) %>%
arrange(ZCTA5,App.Complete.Date); xxx
Last.Unique.Date.In.ZCTA5 =c(NA, "2006-07-13", "2006-07-13", "2006-07-13", NA, "2005-01-23",
"2005-01-23", NA, "2005-01-23", "2006-11-21")
Desired output
ZCTA5 App.Complete.Date Last.Unique.Date.In.ZCTA5
1 a 2006-07-13 <NA>
2 a 2007-01-01 2006-07-13
3 a 2007-01-01 2006-07-13
4 a 2007-01-01 2006-07-13
5 b 2005-01-23 <NA>
6 b 2006-11-21 2005-01-23
7 b 2006-11-21 2005-01-23
8 c 2005-01-23 <NA>
9 c 2006-11-21 2005-01-23
10 c 2007-01-01 2006-11-21
I don't want to drop any observations. Mutating in place would be ideal, but I understand joining by ZCTA5 and (not shown but I do have it) individual ID later would be fine.
I couldn't figure out a way to mutate a new variable by lagging the unique App.Complete.Date values so I am stuck. Additionally, slicing has been too cumbersome since I still need the last date without removing contemporaneous dates.
EDIT: If the NA is the same row's App.Complete.Date, that's acceptable.

Try the following:
xxx = xxx %>%
mutate(App.Complete.Date = as.Date(App.Complete.Date),
rn = row_number())
Initial setup to ensure date column is of type date. Adding row numbers in order to preserve duplicate dates in origin.
yyy = xxx %>%
left_join(xxx, by = "ZCTA5") %>%
# discard all the out-of-scope dates
mutate(App.Complete.Date.y = ifelse(App.Complete.Date.y < App.Complete.Date.x,
App.Complete.Date.y, NA)) %>%
# we need to include row number here to preserve all rows in the original
group_by(ZCTA5, App.Complete.Date.x, rn.x) %>%
# na.rm = TRUE handles all the missing values removed in the previous mutate
summarise(App.Complete.Date.y = max(App.Complete.Date.y, na.rm = TRUE), .groups = 'drop') %>%
# summarise may return numeric type rather than date type - convert back
mutate(App.Complete.Date.y = as.Date(App.Complete.Date.y, origin = "1970-01-01")) %>%
# rename to output
select(ZCTA5,
App.Complete.Date = App.Complete.Date.x,
Last.Unique.Date.In.ZCTA5 = App.Complete.Date.y)
You may need to change the origin argument in the last mutate depending on what the base date in your system is set at. When my computer returned 13342 instead of '2006-07-13', I determined the base date was '1970-01-01' because '2006-07-13' is 13342 days after '1970-01-01'.

Related

Specify number of columns to read when first row is missing values

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).
The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).
Example code:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)
Use the base read.csv then convert to typle if need be:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17
I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.
Assuming you have a file test.csv like this:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
Try something like this:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv now looks like this:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
So we now have 7 columns, the first is the timestamp.
This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
The final mydata:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17

How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?

I have a dataframe with multiple columns pertaining to the same variable, that I'd like to combine into a single column. However, most of the answers I can find on here about this regard concatenating columns (e.g. Merge 2 columns into one in dataframe), whereas I want to preserve each individual cell of data in my dataframe, but just assemble them into one single column.
For clarity, here's a sample of what my input data approximately look like.
a
b
c
ID
string1
string11
string21
1111
string2
string12
string22
2222
Here is what I would like these data to look like:
newvar
ID
string1
1111
string11
1111
string21
1111
string2
2222
string12
2222
string22
2222
So far, I've been trying to use "pivot_longer()" to accomplish this, like so:
pivot_longer(df, c("a", "b", "c"), "newvar")
but I think I must misunderstand the purpose of pivot_longer() because the df it returns has cells populated with values a b and c rather than with the row values from those columns. I'm also not sure that pivot_longer has the ability to group_by column ID like I wish, except maybe through piping. Any help is much appreciated.
Edit: I've realized that my issue in using pivot_longer() seems to be that I need to specify "values_to" as the argument "newvar" is answering.
pivot_longer(df, c("a", "b", "c"), values_to = "newvar")
This code mostly accomplishes what I need
Try to set the inputs of the function pivot_longer()correctly as cols and values_to. cols=... defines the columns which you are taking the values from. values_to = ... defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer returns always the names of the columns which values you are taking from, unless you try other trickier things.
library(tidyverse)
df = data.frame(
a = c("string1","string2"),
b= c("string11","string12"),
c = c("string21", "string22"),
ID = c("1111","2222")
)
df %>%
pivot_longer(cols = names(df)[1:3],
values_to = "newvar") %>%
select(newvar, ID)
Output:
# A tibble: 6 x 2
newvar ID
<chr> <chr>
1 string1 1111
2 string11 1111
3 string21 1111
4 string2 2222
5 string12 2222
6 string22 2222
Or with data.table.
library(data.table)
df = data.table(a=c("string1", "string2"), b=c("string11", "string12"), c=c("string21", "string22"), ID=c(1111,2222))
df_final = melt(df,
id.vars="ID",
measure.vars=c("a", "b", "c"),
value.name="newvar")[order(by=ID)][, c("ID", "newvar")]
Output:
> df_final
ID newvar
1: 1111 string1
2: 1111 string11
3: 1111 string21
4: 2222 string2
5: 2222 string12
6: 2222 string22

find all the records belongs to particular Id are decommissioned

I need to find the records which are decommissioned based on date.
below is the example data frame:
input date: 2020-08-01(YYYY-MM-DD)
df <- data.frame(cel = c("cel12", "cel34", "cel05", "cel98", "cel67",
"cel35", "cel05", "cel45", "cel12","cel99","cel45"),
sect = c("sect56", "sect56", "sect56", "sect78", "sect78",
"sect60", "sect51", "sect51",
"sect98", "sect98", "sect98"),
site = c("site14","site14", "site08", "site08", "site08",
"site89", "site89", "site08", "site24",
"site24", "site36"),
decomdate = c(as.Date("2020-02-01"),as.Date("2020-03-01"), as.Date("2020-12-01"), as.Date("2020-05-01"), NA, NA, as.Date("2020-12-01"), as.Date("2020-07-01"), as.Date("2020-06-01"), NA, NA))
if all the 'cel' in particular 'sect' belongs to particular 'site' are decommissioned(i.e decomdate < inputdate) then that 'sect' is decommissioned.
Expected Output: sect column with decommissioned sects
sect
sect56
sect51
For each sect we can check if all the decomdate is less than input date.
library(dplyr)
input <- as.Date('2020-08-01')
df %>% group_by(sect) %>% filter(all(input > decomdate))
# cel sect decomdate
# <chr> <chr> <date>
#1 cel12 sect56 1964-06-20
#2 cel34 sect56 1964-06-19
#3 cel05 sect56 1964-06-17
#4 cel05 sect51 1964-06-17
#5 cel45 sect51 1964-06-15
To get only sect back we can use distinct :
df %>% group_by(sect) %>% filter(all(input > decomdate)) %>% distinct(sect)
# sect
# <chr>
#1 sect56
#2 sect51
This can also be done using base R with subset and ave :
unique(subset(df, ave(input > decomdate, sect, FUN = all), select = sect))

Create new column in string partial match-based dataframe without repeats

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
DF:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll.
If GLDESC contains the word Supply anywhere in the string then I want KIND to be Supply.
In all other cases I want KIND to be Other.
Then, I found this:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
But with that, I have everything that matches Supply, for example, classified. However, as in DF lines 4 and 5, the same GL has two Supply, which for me is unnecessary. In fact, I need only one type of GLDESC to be matched if for the same GL the string is repeated.
Edit: I can not delet any row. I want to have this as output:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll
If we need the repeating element to be NA, use duplicated on 'GLDESC' to get a logical vector and assign those elements in 'KIND' created with ifelse to NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
If we need to change the values by a grouping variable
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
Or with the full changes
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
data
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")

select duplicates from two columns by row and create a new variable in R

I have a data frame in which I have different duplicates of ID and dates. I just want to detect the duplicates of one column that are also in the other so I can say:
1. remove the rows with duplicate id, duplicate datee and missing in T (second record in this table).
2. And then say: if there is a duplicate id and duplicate date, chose the T=="high".
id<-c("a", "a", "a", "a", "b", "c")
datee<-c("12/02/10", "12/02/10", "12/02/10","10/03/11", "10/04/18","1/04/18" )
T<-c("high", NA, "low","high", "low", "medium")
mydata<-data.frame(id, datee, T)
This is like this:
id datee T
a 12/02/10 high
a 12/02/10 <NA>
a 12/02/10 low
a 10/03/11 high
b 10/04/18 low
c 1/04/18 medium
A step by step solution
Step 1 - Remove missing
mydata<-mydata[!is.na(mydata[,3]),]
Step 2 - Identify duplicated rows on ID
dup_rows_ID<-duplicated(mydata[,c(1)],fromLast = TRUE) | duplicated(mydata[,c(1)],fromLast = FALSE)
mydata_dup<-mydata[dup_rows_ID,]
Step 3 - Identity duplicated rows on ID and datee
dup_rows_ID_datee<-duplicated(mydata_dup[,c(1,2)],fromLast = TRUE) | duplicated(mydata_dup[,c(1,2)],fromLast = FALSE)
Step 4 - Select T="high"
mydata_dup2<-mydata_dup[mydata_dup[dup_rows_ID_datee,"T"]=="high",]
Your output
rbind(mydata_dup[rownames(mydata_dup) %in% rownames(mydata_dup2),],
+ mydata[!dup_rows_ID,])
id datee T
1 a 12/02/10 high
4 a 10/03/11 high
5 b 10/04/18 low
6 c 1/04/18 medium
About ID==a you have two date with T=="high", you have to choose if you want the one with higher or lower datee.
You can do like this first:
is_duplicate <- lapply(X = mydata, FUN = duplicated, incomparables = FALSE)
is_na <- lapply(X = mydata, FUN = is.na)
and use data.frames to f.ex. remove the rows with duplicate id, duplicate datee and missing in T like this:
drop_idx <- which(is_duplicate$id & is_duplicate$datee & is_na$T)
data[drop_idx, ]

Resources