I have this dataframe:
a <- c(1,2,3,4,5)
b <- c(100, 300, NA, 430, 270)
c <- c('2018-02-01', '2020-02-17', '2019-10-24', '2019-10-24', '2020-02-17')
df <- data.frame(a,b,c)
names(df) <- c('id', "value", "Date")
All I want is to sort the value of Dateand value in the descending order. However I find that Date can be order by using rev but with NA the order is not correct :
df[rev(order(df$Date, df$value)),] ##
The result that I want should be like this:
id value Date
2 2 300 2020-02-17
5 5 270 2020-02-17
4 4 430 2019-10-24
3 3 NA 2019-10-24
1 1 100 2018-02-01
Any suggest to deal with NA in this dataframe without dropping it.
you can try dplyr functions, since arrange() sorts NA values to last by default:
df %>%
arrange(desc(Date),desc(value))
gives:
id value Date
1 2 300 2020-02-17
2 5 270 2020-02-17
3 4 430 2019-10-24
4 3 NA 2019-10-24
5 1 100 2018-02-01
Using na.last = FALSE should do the trick:
df[rev(order(df$Date, df$value, na.last = FALSE)),]
If you want to use data.table package, you can do it like this:
library(data.table)
setorderv(df, cols = c("Date","value"), order = -1, na.last=TRUE)
Related
I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:
I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.
f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
# dates number
# 1 2020-01-01 2
# 2 2020-01-02 2
# 3 2020-01-03 1
# 4 2020-01-04 0
# 5 2020-01-05 1
# 6 2020-01-06 1
# 7 2020-01-07 1
# 8 2020-01-08 1
# 9 2020-01-09 1
# 10 2020-01-10 2
Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:
library(dplyr)
library(lubridate)
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"),
as.Date("2020-1-10"),
by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
full_join(df2, by = character()) %>%
mutate(number = dates %within% interval(start, end)) %>%
group_by(dates) %>%
summarise(number = sum(number))
#> # A tibble: 10 × 2
#> dates number
#> <date> <dbl>
#> 1 2020-01-01 2
#> 2 2020-01-02 2
#> 3 2020-01-03 1
#> 4 2020-01-04 0
#> 5 2020-01-05 1
#> 6 2020-01-06 1
#> 7 2020-01-07 1
#> 8 2020-01-08 1
#> 9 2020-01-09 1
#> 10 2020-01-10 2
Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?
I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.
book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1
books chapt.start chapt.end
1 Dune 300 310
2 LOTR 8 19
3 LOTR 94 110
4 OriginOfSpecies 150 158
5 OldManSea 600 630
My second dataframe contains a list of book titles and a single page.
title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2
title page
1 LOTR 4
2 LOTR 12
3 LOTR 30
4 OriginOfSpecies 200
5 OldManSea 620
6 OldManSea 650
What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end
The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE
Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!
This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page).
sqldf
library(sqldf)
sqldf(
"select df1.*, df2.page
from df1
left join df2 on df1.book=df2.title
and df2.page between df1.[chapt.start] and df1.[chapt.end]")
# book chapt.start chapt.end page
# 1 Dune 300 310 NA
# 2 LOTR 8 19 12
# 3 LOTR 94 110 NA
# 4 OriginOfSpecies 150 158 NA
# 5 OldManSea 600 630 620
fuzzyjoin
(Edited out, see #IanCampbell's answer.)
data.table
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
# title page p2 p2.1
# <char> <num> <num> <num>
# 1: Dune NA 300 310
# 2: LOTR 12 8 19
# 3: LOTR NA 94 110
# 4: OriginOfSpecies NA 150 158
# 5: OldManSea 620 600 630
The reason I add p2 as a copy of page is that data.table on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.
You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin package:
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`))
books chapt.start chapt.end title page
1 Dune 300 310 <NA> NA
2 LOTR 8 19 LOTR 12
3 LOTR 94 110 <NA> NA
4 OriginOfSpecies 150 158 <NA> NA
5 OldManSea 600 630 OldManSea 620
From here it's easy to get to the desired output:
library(dplyr)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`)) %>%
mutate(result = !is.na(page)) %>%
select(-c(title,page))
books chapt.start chapt.end result
1 Dune 300 310 FALSE
2 LOTR 8 19 TRUE
3 LOTR 94 110 FALSE
4 OriginOfSpecies 150 158 FALSE
5 OldManSea 600 630 TRUE
using dplyr only i.e. without purrr or fuzzyjoin
df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
group_by(id, title) %>%
summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))),
F,
as.logical(sum(chapt.start <= page & page <= chapt.end))))
# A tibble: 5 x 3
# Groups: id [5]
id title desired
<int> <chr> <lgl>
1 1 Dune FALSE
2 2 LOTR TRUE
3 3 LOTR FALSE
4 4 OriginOfSpecies FALSE
5 5 OldManSea TRUE
Another approach using purrr without joining data
Create a logical check variable for df1
library(dplyr)
library(purrr)
# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
book_record <- tibble(...)
any_record <- page_df %>%
filter(title == book_record[["book"]],
page >= book_record[["chapt.start"]],
page <= book_record[["chapt.end"]])
nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#> book chapt.start chapt.end check
#> 1 Dune 300 310 FALSE
#> 2 LOTR 8 19 TRUE
#> 3 LOTR 94 110 FALSE
#> 4 OriginOfSpecies 150 158 FALSE
#> 5 OldManSea 600 630 TRUE
Same logics just did it for df2
# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
page_record <- tibble(...)
any_record <- book_chapters_df %>%
filter(book == page_record[["title"]],
chapt.start <= page_record[["page"]],
chapt.end >= page_record[["page"]])
nrow(any_record) > 0
}
# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)
df2
#> title page check
#> 1 LOTR 4 FALSE
#> 2 LOTR 12 TRUE
#> 3 LOTR 30 FALSE
#> 4 OriginOfSpecies 200 FALSE
#> 5 OldManSea 620 TRUE
#> 6 OldManSea 650 FALSE
Created on 2021-04-12 by the reprex package (v1.0.0)
Following data frame in data.table
df <- data.table (id=c(1,1,2,2,3,3,4,4),
date=c("2013-11-22","2017-01-24","2017-06-24","2020-02-10","2011-01-03","2013-11-24","2015-01-24","2017-08-24"),
status=c("Former","Current","Former","Never","Current",NA,"Current","Former"))
df
id date status
1: 1 2013-11-22 Former
2: 1 2017-01-24 Current
3: 2 2017-06-24 Former
4: 2 2020-02-10 Never
5: 3 2011-01-03 Current
6: 3 2013-11-24 <NA>
7: 4 2015-01-24 Current
8: 4 2017-08-24 Former
I want to create a unique row per id with the following logicals. The latest date should be kept. If the status at latest date is <NA> or Never and there was another status for an earlier date, than the row with the earlier date should be kept.
I solved this with the following functions:
unique1 <- df[df$status %in% c("Former","Current"),]
unique1 <- unique1[,.SD[which.max(anydate(date))],by=.(id)]
unique_final <- unique(df[order(id,ordered(status,c("Former","Current","Never",NA)))],by='id')
unique_final[match(unique1$id,unique_final$id),]<-unique1
and get these results
id date status
1: 1 2017-01-24 Current
2: 2 2017-06-24 Former
3: 3 2011-01-03 Current
4: 4 2017-08-24 Former
Is there a way to combine these two logical subsetting steps? I would like to avoid creating a new data frame and than matching them.
I am working with data.table and a solution for a larger data set would be great.
Thanks!
Could try:
library(data.table)
df[, .SD[
if (all(status %in% c(NA, 'Never'))) .N
else max(which(!status %in% c(NA, 'Never')))
], by = id]
Output:
id date status
1: 1 2017-01-24 Current
2: 2 2017-06-24 Former
3: 3 2011-01-03 Current
4: 4 2017-08-24 Former
Here is a dplyr based solution. It recodes status so that current and former have the same level, and then sorts and takes the first row for each id
library(dplyr)
library(data.table)
df <- data.table(id=c(1,1,2,2,3,3,4,4),
date=c("2013-11-22","2017-01-24","2017-06-24","2020-02-10","2011-01-03","2013-11-24","2015-01-24","2017-08-24"),
status=c("Former","Current","Former","Never","Current",NA,"Current","Former"))
df %>%
mutate(
status = factor(status, levels = c("Never", "Former", "Current")),
status2 = forcats::fct_recode(status, "Current" = "Former")
) %>%
group_by(id) %>%
arrange(desc(status2), desc(date)) %>%
select(-status2) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id [4]
#> id date status
#> <dbl> <chr> <fct>
#> 1 1 2017-01-24 Current
#> 2 2 2017-06-24 Former
#> 3 3 2011-01-03 Current
#> 4 4 2017-08-24 Former
Created on 2020-08-29 by the reprex package (v0.3.0)
Here is a base R option using subset + ave
subset(
df[!df$status %in% c(NA, "Never"), ],
as.logical(ave(date, id, FUN = function(x) x == max(x)))
)
I have df1 sorted by date like this:
Date <- c("12/17/17","12/19/17","12/20/17","12/30/17","12/31/17","1/1/18")
Jon <- c(388,299,412,NA,NA,353)
Eric <- c(121,NA,321,473,832,NA)
Scott <- c(NA,122,NA,NA,NA,424)
df1 <- data.frame(Date,Jon,Eric,Scott)
df1$Date <- as.Date(df1$Date,format='%m/%d/%y')
#df1
Date Jon Eric Scott
1 12/17/17 388 121 NA
2 12/19/17 299 NA 122
3 12/20/17 412 321 NA
4 12/30/17 NA 473 NA
5 12/31/17 NA 832 NA
6 1/1/18 353 NA 424
I'm trying to create a new list that includes only the data that is within the last 12 days of each person's most recent date with a non-NA value. If there is only one non-NA value within 12 days of the person's most recent non-NA value, then I want to take the 2 most recent non-NA values for that person, even if one falls outside of the 12 day date range.
The code below successfully puts data within the last 12 days of each person's most recent non-NA value in a new list:
df2 <- lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)])
This code successfully takes the 2 most recent non-NA entries, regardless of whether or not it's within the 12 day range:
df3 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2))
This code comes very close to doing what I want it to do, except it loses the column names. Notice that the column names are replaced with numbers, unlike the lapply statements above, which both keep the column names.
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
How can I maintain the column names?
I would approach this problem using the tidyverse packages.
Data
library(tidyr)
library(dplyr)
library(lubridate)
df <- tibble(
my_date = as.Date(
c("12/17/17", "12/19/17", "12/20/17", "12/30/17", "12/31/17", "1/1/18"),
"%m/%d/%y"
),
jon = c(388, 299, 412, NA, NA, 353),
eric = c(121, NA, 321, 473, 832, NA),
scott = c(NA, 122, NA, NA, NA, 424)
)
Long format data frame
This output feels more natural.
df_long <- df %>%
gather(key, value, -my_date) %>%
drop_na %>%
group_by(key) %>%
mutate(
in_date = if_else(my_date >= max(my_date) - days(12), TRUE, FALSE),
count = sum(in_date)
) %>%
filter(in_date | count < 2) %>%
top_n(2, my_date) %>%
ungroup %>%
select(-c(in_date, count))
df_long
# # A tibble: 6 x 3
# my_date key value
# <date> <chr> <dbl>
# 1 2017-12-20 jon 412
# 2 2018-01-01 jon 353
# 3 2017-12-30 eric 473
# 4 2017-12-31 eric 832
# 5 2017-12-19 scott 122
# 6 2018-01-01 scott 424
Wide format
Thankfully, it is only one additional step to spread to your original columns.
df_long %>% spread(key, value)
# # A tibble: 5 x 4
# my_date eric jon scott
# * <date> <dbl> <dbl> <dbl>
# 1 2017-12-19 NA NA 122
# 2 2017-12-20 NA 412 NA
# 3 2017-12-30 473 NA NA
# 4 2017-12-31 832 NA NA
# 5 2018-01-01 NA 353 424
Seems like the easiest thing to do for me is to store the column headers in a variable and then reattach them:
myHeaders <- names(df1[-1])
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
names(df4) <- myHeaders
I'm trying to create a column in a dataset that tells me the (approximate) number of months a customer has been with the company.
This is my current attempt:
dat <- data.frame(ID = c(1:4), start.date = as.Date(c('2015-04-09', '2014-03- 24', '2016-07-01', '2011-02-02')))
dat$months.customer <- apply(dat[2], 1, function(x) (as.numeric(Sys.Date())- as.numeric(x))/30)
It's returning all NAs
You can use difftime:
dat$months.customer <-
as.numeric(floor(difftime(Sys.Date(),dat$start.date,units="days")/30))
# ID start.date months.customer
# 1 1 2015-04-09 16
# 2 2 2014-03-24 29
# 3 3 2016-07-01 1
# 4 4 2011-02-02 67