Merging two dataframes with different size

Merging two dataframes with different size - r

I have two Data Frames with two columns. One column for date another for numeric data. The two Data Frames have different size. I give you an example of what I have and what I need.
This is what I have:
DF1
2015-01-02 0
2015-01-03 0
2015-01-04 0
DF2
2015-01-03 200
This is what I need:
DF1
2015-01-02 0
2015-01-03 200
2015-01-04 0
I have tried comparing (compare function) both DF but I have no solution.
Maybe this could help you (or even make the functions faster), in both DF the dates are sorted.
Could someone help me?
Thank you very much,
Gobya

It's not clear how you want to choose which row to select when there are matching dates from the two data frames (per #user295691's comment), so I've provided two selection options below that give the result you specified.
DF1 <- data.frame(date = c("2015-01-02", "2015-01-03", "2015-01-04"),
value = c(0, 0, 0), stringsAsFactors=FALSE)
DF2 <- data.frame(date = c("2015-01-03"), value = c(200), stringsAsFactors=FALSE)
DF1$source = "DF1"
DF2$source = "DF2"
library(dplyr)
# Choose the greatest value for each date
newDF = DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
filter(value == max(value))
# If there are more than two values for a given date,
# choose the value(s) from DF2 for that date
newDF = DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
mutate(n=n()) %>%
filter(ifelse(n>1, source=="DF2", source=="DF1")) %>%
select(-n)
FYI, for the second approach, I thought the following would work, but it excludes rows with date=2014-01-03. I'm not sure why and would be interested in any ideas on what's going wrong:
DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
filter(ifelse(n() > 1, source=="DF2", source=="DF1"))
date value source
1 2015-01-02 0 DF1
2 2015-01-04 0 DF1

Using full_join() from the dplyr package:
DF1 <- data.frame(date = c("2015-01-02", "2015-01-03", "2015-01-04"),
number = c(0, 0, 0))
DF2 <- data.frame(date = c("2015-01-03"), number = c(200))
DF3 <- full_join(DF1, DF2, by="date")
DF3$newColumn <- ifelse(is.na(DF3$number.y), DF3$number.x, DF3$number.y)

newdf <- merge(DF1, DF2, by='V1', all=T)
newdf[,2][is.na(newdf[,2])] <- newdf[,3][!is.na(newdf[,3])]
newdf[-3]
# V1 V2.x
# 1 2015-01-02 0
# 2 2015-01-03 200
# 3 2015-01-04 0

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])

mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

How to group_by more elegantly my data frame

I have two tables:
One where I know for sure all users of this table df1 have used a feature called "Folder"
The other where I don't know if users of this table df2 have used a feature called "Folder"
I want to build a graph showing the number of users that used the Folder feature on each date.
So the main data frame I want to build is a data frame with ALL the Dates included (from df1 and df2) and for each date the number of users who used this feature "Folder".
Here is a reproducible example:
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"))
colnames(df1) <- c("Date", "user_ID")
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"))
colnames(df2) <- c("Date", "user_ID")
The only way I found so far was to create some kind of flag Folder_True where it's a 1 if we know this user used Folder feature on this date and 0 if we don't know. I then used it with dplyr combining group_by and sum. But I think it's not very elegant and I would like to learn a more logical/efficient way to do this data wrangling.
Thanks!
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"), Folder_True=c(0,0,0))
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"), Folder_True=c(1,0,1))
combined_df <- rbind(df1, df2)
combined_df <-
combined_df %>%
group_by(Date, user_ID) %>%
summarise(Folder_True = sum(Folder_True))
final_df <-
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = sum(Folder_True))

For each Date you can find out unique users who have Folder_True = 1.
library(dplyr)
combined_df <- rbind(df1, df2)
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = n_distinct(user_ID[Folder_True == 1]))
# Date Nb_Users_Folder_True
# <chr> <int>
#1 2021-05-12 1
#2 2021-05-15 0
#3 2021-05-20 0
#4 2021-05-22 1

Using uniqueN from data.table
library(data.table)
rbindlist(list(df1, df2))[,
.(Nb_Users_Folder_True = uniqueN(user_ID[as.logical(Folder_True)])), Date]
-output
# Date Nb_Users_Folder_True
#1: 2021-05-12 1
#2: 2021-05-15 0
#3: 2021-05-20 0
#4: 2021-05-22 1

Rearranging column (descending/ascending) with dplyr without a prior vector

The following code can rearrange columns if one has prior knowledge of what columns are available, but what if one wants to rearrange the columns by either desc/ascending order? There are a few similar posts on StackOverflow, but not one that can do so without prior knowledge of what columns are available.
type value
1 rna 1
2 rna 2
3 rna 3
4 dna 20
5 dna 30
d<- data.frame (type=c("rna","rna","rna"), value = c(1,2,3) )
d2 <- data.frame (type=c("dna","dna"), value = c(20,30) )
df <- rbind (d,d2)
library(dplyr)
df %>%
group_by(type) %>%
summarise_all(sum) %>%
data.frame() %>%
arrange(desc(value)) %>% # reorder row
select_(.dots = c("value","type") ) # reorder column

sort(names(.)) or rev(sort(names(.))) should work...
d<- data.frame (type=c("rna","rna","rna"), value = c(1,2,3) )
d2 <- data.frame (type=c("dna","dna"), value = c(20,30) )
df <- rbind (d,d2)
library(dplyr)
df %>%
group_by(type) %>%
summarise_all(sum) %>%
data.frame() %>%
arrange(desc(value)) %>% # reorder row
select(sort(names(.)))

Subsetting rows based on dates and criteria across two data frames

I have one data frame outlining pollution levels continuously measured from two sites.
Dates <- as.data.frame(seq(as.Date("2015/01/01"), as.Date("2017/01/01"),"day"))
Pollution_Site.A <- as.data.frame(c(seq(from = 1, to = 366, by = 1),
(seq(from = 366, to = 1, by = -1))))
Pollution_Site.B <- as.data.frame(c(seq(from = 0, to = 365, by = 1),
(seq(from = 365, to = 0, by = -1))))
df1 <- cbind(Dates,Pollution_Site.A,Pollution_Site.B)
colnames(df1) <- c("Dates","Site.A","Site.B")
I have a separate data frame highlighting when surveyors (each site has one unique surveyor) visited each site.
Site<- c("Site.A","Site.A","Site.B","Site.B")
Survey_Dates <- as.data.frame(as.POSIXct(c("2014/08/17","2016/08/01",
"2015/02/01","2016/10/31")))
df2 <- as.data.frame(cbind(Site,Survey_Dates))
colnames(df2) <- c("Site","Survey_Dates")
What I want to do is (i) define a high pollution event (although perhaps some form of 'apply' function would be better to do this iteratively across multiple sites)?
High_limit_Site.A <- 1.5*median(df1$Site.A)
High_limit_Site.B <- 1.5*median(df1$Site.B)
The I want to (ii) subset the second data frame to show which surveyors have visited the site before and after a high pollution event within 1 year (providing there is pollution data as well). I presume something along the 'difftime' function will work here, but am not sure how I would apply this.
Finally, I would like (iii) the subsetted data frame to highlight whether the surveyor was out before or after the pollution event.
So in the example above, the desired output should only contain Site B. This is because Site A's first survey date precedes the first pollution measurement AND was over a year before the high pollution event. Thank you in advance for any help on this.

You need to pivot df1 and then cross-join it with df2
library(dplyr)
library(tidyr)
df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
# this will function as cross-join because Site is not a unique ID
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Here you can see earliest and latest event corresponding to each visit
# A tibble: 3 x 5
# Groups: Site, Survey_Dates [?]
Site Survey_Dates Been_Before Event_date_min Event_date_max
<chr> <dttm> <chr> <date> <date>
1 Site.A 2016-08-01 after 2015-10-03 2016-04-01
2 Site.B 2015-02-01 before 2015-10-02 2016-01-30
3 Site.B 2016-10-31 after 2015-11-01 2016-04-02

Just to build on the answer #dmi3kno displayed above, I can then subset sites which contain both a "before" and "after" sign for each site.
Output_df <- df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Then using dplyr again:
Final_df <- Output_df %>%
group_by(Site) %>%
filter(all(c("before", "after") %in% Been_Before))

how to use dplyr() to subset observations based on the difference between two date

I've got a data frame (df1) with an ID variable and two date variables (dat1 and dat2).
I'd like to subset the data frame so that I get the observations for which the difference between dat2 and dat1 is less than or equal to 30 days.
I'm trying to use dplyr() but I can't get it to work.
Any help would be much appreciated.
Starting point (df):
df1 <- data.frame(ID=c("a","b","c","d","e","f"),dat1=c("01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017"),dat2=c("14/05/2017","05/06/2017","23/05/2017","15/10/2017","15/11/2017","15/12/2017"), stringsAsFactors = FALSE)
Desired outcome (df):
dfgoal <- data.frame(ID=c("a","c"),dat1=c("01/05/2017","01/05/2017"),dat2=c("14/05/2017","23/05/2017"),newvar=c(13,22))
Current code:
library(dplyr)
df2 <- df1 %>% mutate(newvar = as.Date(dat2) - as.Date(dat1)) %>%
filter(newvar <= 30)

We need to convert to Date class before doing the subtraction
library(dplyr)
library(lubridate)
df1 %>%
mutate_at(2:3, dmy) %>%
mutate(newvar = as.numeric(dat2- dat1)) %>%
filter(newvar <=30)
The as.Date also needs to include the format argument, otherwise, it will think that the format is in the accepted %Y-%m-%d. Here, it is in %d/%m/%Y
df1 %>%
mutate(newvar = as.numeric(as.Date(dat2, "%d/%m/%Y") - as.Date(dat1, "%d/%m/%Y"))) %>%
filter(newvar <= 30)
# ID dat1 dat2 newvar
#1 a 01/05/2017 14/05/2017 13
#2 c 01/05/2017 23/05/2017 22

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging two dataframes with different size - r

newdf <- merge(DF1, DF2, by='V1', all=T) newdf[,2][is.na(newdf[,2])] <- newdf[,3][!is.na(newdf[,3])] newdf[-3] # V1 V2.x # 1 2015-01-02 0 # 2 2015-01-03 200 # 3 2015-01-04 0

Related

How to subtract using max(date) and second latest (month) date

How to group_by more elegantly my data frame

Rearranging column (descending/ascending) with dplyr without a prior vector

Subsetting rows based on dates and criteria across two data frames

how to use dplyr() to subset observations based on the difference between two date

Categories

Resources