I am trying to create a new variable using ifelse by combining data from two data.frames (similar to this question but without factors).
My problem is that df1 features yearly data, whereas vars in df2 are temporally aggregated: e.g. df1 has multiple obs (1997,1998,...,2005) and df2 only has a range (1900-2001).
For illustration, a 2x2 example would look like
df1$id <- c("2","20")
df1$year <- c("1960","1870")
df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")
I want to combine both in such a way that the id (same variable exists in both) is matched, and further, the year in df1 is within the range of df2. I tried the following
df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear &
df1$year<df2$endyear,1,0)
Which ideally should return 1 and 0, respectively.
But instead I get warning messages:
1: In df1$id == df2$id : longer object length is not a multiple of
shorter object length
2: In df1$year >= df2$styear : longer object length is not a
multiple of shorter object length
3: In df1$year < df2$endyear : longer object length is not a
multiple of shorter object length
For the record, the 'real' df1 has 500 obs and df2 has 14. How can I make this work?
Edit: I realised some obs in df2 are repeated, with multiple periods e.g.
id styear endyear
1 1800 1915
1 1950 2002
2 1912 1988
3 1817 2000
So, I believe what I need is something like a double-ifelse:
df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear &
df1$year<df2$endyear | df1$year>=df2$styear &
df1$year<df2$endyear,1,0)
Obviously, this wouldn't work, but it is a way to get out of the duplicates-problem.
For example, if id=1 in df1$year=1801, it will pass the first year-range test (1801 is between 1800-1915), but fail the second one (1801 is not between 1950-2002), so it is only coded once and no extra rows are added (currently the duplicates add extra rows).
df1$id <- c("2","20")
df1$year <- c("1960","1870")
df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")
library(dplyr)
df3 <- left_join(df1,df2,by = "id") %>% filter(year <= endyear,year >= startyear)
I highly recommend the dplyr package for data manipulation.
With base R:
df1 <- data.frame(id=c(2,20,22), year=c(1960,1870, 2016))
df2 <- data.frame(id=c(2,20,21), styear=c(1800,1900,2000), endyear=c(2001,1950,2016))
df1
id year
1 2 1960
2 20 1870
3 22 2016
df2
id styear endyear
1 2 1800 2001
2 20 1900 1950
3 21 2000 2016
df1 <- merge(df1, df2, by='id', all.x = TRUE)
df1$new.var <- !is.na(df1$styear) & df1$year>=df1$styear & df1$year< df1$endyear
df1 <- df1[c('id', 'year', 'new.var')]
df1
id year new.var
1 2 1960 TRUE
2 20 1870 FALSE
3 22 2016 FALSE
Alright, I made it work for myself. Beware, it is quite convoluted and probably contain some redundancies. After a brief look at the data wrangling cheatsheet, assuming you have df1 and df2 with an identical var and df2 contains new.var, one can do the following:
library(dplyr)
#Join everything, all values and rows
df3 <- full_join(df1,df2,by="id")
#filter out obs those year is greater than endyear
df3 <- filter(df3,df3$year<=df3$endyear)
#same, the other way around
df3 <- filter(df3,df3$year>=df3$styear)
df3 <- distinct(df3) #remove duplicate rows (at least I had some)
As far as I can tell by looking at the end result, this method only extracts information from the correct time period while dropping all other time periods in df2. Then, it is a matter of merging with the original data.frame (df1) and filling in the NAs:
df1 <- merge(df1,df3,by=(id),all.x=TRUE)
df1 <- distinct(df1) #just to make sure, I still had three
df1$new.var <- ifelse(is.na(df1$new.var),0,df1$new.var)
which is what I wanted.
This can be solved easily and efficiently using non-equi joins in data.table devel version (1.9.7+):
library(data.table)
setDT(df1); setDT(df2) # converting to data.table in place
df1[, new.var := df2[df1, on = .(id, styear <= year, endyear >= year),
.N > 0, by = .EACHI]$V1]
df1
# id year new.var
#1: 2 1960 TRUE
#2: 20 1870 FALSE
The above join looks for matches in df2 for each row of df1 (by = .EACHI), and checks the number of matching rows (.N).
Related
I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.
An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add
Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]
How do you sum a value that occurs between two dates in R?
For example, I have two data tables, df1 has start and end dates, df2 has values corresponding to certain dates between the start and end dates in df1. I would like to sum the values in df2 between each Start and End date in df1 and record that information in df1.
df1 <- data.frame(Start = c('1/1/20', '5/1/20', '10/1/20', '2/2/21', '3/20/21'),
End = c('1/7/20', '5/7/20', '10/7/20', '2/7/21', '3/30/21'))
df2 <- data.frame(Date = c('1/1/20','1/3/20' ,'5/1/20','5/2/20','6/2/20' ,'6/4/20','10/1/20', '2/2/21', '3/20/21'),value=c('1','2','5','15','20','2','3','78','100'))
I have tried following the example at the following link that provides information on counting between two dates in R but I am struggling to apply it to the function sum. Sum/count between two dates in R
Thank you!
We can use a non-equi join in data.table after converting the date columns to Date class
library(data.table)
library(lubridate)
setDT(df1)[df2, value := sum(value),
on = .(Start <= Date, End >= Date), by = .EACHI]
-output
df1
# Start End value
#1: 2020-01-01 2020-01-07 2
#2: 2020-05-01 2020-05-07 15
#3: 2020-10-01 2020-10-07 3
#4: 2021-02-02 2021-02-07 78
#5: 2021-03-20 2021-03-30 100
data
df1[] <- lapply(df1, mdy)
df2$Date <- mdy(df2$Date)
df2$value <- as.numeric(df2$value)
I want to merge two dataframes. DF2 has one temperature value for each day while DF1 has several entries for each day. So I want to look up the temperature for one day in DF2 and have it added to every entry of this day in dataframe 1.
I guess a loop would work best but being quite new to R I can't figure out how it has to look like
DF1$Date<-c(1.8.18, 1.8.18, 2.8.18)
DF2$Date<-c(1.8.18, 2.8.18, 3.8.18)
DF2$Temperature<-c(17,18,17)
DF2$Difference<-c(0.5,0.4,0.5)
This is the expected output:
DF1$Date<-c(1.8.18, 1.8.18, 2.8.18)
DF1$Temperature<-c(17,17,18)
DF1$Difference<-c(0.5,0.5,0.4)
I would highly recommend using the tidyverse library for general data wrangling (and lubridate for date manipulation, although you don't necessarily need lubridate for this question).
This could work in your case:
library(tidyverse)
# Create the dataframes
DF1 <- data.frame(c("1.8.18", "1.8.18", "2.8.18"))
DF2 <- data.frame(c("1.8.18", "2.8.18", "3.8.18"),
c(17,18,17),
c(0.5,0.4,0.5)
)
names(DF1) <- "Date"
names(DF2) <- c("Date", "Temperature", "Difference")
#### OUTPUT ####
> DF1
# Date
# 1 1.8.18
# 2 1.8.18
# 3 2.8.18
> DF2
# Date Temperature Difference
# 1 1.8.18 17 0.5
# 2 2.8.18 18 0.4
# 3 3.8.18 17 0.5
So above I just recreated your dataframes. DF1 has just the one column, DF2 has 3 columns.
# join dataframes by what the "Date" columns have in common
left_join(x = DF1, y = DF2, by = "Date")
This should get your expected output.
> DF3
# Date Temperature Difference
# 1 1.8.18 17 0.5
# 2 1.8.18 17 0.5
# 3 2.8.18 18 0.4
For more details check out the join function in dplyr (which is part of tidyverse library).
I would either take your Date variable as a date variable or a character variable. I would not use it as a factor variable for this purpose
library(tidyverse)
DF1$Date = as.Date(DF1$Date, "%d.%m.%y")
DF2$Date= as.Date(DF2$Date, "%d.%m.%y")
left_join(x = DF1, y = DF2, by = "Date")
OR
DF1$Date = as.character(DF1$Date)
DF2$Date = as.character(DF2$Date)
left_join(x = DF1, y = DF2, by = "Date")
Using it as a factor, you will get an error message, and you have a good chance to get it wrong,
I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.
Start off a sample data-frame:
Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
"30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
df <- as.data.frame(cbind(Date, ISIN, price))
And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I)
The idea is that the data should first filter by ISIN and then filter by Date. this 2 step process should keep my data intact.
Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)
Result_df <- cbind(Result_d, Result_I, Result_P)
Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol
What i have so far:
I take all unique dates and store:
Unique_Dates <- unique(df$Date)
The same for the Identifiers:
Unique_ID <- unique(df$ISIN)
Now the grepping issue:
If i wanted all rows containing Unique_Dates i would do something like:
pattern <- paste(Unique_dates, collapse = "|")
result <- as.matrix(df[grep(pattern, df$Date),])
and this will retrieve basically the entire data set. i am wondering if anyone knows an efficient way of doing this.
Thanks in advance.
Using dplyr:
library(dplyr)
Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
"30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")
#Examine data ranges and frequencies
#date range
range(DF$Date)
#date frequency count
table(DF$Date)
#ISIN frequency count
table(DF$ISIN)
#select ISINs for filtering, with user defined choice of filters
# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)
subISIN = names(sort(table(DF$ISIN)))[2]
subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%
as.data.frame()
#> subDF
# Date ISIN price
#1 2014-12-29 LU0168343191 7
#2 2014-12-30 LU0168343191 4
#3 2014-12-31 LU0168343191 1
We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table (.SD) based on the 'i' index.
library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
# Date ISIN price
#1: 31-DEC-2014 LU0168343191 1
#2: 30-DEC-2014 LU0168343191 4
#3: 29-DEC-2014 LU0168343191 7
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.