Extracting Uncommon values from 2 data frames in R - r

Given two data frames containing dates:
d1
# dates
# 2016-08-01
# 2016-08-02
# 2016-08-03
# 2016-08-04
d2
# dates
# 2016-08-02
# 2016-08-03
# 2016-08-04
# 2016-08-05
# 2016-08-06
How do I create a 3rd dataframe that would have the not-common values?
d3
# dates
# 2016-08-01
# 2016-08-05
# 2016-08-06
Data:
df1 <- structure(list(dates = structure(c(17014, 17015, 17016, 17017 ),
class = "Date")), .Names = "dates", row.names = c(NA, -4L), class =
"data.frame")
df2 <- structure(list(dates = structure(c(17015, 17016, 17017, 17018,
17019), class = "Date")), .Names = "dates", row.names = c(NA, -5L), class
= "data.frame")

Suppose you have two vectors x and y, elements that are not shared are
c(x[!(x %in% y)], y[!(y %in% x)])
If you work with data frames, provided that your dates column is "character" or "Date" instead of "factor", you can do
rbind(subset(df1, !(df1$dates %in% df2$dates)),
subset(df2, !(df2$dates %in% df1$dates)))
Simple vector example
x <- 1:5
y <- 3:8
c(x[!(x %in% y)], y[!(y %in% x)])
# [1] 1 2 6 7 8
Vector of "Date"
x <- seq(from = as.Date("2016-01-01"), length = 5, by = 1)
y <- seq(from = as.Date("2016-01-03"), length = 5, by = 1)
c(x[!(x %in% y)], y[!(y %in% x)])
# [1] "2016-01-01" "2016-01-02" "2016-01-06" "2016-01-07"
Example data frame in your question
rbind(subset(df1, !(df1$dates %in% df2$dates)),
subset(df2, !(df2$dates %in% df1$dates)))
# dates
#1 2016-08-01
#4 2016-08-05
#5 2016-08-06

You could probably just use a join as others have shown. Personally I like using ?setops in base R. Something like this:
# if they are just character/factor variables
setdiff(d1$dates, d2$dates)
# if they are date variables
setdiff(as.character(d1$dates), as.character(d2$dates))
# then convert back to as.Date(setdiff(...))
Applying this, you could filter the data.frame based on the result, or like #ZheyuanLi has indirectly identified, use matching to exclude:
# If they are date variables
d2[!as.character(d2$dates) %in% as.character(d1$dates),]
# If they are character/factor variables
d2[!d2$dates %in% d1$dates,]

Related

more dynamic melting with data.table

I am looking for the most efficient form to transform
ARTNR FILGRP
1 1 9827
2 2 9348
3 3 9335, 9827, 9339
into this
ARTNR FILGRP
1 1 9827
2 2 9348
3 3 9335
4 3 9827
5 3 9339
I tried the following code and it works, but it is not elegant and has some shortcomings. :
setDT(artnrs)
artnrs[, c("P1", "P2", "P3") := tstrsplit(FILGRP, ",", fixed=TRUE)] # 1)
artnrs <- melt(artnrs, c("ARTNR"), measure = patterns("^P")) # 2)
artnrs[,variable:=NULL] # 3)
artnrs <- na.omit(artnrs, cols="value") # 4)
names(artnrs)[2] <- "FILGRP" # 5)
ad 1) splits the last column in three new ones. How can I make this dynamic and make it fit for five or ten?
ad 2-5) rather clumpsy operations, could I chain this better?
It is based on data.table but performance is not that critical so an easy to understand tidyverse solution would be ok. But the fewer packages, the better.
Thanks!
dput output;
structure(list(ARTNR = c(1, 2, 3), FILGRP = c("9827", "9348", "9335, 9827, 9339")),
row.names = c(NA, -3L), class = "data.frame")
df <- structure(list(ARTNR = c(1, 2, 3), FILGRP = c("9827", "9348", "9335, 9827, 9339")),
row.names = c(NA, -3L), class = "data.frame")
df2 <- strsplit(df$FILGRP, split = ",")
df2 <- data.frame(ARTNR = rep(df$ARTNR, sapply(df2, length)), FILGRP = unlist(df2))
here is a data.table approach
library( data.table )
setDT(DT)
melt( DT[, paste0( "v", 1:length(tstrsplit( DT$FILGRP, ", ") ) ) := tstrsplit( FILGRP, ", ") ],
id.vars = "ARTNR",
measure.vars = patterns( "^v" ),
value.name = "FILGRP" )[!is.na(FILGRP), .SD, .SDcols = c(1,3) ]
# ARTNR FILGRP
# 1: 1 9827
# 2: 2 9348
# 3: 3 9335
# 4: 3 9827
# 5: 3 9339

How can I filter based on 2 conditions

I am not able to filter based on 2 condition. as1 is a dataframe
as1
da cat
1 2016-06-04 04:05:45 A
2 2016-06-04 04:05:46 B
3 2016-06-04 04:05:45 C
4 2016-06-04 04:05:46 D
as2 <- as1 %>% filter(as.POSIXct("2016-06-04 04:05:45") && cat == "A")
I need below dataframe
as2
da cat
1 2016-06-04 04:05:45 A
Let's make some reproducible data as your question is missing it:
as1 <- read.csv(header = T, text = "
da, cat
2016-06-04 04:05:45,A
2016-06-04 04:05:46,B
2016-06-04 04:05:45,C
2016-06-04 04:05:46,D", stringsAsFactors = FALSE)
Now first thing you want to check is if the column "da" is, in fact, POSIXct.
class(as1$da)
#> [1] "character"
In my sample it is not, so I add an extra line to the dplyr pipe.
library(dplyr)
as2 <- as1 %>%
mutate(da = as.POSIXct(da)) %>% # add only if column isn't POSIXct
filter(da == as.POSIXct("2016-06-04 04:05:45") & cat == "A")
Basically what you did wrong was leaving as.POSIXct("2016-06-04 04:05:45") as the expression. filter evaluates a condition, meaning it only keeps the rows where something is TRUE. Hence to "2016-06-04 04:05:45" you need a test---da == as.POSIXct("2016-06-04 04:05:45").
For why you need & here and not &&, see this answer.
You were almost there This is a possible solution for you. You needed to format the data using lubridate before filtering the data.
# load library
library(dplyr)
# create data
x = data.frame(da = c("2019-10-04 07:05:02","2019-10-04 07:05:03","2019-10-04 07:05:02","2019-10-04 07:05:03","2019-10-04 07:05:04"),
db = c("a","a","c","a","a"), stringsAsFactors = F)
# convert to date time format
x$da = lubridate::ymd_hms(x$da)
# see the structure of data
str(x)
# filter the data
x %>% filter(da <= lubridate::ymd_hms('2019-10-04 07:05:02') & db == 'a' )
# da db
#1 2019-10-04 07:05:02 a
Your data
# Data
x = structure(list(da = structure(c(1464993345, 1464993346, 1464993345, 1464993346), class = c("POSIXct", "POSIXt"), tzone = ""), cat = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor")), class = "data.frame", row.names = c(NA, -4L))
# convert to date time format
x$da = lubridate::ymd_hms(x$da)
# see the structure of data
str(x)
# filter the data
x %>% filter(da <= lubridate::ymd_hms('2016-06-03 15:35:45') & cat == 'A' )
# da cat
#1 2016-06-03 15:35:45 A

Subsetting data frame with multiple date conditions for ranges in between

I need subsets between multiple dates.
Example data frame:
testdf <- data.frame(short_date = seq(as.Date("2007-03-01"),
as.Date("2008-09-01"), by = 'day'))
An example of data frame with values for date ranges:
dates_cut <- structure(list(emergence = structure(c(13627, 13997), class = "Date"), disease_onset = structure(c(13694, 14062), class = "Date")), .Names = c("emergence", "disease_onset"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Obviously this is just a sample, there is a number of years for which I need subsets of data in between ($emergence date and $disese_onset).
This works for one data range:
testdf %>% filter(short_date >=dates_cut[1,1], short_date >=dates_cut[1,2])
The problem is when there are multiple date ranges.
Thanks.
One option would be to lapply over the rows of dates_cut and then store each subset in a list. After that you can rbind them all together with do.call:
list <- lapply(1:nrow(dates_cut), function(i) {
testdf[which(testdf$short_date >= dates_cut[i, "emergence"] &
testdf$short_date <= dates_cut[i, "disease_onset"]), , drop = FALSE]})
res <- do.call(rbind, list)
head(res)
# short_date
#55 2007-04-24
#56 2007-04-25
#57 2007-04-26
#58 2007-04-27
#59 2007-04-28
#60 2007-04-29

replacing blank not NA

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))
instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth

Use ddply on multiple dataframes and create corresponding new dataframes

I have 18 dataframes named as ageharmonic1 , ageharmonic2, ageharmonic3,..... , ageharmonic18.
All of the dataframes have similar content and exact number of data. I am going to show head of one dataframe.
ageharmonic1 <- structure(list(Time = c(129, 129.041687011719, 129.08332824707,
129.125015258789, 129.166687011719, 129.20832824707), Dye = c(0.99999612569809,
0.999995410442352, 0.999996840953827, 0.999998211860657, 1.00000166893005,
0.999999165534973), ageconc = c(583.908142089844, 576.525756835938,
572.939453125, 572.553527832031, 573.761291503906, 578.520263671875
), id = c("station1", "station1", "station1", "station1", "station1",
"station1"), dist = c(0, 0, 0, 0, 0, 0), age = c(0.00675822227239628,
0.00667278244035045, 0.00663126461889936, 0.00662678879212212,
0.00664074460576439, 0.0066958419725371)), .Names = c("Time",
"Dye", "ageconc", "id", "dist", "age"), row.names = c(NA, 6L), class = "data.frame")
> head(ageharmonic1)
Time Dye ageconc id dist age
1 129.0000 0.9999961 583.9081 station1 0 0.006758222
2 129.0417 0.9999954 576.5258 station1 0 0.006672782
3 129.0833 0.9999968 572.9395 station1 0 0.006631265
4 129.1250 0.9999982 572.5535 station1 0 0.006626789
5 129.1667 1.0000017 573.7613 station1 0 0.006640745
6 129.2083 0.9999992 578.5203 station1 0 0.006695842
What I want to do now is to aggregate the dataframe with id variable using ddply function from plyr package
aggreg1 <- ddply(ageharmonic1, .(id), summarise, meanage=mean(age))
I want to use the same formula above to all the dataframes and automatically create dataframes aggreg1, aggreg2, aggreg3, .... , aggreg18.
This is what I have tried:
for (i in 1:18){
aggreg[i] <- ddply(paste0("ageharmonic",i),.(id),summarise,meanage=mean(age))
}
I expression in paste0("ageharmonic",i) is a character and doesn't seem to represent the dataframe that I am trying to work on.
If you put your data frames in a list, you could try this:
# a small example
# create some data frames
df1 <- data.frame(id = rep(1:2, each = 3), age = rnorm(6))
df2 <- data.frame(id = rep(3:4, each = 3), age = rnorm(6))
df3 <- data.frame(id = rep(1:2, each = 3), age = rnorm(6))
# create a list of data frames
mylist <- list(df1, df2, df3)
mylist
# for each element in the list (i.e. a single data frame), apply the function 'aggregate',
# where mean age per id is calculated
# store aggregated results in a new list
mylist2 <- lapply(seq_along(mylist), function(x) aggregate(age ~ id, data = mylist[[x]], mean))
mylist2
mydata1<-mtcars[1:10,1:2]
mydata2<-mcars[11:20,1:2]
mydata<-list(mydata1,mydata2)
library(plyr)
kk<-Map(function(x) ddply(x,.(cyl),summarize,mpg=mean(mpg)), mydata)
> kk
[[1]]
cyl mpg
1 4 23.33333
2 6 20.14000
3 8 16.50000
[[2]]
cyl mpg
1 4 32.23333
2 6 17.80000
3 8 14.06667

Resources