merging time series datasets, but missing the time series column - r

I've tried methods from others for merging time series datasets. However, The time series column is missing. Please see captured screen.
Here is the example of my datasets.
df1 = data.frame(Time = round(seq(1, 200, length.out= 50)), Var1 = runif(50,1, 10))
df2 = data.frame(Time = round(seq(1, 200, length.out= 80)), Var2 = runif(80,1, 10))
df3 = data.frame(Time = round(seq(1, 200, length.out= 100)), Var3 = runif(100,1, 10))
Here is what I've tried.
a = read.zoo(df1,drop = FALSE)
b = read.zoo(df2,drop = FALSE)
c = read.zoo(df3,drop = FALSE)
abc = merge(a, b, c)
How can I add one first column listing the Time? Any comments about this task that I can learn from you?
Thanks.

This converts all three data frames to zoo and merges them into a combined zoo object.
z <- do.call("merge", lapply(list(df1, df2, df3), read.zoo, drop = FALSE))
Note that in zoo objects the time is stored in the index attribute. It is not a column. The statement shown above already includes the time as derived from the first columns of each of the data frames.

Related

Conditional operations between factor level pairs

I have a dataframe (df1) that contains Start times and End times for observations of different IDs:
df <- structure(list(ID = 1:4, Start = c("2021-05-12 13:22:00", "2021-05-12 13:25:00", "2021-05-12 13:30:00", "2021-05-12 13:42:00"),
End = c("2021-05-13 8:15:00", "2021-05-13 8:17:00", "2021-05-13 8:19:00", "2021-05-13 8:12:00")),
class = "data.frame", row.names = c(NA,
-4L))
I want to create a new dataframe that shows the latest Start time and the earliest End time for each possible pairwise comparison between the levels ofID.
I was able to accomplish this by making a duplicate column of ID called ID2, using dplyr::expand to expand them, and saving that in an object called Pairs:
library(dplyr)
df$ID2 <- df$ID
Pairs <-
df%>%
expand(ID, ID2)
Making two new objects a and b that store the Start and End times for each comparison separately, and then combining them into df2:
a <- left_join(df, Pairs, by = 'ID')%>%
rename(StartID1 = Start, EndID1 = End, ID2 = ID2.y)%>%
select(-ID2.x)
b <- left_join(Pairs, df, by = "ID2")%>%
rename(StartID2 = Start, EndID2 = End)%>%
select(ID2, StartID2, EndID2)
df2 <- cbind(a,b)
df2 <- df2[,-4]
and finally using dplyr::if_else to find the LatestStart time and the EarliestEnd time for each of the comparisons:
df2 <-
df2%>%
mutate(LatestStart = if_else(StartID1 > StartID2, StartID1, StartID2),
EarliestEnd = if_else(EndID1 > EndID2, EndID2, EndID1))
This seems like such a simple task to perform, is there a more concise way to achieve this from df1 without creating all of these extra objects?
For such computations usually outer comes handy:
df %>%
mutate(across(c("Start", "End"), lubridate::ymd_hms)) %>%
{
data.frame(
ID1 = rep(.$ID, each = nrow(.)),
ID2 = rep(.$ID, nrow(.)),
LatestStart = outer(.$Start, .$Start, pmax),
LatestEnd = outer(.$End, .$End, pmin)
)
}

matching large vector of string against large vector of patterns

I have a very large dataframe with a column containing postal codes:
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
I also have a second large-ish dataframe with postal codes patterns mapped to a zone.
mapping <- data.frame(code = c("10*", "20*"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)
I would like to join those two tables to add the zone column to the data dataframe but the volume of the data is too large to do a "rowwise" grepl. What is the most efficient way of doing this?
The most efficient way to deal with large objects is data.table. To do joins, you need a common column in both objects. I'm using substr to get only the first two digits of the code column in the data object. Also note that I removed the "*" from mapping as that character is not present in data.
library(data.table)
setDT(data)
setDT(mapping)
data[, code := substr(code, start = 1, stop = 2)]
mapping[data, on="code"]
code zone data
1: 10 zone1 -1.0481912
2: 11 <NA> 1.1339476
3: 20 zone2 -0.8072921
4: 90 <NA> 1.5883562
DATA
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
mapping <- data.frame(code = c("10", "20"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)
I am not sure what specific method you are using when you say "rowwise" but here is what I would do in the dplyr world.
mapping <- dplyr::rename(mapping, codeString = code) # rename for joining.
data <- data %>%
dplyr::mutate( codeString = paste0(substr(code, 1, 2), "*")) %>%
dplyr::left_join(mapping, by= "codeString")
You should be able to join like this and avoid any rowwise operation since the patter you're looking for is easy to create.

mutate_at - using a function with map2

My objective is to convert a set monthly revenue columns from AUD to USD. To achieve this, I need to apply a different exchange rate to each of the revenue columns.
data for analysis:
pacman::p_load(lubridate, purrr, dplyr)
df1 <- data.frame(
Date = seq(dmy("01/01/2017"), by = "day", length.out = 3),
Customer = "a",
Product = "xxx",
Revenue1 = c(10, 20, 30),
Revenue2 = c(100, 200, 300))
df2 <- data.frame(Factor1 = c(10),
Factor2 = c(20))
df3 <- select(df1, Revenue1:Revenue2)
This is my function
fx_adjust <- function(x, y = df2){map2_df(x, y, ~ .x * .y)}
These two work:
fx_adjust(df3, df2)
mutate_at(df1, vars(contains("Revenue")), funs(. * 10))
But this does not work:
mutate_at(df1, vars(contains("Revenue")), funs(fx_adjust(.)))
Could someone kindly explain why mutate_at is misbehaving.
This is because mutate_at calls your function separately for each column. It does not pass all the columns at once in the .
Observe this example
fx_dump<-function(...) print(list(...))
mutate_at(df1, vars(contains("Revenue")), funs(fx_dump(.)))
You'll see that fx_dump is called twice, once for each column. You cannot pass multiple parameters at a time to your function using mutate_at.

long to wide data restructure in data.table

I have a data.table which is in long format and I want to re-structure it to wide format. Something similar to case to vars in SPSS. I have the data which is created in long format using melt
library(data.table)
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=17),
tc = rep(c('C','D'), 17),
one = rnorm(34,1,1),
two = rnorm(34,2,1),
three = rnorm(34,3,1),
four = rnorm(34,4,1),
five = rnorm(34,5,2),
six = rnorm(34,6,2),
seven = rnorm(34,7,2),
eight = rnorm(34,28,3))
DT1 <- melt(DT, id.vars = c("town","tc"),measure=3:10)
DT1[, `:=` (mn = mean(value,na.rm = TRUE), sdev = sd(value,na.rm = TRUE), uplimit = mean(value,na.rm = TRUE)+1.96*sd(value,na.rm = TRUE), lowlimit=mean(value,na.rm = TRUE)-1.96*sd(value,na.rm = TRUE)), by = .(town,tc,variable)][, outlier := +(value < mn - 1.96*sdev | value > mn + 1.96*sdev)]
so originally the data had 34 records and we had 8 key variables "one" to "eight". using melt we get a data similar to the one that is generated by above code and column "value" holds the data from the original data. Now on this data we do some computation and create the other variables "mn", "sd", "up", "low", "out". Now this data needs to be merged with the original data which has only 34 records. So I want to re-structure this data so that the restructured data has 34 records and eight variable each for "mn", "sd", "uplimit", "lowlimit", "out". How can I achieve this? I was trying dcast but not very clear of ~ and + how to use in the formula.....not clearly mentioned in ?dcast and other notes. Do you have something to share which explains this with an example. can the requirement be achieved using dcast?
finally I have found answer to my question post multiple tries. So basically the key to note is that the original data had to have a unique ID identifier which in this case was not present. So I added sequential unique numbers at the end in variable "unique".
library(data.table)
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=17),
tc = rep(c('C','D'), 17),
one = rnorm(34,1,1),
two = rnorm(34,2,1),
three = rnorm(34,3,1),
four = rnorm(34,4,1),
five = rnorm(34,5,2),
six = rnorm(34,6,2),
seven = rnorm(34,7,2),
eight = rnorm(34,28,3),
unique = 1:34)
This have me the data with a column having uniqueIDs. Then when converting the data in long format, I have used this unique variable as well in melt.
DT1 <- melt(DT, id.vars = c("town","tc","unique"),measure=3:10)
Then created the new variables that were needed
DT1[, `:=` (mn = mean(value,na.rm = TRUE), sdev = sd(value,na.rm = TRUE), uplimit = mean(value,na.rm = TRUE)+1.96*sd(value,na.rm = TRUE), lowlimit=mean(value,na.rm = TRUE)-1.96*sd(value,na.rm = TRUE)), by = .(town,tc,variable)][, outlier := +(value < mn - 1.96*sdev | value > mn + 1.96*sdev)]
Then when converting this long data to wide format so that I have same number of records as in the original data, I have taken help of this unique variable in dcast
DT2 <- dcast(DT1,town+tc+unique~variable,value.var = c("mn","sdev","outlier","uplimit","lowlimit"))
This gave me the desired output with same number of records (34) as in the original data and created the desired number of columns and each cell having same data as that in the long format.
I hope you all find this useful :)!!

Matching across two data frames with certain observations having multiple entries to match against

I working with two data frames corresponding to the sample below:
# Data sets
set.seed(1)
dta_a <- data.frame(some_value = runif(n = 10),
identifier=c("A0001","A0002","A0003","A0004","A0005",
"A0006","B0001","B0002","B0003","B0004"),
other_val = runif(n = 10))
dta_b <- data.frame(variable_abc = runif(n = 6),
identifier=c("A0001","A0002","A0003,A0004,A0005,C0001",
"B0001,B0002","B0003","B0004"),
variable_df = runif(n = 6))
I would like to merge those two data frames and obtain a data frame similar to the one presented below:
The resulting data frame would have the following qualities:
For the observations where only one identifier is present the merge command performs with all.y = TRUE and all.x = FALSE assuming that y is dta_b.
For the observations where multiple identifiers are provided only the first matched value from the dta_a is taken with the remaining values ignored. If there is no match on the first identifier (A0003) I would like for the command to attempt to match the next one (A0004).
I made a reference to the merge command but, naturally, dplyr and other solutions are fine.
you can 'melt' the dta_b so to have one row per identifier with a preference order and then join all the identifiers:
library(dplyr)
library(tidyr)
melt_dta_b = lapply(1:nrow(dta_b), function(i){
split_identifier = strsplit(as.character(dta_b$identifier[i]), split = ",", fixed = TRUE)[[1]]
data_frame(identifier = split_identifier,
original_identifier = dta_b$identifier[i], original_row = i, preference = 1:length(identifier),
variable_abc = dta_b$variable_abc[i], variable_df = dta_b$variable_df[i])
})
melt_dta_b = rbind_all(melt_dta_b)
At that point you can select only the one with the highest preference score:
joined_df = left_join(melt_dta_b, dta_a) %>%
filter(!is.na(some_value)) %>%
group_by(original_row) %>%
filter(preference == min(preference)) %>%
ungroup()
UPDATE
in order to not explicitly call the variables by name you can use the following code that binds all the 'unused' columns of the orginal df:
melt_dta_b = lapply(1:nrow(dta_b), function(i){
tmp = dta_b[i,]
split_identifier = strsplit(as.character(tmp$identifier), split = ",", fixed = TRUE)[[1]]
colnames(tmp)[2] = "original_identifier"
data_frame(identifier = split_identifier, original_row = i, preference = 1:length(identifier)) %>%
cbind(tmp)
})
melt_dta_b = rbind_all(melt_dta_b)
Just one way of doing it, but not best way I guess. Just made a try.
Split the identifiers and merge according to the first one.
dta_a$identifier = as.vector(dta_a$identifier)
dta_a1 = data.frame(dta_a, identifier_split = do.call(rbind, strsplit(dta_a$identifier, split = ",", fixed = T)))
dta_b$identifier = as.vector(dta_b$identifier)
dta_b1 = data.frame(dta_b, identifier_split = do.call(rbind, strsplit(dta_b$identifier, split = ",", fixed = T)))
dta_join = merge(dta_a1, dta_b1, by = "identifier_split.1", all.x = F, all.y = T)
In cases you don't have a match for the first one, you'll see NAs and you can subset them and merge with second ones ("identifier_split.2")

Resources