R date time aligning and fill through values - r

I have multiple frames, for the purpose suppose 2.
Each frame comprises 2 columns - an index column, and a value column
sz<-5;
frame_1<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_2<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_1:
index value
1 49
6 62
7 58
8 30
10 50
frame_2:
index value
4 60
5 64
6 48
7 46
9 57
The goal is to create a third frame, frame_3, whose indices will be the union of those in frame_1 and frame_2,
frame_3<-data.frame(index = sort(union(frame_1$index,frame_2$index)));
and which will comprise two additional columns, value_1 and value_2.
frame_3$value_1 will be filled out from frame_1$value, frame_3$value_2 will be filled out from frame_2$value;
These should be filled out like so:
frame_3:
index value_1 value_2
1 49 NA
4 49 60 # value_1 is filled through with previous value
5 49 64 # value_1 is filled through with previous value
6 62 48
7 58 46
8 30 46 # value_2 is filled through with previous value
9 30 57 # value_1 is filled through with previous value
10 50 57 # value_1 is filled through with previous value
i'm looking for an efficient solution, as im dealing with records in the hundreds of thousands

This problem screams for data.table. You can use a loop to recursively construct columns one by one using x[y, roll=TRUE].
require(data.table)
dt1 <- data.table(frame_1)
dt2 <- data.table(frame_2)
setkey(dt1, index)
setkey(dt2, index)
dt3 <- data.table(index = sort(unique(c(dt1$index, dt2$index))))
> dt1[dt2[dt3, roll=TRUE], roll=TRUE]
# index value value.1
# 1: 1 49 NA
# 2: 4 49 60
# 3: 5 49 64
# 4: 6 62 48
# 5: 7 58 46
# 6: 8 30 46
# 7: 9 30 57
# 8: 10 50 57

If your data.frames aren't very large, you can just use merge combined with zoo::na.locf.
R> library(zoo)
R> frame_3 <- merge(frame_1, frame_2, by="index",
+ all=TRUE, suffixes=paste(".",1:2,sep=""))
R > (frame_3 <- na.locf(frame_3))
index value.1 value.2
1 1 49 NA
2 4 49 60
3 5 49 64
4 6 62 48
5 7 58 46
6 8 30 46
7 9 30 57
8 10 50 57
Or, just use zoo objects to begin with, assuming your "value" columns are all one type (like a matrix, you can't mix types in zoo objects).
R> z1 <- zoo(frame_1$value, frame_1$index)
R> z2 <- zoo(frame_2$value, frame_2$index)
R> (z3 <- na.locf(merge(z1, z2)))
z1 z2
1 49 NA
4 49 60
5 49 64
6 62 48
7 58 46
8 30 46
9 30 57
10 50 57

Related

Vectorizing lagged operations

How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])

Creating new column according to the position of other column in R

For the following data set,
mydat=data.frame(sl=c(1,3,8,10,4,6,5,7,2,9),x=c(50,42,15,49,56,30,66,52,40,38))
mydat
sl x
1 1 50
2 3 42
3 8 15
4 10 49
5 4 56
6 6 30
7 5 66
8 7 52
9 2 40
10 9 38
I would like to create another column according to the position of sl, The first value of the new column say xval should be 50, second value is 40, third value is 42. So the new column should look like xval=50,40,42,56,...,49.
Any help is appreciated.
Using data.table
require(data.table); setDT(mydat)
mydat[, New := x[order(sl)]]
Using Base R
Contribution from Onyambu:
transform(mydat,New = x[order(sl)])
Alternatively:
mydat$New = mydat$x[order(mydat$sl)]
Result
> mydat
sl x New
1: 1 50 50
2: 3 42 40
3: 8 15 42
4: 10 49 56
5: 4 56 66
6: 6 30 30
7: 5 66 52
8: 7 52 15
9: 2 40 38
10: 9 38 49

How can I create a relative distance matrix in r using string variables?

I have a dataset on npi's containing information on those npi mostly in string variables
But I've simplyfied it for this example
data <- as.data.frame(cbind(51:60, sample(1:10, 10, replace = T), sample(1:10, 10, replace = T), sample(1:10, 10, replace = T)), stringsAsfactors = F)
colnames(data) <- c("npi", "a", "b", "c")
for instance:
npi a b c
51 6 2 1
52 6 2 6
53 10 9 2
54 7 4 7
55 7 10 5
56 8 5 7
57 7 2 10
58 5 9 3
59 8 4 6
60 1 10 2
I want to create a distance matrix showing the relative distances between the different NPI's
I want them to have a large distance when they're not very similar and a small distance when they are very similar. With similar I mean they share values on variables. The variables in the real dataset are names and addresses so I cannot simply use dist().
This is how I got the distance between two npi's
length(intersect(npi1,npi2))/3
But I don't know how to create a loop or a function to run through the whole dataset and give me a distance matrix like this:
51 52 53 54 55 56 57 58 59 60
51 0 distance 51 to 52
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
Would you be able to point me in the right direction which kind of loop or function to use for this problem?
#sample data
df <- read.table(text='npi a b c
51 6 2 1
52 6 2 6
53 10 9 2
54 7 4 7
55 7 10 5
56 8 5 7
57 7 2 10
58 5 9 3
59 8 4 6
60 1 10 2', header=T, sep='')
#convert 1st column of data as the row index
df1 <- df[,-1]
rownames(df1) <- df[,1]
#calculate distance
library(proxy)
dist_func <- function(x, y) length(intersect(x,y))/3
proxy::dist(df1, method = dist_func)

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

How to properly combined columns into one column using R

I have 3 sets of data. Each one is a column of variables:
A B C
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
I have been trying to use rbind to combine these three data sets into one dataset with one column.
Combine<-rbind(A,B,C)
Instead I get something this, where not only do I end up with a series of shorter columns, the numbers all change. How do I stop this from happening?
V1 V2 V3 V4
14 9 9 5
19 15 14 5
# example data frames
dt1 = data.frame(A = 1:5)
dt2 = data.frame(B = 3:10)
dt3 = data.frame(C = 5:7)
# change to a common column name
names(dt1) = "x"
names(dt2) = "x"
names(dt3) = "x"
# bind rows
rbind(dt1, dt2, dt3)
# x
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 3
# 7 4
# 8 5
# 9 6
# 10 7
# 11 8
# 12 9
# 13 10
# 14 5
# 15 6
# 16 7

Resources