R: Calculate differences between rows in data.table - r

RProf revealed, that the following operation I perform is rather slow:
stockHistory[.(p), stock:=stockHistory[.(p), stock] - (backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog])]
I suppose this is because of the subtraction
backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog]
Is there any way to speed up this operation?
.(p) subsets the data.table for a period p, .(p-1) subsets the previous period (see example data below). Would it maybe be faster to apply some kind diff() here? I do not know how to do this, though.
Example data:
backorderedDemands<-CJ(period=1:1000, articleID=letters[1:10], backlog=0)[,backlog:=round(runif(10000)*42,0)]
setkey(backorderedDemands,period, articleID)
stockHistory<-CJ(period=1:1000, articleID=letters[1:10], stock=0)[,stock:=round(runif(10000)*42+66,0)]
setkey(stockHistory,period, articleID)

You can first calculate a difference column in backorderedDemands.
backorderedDemands[, diff := c(NA, -diff(backlog)), by=articleID]
Also it is not necessary to use stockHistory[.(p), stock]. It's enough to just use stock.
stockHistoryNew[.(p), stock:=stock - backorderedDemands[.(p), diff]]

If you want to compute first differences of your data, you can do it like below. It is fast...I included step by step computation.
library(data.table)
library(dplyr)
Data
set.seed(1)
backorderedDemands <-
CJ(period = 1:1000,
articleID = letters[1:10],
backlog = 0)[,backlog:= round(runif(10000) * 42, 0)]
stockHistory <-
CJ(period = 1:1000,
articleID = letters[1:10],
stock = 0)[, stock:= round(runif(10000) * 42 + 66, 0)]
Solution
merge(stockHistory, backorderedDemands,
by = c("period", "articleID")) %>%
group_by(articleID) %>%
mutate(lag_backlog = lag(backlog, 1),
my_backlog_diff = backlog - lag_backlog,
my_diff = stock + my_backlog_diff) %>%
as.data.frame(.) %>%
head(., 20)
period articleID stock backlog lag_backlog my_backlog_diff my_diff
1 1 a 69 11 NA NA NA
2 1 b 94 16 NA NA NA
3 1 c 97 24 NA NA NA
4 1 d 71 38 NA NA NA
5 1 e 68 8 NA NA NA
6 1 f 71 38 NA NA NA
7 1 g 103 40 NA NA NA
8 1 h 101 28 NA NA NA
9 1 i 102 26 NA NA NA
10 1 j 67 3 NA NA NA
11 2 a 71 9 11 -2 69
12 2 b 89 7 16 -9 80
13 2 c 71 29 24 5 76
14 2 d 96 16 38 -22 74
15 2 e 96 32 8 24 120
16 2 f 99 21 38 -17 82
17 2 g 92 30 40 -10 82
18 2 h 87 42 28 14 101
19 2 i 85 16 26 -10 75
20 2 j 67 33 3 30 97

Related

R | Mutate with condition for multiple columns

I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000

Apply Lag function Dynamically on all columns

Hi I have data frame with 100 columns , i am trying to calculate current_value -lag(current_value)
I can able to calculate for single column , but unable run code on all available columns dynamically
Sample df
Class <- c("A","A","A","A","B","B","B","C","C","C","C","C","C")
A<-c(23,33,45,56,22,34,34,45,65,5,57,75,57)
D<-c(2,133,5,60,23,312,341,25,75,50,3,9,21)
M<-c(34,35,67,325,46,56,547,47,67,67,68,3,12)
df <- data.frame(Class,A,D,M)
I have tried with below code
df <- df %>% group_by(Class) %>%
mutate(A_lag =(A-lag(A)))
Help me to calculate all lad columns dynamically
Thanks in advance
Class <- c("A","A","A","A","B","B","B","C","C","C","C","C","C")
A<-c(23,33,45,56,22,34,34,45,65,5,57,75,57)
D<-c(2,133,5,60,23,312,341,25,75,50,3,9,21)
M<-c(34,35,67,325,46,56,547,47,67,67,68,3,12)
df <- data.frame(Class,A,D,M)
library(dplyr)
df %>%
group_by(Class) %>%
mutate_all(~.-lag(.)) %>%
ungroup()
# # A tibble: 13 x 4
# Class A D M
# <fct> <dbl> <dbl> <dbl>
# 1 A NA NA NA
# 2 A 10 131 1
# 3 A 12 -128 32
# 4 A 11 55 258
# 5 B NA NA NA
# 6 B 12 289 10
# 7 B 0 29 491
# 8 C NA NA NA
# 9 C 20 50 20
#10 C -60 -25 0
#11 C 52 -47 1
#12 C 18 6 -65
#13 C -18 12 9
or if you want to add new columns to the existing ones
df %>%
group_by(Class) %>%
mutate_all(funs(new = .-lag(.))) %>%
ungroup()
# # A tibble: 13 x 7
# Class A D M A_new D_new M_new
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 23 2 34 NA NA NA
# 2 A 33 133 35 10 131 1
# 3 A 45 5 67 12 -128 32
# 4 A 56 60 325 11 55 258
# 5 B 22 23 46 NA NA NA
# 6 B 34 312 56 12 289 10
# 7 B 34 341 547 0 29 491
# 8 C 45 25 47 NA NA NA
# 9 C 65 75 67 20 50 20
#10 C 5 50 67 -60 -25 0
#11 C 57 3 68 52 -47 1
#12 C 75 9 3 18 6 -65
#13 C 57 21 12 -18 12 9
Using data.table and calculating lag for only numeric variables:
library(data.table)
setDT(df)
df[,
lapply(.SD, function(x) x - shift(x)),
by = Class,
.SDcols = (sapply(df, is.numeric))
]
Class A D M
1: A NA NA NA
2: A 10 131 1
3: A 12 -128 32
4: A 11 55 258
5: B NA NA NA
6: B 12 289 10
7: B 0 29 491
8: C NA NA NA
9: C 20 50 20
10: C -60 -25 0
11: C 52 -47 1
12: C 18 6 -65
13: C -18 12 9

Distance across multiple GPS points

I'm relatively new to R.
I'm working on creating a tortuosity index of a line along GPS points but to work this out I'm trying to calculate the distance between the point n and the n+10 point for my latitude and longitude data.
I have already used distanceTrack to calculate the distance between every point in series:
lat<-data$Latitude
long<-data$Longitude
distanceTrack(lat,long)
But I wonder if there is an alternative for doing this over 10 steps?
I tried using another method that used sp package and spDistsN1 to get distances between successive points:
coordinates(gpsdat)<-~Longitude+Latitude
proj4string(gpsdat)<-CRS("+proj=longlat +datum=WGS84")
dist<-sapply(seq_along(gpsdat[-1,]), function(i)
spDistsN1(pts=gpsdat[i,],pt=gpsdat[i+1,],longlat=TRUE))
I thought I might be able to simply change some of the script so it was:
dist<-sapply(seq_along(gpsdat[-1,]), function(i)
spDistsN1(pts=gpsdat[i,],pt=gpsdat[i+10,],longlat=TRUE))
But this change returns that my subscript is out of bounds.
I wonder if anyone might have an idea of how to progress with this?
Thank you for any help you can offer!
You could do this without using an apply or loop structure with data.table and geosphere.
For that we shift the lat/lon rows by 5 in my example, filter out those, that dont have NA values, calculate the distance per row and reassign the distance to the original data.table. You can vary the amount of shifts to your desire.
My example also groups by L1, as it assumes different Line-Ids. You can adapt the grouping column or just omit it.
library(data.table)
library(geosphere)
## Data
points = data.table(
id = 1:20,
lon = seq(11, 30, 1),
lat = seq(51, 70, 1),
L1 = rep(seq(1,2,1), each=10)
)
## Shift Rows
points[, c("next_id", "next_lon", "next_lat") := data.table::shift(.SD, 5, NA, "lead"),
.SDcols=c("id","lon","lat"), by = L1];
## Filter NA values
points_not_NA = points[!is.na(points$next_lon),]
## Get Distance
distRes <- distCosine(as.matrix(points_not_NA[,c("lon","lat")]),
as.matrix(points_not_NA[,c("next_lon","next_lat")]))
## Assign initial value for distance
points$dist <- 0
## Assign resulting distance to filtered table
points[!is.na(points$next_lon),]$dist <- distRes
points
> points
id lon lat L1 next_id next_lon next_lat dist
1: 1 11 51 1 6 16 56 647237.8260
2: 2 12 52 1 7 17 57 643248.9003
3: 3 13 53 1 8 18 58 639283.2756
4: 4 14 54 1 9 19 59 635346.2618
5: 5 15 55 1 10 20 60 631443.2317
6: 6 16 56 1 NA NA NA 0.0000
7: 7 17 57 1 NA NA NA 0.0000
8: 8 18 58 1 NA NA NA 0.0000
9: 9 19 59 1 NA NA NA 0.0000
10: 10 20 60 1 NA NA NA 0.0000
11: 11 21 61 2 16 26 66 609045.7607
12: 12 22 62 2 17 27 67 605534.8132
13: 13 23 63 2 18 28 68 602102.1704
14: 14 24 64 2 19 29 69 598753.3917
15: 15 25 65 2 20 30 70 595494.0003
16: 16 26 66 2 NA NA NA 0.0000
17: 17 27 67 2 NA NA NA 0.0000
18: 18 28 68 2 NA NA NA 0.0000
19: 19 29 69 2 NA NA NA 0.0000
20: 20 30 70 2 NA NA NA 0.0000

Enumerate instances of a factor level

I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA

Replace last NA of a segment of NAs in a column with last valid value

Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21

Resources