compare rows and fill gaps with previous values in a data table

compare rows and fill gaps with previous values in a data table - r

I have data table that looks like this:
library(data.table)
data <- data.table(time = c(seq(0, 14)),
anom = c(0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0),
gier = c(0, 0, 0, 4, 9, 7, 2, 0, 3, 1, 4, 2, 0, 0, 0))
Now I want to fill the gaps (zeros) with ones in column anom so that the result looks like this:
res <- data.table(time = c(seq(0, 14)),
anom = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0),
gier = c(0, 0, 0, 4, 9, 7, 2, 0, 3, 1, 4, 2, 0, 0, 0))
In addition there is the problem that I want to be flexible with the gap size so I can choose how big the gap can be. There must be an easy way to do something similar to LOCF only for real values (maybe filling it with the previous value of the row and not only ones or zeros) and not only for NA's like the functions fill or na.locf.

An example using the maxgap argument to select the maximum gap size
library(zoo)
na.fill(
na.locf(
replace(data$anom,data$anom==0,NA),
na.rm=F,
maxgap=2
),0
)
[1] 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

Here is another option using rolling join:
maxgap <- 1L
data[, c("rn", "lu") := .(.I, anom)]
data[anom==0L, lu := fcoalesce(
data[anom!=0L][.SD, on=.(rn=rn), roll=maxgap, rollends=c(FALSE, FALSE), x.anom],
anom)
]
output:
time anom gier rn lu
1: 0 0 0 1 0
2: 1 0 0 2 0
3: 2 0 0 3 0
4: 3 1 4 4 1
5: 4 1 9 5 1
6: 5 1 7 6 1
7: 6 0 2 7 1
8: 7 0 0 8 0
9: 8 1 3 9 1
10: 9 0 1 10 1
11: 10 1 4 11 1
12: 11 1 2 12 1
13: 12 0 0 13 0
14: 13 0 0 14 0
15: 14 0 0 15 0

Related

rolling computation to fill gaps by finding following or previous values in a data.table time series

I have a data.table that looks like this:
tsdata <- data.table(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
signal = c(0, 1, 1, 0, 0, 1, 0, 0, 0, 1))
I am trying to fill the gaps between the ones, but only if the gap of zeros is small. So a flexible solution to define the gap would be nice. In this example the gap with zeros shouldn't be bigger than 2.
The result should look like this:
tsdata <- data.table(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
signal = c(0, 1, 1, 1, 1, 1, 0, 0, 0, 1))
My real time series data is much bigger than this, so any help is appreciated.

Group by rleid(signal) and then fill in short 0 sequences not at the beginning or end with 1.
tsdata[, signal2 := ifelse(signal[1] == 0 &
.N <= 2 &
time[1] > min(tsdata$time) &
time[.N] < max(tsdata$time), 1, signal),
by = rleid(signal)]
tsdata
giving:
time signal signal2
1: 1 0 0
2: 2 1 1
3: 3 1 1
4: 4 0 1
5: 5 0 1
6: 6 1 1
7: 7 0 0
8: 8 0 0
9: 9 0 0
10: 10 1 1

How to create multiple data frame from one data frame with multiple condition in R

I would like to create four data sets from the following given data frame by multiple conditions in x1 and x2
mydata=structure(list(y = c(-3, 24, 4, 5, 3, -3, -3, 24, 5, 4, 8, 7,
9, 2, 4, 8, 7, 3, 8, 12, 9, 10, 12, 11, 2),
x1 = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1
),
x2 = c(1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0)), class = "data.frame",
row.names = c(NA,25L))
The first data set is mydata00 which is constructed with these conditions x1=0 and x2=0,
mydata00=filter(mydata, c(mydata$x1==0 & mydata$x2==0))
> mydata00
y x1 x2
1 -3 0 0
2 -3 0 0
3 8 0 0
4 3 0 0
5 9 0 0
Now, I need only the unique values of y and corresponding x1 and x2. Finally, I would like to sort y. So my final data set must look like
y x1 x2
1 -3 0 0
2 3 0 0
3 8 0 0
4 9 0 0
I would like to do the job for mydata11, mydata10, mydata01, where ,
mydata11=filter(mydata, c(mydata$x1==1 & mydata$x2==1))
mydata10=filter(mydata, c(mydata$x1==1 & mydata$x2==0))
mydata01=filter(mydata, c(mydata$x1==0 & mydata$x2==1))
Can I use any for loop or builtin functionn to create these data sets?
Any help is appreciated.

We can split the data based on unique values of x1 and x2 and get unique rows in each list after ordering it by y.
temp <- lapply(split(mydata, list(mydata$x1, mydata$x2)), function(x)
unique(x[order(x$y), ]))
temp
#$`0.0`
# y x1 x2
#6 -3 0 0
#18 3 0 0
#16 8 0 0
#21 9 0 0
#$`1.0`
# y x1 x2
#14 2 1 0
#5 3 1 0
#10 4 1 0
#4 5 1 0
#...
If we need data as a separate dataframe, we can name them appropriately and use list2env.
names(temp) <- paste0("mydata", names(temp))
list2env(temp, .GlobalEnv)
tidyverse way of doing this would be :
library(tidyverse)
mydata %>% group_split(x1, x2) %>% map(~.x %>% arrange(y) %>% distinct)

Issues with plotting network in igraph

I am having some issues in realizing a bipartite network in R with the library igraph. Here is my script:
library(igraph)
library(reshape2)
setwd("....")
getwd()
library(readxl)
network=read_excel("network1.xlsx")
print(network)
subjects=as.character(unlist(network[,1]))
agents=colnames(network[-1])
print(network)
network = network[,-1]
g=graph.incidence(network, weighted = T)
V(g)$type
V(g)$name=c(subjects,agents)
V(g)$color = V(g)$type
V(g)$color=gsub("FALSE","red",V(g)$color)
V(g)$color=gsub("TRUE","lightblue",V(g)$color)
plot(g, edge.arrow.width = 0.3,
vertex.size = 5,
edge.arrow.size = 0.5,
vertex.size2 = 5,
vertex.label.cex = 1,
vertex.label.color="black",
asp = 0.35,
margin = 0,
edge.color="grey",
edge.width=(E(g)$weight),
layout=layout_as_bipartite)
The network is properly plotted
as you can see
however I have two issues
(1) I don't understand the order in which the vertexs are showed in the plot. They are not in the same order of the excel file, neither in alphabetical or numerical order. They seem to be in random order. How could I choose the order in which the vertex should be placed?
(2) I don't understand why some vertex are closer toghether, and some are more far apart. I would all vertexes at the same distance. How could I do it?
Thank you a lot for your invaluable help.

Since you do not provide your data, I will illustrate with a made-up example.
Sample graph data
library(igraph)
set.seed(123)
EL = matrix(c(sample(8,18, replace=T),
sample(LETTERS[1:6], 18, replace=T)), ncol=2)
g = simplify(graph_from_edgelist(EL))
V(g)$type = bipartite_mapping(g)$type
VCol = c("#FF000066", "#0000FF66")[as.numeric(V(g)$type)+1]
plot(g, layout=layout_as_bipartite(g), vertex.color=VCol)
As with your graph, this has two problems. The nodes are ordered arbitrarily
and the lower row is oddly spaced. Let's address those problems one at a time.
To do so, we will need to take control of the layout instead of using any of
the automated layout functions. A layout is simply a vcount(g) * 2 matrix
giving the x-y coordinates of the vertices for plotting. Here, I will put one
type of nodes in the top row by specifying the y coordinate as 1 and the other
nodes in a lower row by specifying y=0. We want to specify the order horizontally
by rank (alphabetically) within each group. So
LO = matrix(0, nrow=vcount(g), ncol=2)
LO[!V(g)$type, 2] = 1
LO[V(g)$type, 1] = rank(V(g)$name[V(g)$type])
LO[!V(g)$type, 1] = rank(V(g)$name[!V(g)$type])
plot(g, layout=LO, vertex.color=VCol)
Now both rows are ordered and evenly spaced, but because there are fewer
vertices in the bottom row, there is an unattractive, unbalanced look. We
can fix that by stretching the bottom row. I find it easier to make the right
scale factor if the coordinates go from 0 to (number of nodes) - 1 rather than
1 to (number of nodes) as above. Doing this, we get
LO[V(g)$type, 1] = rank(V(g)$name[V(g)$type]) - 1
LO[!V(g)$type, 1] = (rank(V(g)$name[!V(g)$type]) - 1) *
(sum(V(g)$type) - 1) / (sum(!V(g)$type) - 1)
plot(g, layout=LO, vertex.color=VCol)

thank you a lot. I performed your very very helpful example, and with the step one I did it work properly with my data, keeping the different thickness of the edges and all as in my plot, but with the proper order. This is very important, thank you a lot. However, I have some troubles in understanding how to rescale properly the top and the bottom row with my data, because they always seem to bee too near. probably I did not understand completly the coordinates on which I have to work. Here are my data.
> `> network=read_excel("network1.xlsx",2)
> dput(network)
structure(list(`NA` = c(2333, 2439, 2450, 2451, 2452, 2453, 2454,
2455, 2456, 2457, 2458, 2459, 2460, 2461, 2480, 2490, 2491, 2492,
2493, 2494, 2495), A = c(12, 2, 2, 5, 2, 0, 5, 3, 0, 0, 7, 0,
0, 0, 6, 2, 10, 7, 1, 2, 5), B = c(0, 1, 0, 1, 0, 0, 2, 0, 0,
0, 0, 0, 1, 0, 5, 0, 2, 0, 0, 0, 0), C = c(0, 0, 0, 0, 1, 0,
4, 0, 0, 0, 0, 1, 0, 0, 2, 0, 4, 4, 2, 1, 0), D = c(2, 0, 0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 1, 4, 0), E = c(11,
2, 3, 3, 3, 8, 3, 6, 4, 1, 1, 0, 12, 0, 5, 0, 4, 6, 4, 8, 9),
F = c(2, 0, 0, 3, 1, 0, 10, 1, 0, 0, 0, 1, 0, 0, 9, 0, 0,
1, 1, 3, 3), G = c(0, 3, 1, 1, 0, 0, 0, 0, 0, 3, 2, 0, 0,
0, 1, 0, 0, 2, 0, 1, 0), H = c(0, 0, 2, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1), I = c(0, 0, 0, 0, 0,
0, 3, 0, 6, 3, 0, 0, 1, 0, 7, 0, 0, 4, 1, 2, 0), J = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-21L), .Names = c(NA, "A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"))
> print(network)
NA A B C D E F G H I J
1 2333 12 0 0 2 11 2 0 0 0 0
2 2439 2 1 0 0 2 0 3 0 0 0
3 2450 2 0 0 0 3 0 1 2 0 0
4 2451 5 1 0 0 3 3 1 0 0 0
5 2452 2 0 1 0 3 1 0 0 0 0
6 2453 0 0 0 0 8 0 0 0 0 1
7 2454 5 2 4 2 3 10 0 1 3 0
8 2455 3 0 0 0 6 1 0 0 0 0
9 2456 0 0 0 0 4 0 0 0 6 0
10 2457 0 0 0 0 1 0 3 0 3 0
11 2458 7 0 0 0 1 0 2 0 0 0
12 2459 0 0 1 0 0 1 0 0 0 0
13 2460 0 1 0 0 12 0 0 0 1 0
14 2461 0 0 0 0 0 0 0 0 0 0
15 2480 6 5 2 7 5 9 1 2 7 1
16 2490 2 0 0 0 0 0 0 0 0 0
17 2491 10 2 4 4 4 0 0 0 0 0
18 2492 7 0 4 0 6 1 2 0 4 0
19 2493 1 0 2 1 4 1 0 0 1 0
20 2494 2 0 1 4 8 3 1 0 2 0
21 2495 5 0 0 0 9 3 0 1 0 0
> `

Broken stick (or piecewise) regression with 2 breakpoints

I want to estimate two breakpoints of a function with the next data:
df = data.frame (x = 1:180,
y = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 2, 2, 4, 2, 2, 3, 2, 1, 2,0, 1, 0, 1, 4, 0, 1, 2, 3, 1, 1, 1, 0, 2, 0, 3, 2, 1, 1, 1, 1, 5, 4, 2, 1, 0, 2, 1, 1, 2, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0,
2, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
# plotting y ~ x
plot(df)
I know that the function have two breakpoints such that:
y = y1 if x < b1;
y = y2 if b1 < x < b2;
y = y3 if b2 < x;
And I want to find b1 and b2 to fit a kind of rectangular function with the following form
Can anyone help me or point me in the right direction? Thanks!

1) kmeans Try kmeans like this:
set.seed(123)
km <- kmeans(df, 3, nstart = 25)
> fitted(km, "classes") # or equivalently km$cluster
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> unique(fitted(km, "centers")) # or except for order km$centers
x y
3 30.5 0.5166667
1 90.5 0.9000000
2 150.5 0.0000000
> # groups are x = 1-60, 61-120 and 121-180
> simplify2array(tapply(df$x, km$cluster, range))
1 2 3
[1,] 61 121 1
[2,] 120 180 60
plot(df, col = km$cluster)
lines(fitted(km)[, "y"] ~ x, df)
2) brute force Another approach is a brute force approach in which we calculate every possible pair of breakpoints and choose the pair whose sum of squares in a linear model is least.
grid <- subset(expand.grid(b1 = 1:180, b2 = 1:80), b1 < b2)
# the groups are [1, b1], (b1, b2], (b2, Inf)
fit <- function(b1, b2, x, y) {
grp <- factor((x > b1) + (x > b2))
lm(y ~ grp)
}
dv <- function(...) deviance(fit(...))
wx <- which.min(mapply(dv, grid$b1, grid$b2, MoreArgs = df))
grid[wx, ]
## b1 b2
## 14264 44 80
plot(df)
lines(fitted(fit(grid$b1[wx], grid$b2[wx], x, y)) ~ x, df)

I can see that y is integer numbers, so maybe this is best estimated with a Poisson or Binomial model. Here is a solution using the R package mcp:
# Three intercept segments
model = list(
y ~ 1,
~ 1,
~ 1
)
library(mcp)
fit = mcp(model, df, family = poisson(), par_x = "x", adapt = 2000)
plot(fit)
Notice that mcp is one of the only packages to estimate the uncertainty around the change point ant parameter estimates. The summary shows where the change point is estimated to be (cp_1 and cp_2) as well as the other parameters (on a log-scale since that's the default link function for Poisson models):
summary(fit)
Family: poisson(link = 'log')
Iterations: 9000 from 3 chains.
Segments:
1: y ~ 1
2: y ~ 1 ~ 1
3: y ~ 1 ~ 1
Population-level parameters:
name mean lower upper Rhat n.eff
cp_1 39.57 37.8 45.00 1 54
cp_2 99.82 99.0 101.21 1 2211
int_1 -4.00 -6.5 -1.88 1 577
int_2 0.32 0.1 0.54 1 6288
int_3 -11.02 -20.9 -3.56 1 2487

cumsum according to certain restricts in r

I have a large data of car accidents and a sample of it is provided below.
accident is a binary variable of whether the accident happens or
not.
shift_number is the number of the shift, 0 means the driver is
taking a rest and not a shift.
time_diff is the amount of time at each observation.
df <- data.frame(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17
)
My question is to measure the total amount of working time since the driver starts this shift for each accident.
wanted <- data.frame
(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17,
cum_time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 75)
)
Does anyone have ideas on solving this problem with R? It's better to have data.table or vectorised solution because I've got huge data to deal with.

df$cum_time = 0
accident = which(df$accident == 1)
df$cum_time[accident] <- sapply(accident, function(x) {
sum(df$time_diff[(which.max(cumsum(df$shift_number[1:x] == 0)) + 1): x])
})
df
# accident shift_number time_diff cum_time
#1 0 1 3 0
#2 0 1 4 0
#3 0 0 5 0
#4 0 0 6 0
#5 0 0 7 0
#6 0 2 8 0
#7 0 2 9 0
#8 0 2 10 0
#9 0 0 11 0
#10 0 0 12 0
#11 0 3 13 0
#12 1 3 14 27
#13 0 3 15 0
#14 0 3 16 0
#15 1 3 17 75
We first make all the values in cum_time variable as 0. We find the indices where accident has occurred. For each of those indices we find the latest 0 in shift_number and calculate the sum of values of time_diff from the latest 0 to x and assign it to its respective indices.

Use the ave function to compute the cumulative sum of time_diff by shift_number:
cumsum_by_shift <- ave(df$time_diff, df$shift_number, FUN=cumsum)
#[1] 3 7 5 11 18 8 17 27 29 41 13 27 42 58 75
Pick out elements of cumsum_by_shift where accidents occur:
cum_time <- ifelse(df$accident == 1, cumsum_by_shift, 0)
#[1] 0 0 0 0 0 0 0 0 0 0 0 27 0 0 75
Note the use of the vectorized ifelse function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

compare rows and fill gaps with previous values in a data table - r

An example using the maxgap argument to select the maximum gap size library(zoo) na.fill( na.locf( replace(data$anom,data$anom==0,NA), na.rm=F, maxgap=2 ),0 ) [1] 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

Related

rolling computation to fill gaps by finding following or previous values in a data.table time series

How to create multiple data frame from one data frame with multiple condition in R

Issues with plotting network in igraph

Broken stick (or piecewise) regression with 2 breakpoints

cumsum according to certain restricts in r

Categories

Resources