Related
I have a data frame that looks like this;
df <- data.frame(Trip =c(rep("A",10),rep("B",10)),
State =c(0,0,0,1,1,1,0,0,1,0,0,1,1,0,0,0,1,1,1,0),
Distance = c(0,2,9,4,3,1,4,5,6,3,2,6,1,5,3,3,6,1,8,2),
DistanceToNext = c(NA,NA,NA,3,1,15,NA,NA,NA,NA,NA,1,17,NA,NA,NA,1,8,NA,NA))
Trip State Distance DistanceToNext
1 A 0 1 NA
2 A 0 2 NA
3 A 0 9 NA
4 A 1 4 3
5 A 1 3 1
6 A 1 1 15
7 A 0 4 NA
8 A 0 5 NA
9 A 1 6 NA
10 A 0 3 NA
11 B 0 2 NA
12 B 1 6 1
13 B 1 1 17
14 B 0 5 NA
15 B 0 3 NA
16 B 0 3 NA
17 B 1 6 1
18 B 1 1 8
19 B 1 8 NA
20 B 0 2 NA
The State column indicates whether a fishing boat is fishing (State = 1) or not fishing (State = 0). I want to calculate the Distance travelled between each fishing event (State = 1).
The Distance column indicates the distance between that rows location and the previous row (e.g. it is the lag distance).
The DistanceToNext column is the answer I am trying to generate, it should be NA for all rows in the Trip until the first row where the fishing State = 1. For this row DistanceToNext should equal the sum of the Distance column of subsequent rows until the next fishing State = 1.
For example row 4 is the first fishing event (State = 1) in Trip A, the DistanceToNext cell should be the Distance travelled before the next fishing event which in his case is the very next row (row 5) which has a distance of 3.
For row 5 the next fishing event is again the very next row (row 6) which has a distance of 1. However for row 6 we see that there isn't another fishing event until row 9 so I want a cumulative sum of the d column for the rows between 6 and 9 which is 15.
If it is the last State = 1 row in it's x grouping (A or B) then there isn't another fishing event so there is not distance to calculate so I want it to give NA.
Here is another solution you could use. I also used a custom function for every State/ Distance vectors in each group that results in the desired output:
fn <- function(State, Distance) {
out <- rep(NA, length(State))
inds <- which(State == 1)
for(i in inds) {
if(State[i] == 1 & State[i + 1] == 1) {
out[i] <- Distance[i + 1]
} else if (State[i] == 1 & State[i + 1] == 0 & i != inds[length(inds)]) {
nx <- which(inds == i)
out[i] <- sum(Distance[(i+1):(inds[nx + 1])])
} else {
NA
}
}
out
}
df %>%
group_by(Trip) %>%
mutate(MyDistance = fn(State, Distance))
# A tibble: 20 x 5
# Groups: Trip [2]
Trip State Distance DistanceToNext MyDistance
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 NA NA
2 A 0 2 NA NA
3 A 0 9 NA NA
4 A 1 4 3 3
5 A 1 3 1 1
6 A 1 1 15 15
7 A 0 4 NA NA
8 A 0 5 NA NA
9 A 1 6 NA NA
10 A 0 3 NA NA
11 B 0 2 NA NA
12 B 1 6 1 1
13 B 1 1 17 17
14 B 0 5 NA NA
15 B 0 3 NA NA
16 B 0 3 NA NA
17 B 1 6 1 1
18 B 1 1 8 8
19 B 1 8 NA NA
20 B 0 2 NA NA
In base R you would do:
fun <- function(df){
a <- which(df$State == 1)
b <- rep(NA, nrow(df))
d <- mapply(function(x, y) sum(df$Distance[(x+1):y]), head(a,-1), tail(a, -1))
b[a] <- c(d, NA)
transform(df, DisttoNext = b)
}
do.call(rbind, by(df, df$Trip, fun))
Trip State Distance DistanceToNext DisttoNext
A.1 A 0 0 NA NA
A.2 A 0 2 NA NA
A.3 A 0 9 NA NA
A.4 A 1 4 3 3
A.5 A 1 3 1 1
A.6 A 1 1 15 15
A.7 A 0 4 NA NA
A.8 A 0 5 NA NA
A.9 A 1 6 NA NA
A.10 A 0 3 NA NA
B.11 B 0 2 NA NA
B.12 B 1 6 1 1
B.13 B 1 1 17 17
B.14 B 0 5 NA NA
B.15 B 0 3 NA NA
B.16 B 0 3 NA NA
B.17 B 1 6 1 1
B.18 B 1 1 8 8
B.19 B 1 8 NA NA
B.20 B 0 2 NA NA
A data.table alternative.
library(data.table)
setDT(df)
df[,`:=`(next_dist = shift(Distance, type = "lead"), g = cumsum(State), ri = .I),
by = Trip]
d = df[ , .(ri = ri[1], State = State[1], s = sum(next_dist)), by = .(Trip, g)]
df[d[State == 1, .SD[-.N], by = Trip], on = .(ri), s := s]
df[ , `:=`(ri = NULL, next_dist = NULL, g = NULL)]
# Trip State Distance DistanceToNext s
# 1: A 0 0 NA NA
# 2: A 0 2 NA NA
# 3: A 0 9 NA NA
# 4: A 1 4 3 3
# 5: A 1 3 1 1
# 6: A 1 1 15 15
# 7: A 0 4 NA NA
# 8: A 0 5 NA NA
# 9: A 1 6 NA NA
# 10: A 0 3 NA NA
# 11: B 0 2 NA NA
# 12: B 1 6 1 1
# 13: B 1 1 17 17
# 14: B 0 5 NA NA
# 15: B 0 3 NA NA
# 16: B 0 3 NA NA
# 17: B 1 6 1 1
# 18: B 1 1 8 8
# 19: B 1 8 NA NA
# 20: B 0 2 NA NA
Explanation:
Convert data to data.table (setDT(df)).
For each 'Trip' (by = Trip), create new variables by reference (:=): next distance (shift(Distance, type = "lead")), a grouping variable which increases everytime 'State' is 1 (cumsum(State)), a row index used to join result (.I; this also could be done first, without the grouping).
For each 'Trip' and 'State group' (by = .(Trip, g)), select first row index (ri[1]), first 'State' (State = State[1]), and sum the lead distances (sum(next_dist)).
From the result above, select rows where 'State' is 1 (State == 1). Then, for each 'Trip' (by = Trip), select the subset of data (.SD) except the last row (-.N). Join to the original data on row index (on = .(ri)). Create a new column, sum of distances, 's' by reference (:=). If desired, remove temp variables.
I have a data frame as below
df<- data.frame(a = c(1,2,3,4,5),
b = c(6,7,8,9,10),
c = c(11,12,13,14,15),
z = c("b","c","a","a","b"))
I'm trying to replace row values where that row's column name is equal to the value in column Z. The desired output is below
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
I was thinking something like the following code applied to each row
If column name is equal to Z, replace value with NA
But can't figure it out. Any help appreciated
Cheers!
Matrix indexing match-ing the z column to the colnames
df[cbind(seq(nrow(df)),match(df$z,colnames(df[1:3])))] <- NA
df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
This is only going to work if the columns with the letters are in lexigraphic order:
> df[cbind(1:5,as.numeric(df$z))] <- rep(NA,5)
> df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Let's say I have this kind of data frame:
df <- data.frame(
t=rep(seq(0,2),6),
no=rep(c(1,2,3,4,5,6),each=3),
value=rnorm(18),g=rep(c("nc","c1", NA),each=3)
)
t no value g
1 0 1 0.5022163 nc
2 1 1 0.5687227 nc
3 2 1 -0.2922622 nc
4 0 2 -0.3587089 c1
5 1 2 -0.9028012 c1
6 2 2 0.1926774 c1
7 0 3 0.6771236 NA
8 1 3 0.3752632 NA
9 2 3 0.2795892 NA
10 0 4 -0.4565521 nc
11 1 4 -0.1241807 nc
12 2 4 -1.2603695 nc
13 0 5 -0.6323118 c1
14 1 5 -0.6283850 c1
15 2 5 -0.2052317 c1
16 0 6 1.5996913 NA
17 1 6 -0.4802057 NA
18 2 6 -0.4255056 NA
I want to set the values in df$value to NA whenever there is NA in df$g (only in the same rows).
And similarly, set the values in df$value to NA, if df$no is, e.g., 1 or 5.
I was fooling around with for loops, but I could not get it right.
Any help will be much appreciated.
Thanks
With a for loop
for (i in 1:nrow(df)) {
if (df$no[i] == 1 | df$no[i] == 5 | is.na(df$g[i])) {
df$value[i] <- NA
}
}
I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1