How do I assign progressive numbers to a column every time a given condition is met in another one? Given this data:
a <- data.frame(var = c(1, 0, 0, 1, 4, 5, 6, 1, 7, 1, 1))
I would like to construct a column that is progressively augmented by 1 every time var == 1 and returns NAs for the rest. The new column should then be filled with:
1, NA, NA, 2, NA, NA, NA, 3, NA, 4, 5
I thought about ifelse but I didn't manage to make it work.
Thanks!
You can use ifelse and cumsum:
a$newvar <- ifelse(a$var==1, cumsum(a$var==1), NA)
var newvar
1 1 1
2 0 NA
3 0 NA
4 1 2
5 4 NA
6 5 NA
7 6 NA
8 1 3
9 7 NA
10 1 4
11 1 5
I'd just like to add one more thing for a general case in which if you want to do the same thing for 4 or 5 or any thing else
a <- data.frame(var = c(1, 0, 0, 1, 4, 5, 6, 1, 7, 1, 1))
a$New <- ifelse(a$var == 1,1,NA)
a$New[!is.na(a$New)] <- cumsum(a$New[!is.na(a$New)])
Output:
> print(a)
var New
1 1 1
2 0 NA
3 0 NA
4 1 2
5 4 NA
6 5 NA
7 6 NA
8 1 3
9 7 NA
10 1 4
11 1 5
We can also do this with a variation of cumsum
a$newVar <- with(a, cumsum(var ==1) * NA^(var!=1))
a$newVar
#[1] 1 NA NA 2 NA NA NA 3 NA 4 5
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(a)), specify the logical condition in 'i' (var == 1), and assign (:= it is efficient as it assigns in place) the cumsum of 'var' to 'newvar'. By default, the other elements in 'newvar' that do not correspond to the logical condition will be filled by NA.
library(data.table)
setDT(a)[var==1, newvar := cumsum(var)]
a
# var newvar
# 1: 1 1
# 2: 0 NA
# 3: 0 NA
# 4: 1 2
# 5: 4 NA
# 6: 5 NA
# 7: 6 NA
# 8: 1 3
# 9: 7 NA
#10: 1 4
#11: 1 5
Or instead of cumsum we can use the sequence of rows
setDT(a)[var==1, newvar := seq_len(.N)]
Related
I have a dataframe df of integers across 6 variables.
a <- c(NA, NA, NA, 0, 0, 1, 1, 1)
b <- c(NA, NA, NA, 2, 2, 3, 3, 3)
c <- c(NA, NA, NA, 2, 2, 3, 3, 3)
d <- c(NA, NA, NA, 1, 1, 2, 2, 2)
e <- c(NA, NA, NA, 0, 0, 1, 1, 1)
f <- c(NA, NA, NA, 0, 0, 1, 1, 1)
df <- data.frame(a, b, c, d, e, f)
print(df)
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 0 2 2 1 0 0
5 0 2 2 1 0 0
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
I would like to add 1 to each row that contains a zero, resulting in:
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 3 3 2 1 1
5 1 3 3 2 1 1
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
I've been able to test if a row contains a zero with the following code, which adds a new column of "TRUE" or "FALSE".
df$cont0 <- apply(df, 1, function(x) any(x %in% "0"))
I thought I would this new value to then add 1 to reach row where df$cont0 == "TRUE"
ifelse(df$cont0 == "TRUE", df + 1, df)
This ends up creating a nested list that still does not perform the correct operation. I understand that ifelse is already vectorized, but other than that I'm not sure how to approach this issue. I am open to splitting apart the df into "TRUE" and "FALSE" conditions, then performing the operation on df$cont0 == "TRUE", but they need to be re-merged in the original order as the data are chronological and row order therefore matters. However I suspect there's an easier solution. Thank you!
Create a logical index with rowSums on the logical matrix and use that as row index to add
i1 <- rowSums(df == 0, na.rm = TRUE) > 0
df[i1,] <- df[i1, ] + 1
-ouptut
> df
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 3 3 2 1 1
5 1 3 3 2 1 1
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
Regarding the use of ifelse on a logical vector, it is related to the property of ifelse that it requires all the arguments to be of same length which is not met in the OP's case
Just try to get row index first :
index <- rowIndex(af == 0, na.rm = TRUE) > 0
af[index,] <- af[index, ] + 1
It should work.
Being still rather new to R, I struggle with the following problem:
I am looking at several particles moving along the x axis (in reality this is in 3D, but this simplifies matters for our purpose, here). I have a data frame with the each particle's ID and their corresponding position at a given time point.
Here's an example:
x.position1 <- c(5, NA, 4, 7, 1, NA, 2, NA, NA, 3)
x.position2 <- c(6, NA, 8, 7, 2, 1, 2, NA, NA, 1)
x.position3 <- c(6, 2, 7, 7, 4, 3, 1, NA, NA, 6)
x.position4 <- c(7, 4, 9, 7, 5, 5, 0, 0, 5, 7)
x.position5 <- c(9, 5, NA, 7, 6, NA, 2, 3, 8, 11)
particule.ID <- c(1:10)
df <- data.frame(particule.ID, x.position1, x.position2, x.position3, x.position4, x.position5)
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5
1 1 5 6 6 7 9
2 2 NA NA 2 4 5
3 3 4 8 7 9 NA
4 4 7 7 7 7 7
5 5 1 2 4 5 6
6 6 NA 1 3 5 NA
7 7 2 2 1 0 2
8 8 NA NA NA 0 3
9 9 NA NA NA 5 8
10 10 3 1 6 7 11
My aim is to calculate the displacement of each particle at each time point i. This displacement is therefore xi - x1. This newly calculated displacement is to be placed in a newly created column.
Here's the script I originally wrote for this:
for (i in 1:5){ # iterate for each time point i
df$Disp <- df[,2+i-1]-df[,2] # create a new column with the calculated displacement for time point i
nam.Disp <- paste("Disp", i, sep = "") #rename new column Disp+time point number
names(df)[names(df) == 'Disp'] <- nam.Disp
}
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5 Disp1 Disp2 Disp3 Disp4 Disp5
1 1 5 6 6 7 9 0 1 1 2 4
2 2 NA NA 2 4 5 NA NA NA NA NA
3 3 4 8 7 9 NA 0 4 3 5 NA
4 4 7 7 7 7 7 0 0 0 0 0
5 5 1 2 4 5 6 0 1 3 4 5
6 6 NA 1 3 5 NA NA NA NA NA NA
7 7 2 2 1 0 2 0 0 -1 -2 0
8 8 NA NA NA 0 3 NA NA NA NA NA
9 9 NA NA NA 5 8 NA NA NA NA NA
10 10 3 1 6 7 11 0 -2 3 4 8
However, as you may notice in the data frame, sometimes a particle is not detected at i=1, or later. This means that I get a NA value. Therefore, including another loop with IF so that if that 1st time point is NA, R would go to the next time point until it found a non NA value to substract.
I threfore came up with this, using ifelse instead of IF, since the latter can only deal with one value while my input is actually a column:
for (i in 1:5){ # iterate for each time point i
for (j in 1:5){ # if first time point has no value (NA) scan the row for next time point until an object is detected
ifelse(!is.na(df[,2+j-1]),
df$Disp <- (df[,2+i-1]-df[,2+j-1]), # create a new column with the calculated displacement for i time point
next) # if time point is NA go to next j (next fixed initial time point to test)
}
nam.Disp <- paste("Disp", i, sep = "") #rename new column Disp+time point number
names(df)[names(df) == 'Disp'] <- nam.Disp
}
df
particule.ID x.position1 x.position2 x.position3 x.position4 x.position5 Disp1 Disp2 Disp3 Disp4 Disp5
1 1 5 6 6 7 9 -4 -3 -3 -2 0
2 2 NA NA 2 4 5 NA NA -3 -1 0
3 3 4 8 7 9 NA NA NA NA NA NA
4 4 7 7 7 7 7 0 0 0 0 0
5 5 1 2 4 5 6 -5 -4 -2 -1 0
6 6 NA 1 3 5 NA NA NA NA NA NA
7 7 2 2 1 0 2 0 0 -1 -2 0
8 8 NA NA NA 0 3 NA NA NA -3 0
9 9 NA NA NA 5 8 NA NA NA -3 0
10 10 3 1 6 7 11 -8 -10 -5 -4 0
Somehow this return wrong values. It looks like the computation occured backwards: Disp1 = x5-x1, Disp2 = x5- x2, Disp3 = x5-x3 etc... while what I expected was: Disp1 = x1-x1, Disp2 = x2-x1, Disp3 = x3-x1 etc.
How can this inclusion of the new for loop and the ifelse function have caused this? What am I doing wrong?
There may be a way to do this manually, but since in reality I have at least 60 time points if not more, the task would be daunting.
Also, if you think there's a much smarter way to do this, please do share! And if I forgot to include important details that would help you understand better, just let me know.
Many thanks!
Flo
An example on how to deal with your problem in long format:
library(data.table)
library(zoo)
library(stringr)
df <- setDT(df)
dflong <- melt(df,patterns = "position",id.vars = "particule.ID",value.name = "position")
dflong[,time := str_extract(variable,"[0-9]+$")]
setkey(dflong,time)
dflong[,displacement := c(NA,diff(na.locf(position,na.rm = F))),by = particule.ID]
particule.ID variable position time displacement
1: 1 x.position1 5 1 NA
2: 1 x.position2 6 2 1
3: 1 x.position3 6 3 0
4: 1 x.position4 7 4 1
5: 1 x.position5 9 5 2
6: 2 x.position1 NA 1 NA
7: 2 x.position2 NA 2 NA
8: 2 x.position3 2 3 NA
9: 2 x.position4 4 4 2
10: 2 x.position5 5 5 1
11: 3 x.position1 4 1 NA
12: 3 x.position2 8 2 4
13: 3 x.position3 7 3 -1
14: 3 x.position4 9 4 2
15: 3 x.position5 NA 5 0
.....
I use here data.table, but the same can be done with dplyr and tidyr, with pivot_long.
melt(df,patterns = "position",id.vars = "particule.ID",value.name = "position")
transform your data in long format, with 3 variables : your column name containing the time, the particle.ID, and the position.
I then extract the time with str_extract from stringr:
dflong[,time := str_extract(variable,"[0-9]+$")]
I order the table along ID and time setkey(dflong,time)
I then use diff(position) for each particule.ID to calculate the displacement. But as I know that I have Nas problems, I use na.locf from zoo, which allow me to replace the NAs by the last available value.
Since you want to calculate the difference of a certain time i to time 1 for each position, you will get NA if one of the two numbers is NA.
In your case, we can create a second data frame df2 which captures changes from time 1 to i for each position 1 until 10.
df2<-data.frame(matrix(NA, nrow = 10, ncol = 4))
colnames(df2)<-cbind("chp2","chp3","chp4","chp5")
We can capture the differences in a loop:
for (i in 1:4) { for (j in 1:10) {
df2[i][j,]<-df[i+2][j,]-df[2][j,]
}
}
Now we can put them all in a data frame if we want df1<-cbind(df,df2)
Suppose I have this dataframe
df <- data.frame(
x=c(1, NA, NA, 4, 5, NA),
y=c(NA, 2, 3, NA, NA, 6)
which looks like this
x y
1 1 NA
2 NA 2
3 NA 3
4 4 NA
5 5 NA
6 NA 6
How can I merge the two columns into one? Basically the NA values are in complementary rows. It would be nice to also obtain (in the process) a flag column containing 0 if the entry comes from x and 1 if the entry comes from y.
We can try using the coalesce function from the dplyr package:
df$merged <- coalesce(df$x, df$y)
df$flag <- ifelse(is.na(df$y), 0, 1)
df
x y merged flag
1 1 NA 1 0
2 NA 2 2 1
3 NA 3 3 1
4 4 NA 4 0
5 5 NA 5 0
6 NA 6 6 1
We can also use base R methods with max.col on the logical matrix to get the column index, cbind with row index and extract the values that are not NA
df$merged <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df)))]
df$flag <- +(!is.na(df$y))
df
# x y merged flag
#1 1 NA 1 0
#2 NA 2 2 1
#3 NA 3 3 1
#4 4 NA 4 0
#5 5 NA 5 0
#6 NA 6 6 1
Or we can use fcoalesce from data.table which is written in C and is multithreaded for numeric and factor types.
library(data.table)
setDT(df)[, c('merged', 'flag' ) := .(fcoalesce(x, y), +(!is.na(y)))]
df
# x y merged flag
#1: 1 NA 1 0
#2: NA 2 2 1
#3: NA 3 3 1
#4: 4 NA 4 0
#5: 5 NA 5 0
#6: NA 6 6 1
You can do that using dplyr as follows;
library(dplyr)
# Creating dataframe
df <-
data.frame(
x = c(1, NA, NA, 4, 5, NA),
y = c(NA, 2, 3, NA, NA, 6))
df %>%
# If x is null then replace it with y
mutate(merged = coalesce(x, y),
# If x is null then put 1 else put 0
flag = if_else(is.na(x), 1, 0))
# x y merged flag
# 1 NA 1 0
# NA 2 2 1
# NA 3 3 1
# 4 NA 4 0
# 5 NA 5 0
# NA 6 6 1
I would like to calculate the rainfall that has fallen over the last three days for each grid square, and add this as a new column in my data.table. To be clear, I want to sum up the current and PREVIOUS two (2) days of rainfall, for each meterological grid square
library ( zoo )
library (data.table)
# making the data.table
rain <- c(NA, NA, NA, 0, 0, 5, 1, 0, 3, 10) # rainfall values to work with
square <- c(1,1,1,1,1,1,1,1,1,2) # the geographic grid square for the rainfall measurement
desired_result <- c(NA, NA, NA, NA, NA, 5, 6, 6, 4, NA ) # this is the result I'm looking for (the last NA as we are now on to the first day of the second grid square)
weather <- data.table(rain, square, desired_result) # making the data.table
My attempt to answer: this line used to work, but no longer does
weather[, rain_3 := filter(rain, rep(1, 2), sides = 1), by = list(square)]
So here I am trying another method:
# this next line gets the numbers right, but sums the following values, not the preceeding ones.
weather$rain_3 <- rollapply(zoo(weather$rain), list(seq(-2,0)), sum)
# here I add in the by weather$ square, but still no success
weather$rain_3 <- rollapply(zoo(weather$rain), list(seq(-2,0)), sum, by= list(weather$square))
I would greatly appreciate any insights or suggestions you may have.
Many thanks!
Here's a quick and efficient solution using the latest data.table version (v 1.9.6+)
weather[, rain_3 := Reduce(`+`, shift(rain, 0:2)), by = square]
weather
# rain square desired_result rain_3
# 1: NA 1 NA NA
# 2: NA 1 NA NA
# 3: NA 1 NA NA
# 4: 0 1 NA NA
# 5: 0 1 NA NA
# 6: 5 1 5 5
# 7: 1 1 6 6
# 8: 0 1 6 6
# 9: 3 1 4 4
# 10: 10 2 NA NA
The basic idea here is to shift the rain column twice and then sum up the rows.
The rollapply solution would be done like this:
weather[, rain_3 := rollapplyr(rain, 3, sum, fill = NA_real_), by = square]
giving:
rain square desired_result rain_3
1: NA 1 NA NA
2: NA 1 NA NA
3: NA 1 NA NA
4: 0 1 NA NA
5: 0 1 NA NA
6: 5 1 5 5
7: 1 1 6 6
8: 0 1 6 6
9: 3 1 4 4
10: 10 2 NA NA
Update
Have simplified based on version of zoo that came out since this question was originally asked.
weather[, rain_3 := filter(rain, rep(1, 3), sides = 1), by = list(square)]
#Error in filter(rain, rep(1, 3), sides = 1) :
# 'filter' is longer than time series
weather[, rain_3 := if(.N > 2) filter(rain, rep(1, 3), sides = 1) else NA_real_,
by = square]
# rain square desired_result rain_3
# 1: NA 1 NA NA
# 2: NA 1 NA NA
# 3: NA 1 NA NA
# 4: 0 1 NA NA
# 5: 0 1 NA NA
# 6: 5 1 5 5
# 7: 1 1 6 6
# 8: 0 1 6 6
# 9: 3 1 4 4
#10: 10 2 NA NA
Take care that dplyr is not loaded because it masks filter. If you need dplyr, you can call stats::filter explicitly.
You have almost got the answer yourself. rollsum (or rollapply in your case) gives you the vector of length N-2, so you just have to fill the desired cells with NAs. It can be simply done like this: roll<-c(NA,NA,rollsum(yourvector,k=3))
Here is how I do it. I am using roll_sum from {RcppRoll} package, because it is much faster and deals with NAs easier. Simple by argument from data.table lets you group result by square.
library(RcppRoll)
weather[,rain_3:=if(.N>2){c(NA,NA,roll_sum(rain,n=3))}else{NA},by=square]
weather
rain square desired_result rain_3
1: NA 1 NA NA
2: NA 1 NA NA
3: NA 1 NA NA
4: 0 1 NA NA
5: 0 1 NA NA
6: 5 1 5 5
7: 1 1 6 6
8: 0 1 6 6
9: 3 1 4 4
10: 10 2 NA NA
Late to the party, but a more recent version of data.table package (1.12.8 for me) has frollsum function that will accomplish this a bit more cleanly than earlier (but very much valid) answers:
library (data.table)
# making the data.table
rain <- c(NA, NA, NA, 0, 0, 5, 1, 0, 3, 10) # rainfall values to work with
square <- c(1,1,1,1,1,1,1,1,1,2) # the geographic grid square for the rainfall measurement
desired_result <- c(NA, NA, NA, NA, NA, 5, 6, 6, 4, NA ) # this is the result I'm looking for (the last NA as we are now on to the first day of the second grid square)
weather <- data.table(rain, square, desired_result) # making the data.table
# using `frollsum`
weather[, rain3 := frollsum(rain, n = 3), by = square][]
#> rain square desired_result rain3
#> 1: NA 1 NA NA
#> 2: NA 1 NA NA
#> 3: NA 1 NA NA
#> 4: 0 1 NA NA
#> 5: 0 1 NA NA
#> 6: 5 1 5 5
#> 7: 1 1 6 6
#> 8: 0 1 6 6
#> 9: 3 1 4 4
#> 10: 10 2 NA NA
Created on 2020-07-09 by the reprex package (v0.3.0)
A dplyr solution:
library(dplyr)
weather %>%
group_by(square) %>%
mutate(rain_3 = rain + lag(rain) + lag(rain, n = 2L))
Result:
Source: local data table [10 x 4]
rain square desired_result rain_3
(dbl) (dbl) (dbl) (dbl)
1 NA 1 NA NA
2 NA 1 NA NA
3 NA 1 NA NA
4 0 1 NA NA
5 0 1 NA NA
6 5 1 5 5
7 1 1 6 6
8 0 1 6 6
9 3 1 4 4
10 10 2 NA NA
If you want to assign rain3 to your dataset, you can use the %<>% symbol from maggritr in your pipe:
library(magrittr)
weather %<>%
group_by......
Related to my other question - given the following data frame
df0 <- data.frame (id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
val = c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1))
what is the most straightforward way (i.e. one liner) to add a third column in which, for each id, sequence numbers appear only when val == 1, like the following?
id val seq
1 1 0 NA
2 1 1 1
3 1 1 2
4 2 0 NA
5 2 0 NA
6 2 0 NA
7 3 1 1
8 3 0 NA
9 3 1 2
10 3 1 3
Using data.table:
require(data.table)
setDT(df0)[val == 1L, seq := seq_len(.N), by=id]
# id val seq
# 1: 1 0 NA
# 2: 1 1 1
# 3: 1 1 2
# 4: 2 0 NA
# 5: 2 0 NA
# 6: 2 0 NA
# 7: 3 1 1
# 8: 3 0 NA
# 9: 3 1 2
# 10: 3 1 3
.N contains the number of observations per group. Start with the HTML vignettes for more.
With 0 instead of NA:
library(dplyr)
df0 <- df0 %>% group_by(id) %>% mutate(seq = (val==1)*cumsum(val))
with NA
df0[df0$seq==0, 3] <- NA