Replacing row values in R based on previous rows - r

Below is my data frame df
df <- data.frame(A=c(1,1,1,1,0,0,-1,-1,-1,1,1,1,1))
I would like to have another variable T_D which maintains the first value when it encounters the change in the value of A by either 1 or -1and replaces the next rows by 0
Expected Output:
A T_D
1 1
1 0
1 0
1 0
0 0
0 0
-1 -1
-1 0
-1 0
1 1
1 0
1 0
1 0

dplyr's window functions make this easy. You can use the lag function to look at the previous value and see if it equals the current value. The first row of the table doesn't have a previous value so T_D will always be NA. Fortunately that row will always be equal to a so it's an easy problem to fix with a second mutate (or df[1,2] <- df[1,1]).
library(tidyverse) # Loads dplyr and other useful packages
df <- tibble(a = c(1, 1, 1, 1, 0, 0, -1, -1, -1, 1, 1, 1, 1))
df %>%
mutate(T_D = ifelse(a == lag(a), 0, a)) %>%
mutate(T_D = ifelse(is.na(T_D), a, T_D))

Base R solution, this seems to work for you:
df$T_D = df$A*!c(FALSE,diff(df$A,lag=1)==0),
Find the difference between sequential rows. If the difference is 1, take the entry from column A, otherwise set to 0.
OUTPUT
A T_D
1 1 1
2 1 0
3 1 0
4 1 0
5 0 0
6 0 0
7 -1 -1
8 -1 0
9 -1 0
10 1 1
11 1 0
12 1 0
13 1 0

df$T_D <- sign(abs(df$A)*diff(c(0, df$A)))

A data.table approach would be,
library(data.table)
setDT(df)[, T_D := replace(A, duplicated(A), 0), by = rleid(A)][]
# A T_D
# 1: 1 1
# 2: 1 0
# 3: 1 0
# 4: 1 0
# 5: 0 0
# 6: 0 0
# 7: -1 -1
# 8: -1 0
# 9: -1 0
#10: 1 1
#11: 1 0
#12: 1 0
#13: 1 0

Related

How can I create a new column with values 1/0, where the value in the new column is 1 only if values in two other columns are both 1?

I have two columns within a DF, "wet" and "cold", with values of 1 and 0 respectively, e.g:
Wet Cold
1 1
0 1
0 1
1 0
1 1
0 0
I would like to create a new column, wet&cold, where only if wet=1 and cold=1, then wet&cold=1. If any or both of them are 0 or not matching, then wet&cold=0.
I tried to work around with grepl, but without success.
Base R solution
df$`wet&cold` <- df$Wet*df$Cold
df
Wet Cold wet&cold
1 1 1 1
2 0 1 0
3 0 1 0
4 1 0 0
5 1 1 1
6 0 0 0
dplyr solution
df %>%
mutate(`wet&cold`=Wet*Cold)
Wet Cold wet&cold
1 1 1 1
2 0 1 0
3 0 1 0
4 1 0 0
5 1 1 1
6 0 0 0
Another option by checking I all row values have the value 1 for all the columns and convert the TRUE/FALSE to 1/0 with as.integer like this:
df$wet_cold = as.integer(rowSums(df == 1) == ncol(df))
df
#> Wet Cold wet_cold
#> 1 1 1 1
#> 2 0 1 0
#> 3 0 1 0
#> 4 1 0 0
#> 5 1 1 1
#> 6 0 0 0
Created on 2023-01-18 with reprex v2.0.2
Other solution works great with the clever multiplication. Here's perhaps a more general solution using ifelse(), which works well for this two case situation.
df <- data.frame(
wet = c(1, 0, 0, 1, 1, 0),
cold = c(1, 1, 1, 0, 1, 0)
)
df$wet_cold <- ifelse(df$wet == 1 & df$cold == 1, 1, 0)
df
# df
# wet cold wet_cold
# 1 1 1 1
# 2 0 1 0
# 3 0 1 0
# 4 1 0 0
# 5 1 1 1
# 6 0 0 0
You can use & to check if both are 1 and using + to convert TRUE or FLASE to 1 and 0.
DF["wet&cold"] <- +(DF$wet & DF$cold)
#DF
# wet cold wet&cold
#1 1 1 1
#2 0 1 0
#3 0 1 0
#4 1 0 0
#5 1 1 1
#6 0 0 0
Two more general approaches for more than two columns and also other conditions than 1 will be.
DF["wet&cold"] <- +(apply(DF==1, 1, all))
DF["wet&cold"] <- +(rowSums(DF != 1) == 0)
Data
DF <- data.frame(wet = c(1, 0, 0, 1, 1, 0), cold = c(1, 1, 1, 0, 1, 0))

Apply "or" function across any number of data.frame columns and preserve missingness

I create datasets in R regularly and often find I need to take two or more binary variables and "or" them into one new variable that indicates if any were 1, none were 1, or all were missing.
Simply using | does not handle NA's the way I would like.
So given a data.frame, df of three columns:
x = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,NA,NA,NA,NA,NA,NA,NA,NA,NA)
y = c( 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA)
z = c( 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA)
df = data.frame(x,y,z)
The output I am looking for is:
myFunction(df)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
But simply using | does not handle 0's the way I am looking for as it prioritizes NA's over 0's:
as.numeric(df$x | df$y | df$z)
[1] 0 1 NA 1 1 1 NA 1 NA 1 1 1 1 1 1 1 1 1 NA 1 NA 1 1 1 NA 1 NA
This is the best solution I came up with:
myFunction <- function(...) {
as.numeric(apply(data.frame(...),1,function(x) { ifelse(all(is.na(x)),NA,sum(x,na.rm = T)) }) > 0)
}
df$xyz = myFunction(df)
df$xyz
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
Is there a package with this functionality or a better way to write this so I don't have to copy paste this mess across all my scripts? Am I over thinking this?
We can use rowSums and convert to binary
df$new_col <- +(rowSums(df, na.rm = TRUE) > 0) * NA^(!rowSums(!is.na(df)))
-output
df$new_col
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
It is also possible in a compact way if we use sum_ from hablar
library(hablar)
+(apply(df, 1, sum_) > 0)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
If you want your output as a new column in the dataframe:
dplyr::if_any is most helpful here. We can use if_any() to create a logical vector that outputs TRUE if any of the elements in the data is TRUE, rowwise. Then replace NAs with zeroes with coalesce.
library(dplyr)
df %>% mutate(new_col=coalesce(if_any(everything()), 0))
x y z new_col
1 0 0 0 0
2 0 0 1 1
3 0 0 NA 0
4 0 1 0 1
5 0 1 1 1
6 0 1 NA 1
7 0 NA 0 0
8 0 NA 1 1
9 0 NA NA 0
10 1 0 0 1
11 1 0 1 1
12 1 0 NA 1
13 1 1 0 1
14 1 1 1 1
15 1 1 NA 1
16 1 NA 0 1
17 1 NA 1 1
18 1 NA NA 1
19 NA 0 0 0
20 NA 0 1 1
21 NA 0 NA 0
22 NA 1 0 1
23 NA 1 1 1
24 NA 1 NA 1
25 NA NA 0 0
26 NA NA 1 1
27 NA NA NA 0
We use coalesce to replace NAs with 0s inside the mutate call, so the NAs from the original columns are preserved.
We can also use reduce( | ) to create the new column, then coerce to numeric with +.
library(dplyr)
library(purrr)
df %>% mutate(new_col = +(map_dfc(df, coalesce, 0) %>% reduce(`|`)))
Or just use the reduce(|) method first, then replace NAs with 0 with coalesce at the end:
library(dplyr)
library(purrr)
df %>% mutate(new_col = coalesce(reduce(., `|`), 0))
If you want just the vector, use:
coalesce(Reduce(`|`, df), 0)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0
observation
For row-wise logical operations, if_any/if_all, reduce(|) and reduce(&), and rowSums(condition) are more robust then rowwise %>% max because max can`t handle rows with all NAs (will output Inf).
In case you want to have NAs as the output when all values are NAs for a given row
For that, just pipe the intermediate objects into replace...if_all...is.na..., as with the following code:
output<-df %>% mutate(new_col=coalesce(if_any(everything()), 0) %>%
replace(., if_all(everything(), is.na), NA))
output$new_col
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
Another way that I thought of:
library(dplyr)
df %>%
rowwise() %>%
mutate(out = max(c_across(),na.rm = TRUE)) %>%
pull(out) %>%
replace(is.infinite(.), NA)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA

Iterative replacing of value with lagged values using dplyr

I have the following data frame -
x <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
y <- c(0,0,0,1,0,-1,0,-1,0,1,0,-1,0,1,0,0,0)
data <- data.frame(x,y)
and I would like to create a type of momentum indicator. Effectively, if y is non-zero, x takes y's value and if y is 0, x takes on the value of the lagged x value. Essentially, I am replacing x's value row by row. Doing this in a for loop is simple -
for (i in 1:nrow(data)) {
data$x[i] <-
ifelse(data$y[i] == 1, 1, ifelse(data$y[i] == -1, -1, data$x[i-1]))}
Giving me this output (what I am looking for)
x y
1 NA 0
2 NA 0
3 NA 0
4 1 1
5 1 0
6 -1 -1
7 -1 0
8 -1 -1
9 -1 0
10 1 1
11 1 0
12 -1 -1
13 -1 0
14 1 1
15 1 0
16 1 0
17 1 0
However, on really large datasets, this for loop is extremely inefficient. I'd like to implement this in dplyr, however the best solution I have managed to come up with does not do the trick
data2 <- data.frame(x,y)
data2 <-
data2 %>%
mutate(x = ifelse(y == 1, 1, ifelse(y == -1, 0, Lag(x))))
which return this
x y
1 NA 0
2 1 0
3 1 0
4 1 1
5 1 0
6 0 -1
7 1 0
8 0 -1
9 1 0
10 1 1
11 1 0
12 0 -1
13 1 0
14 1 1
15 1 0
16 1 0
17 1 0
My guess is that the way I am currently attempting to do this in dplyr does not control for the iterative nature of what I want to do, namely replace x as I move down the rows. Does anyone have ideas as to how I could do this through dplyr?
One option is to replace 0 with NA, and then do a forward fill:
library(dplyr); library(tidyr)
data %>% mutate(x = na_if(y, 0)) %>% fill(x)
# x y
#1 NA 0
#2 NA 0
#3 NA 0
#4 1 1
#5 1 0
#6 -1 -1
#7 -1 0
#8 -1 -1
#9 -1 0
#10 1 1
#11 1 0
#12 -1 -1
#13 -1 0
#14 1 1
#15 1 0
#16 1 0
#17 1 0
Here is another option using na.locf from zoo
library(zoo)
data$x <- with(data, na.locf(y*(NA^!y), na.rm=FALSE))

how to select subset only by [] in r?

a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C

How to find out how many times a certain pattern in x rows corresponds to a value in another row?

I want to find out how many times a certain pattern in columns one two and three corresponds to a certain value in the fourth column (class). my data.frame looks as follows:
one <- c(-1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 1)
two <- c(0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1)
three <- c(0, 0, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, -1, -1, 0, 0, 0, -1, -1, 0)
class <- c(0, 1, 1, 0, -1, -1, 1, 0, 1, -1, 1, 0, -1, -1, 1, 0, 1, -1, -1, 1)
mydf <- data.frame(one, two, three, class)
mydf
one two three class
1 -1 0 0 0
2 1 1 0 1
3 1 1 0 1
4 -1 0 0 0
5 -1 0 -1 -1
6 1 1 0 -1
7 1 1 0 1
8 1 1 0 0
9 1 1 0 1
10 -1 0 -1 -1
11 -1 0 -1 1
12 -1 0 0 0
13 -1 0 -1 -1
14 -1 0 -1 -1
15 1 1 0 1
16 1 1 0 0
17 1 1 0 1
18 -1 0 -1 -1
19 -1 0 -1 -1
20 1 1 0 1
# column one contains only value 1 or -1
# column two contains only value 1 and 0
# column three contains only values 0 and -1
# column class contains all values 1, 0 and -1
columns one two and three should be seen as a seperate table. with the values 0, 1, -1 there are 8 possible patterns for each row.
pattern1: -1 0 -1
pattern2: -1 0 0
pattern3: -1 1 -1
pattern4: -1 1 0
pattern5: 1 0 -1
pattern6: 1 0 0
pattern7: 1 1 -1
pattern8: 1 1 -1
i want to find out how many times each pattern corresponds to a 1, 0 and -1 in the last column (class).
how can i do that?? i was thinking if i had characters instead of numbers (e.g. 1=a, 0=b, -1=c) i could merge the columns one two three into a single column containing a certain term (eg. abc, acb, bac, bca,...). then i could find out how many times the term abc corresponds to a 1, 0 and -1 in the fourth column. one could even merge columns one to four and count the number of rows containing the resulting terms (abca, abcb, abcc, acba, acbb,...)
i'd be happy if someone knows a direct (and more elegant) way to do that!
Thank you very much!!
EDIT / NEW TASK:
# with your answers i get:
x <- do.call(paste, expand.grid(lapply(mydf[-4], unique)))
## Paste together the first three columns
y <- do.call(paste, mydf[-4])
## Tabulate
x <- factor(x)
table1 <- table(pattern = x[match(y, x)], value = mydf[, 4])
table1
value
pattern -1 0 1
-1 0 -1 6 0 1
-1 0 0 0 3 0
-1 1 -1 0 0 0
-1 1 0 0 0 0
1 0 -1 0 0 0
1 0 0 0 0 0
1 1 -1 0 0 0
1 1 0 1 2 7
my new task is the following: i get a new data.frame with only columns one two and three, but without column 4. e.g.
one.new <- c(-1, -1, -1, 1, 1)
two.new <- c(1, 1, 0, 1, 0)
three.new <- c(-1, 0, 0, -1, 0)
mydf.new <- data.frame(one.new, two.new, three.new)
mydf.new
one.new two.new three.new
# 1 -1 1 -1
# 2 -1 1 0
# 3 -1 0 0
# 4 1 1 -1
# 5 1 0 0
i now want to get a fourth column, that assigns the pattern of each row to the class-value with the highest frequency in table1.
so for example the first row will get a value of -1 in the forth column.
# first row of table1:
# value
# pattern -1 0 1
# -1 0 -1 6 0 1
(there are patterns that don't occur in this example. in this case, there should be a 0 in the fourth column)
does anyone have an idea on how to do that? Thank you!!
Here are some ways. They use mydf as constructed in the code of the question (which differs from the displayed version of mydf). There is one row for each pattern and class combination that appears in the data and the last column shows how many of such combinations exist.
1) aggregate
aggregate(count ~., cbind(count = 1, mydf), length)
giving:
one two three class count
1 -1 -1 -1 -1 6
2 1 1 0 -1 1
3 -1 -1 0 0 3
4 1 1 0 0 2
5 -1 -1 -1 1 1
6 1 1 0 1 7
2) sqldf
library(sqldf)
sqldf("select one, two, three, class, count(*)
from mydf
group by class, one, two, three")
giving:
one two three class count(*)
1 -1 -1 -1 -1 6
2 1 1 0 -1 1
3 -1 -1 0 0 3
4 1 1 0 0 2
5 -1 -1 -1 1 1
6 1 1 0 1 7
3) data.table
library(data.table)
DT <- data.table(mydf, key = "class,one,two,three")
DT[, list(count = .N), by = key(DT)]
class one two three count
1: -1 -1 -1 -1 6
2: -1 1 1 0 1
3: 0 -1 -1 0 3
4: 0 1 1 0 2
5: 1 -1 -1 -1 1
6: 1 1 1 0 7
4) reshape2. If you prefer the class along the top then try this:
library(reshape2)
dcast(mydf, ... ~ class, fun = length)
Using class as value column: use value.var to override.
one two three -1 0 1
1 -1 -1 -1 6 0 1
2 -1 -1 0 0 3 0
3 1 1 0 1 2 7
ADDED aggregate, data.table, reshape2.
Here is my interpretation of what you're asking:
## Create the combinations that are possible
x <- do.call(paste,
expand.grid(lapply(mydf[-4], unique)))
## Paste together the first three columns
y <- do.call(paste, mydf[-4])
## Tabulate
table(pattern = x[match(y, x)], value = mydf[, 4])
# value
# pattern -1 0 1
# -1 0 0 0 3 0
# -1 0 -1 6 0 1
# 1 1 0 1 2 7
Edit: Updated to match final data and fix a typo...
UPDATE
To get all 8 patterns in the output, factor "x" before tabulating. Continuing from above:
x <- factor(x)
table(pattern = x[match(y, x)], value = mydf[, 4])
# value
# pattern -1 0 1
# -1 0 0 0 3 0
# 1 0 0 0 0 0
# -1 0 -1 6 0 1
# 1 0 -1 0 0 0
# -1 1 0 0 0 0
# 1 1 0 1 2 7
# -1 1 -1 0 0 0
# 1 1 -1 0 0 0

Resources