Reference previous row in a data.frame - r

I have a very simple problem. I'm am trying to set the value of column X to 0 if column Y[n,] does not equal column Y[n-1,]. My issue is that I do not know how to reference a previous row value in R, and then use that value to set the value of another column.
As an example:
Y X
1 5
1 1
2 0
2 2
X[3,2] is 0 because Y[3,1] does not equal Y[2,1].
I need to basically find all instance of this in a large data-set and set the corresponding X value to 0.
data$X <- 0 if data$Y[n] =! data$Y[n-1]
Is there a simple solution to this in R? It really feels as though there should be.
Thank you

Similarly to the post from #markus, with dplyr you can do:
df %>%
mutate(X = (Y == lag(Y, default = first(Y))) * X)
Y X
1 1 5
2 1 1
3 2 0
4 2 2

Given
Y <- c(1, 1, 2, 2)
X <- c(5, 1, 10, 2)
an option would be diff
X * (c(0, diff(Y)) == 0)
# [1] 5 1 0 2
The idea is to check if x[i] - x[i -1] equals zero which gives a logical vector that we multiply by X

Another base R option
with(df, X * c(TRUE, !(Y[-1] - Y[-length(Y)])))
#[1] 5 1 0 2
Or using dplyr
library(dplyr)
df %>%
mutate(X = c(X[1], ((duplicated(Y) * X)[-1])))
# Y X
#1 1 5
#2 1 1
#3 2 0
#4 2 2
data
df <- structure(list(Y = c(1L, 1L, 2L, 2L), X = c(5L, 1L, 0L, 2L)),
class = "data.frame", row.names = c(NA, -4L))

Related

How to make a For loop that keeps the original row value

I am trying to run multiple conditional statements in a loop. My first conditional is an if, else if with 3 conditions (4 technically if nothing matches). My second really only needs one condition, and I want to keep the original row value if it doesn't meet that condition. The problem is my output doesn't match the row numbers, and I'm not sure how to output only to a specific row in a loop.
I want to loop over each column, and within each column I use sapply to check each value for falling outside of a range1 (gets marked with 4), inside of range1 (gets marked with 1), is.na (gets marked with 9), otherwise is marked -999. A narrower range would then be used, if each value in a column falls inside of range2, mark with a 3, otherwise don't update.
My partially working code, and a reproducible example is below. My input and first loop is:
df <- structure(list(A = c(-2, 3, 5, 10, NA), A.c = c(NA, NA, NA, NA, NA), B = c(2.2, -55, 3, NA, 99), B.c = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -5L))
> df
A A.c B B.c
1 -2 NA 2.2 NA
2 3 NA -55.0 NA
3 5 NA 3.0 NA
4 10 NA NA NA
5 NA NA 99.0 NA
min1 <- 0
max1 <- 8
test1.func <- function(x) {
val <- if (!is.na(x) & is.numeric(x) & (x < min1 | x > max1){
num = 4
} else if (!is.na(x) & is.numeric(x) & x >= min1 & x <= max1){
num = 1
} else if (is.na(x)){# TODO it would be better to make this just what is already present in the row
} else {
num = -999
}
val
}
Test1 <- function(x) {
i <- NA
for(i in seq(from = 1, to = ncol(x), by = 2)){
x[, i + 1] <- sapply(x[[i]], test1.func)
}
x
}
df_result <- Test1(df)
> df_result
A A.c B B.c
1 -2 4 2.2 1
2 3 1 -55.0 4
3 5 1 3.0 1
4 10 4 NA 9
5 NA 9 99.0 4
The next loop and conditional (any existing values of 4 or 9 would remain):
min2 <- 3
max2 <- 5
test2.func <- function(x) {
val <- if (!is.na(x) & is.numeric(x) & (x < min2 | x > max2){
num = 3
}
val
}
Test2 <- function(x) {
i <- NA
for(i in seq(from = 1, to = ncol(x), by = 2)){
x[, i + 1] <- sapply(x[[i]], test2.func)
}
x
}
df_result2 <- Test2(df_result)
# Only 2.2 matches, if working correctly would output
> df_result2
A A.c B B.c
1 -2 4 2.2 3
2 3 1 -55.0 4
3 5 1 3.0 1
4 10 4 NA 9
5 NA 9 99.0 4
Current code errors, since there is only one match:
Warning messages:
1: In `[<-.data.frame`(`*tmp*`, , i + 1, value = list(3, NULL, NULL, :
provided 5 variables to replace 1 variables
Some thoughts.
for loops are not necessary, it is better to capitalize on R's vectorized operations;
it appears that your values of 4 and 3 are really something like "outside band 1" and "outside band 2", in which case this can be resolved in one function.
Testing for == "NA" is a bit off ... if one of the values in a column is a string "NA" (and not R's NA value), then all values in that column are strings and you have other problems. Because of this, I don't explicitly check for is.numeric, though it is not hard to work back in.
Try this:
func <- function(x, range1, range2) {
ifelse(is.na(x), 9L,
ifelse(x < range1[1] | x > range1[2], 4L,
ifelse(x < range2[1] | x > range2[2], 3L,
1L)))
}
df[,c("A.c", "B.c")] <- lapply(df[,c("A", "B")], func, c(0, 8), c(3, 5))
df
# A A.c B B.c
# 1 -2 4 2.2 3
# 2 3 1 -55.0 4
# 3 5 1 3.0 1
# 4 10 4 NA 9
# 5 NA 9 99.0 4
One problem I have with this is that it uses a 3-nested ifelse loop. While this works fine, it can be difficult to trace and troubleshoot (and ifelse has problems of its own). If you have other conditions to incorporate, it might be nice to use dplyr::case_when.
func2 <- function(x, range1, range2) {
dplyr::case_when(
is.na(x) ~ 9L,
x < range1[1] | x > range1[2] ~ 4L,
x < range2[1] | x > range2[2] ~ 3L,
TRUE ~ 1L
)
}
I find this second method much easier to read, though it does have the added dependency of dplyr (which, while it definitely has advantages and strengths, includes an army of other dependencies). If you are already using any of the tidyverse packages in your workflow, though, this is likely the better solution.

How do I code 2 seperate categorical variables into a single one in R?

I have two continuous variables that I dummy coded into a categorical variable with 2 levels. Each of these variables are coded either 0 or 1 for low and high levels of this variable. Both variables were z-scored to know if they fell below or above the mean.
MeanAboveAvo <- ifelse(Dataframeforstudy2$avo < 0, 0, 1)
MeanAboveAnx <- ifelse(Dataframeforstudy2$anx < 0, 0 , 1)
My question is how do I dummy code these two variables together? I want to create a single variable with 4 different levels using these two variables (MeanAboveAvo & MeanAboveAnx). I want a single variable that is coded with either 1,2,3,4 and the 1 is (0,0), 2 is (0,1), 3 is (1,0) and 4 is (1,1).
My code is this:
stats <- while(MeanAboveAnx = 0 || MeanAboveAvx = 1) {
if(MeanAboveAnx = 0 & MeanAboveAvo = 0 ){
1
}
else if (MeanAboveAnx = 0 & MeanAboveAvo = 1){
2
}
else if(MeanAboveAnx = 1 & MeanAboveAvo = 0){
3
}
else {
4
}}
It is not coding it at all and I am getting an error message. What can I do differently to get the results I want?
Thank you for your help in advance!
Base R has function interaction precisely for this type of problem. The code below can become a one-liner, I leave it like this in order to make it more clear.
f <- with(df, interaction(anx, avo, lex.order = TRUE))
as.integer(f)
# [1] 1 2 1 1 2 3 3 3 4 2
Edit.
I was using the data in TomasIsCoding's answer, here is a solution more to the question's problem, with anx and avo as z-scores. Thanks to #KonradRudolph for his comment.
f <- with(df, interaction(as.integer(anx < 0),
as.integer(avo < 0),
lex.order = TRUE))
f
# [1] 1.1 0.1 0.1 1.0 0.0 0.1 1.1 1.1 1.1 1.0
#Levels: 0.0 0.1 1.0 1.1
as.integer(f)
# [1] 4 2 2 3 1 2 4 4 4 3
Data.
set.seed(1234)
df <- data.frame(anx = rnorm(10), avo = rnorm(10))
Categorical variables in in R don’t need to be numeric (and making them so has several drawbacks!): there’s consequently no need for your ifelse:
MeanAboveAvo <- Dataframeforstudy2$avo < 0
MeanAboveAnx <- Dataframeforstudy2$anx < 0
Next, the code using these encodings contains multiple mistakes:
It’s not clear what the while here is supposed to mean.
All = signs need to be converted to == because you’re performing comparisons.
if, unlike ifelse, isn’t vectorised so you cannot use it to assign its result to a vector of length > 1.
If I understand you correctly, then the following is one (canonical) way of encoding the stats:
stats <- paste(MeanAboveAvo, MeanAboveAnx)
This converts the logical vectors into character vectors and concatenates them element-wise. Once again, it is unnecessary (and unconventional!) in R to convert these categories into a numeric variable; though it may make sense to convert it to a factor via as.factor.
From the mapping rule to code the anx and avo, you actually don't need while loop, since yours is a shifted mapping from binary to decimal. In this case, you can do it like below
df <- within(df,code <- 2*anx + avo + 1)
such that
> df
anx avo code
1 0 0 1
2 0 1 2
3 0 0 1
4 0 0 1
5 0 1 2
6 1 0 3
7 1 0 3
8 1 0 3
9 1 1 4
10 0 1 2
Dummy Data
df <- structure(list(anx = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L
), avo = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
Try this:
as.integer(factor(paste0(MeanAboveAvo, MeanAboveAnx)))
For example:
set.seed(123)
x <- sample(0:1, 10, T) # [1] 0 0 0 1 0 1 1 1 0 0
y <- sample(0:1, 10, T) # [1] 1 1 1 0 1 0 1 0 0 0
as.integer(factor(paste0(x, y)))
# [1] 2 2 2 3 2 3 4 3 1 1

Dispatch values in list column to separate columns

I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0

if function comes across 0, do nothing in R

I have this code:
df[, -1] = apply(df[, -1], 2, function(x){x * log(x)})
df looks like:
sample a b c
a2 2 1 2
a3 3 0 45
The problem I am having is that some of my values in df are 0. You cannot take the ln(0). So I would like tell my program to spit out a 0 if it tries to take ln(0).
You could use ifelse here:
df[,-1] = apply(df[,-1], 2, function(x){ ifelse(x != 0, x*log(x), 0) })
You can take advantage of floating point error to add a tiny amount less than the floating point error to x. Since log(0.00000000000000...0000223) is 0.0000..., inputting 0 will work. The results of other numbers will only be changed by amounts smaller than the floating point error, meaning for practical purposes not at all.
Avoiding the iteration and using .Machine$double.xmin for a very, very small number,
df <- data.frame(sample = c("a2", "a3"),
a = 2:3,
b = c(1L, 0L),
c = c(2L, 45L))
df
#> sample a b c
#> 1 a2 2 1 2
#> 2 a3 3 0 45
df[-1] <- df[-1] * log(df[-1] + .Machine$double.xmin)
df
#> sample a b c
#> 1 a2 1.386294 0 1.386294
#> 2 a3 3.295837 0 171.299812
To check the results, let's use another approach, changing 0 values to 1 so they're return 0:
df2 <- data.frame(sample = c("a2", "a3"),
a = 2:3,
b = c(1L, 0),
c = c(2L, 45L))
df2[df2 == 0] <- 1
df2[-1] <- df2[-1] * log(df2[-1])
df2
#> sample a b c
#> 1 a2 1.386294 0 1.386294
#> 2 a3 3.295837 0 171.299812
Because the change is less than floating point error, the results are identical according to R:
identical(df, df2)
#> [1] TRUE

Get rownames associated with each column based on binary matrix in R

I have a matrix with binary data representing whether each column field is relevant to each row element. I'm looking to create a two column dataframe identifying the name of each field associated with each row. How can I do this in R?
Here is an example of what I'm starting with:
A B C
W 1 1 0
X 0 1 1
Y 1 1 1
Z 0 1 1
And I'm looking to end up with this:
Element | Relevant Field
W|A
W|B
X|B
X|C
Y|A
Y|B
Y|C
Z|B
Z|C
Any hints? Thanks!
If your starting value is a matrix like this
mm <- matrix(c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L),
ncol=3, dimnames = list(c("W", "X", "Y", "Z"), c("A", "B", "C")))
You can treat it like a table and unroll the data faairly easily
subset(as.data.frame(as.table(mm)), Freq>0)
# Var1 Var2 Freq
# 1 W A 1
# 3 Y A 1
# 5 W B 1
# 6 X B 1
# 7 Y B 1
# 8 Z B 1
# 10 X C 1
# 11 Y C 1
# 12 Z C 1
We can use base R methods
data.frame(Element = rep(rownames(m1), each = ncol(m1)),
Relevant_Field = rep(colnames(m1), nrow(m1)))[as.vector(t(m1))!=0,]
Or with CJ
library(data.table)
CJ(Element = row.names(m1), Relevant_Field = colnames(m1))[as.vector(t(m1)!=0)]
# Element Relevant_Field
#1: W A
#2: W B
#3: X B
#4: X C
#5: Y A
#6: Y B
#7: Y C
#8: Z B
#9: Z C
Or as #Frank suggested, we can melt (using reshape2) to a three column dataset, convert to data.table and remove the 0 values
library(reshape2)
setDT(melt(m1))[ value == 1 ][, value := NULL][]
Here is another base R method that uses with and subsetting.
# get the positions of 1s in matrix (row / column) output
posMat <- which(mm==1, arr.ind=TRUE)
# build the data.frame
myDf <- data.frame(rowVals=rownames(mm)[posMat[, 1]],
colVals=colnames(mm)[posMat[, 2]])
or other structures...
# matrix
myMat <- cbind(rowVals=rownames(mm)[posMat[, 1]],
colVals=colnames(mm)[posMat[, 2]])
# vector with pipe separator
myVec <- paste(rownames(mm)[posMat[, 1]], colnames(mm)[posMat[, 2]], sep="|")

Resources