compare current cell and previous cell in excel style without loop - r

I want to create a indicator variable after comparing the current value of a variable and the previous value. The logic is like this:if current value= previous value, then indicator =1,else 0. The first indicator value is truncated because there is no comparison.
It needs to be fast because I have lots of groups to compare in my data( I did not include the group for simplicity)
> dt<-c('a','a','a','b','a','a','c','c')
> indicator
[1] NA 1 1 0 0 1 0 1

Using base R you can remove the last elements and the first element of the vector with head() and tail() and do the comparison, then add the NA to the front.
c(NA, as.numeric(head(dt, -1) == tail(dt, -1)))
If dt were a vector of numbers, you could use diff like
dn <- c(1,1,1,2,1,1,3,3)
c(NA, (diff(dn)==0)+0)
(using +0 rather than as.numeric to make the booleans 1's and 0's.)

You can use Lag from Hmisc package
Ignoring the first value with [-1] and adding NA at the beginning.
library(Hmisc)
c(NA, as.numeric(dt== Lag(dt))[-1])
#[1] NA 1 1 0 0 1 0 1

You could also use rle in base R:
v <- rle(dt)[[1]]
x <- rep(1:length(v),v)
indicator <- c(NA, (diff(x)==0)*1)
#[1] NA 1 1 0 0 1 0 1
v: gets the number of times each character is repeated
x: contains the respective numeric vector from dt to benefit from diff

Related

Using row number to create a 0/1 column in R

I want to create a new column in my dataset for when 'death_code' contains an 'I' (could be I001-I100) then it would return a 1, otherwise it would return a 0
death_code
I099
E045
T054
I065
I022
I have used grepl to search for rows in a variable which contain 'I' and saved the row numbers
rows<-which(grepl('I', fulldata$deathcode))
However I now want to assign a 1 to these rows in a new column and I cannot workout how to do this.
This is what I anticipate the data to look like
death_code CVD_death
I099. 1
E045. 0
T054. 0
I065. 1
I022. 1
Instead of using which, use as.integer on the grepl result - TRUE/FALSE will be converted to 1/0.
fulldata$CVD_death <- as.integer(grepl("I", fulldata$deathcode))
Alternately, you could do it with which by setting all values in the column to 0, and then setting the which values to 1:
fulldata$CVD_death <- 0
fulldata$CVD_death[which(grepl("I", fulldata$deathcode))] <- 1
Using stringr approach:
library(dplyr)
library(stringr)
df %>% mutate(CVD_death = case_when(str_detect(death_code, '^I\\d{3}') ~ 1, TRUE ~ 0))
# A tibble: 5 x 2
death_code CVD_death
<chr> <dbl>
1 I099 1
2 E045 0
3 T054 0
4 I065 1
5 I022 1
Another option is + to convert the logical to integer
fulldata$CVD_death <- +(grepl("I", fulldata$deathcode))

As.numeric function with binary data in R

I recently encountered a new problem in R which I did not see before. I have a set of data with a dependent variable Accuracy which has only two values, "0" and "1". Before, I use data$Accuracy=as.numeric(data$Accuracy) to turn these two levels to numbers and it works.
This time, however, when I did the same thing. "0"s turned to "1"s and "1"s turned to "2"s. Is this due to the new changes made in R? How do I work around this issue?
Thanks!!
It could be that the columns are factor class and when we use as.numeric, we get the integer storage mode values (in R, indexing starts from 1). In that case, we can convert to character and then to numeric
data$Accuracy <- as.numeric(as.character(data$Accuracy))
If it is a factor the manual recommends
as.numeric(levels(data$Accuracy))[data$Accuracy]
to transform it to approximately its original numeric values.
I guess there could be a problem with the dataframe definition or reading from a file. If original data where only 0 and 1 data$Accuracy should be class integer. But any no numeric character in just one row will create a factor column. As example:
> zz<-data.frame(c(0, 0, 1, 1))
> zz
c.0..0..1..1.
1 0
2 0
3 1
4 1
> zz<-data.frame(c(0, 0, 1, 1, "")) # an empty space
> zz
c.0..0..1..1.....
1 0
2 0
3 1
4 1
5
> class(zz$c.0..0..1..1.....)
[1] "factor"
> zz<-data.frame(c(0, 0, 1, 1, NA)) # empty numeric data
> zz
c.0..0..1..1..NA.
1 0
2 0
3 1
4 1
5 NA
> class(zz$c.0..0..1..1..NA.)
[1] "numeric"

How can I use rowSums with conditions to return binary value?

Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?
as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.
We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0
One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().

Check and replace column values in R dataframe

I have multiple files to read in using R. I iterate through the files in a loop, obtain dataframes and then try to change values of a particular column. Examples of the R dataframes are as follows:
df_A:
ID ZN
1 0
2 1
3 1
4 0
df_B:
ID ZN
1 2
2 1
3 1
4 2
As shown above, the column 'ZN' for some dataaframes may have 0's and 1's and others dataframes have have 1's and 2's. What I want is - as I'm iterating through the files, I want to make changes only in the dataframes with column ZN having 1's and 2's like this: 1 to 0 and 2 to 1. Dataframes with ZN values as 0's and 1's will be left unchaged.
my attempt did not work:
if(dataframe$ZN > 1){
dataframe$ZN<-recode(dataframe$ZN,"1=0;2=1")
}
else{
dataframe$ZN
}
Any solutions please?
One approach might be to decrement the value of ZN by one if we detect a single value of 2 anywhere in the column:
if (max(df_A$ZN) == 2) {
df_A$ZN = df_A$ZN - 1
}
Demo
If there are only two values i.e. 0 and 1, then
df_A$ZN <- (df_A$ZN==0) + 1
df_A$ZN
#[1] 2 1 1 2
Or using case_when for multiple values
library(dplyr)
df_A %>%
mutate(ZN = case_when(ZN==0 ~2, TRUE ~ 1))

creating a new column with logical values if value in another column > 0

New to R, so bear with me. I have a dataframe with values
x y
1 2
4 4
5 3
6 0
I want to create a third column that indicates with TRUE or FALSE whether a value in column y is 0.
x y z
1 2 TRUE
0 4 TRUE
5 3 TRUE
6 0 FALSE
The > compares the lhs and rhs to get a logical vector. By assigning the output as a new column ('z'), we create the new column in the original dataset 'df1'.
df1$z <- df1$y > 0
You can also always create one column with an "empty" value in order to avoid if-else loop.
Something like this could work as well (though the solution proposed above is of course better):
df$z <- "False"
df$z[df$y > 0] <- "True"
Quotes can be escaped if you wish a logical variable rather than a string
An alternative solution that is faster (although this only becomes a critical issue when working with large datasets)
library(data.table)
setDT(df1)
df1[, z := y > 0]

Resources