Create a new variable based on other columns values - r

I have a paneldata dataframe structure, something like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
I want to generate a new dummy variable, that takes the value 1, if the rows contains 1 in any of the three columns or otherwise 0 if not. It should end up like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"Final_status" = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0))
Can anyone help me achieve this?

We can use if_any on the columns that starts_with 'Status', to check for any 1 value in a row and it returns TRUE if there is one or else FALSE which is coerced to binary with as.integer/+
library(dplyr)
df %>%
mutate(Final_status = +(if_any(starts_with('Status'), ~ . ==1)))
-outptu
id Status_2014 Status_2015 Status_2016 Final_status
1 1 1 0 0 1
2 1 1 0 0 1
3 1 1 0 0 1
4 1 1 0 0 1
5 2 0 1 0 1
6 2 0 1 0 1
7 2 0 1 0 1
8 2 0 1 0 1
9 3 0 0 0 0
10 3 0 0 0 0
11 3 0 0 0 0
12 3 0 0 0 0
Or using rowSums from base R
df$Final_status <- +(rowSums(df[-1] > 0) > 0)

You write an if condition to define the variable as 1 or 0, and inside this condition the most straight forward ways would be a dplyr pipe.
I don't have the dplyr syntax in my head, to long not used, but dplyr is what you want.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
best greetings

Related

How do I merge two vectors of same length into one vector that has the same length as well in R

I have two vectors like this:
vec1<-c(0, 0, 1, 1, 0, 0, 0, 0, 0, 0)
vec2<-c(0, 0, 0, 1, 0, 0, 1, 0, 1, 0)
I want to merge it somehow to turn it into this:
vec<-c(0, 0, 1, 1, 0, 0, 1, 0, 1, 0)
Is there any way to do this?
We can use pmax
pmax(vec1, vec2)
[1] 0 0 1 1 0 0 1 0 1 0
or with |
+(vec1|vec2)
Here is a solution with ifelse:
ifelse(vec1==0 & vec2==0, 0, 1)
[1] 0 0 1 1 0 0 1 0 1 0

How to remove specific (side-by-side) duplicates in r?

Suppose I have the following string:
l1 = c(0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1)
and I only want to keep the "FIRST new 1", that is, my desire outcome of the above strong is:
l1 = c(0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1)
I tried to shift and subtract the lists, whatever is not 1, set to 0; but this way doesn't work.
You may try (base R way)
x <- c(0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1)
y <- rle(x)
z<- cumsum(y$lengths)[y$values == 0] + 1
w <- rep(0, length(x))
w[z] <- 1
w
[1] 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1
dplyr way
library(dplyr)
library(xts)
library(data.table)
x <- data.frame(
l1 = c(0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1)
)
x %>%
mutate(y = rleid(l1)) %>%
group_by(y) %>%
mutate(l1 = ifelse((y %% 2) == first(l1) & row_number(y)>1, 0, l1)) %>%
ungroup %>%
select(-y) %>%
pull(l1)
[1] 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Clumsy way
bool IsNewOneAppeared = 0
for(int i;i<c.length;i++)
{
if(IsNewOneAppeared )
c[i]= 0;
else if(c[i] equal 1)
{
keep 1;
IsNewOneAppeared =1;
}
}

Create new columns using across() and if_else()

I have survey data that has a binary 1, 0 (indicating peak or off-peak) variable with the related peak or off-peak numbers in two separate columns.
structure(list(q9_jul_2019 = c(1, 0, 1, 0, 1, 0), q9_aug_2019 = c(1,
0, 1, 0, 1, 0), q9_sep_2019 = c(1, 0, 1, 0, 1, 0), q9_oct_2019 = c(0,
0, 1, 0, 1, 0), q9_nov_2019 = c(0, 0, 1, 0, 1, 0), q9_dec_2019 = c(0,
0, 1, 0, 0, 0), q9_jan_2020 = c(0, 0, 1, 0, 0, 0), q9_feb_2020 = c(0,
1, 0, 1, 0, 0), q9_mar_2020 = c(1, 1, 0, 1, 0, 0), q9_apr_2020 = c(1,
1, 1, 1, 0, 1), q9_may_2020 = c(0, 1, 0, 0, 0, 0), q9_jun_2020 = c(0,
0, 0, 0, 0, 0), q15 = c(1, 10, 30, 0, 2, 0), q22 = c(0, 10, 6,
0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
I have created new monthly columns that have the associated visitation numbers in that column but I'm sure there must be a neater way to do it using across(). I haven't been able to make it work though, so at the moment I'm stuck at the following:
survey <- survey %>%
mutate(visitation_jul_19 = if_else(q9_jul_2019 == 1, q15, q22),
visitation_aug_19 = if_else(q9_aug_2019 == 1, q15, q22),
visitation_sep_19 = if_else(q9_sep_2019 == 1, q15, q22),
visitation_oct_19 = if_else(q9_oct_2019 == 1, q15, q22),
visitation_nov_19 = if_else(q9_nov_2019 == 1, q15, q22),
visitation_dec_19 = if_else(q9_dec_2019 == 1, q15, q22),
visitation_jan_20 = if_else(q9_jan_2020 == 1, q15, q22),
visitation_feb_20 = if_else(q9_feb_2020 == 1, q15, q22),
visitation_mar_20 = if_else(q9_mar_2020 == 1, q15, q22),
visitation_apr_20 = if_else(q9_apr_2020 == 1, q15, q22),
visitation_may_20 = if_else(q9_may_2020 == 1, q15, q22),
visitation_jun_20 = if_else(q9_jun_2020 == 1, q15, q22))
You may try
library(dplyr)
survey %>%
mutate(across(q9_jul_2019:q9_jun_2020, ~ ifelse(.x == 1, q15, q22)))
q9_jul_2019 q9_aug_2019 q9_sep_2019 q9_oct_2019 q9_nov_2019 q9_dec_2019 q9_jan_2020 q9_feb_2020 q9_mar_2020 q9_apr_2020
1 1 1 1 0 0 0 0 0 1 1
2 10 10 10 10 10 10 10 10 10 10
3 30 30 30 30 30 30 30 6 6 30
4 0 0 0 0 0 0 0 0 0 0
5 2 2 2 2 2 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
q9_may_2020 q9_jun_2020 q15 q22
1 0 0 1 0
2 10 10 10 10
3 6 6 30 6
4 0 0 0 0
5 0 0 2 0
6 0 0 0 0

Split comma- and pound-separated strings into different columns in R

I have a dataframe , a column of which contains colon and pound-separated strings.
data$col1
col1
1: 3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1]
2: 3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]
3: 4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0]
4: 2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
5: 3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0]
6: 5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1]
I want to extract the elements of the string and split it into different columns.
col1 col2 col3 EVS1 ... EVS12
3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1] 3 Tier_III_Uncertain 1 1
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0] 3 Tier_III_Uncertain 0 0
4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0] 4 Tier_III_Uncertain 0 0
2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0] 2 Tier_IV_benign 0 0
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0] 3 Tier_III_Uncertain 0 0
5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1] 5 Tier_III_Uncertain 0 1
read.table(text=gsub("[^A-Za-z_0-9-]", " ", data$col1),
col.names = c(paste0('col', 2:4), paste0('EVS', 1:12)))[-3]
col2 col3 EVS1 EVS2 EVS3 EVS4 EVS5 EVS6 EVS7 EVS8 EVS9 EVS10 EVS11 EVS12
1 3 Tier_III_Uncertain 1 0 0 1 0 0 0 0 0 -1 1 1
2 3 Tier_III_Uncertain 0 0 0 1 0 0 0 0 0 1 1 0
3 4 Tier_III_Uncertain 0 0 0 1 0 0 0 0 2 0 1 0
4 2 Tier_IV_benign 0 0 0 1 0 0 0 0 0 0 1 0
5 3 Tier_III_Uncertain 0 0 0 1 0 0 0 0 1 0 1 0
6 5 Tier_III_Uncertain 0 0 1 1 0 0 0 0 1 0 1 1
Assuming DT shown reproducibly in the Note at the end replace non-word characters and also EVS= with space. Then read that using fread and set the names. Finally cbind DT to it.
DT2 <- fread(text = gsub("EVS=|\\W", " ", DT$col1))
names(DT2) <- c("col2", "col3", paste0("EVS", 1:(ncol(DT2)-2)))
cbind(DT, DT2)
Note
library(data.table)
L <- "3#Tier_III_Uncertain EVS=[1, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1, 1]
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]
4#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0]
2#Tier_IV_benign EVS=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
3#Tier_III_Uncertain EVS=[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0]
5#Tier_III_Uncertain EVS=[0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1]"
DT <- data.table(col1 = trimws(readLines(textConnection(L))))

R: expand sequence of binary values from a time column

I have a table of time and binary values,
> head(x,10)
Time binary
1 358.214 1
2 359.240 1
3 360.039 0
4 361.163 0
5 361.164 1
6 362.113 1
7 362.114 0
8 365.038 0
9 365.039 0
10 367.488 0
I want to check after a second wether the value in binary column is 1 or 0, and then create new column of the new values. The time here is not continues. For example, first value here is (358.214) and the binary value is 1, if I add a second it is going to be (359.214) and the value is still 1 based on the previous value because (359.214) is not in the dataset.
I want to add two new columns, one for the seconds increasing and one for the new binary values.
time2 new_binary
1 358.214 1
2 359.214 1
3 360.214 0
4 361.214 1
5 362.214 0
6 363.214 0
7 364.214 0
8 365.214 0
9 366.214 0
10 367.214 0
How can I do this in R?
The dataset,
Time <- c(358.214, 359.240, 360.039, 361.163, 361.164, 362.113, 362.114, 365.038, 365.039, 367.488, 367.489, 368.763, 368.764, 371.538, 371.539, 384.013, 384.014, 386.088, 386.089, 389.463, 389.464, 392.663, 392.664, 414.588, 414.589, 421.463, 421.464, 427.863, 427.864, 431.488, 431.489, 432.074, 432.075, 437.124, 437.125, 439.024, 439.025, 451.724, 451.725, 456.224, 456.225, 457.301, 457.302, 459.526, 459.527, 470.776, 470.777, 471.951, 471.952, 477.651, 477.652, 479.601, 479.602, 480.426, 480.427, 480.950, 480.951, 494.626, 494.627, 516.551, 516.552, 539.901, 539.902, 545.276, 545.277, 546.536, 546.537, 548.436, 548.437, 551.111, 551.112, 556.086, 556.087, 557.561, 557.562, 567.799, 567.800, 580.049, 580.050, 583.249, 583.250, 587.374, 587.375, 588.599, 588.600, 596.199, 596.200, 597.674, 597.675, 601.249, 601.250, 602.499, 602.500, 620.699, 620.700, 631.099, 631.100, 637.249, 637.250, 638.999, 639.000, 650.574, 650.575, 658.199, 658.200, 658.696, 658.697, 668.396, 668.397, 676.021, 676.022, 678.846, 678.847, 688.121, 688.122, 690.371, 690.372, 701.946, 701.947, 704.921, 704.922, 712.346, 712.347, 719.321, 719.322, 721.146, 721.147, 723.496, 723.497, 725.696, 725.697, 727.121, 727.122, 729.871, 729.872, 733.721, 733.722, 739.054, 758.078, 761.321, 761.322, 764.221, 764.222, 768.679, 768.680, 774.529, 774.530, 776.679, 776.680, 778.129, 778.130, 780.779, 780.780, 837.204, 837.205, 842.079, 842.080, 846.329, 846.330, 847.579)
binary <- c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0 ,0 ,1 ,1, 0, 0, 1, 1, 0, 0, 1, 1 ,0, 0 ,1 ,1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0 ,0 ,1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1 ,0 ,0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
Update my attempts:
First I got a sequence of the new seconds (which is longer than the original Time)
time2 <- seq(x$Time[1],x$Time[length])
Then I used ifelse to loop through the Time and compare it with time2, if the value in time2 not equal to the value in Time -> put the previous binary value of Time, else, get the binary value. So I want a function that continue comparing two different length columns.
What I did is this,
View(vec_new <-data.frame(time2))
vec_new <- vec_new %>%
mutate(new_Binary = ifelse((x$Time != vec_new$time2)&(vec_new$time2 %l% x$Time),lag(x$binary), x$binary))
However, I got this warning because of the different length columns.
"longer object length is not a multiple of shorter object length"
Also, the results are not quit what I expected. I don't know how this loop works through the values and how loops through all the values. I got a complete binaries till the end of the time2 though.
Any idea how to achieve this in R?
If you use mutate from the dplyr package the solution is relatively easy:
library(dplyr)
df <- data.frame(Time, binary) %>%
mutate(Time=Time-Time[1]) %>%
mutate(binary=as.logical(binary))
Output
head(df)
# Time binary
# 1 0.000 TRUE
# 2 1.026 TRUE
# 3 1.825 FALSE
# 4 2.949 FALSE
# 5 2.950 TRUE
# 6 3.899 TRUE
If you want to create new columns you simply have to give them a new name.
df <- data.frame(Time, binary) %>%
mutate(time2=Time-Time[1]) %>%
mutate(new_binary=as.logical(binary))
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 0.000 TRUE
# 2 359.240 1 1.026 TRUE
# 3 360.039 0 1.825 FALSE
# 4 361.163 0 2.949 FALSE
# 5 361.164 1 2.950 TRUE
# 6 362.113 1 3.899 TRUE
And this solution gives you the time according to your desired output (I hope).
df <- data.frame(Time, binary) %>%
mutate(time2=as.numeric(rownames(df))+357.214) %>%
mutate(new_binary=as.logical(binary))
head(df)
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 358.214 TRUE
# 2 359.240 1 359.214 TRUE
# 3 360.039 0 360.214 FALSE
# 4 361.163 0 361.214 FALSE
# 5 361.164 1 362.214 TRUE
# 6 362.113 1 363.214 TRUE

Resources