Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4
Related
I want to classify the rows of a data frame based on a threshold applied to a given numeric reference column. If the reference column has a value below the threshold, then the result is 0, which I want to add to a new column. If the reference column value is over the threshold, then the new column will have value 1 in all consecutive rows with value over the threshold until a new 0 result comes up. If a new reference value is over the threshold then the value to add is 2, and so on.
If we set up the threshold > 2 then an example of what I would like to obtain is:
row
reference
result
1
2
0
2
1
0
3
4
1
4
3
1
5
1
0
6
6
2
7
8
2
8
4
2
9
1
0
10
3
3
11
6
3
row <- c(1:11)
reference <- c(2,1,4,3,1,6,8,4,1,3,6)
result <- c(0,0,1,1,0,2,2,2,0,3,3)
table <- cbind(row, reference, result)
Thank you!
We can use run-length encoding (rle) for this.
The below assumes a data.frame:
r <- rle(quux$reference <= 2)
r$values <- ifelse(r$values, 0, cumsum(r$values))
quux$result2 <- inverse.rle(r)
quux
# row reference result result2
# 1 1 2 0 0
# 2 2 1 0 0
# 3 3 4 1 1
# 4 4 3 1 1
# 5 5 1 0 0
# 6 6 6 2 2
# 7 7 8 2 2
# 8 8 4 2 2
# 9 9 1 0 0
# 10 10 3 3 3
# 11 11 6 3 3
Data
quux <- structure(list(row = 1:11, reference = c(2, 1, 4, 3, 1, 6, 8, 4, 1, 3, 6), result = c(0, 0, 1, 1, 0, 2, 2, 2, 0, 3, 3)), row.names = c(NA, -11L), class = "data.frame")
As noted in the comments by #Sotos, would consider alternative name for your object.
Since it wasn't clear if data.frame or matrix, assume we have a data.frame df based on your data:
df <- as.data.frame(table)
And have a threshold of 2:
threshold = 2
You can adapt this solution by #flodel:
df$new_result = ifelse(
x <- reference > threshold,
cumsum(c(x[1], diff(x) == 1)),
0)
df
In this case, the diff(x) will include a vector, where values of 1 indicate where result should be increased by cumsum (in the sample data, this occurs in rows 3, 6, and 10). These are transitions from FALSE to TRUE (0 to 1), where reference goes from below to above threshold. Note that x[1] is added/combined since the diff values will be 1 element shorter in length.
Using the ifelse, these new incremental values only apply to those where reference exceeds threshold, otherwise set at 0.
Output
row reference result new_result
1 1 2 0 0
2 2 1 0 0
3 3 4 1 1
4 4 3 1 1
5 5 1 0 0
6 6 6 2 2
7 7 8 2 2
8 8 4 2 2
9 9 1 0 0
10 10 3 3 3
11 11 6 3 3
I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
Being a novice on R, I have trouble setting up the appropriate code (I would still say that it must include if/else commands and a loop).
In concrete terms, I would like to compare two pieces of information (see simplified example, because my actual database is rather long): "Monthly_category" and "Ref_category". The "Ref_category" to be taken into consideration is calculated only at the 5th period for each element (because then we move to the next element), thanks to the mode formula, for each element (Element_id).
Months Element_Id Monthly_Category Ref_Category Expected_output
1 1 3 NA 0
2 1 2 NA 0
3 1 2 NA 1
4 1 1 NA 1
5 1 3 3 0
1 2 6 2 0
2 2 6 6 1
3 2 NA 1 0
4 2 NA 6 0
5 2 1 1 0
More precisely, I would like to put 1 as soon as the "Monthly_category" differs 2 periods in a row from the selected "Ref_category" which is calculated every 5 observations. Otherwise, set 0.
In addition, I would like the lines or Monthly_category = NA to give 0 directly because in the end, I will only take into account lines where I have 1s (and NA doesn't interest me).
For each element (1 element = 5 lines), the reference category is calculated at the end of the 5 periods using the mode. However, by stretching the formula, we have values in each line while I have to consider each time only the last value (so every 5 lines). That's why I thought we needed 2 loops: one to check each line for the monthly category and one to check the reference category every 5 lines.
Do you have any idea of the code that could allow me to do this?
A very big thank you if someone can enlighten me,
Vanie
First of all, please have a look at the questions that #John Coleman and I asked you into the comments because my solution may change based on your request.
Anyway, you don't need an explicit for loop or an explicit if else to get the job done.
In R, you usually prefer not to write directly any for loop. You'd better use a functional like lapply. In this case the dplyr package takes care of any implicit looping.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, 1, 1, 0,
4 , 2, 1, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
df %>%
# check if elements are equal
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, Ref_Category, identical)) %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... define your Expected Output
mutate(Real_Expected_output = as.integer(lag(Real_Expected_output, default = FALSE) &
lag(Real_Expected_output, 2, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 1
# 1 2 6 2 0 0
# 2 2 6 6 1 0
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
Real_Expected_output is not the same of your Expected_output just because I do believe your expected result contradicts your written requests as I said in one of the comments.
EDIT:
Based on your comment, I suppose this is what you're looking for.
Again: no loops, you just need to use wisely the tools that the dplyr package is already providing, i.e. last, group_by, mutate
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... check if Monthly Category is equal to the last Ref_Category
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, last(Ref_Category), identical)) %>%
# ... and define your Expected Output
mutate(Real_Expected_output = as.integer(Real_Expected_output &
lag(Real_Expected_output, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 0
# 1 2 6 2 0 0
# 2 2 6 6 1 1
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
EDIT 2:
I'll edit it again based on your request. At this point I'd suggest you to create an external function to handle your problem. It looks cleaner.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, NA, 1, 0,
4 , 2, NA, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
get_output <- function(mon, ref){
# set here your condition
exp <- !is.na(mon) & !map2_lgl(mon, last(ref), identical)
# check exp and lag(exp), then convert to integer
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... launch your function
mutate(Real_Expected_output = get_output(Monthly_Category, Ref_Category)) %>%
ungroup()
# # A tibble: 10 x 6
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 3 NA 0 0
# 2 2 1 2 NA 0 0
# 3 3 1 2 NA 1 1
# 4 4 1 1 NA 1 1
# 5 5 1 3 3 0 0
# 6 1 2 6 2 0 0
# 7 2 2 6 6 1 1
# 8 3 2 NA 1 0 0
# 9 4 2 NA 6 0 0
# 10 5 2 1 1 0 0
Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.
I have a vector with repeated numbers. I want to count the number of repeated numbers and print the output.
This is my input:
deg <- c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df <- data.frame(table(deg))
This is my output:
deg Freq
1 1 2
2 2 6
3 3 1
4 4 2
5 5 1
Here in my output I want to print the data frame from 0 to 5, where 0 is the starting element and 5 is the max element in the vector. The output I want to get is:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
Someone please help with this!!!
If we're starting from df we can just unpack the data, add zero as a factor level, then re-tabulate:
f <- with(df, factor(rep(deg, Freq), levels = union(0, levels(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
If we're starting with the vector deg, it's easier. We can just add zero as a factor level then tabulate:
f <- factor(deg, levels = union(0, sort(unique(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
Try this:
df <- data.frame(deg=seq(0,max(deg)),
Freq=sapply(seq(0,max(deg)),function(x) length(which(deg==x))))
Output:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
You can add a row to df:
#convert deg from factor back to numeric
df$deg = as.numeric(as.character(df$deg))
# add 0 deg with 0 freq if it doesn't exist already in df
if (!any(df$deg == 0)) {
df = rbind(df, c(0,0))
# sort df by deg
df = df[order(df$deg),]
}
Try this
rbind(data.frame(deg=0, Freq=0)[!(c(0) %in% deg)], as.data.frame(table(deg)))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
The expand_df function below can help you get the desired output
deg = c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df = as.data.frame(table(deg))
expand_df = function(df){
upd_list = 0: max(as.numeric(as.character(df[,1])))
upd_df = as.data.frame(upd_list)
merged_df = merge(upd_df, df,all.x=TRUE,by.x=colnames(upd_df)[1], by.y=colnames(df)[1])
merged_df[,2] = ifelse(is.na(merged_df[,2]),0,merged_df[,2])
merged_df
}
expand_df(df)