Being a novice on R, I have trouble setting up the appropriate code (I would still say that it must include if/else commands and a loop).
In concrete terms, I would like to compare two pieces of information (see simplified example, because my actual database is rather long): "Monthly_category" and "Ref_category". The "Ref_category" to be taken into consideration is calculated only at the 5th period for each element (because then we move to the next element), thanks to the mode formula, for each element (Element_id).
Months Element_Id Monthly_Category Ref_Category Expected_output
1 1 3 NA 0
2 1 2 NA 0
3 1 2 NA 1
4 1 1 NA 1
5 1 3 3 0
1 2 6 2 0
2 2 6 6 1
3 2 NA 1 0
4 2 NA 6 0
5 2 1 1 0
More precisely, I would like to put 1 as soon as the "Monthly_category" differs 2 periods in a row from the selected "Ref_category" which is calculated every 5 observations. Otherwise, set 0.
In addition, I would like the lines or Monthly_category = NA to give 0 directly because in the end, I will only take into account lines where I have 1s (and NA doesn't interest me).
For each element (1 element = 5 lines), the reference category is calculated at the end of the 5 periods using the mode. However, by stretching the formula, we have values in each line while I have to consider each time only the last value (so every 5 lines). That's why I thought we needed 2 loops: one to check each line for the monthly category and one to check the reference category every 5 lines.
Do you have any idea of the code that could allow me to do this?
A very big thank you if someone can enlighten me,
Vanie
First of all, please have a look at the questions that #John Coleman and I asked you into the comments because my solution may change based on your request.
Anyway, you don't need an explicit for loop or an explicit if else to get the job done.
In R, you usually prefer not to write directly any for loop. You'd better use a functional like lapply. In this case the dplyr package takes care of any implicit looping.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, 1, 1, 0,
4 , 2, 1, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
df %>%
# check if elements are equal
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, Ref_Category, identical)) %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... define your Expected Output
mutate(Real_Expected_output = as.integer(lag(Real_Expected_output, default = FALSE) &
lag(Real_Expected_output, 2, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 1
# 1 2 6 2 0 0
# 2 2 6 6 1 0
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
Real_Expected_output is not the same of your Expected_output just because I do believe your expected result contradicts your written requests as I said in one of the comments.
EDIT:
Based on your comment, I suppose this is what you're looking for.
Again: no loops, you just need to use wisely the tools that the dplyr package is already providing, i.e. last, group_by, mutate
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... check if Monthly Category is equal to the last Ref_Category
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, last(Ref_Category), identical)) %>%
# ... and define your Expected Output
mutate(Real_Expected_output = as.integer(Real_Expected_output &
lag(Real_Expected_output, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 0
# 1 2 6 2 0 0
# 2 2 6 6 1 1
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
EDIT 2:
I'll edit it again based on your request. At this point I'd suggest you to create an external function to handle your problem. It looks cleaner.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, NA, 1, 0,
4 , 2, NA, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
get_output <- function(mon, ref){
# set here your condition
exp <- !is.na(mon) & !map2_lgl(mon, last(ref), identical)
# check exp and lag(exp), then convert to integer
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... launch your function
mutate(Real_Expected_output = get_output(Monthly_Category, Ref_Category)) %>%
ungroup()
# # A tibble: 10 x 6
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 3 NA 0 0
# 2 2 1 2 NA 0 0
# 3 3 1 2 NA 1 1
# 4 4 1 1 NA 1 1
# 5 5 1 3 3 0 0
# 6 1 2 6 2 0 0
# 7 2 2 6 6 1 1
# 8 3 2 NA 1 0 0
# 9 4 2 NA 6 0 0
# 10 5 2 1 1 0 0
Related
I want to classify the rows of a data frame based on a threshold applied to a given numeric reference column. If the reference column has a value below the threshold, then the result is 0, which I want to add to a new column. If the reference column value is over the threshold, then the new column will have value 1 in all consecutive rows with value over the threshold until a new 0 result comes up. If a new reference value is over the threshold then the value to add is 2, and so on.
If we set up the threshold > 2 then an example of what I would like to obtain is:
row
reference
result
1
2
0
2
1
0
3
4
1
4
3
1
5
1
0
6
6
2
7
8
2
8
4
2
9
1
0
10
3
3
11
6
3
row <- c(1:11)
reference <- c(2,1,4,3,1,6,8,4,1,3,6)
result <- c(0,0,1,1,0,2,2,2,0,3,3)
table <- cbind(row, reference, result)
Thank you!
We can use run-length encoding (rle) for this.
The below assumes a data.frame:
r <- rle(quux$reference <= 2)
r$values <- ifelse(r$values, 0, cumsum(r$values))
quux$result2 <- inverse.rle(r)
quux
# row reference result result2
# 1 1 2 0 0
# 2 2 1 0 0
# 3 3 4 1 1
# 4 4 3 1 1
# 5 5 1 0 0
# 6 6 6 2 2
# 7 7 8 2 2
# 8 8 4 2 2
# 9 9 1 0 0
# 10 10 3 3 3
# 11 11 6 3 3
Data
quux <- structure(list(row = 1:11, reference = c(2, 1, 4, 3, 1, 6, 8, 4, 1, 3, 6), result = c(0, 0, 1, 1, 0, 2, 2, 2, 0, 3, 3)), row.names = c(NA, -11L), class = "data.frame")
As noted in the comments by #Sotos, would consider alternative name for your object.
Since it wasn't clear if data.frame or matrix, assume we have a data.frame df based on your data:
df <- as.data.frame(table)
And have a threshold of 2:
threshold = 2
You can adapt this solution by #flodel:
df$new_result = ifelse(
x <- reference > threshold,
cumsum(c(x[1], diff(x) == 1)),
0)
df
In this case, the diff(x) will include a vector, where values of 1 indicate where result should be increased by cumsum (in the sample data, this occurs in rows 3, 6, and 10). These are transitions from FALSE to TRUE (0 to 1), where reference goes from below to above threshold. Note that x[1] is added/combined since the diff values will be 1 element shorter in length.
Using the ifelse, these new incremental values only apply to those where reference exceeds threshold, otherwise set at 0.
Output
row reference result new_result
1 1 2 0 0
2 2 1 0 0
3 3 4 1 1
4 4 3 1 1
5 5 1 0 0
6 6 6 2 2
7 7 8 2 2
8 8 4 2 2
9 9 1 0 0
10 10 3 3 3
11 11 6 3 3
Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4
I want to find a way to replace consecutive same values into 0 at the beginning of each trial, but once the value has changed it should stop replacing and keep the value. It should occur every trials per subject.
For example, first subject has multiple trials (1, 2, etc). At the beginning of each trial, there may be some consecutive rows with the same value (e.g., 1, 1, 1). For these values, I would like to replace them to 0. However, once the value has changed from 1 to 0, I want to keep the values in the rest of the trial (e.g., 0, 0, 1).
subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
trial <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
value <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1)
df <- data.frame(subject, trial, value)
Thus, from the original data frame, I would like to have a new variable (value_new) like below.
subject trial value value_new
1 1 1 1 0
2 1 1 1 0
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 1 1
7 1 2 1 0
8 1 2 1 0
9 1 2 0 0
10 1 2 1 1
11 1 2 1 1
12 1 2 1 1
I was thinking to use tidyr and group_by(subject, trial) and mutate a new variable using conditional statement, but no idea how to do that. I guess I need to use rle(), but again, have no clue of how to replace the consecutive values into 0, and stop replacing once the value has changed and keep the rest of the values.
Any suggestions or advice would be really appreciated!
You can use rleid from data.table :
library(data.table)
setDT(df)[, new_value := value * +(rleid(value) > 1), .(subject, trial)]
df
# subject trial value new_value
# 1: 1 1 1 0
# 2: 1 1 1 0
# 3: 1 1 1 0
# 4: 1 1 0 0
# 5: 1 1 0 0
# 6: 1 1 1 1
# 7: 1 2 1 0
# 8: 1 2 1 0
# 9: 1 2 0 0
#10: 1 2 1 1
#11: 1 2 1 1
#12: 1 2 1 1
You can also do this with dplyr :
library(dplyr)
df %>%
group_by(subject, trial) %>%
mutate(new_value = value * +(rleid(value) > 1))
I want to make a table consisting of 0 and 1.
If a variable is larger than 0, it will be 1 otherwise 0.
As the dataset has over 1,000 columns, I should use the 'sapply?' function on this question.
how do I make the code?
We can specify the condition and replace the value for a data frame. No "apply" family function is needed.
# Create an example data frame
dt <- data.frame(A = c(0, 1, 2, 3, 4),
B = c(4, 6, 8, 0, 7),
C = c(0, 0, 5, 5, 2))
# View dt
dt
# A B C
# 1 0 4 0
# 2 1 6 0
# 3 2 8 5
# 4 3 0 5
# 5 4 7 2
# Replace values larger than 0 to be 1
dt[dt > 0] <- 1
# View dt again
dt
# A B C
# 1 0 1 0
# 2 1 1 0
# 3 1 1 1
# 4 1 0 1
# 5 1 1 1
I have a vector with repeated numbers. I want to count the number of repeated numbers and print the output.
This is my input:
deg <- c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df <- data.frame(table(deg))
This is my output:
deg Freq
1 1 2
2 2 6
3 3 1
4 4 2
5 5 1
Here in my output I want to print the data frame from 0 to 5, where 0 is the starting element and 5 is the max element in the vector. The output I want to get is:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
Someone please help with this!!!
If we're starting from df we can just unpack the data, add zero as a factor level, then re-tabulate:
f <- with(df, factor(rep(deg, Freq), levels = union(0, levels(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
If we're starting with the vector deg, it's easier. We can just add zero as a factor level then tabulate:
f <- factor(deg, levels = union(0, sort(unique(deg))))
as.data.frame(table(deg = f))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
Try this:
df <- data.frame(deg=seq(0,max(deg)),
Freq=sapply(seq(0,max(deg)),function(x) length(which(deg==x))))
Output:
deg Freq
1 0 0
2 1 2
3 2 6
4 3 1
5 4 2
6 5 1
You can add a row to df:
#convert deg from factor back to numeric
df$deg = as.numeric(as.character(df$deg))
# add 0 deg with 0 freq if it doesn't exist already in df
if (!any(df$deg == 0)) {
df = rbind(df, c(0,0))
# sort df by deg
df = df[order(df$deg),]
}
Try this
rbind(data.frame(deg=0, Freq=0)[!(c(0) %in% deg)], as.data.frame(table(deg)))
# deg Freq
# 1 0 0
# 2 1 2
# 3 2 6
# 4 3 1
# 5 4 2
# 6 5 1
The expand_df function below can help you get the desired output
deg = c(2, 1, 4, 3, 2, 4, 2, 5, 2, 2, 1, 2)
df = as.data.frame(table(deg))
expand_df = function(df){
upd_list = 0: max(as.numeric(as.character(df[,1])))
upd_df = as.data.frame(upd_list)
merged_df = merge(upd_df, df,all.x=TRUE,by.x=colnames(upd_df)[1], by.y=colnames(df)[1])
merged_df[,2] = ifelse(is.na(merged_df[,2]),0,merged_df[,2])
merged_df
}
expand_df(df)