R reset counter based on two columns [duplicate] - r

This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).

We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)

Related

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

Column bind several list elements based on id variable [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
Assuming the following list:
a <- data.frame(id = 1:3, x = 1:3)
b <- data.frame(id = 3:1, y = 4:6)
my_list <- list(a, b)
my_list
# [[1]]
# id x
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# id y
# 1 3 4
# 2 2 5
# 3 1 6
I now want to column bind the list elements into a data frame/tibble while matching the respective rows based on the id variable, i.e. the outcome should be:
# A tibble: 3 x 3
id x y
<int> <int> <int>
1 1 1 6
2 2 2 5
3 3 3 4
I know how I can do it with some pivoting, but I'm wondering if there's a smarter way of doing it and I hoped there was some binding function that simply allows for specifying an id column?
Current approach:
library(tidyverse)
my_list %>%
bind_rows() %>%
pivot_longer(cols = -id) %>%
filter(!is.na(value)) %>%
pivot_wider()
reduce(my_list, full_join, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If its only 2 dataframes:
invoke(full_join, my_list, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If you are using base R, any of the following should work:
Reduce(merge, my_list)
do.call(merge, my_list)

How to combine data points in a data frame in R?

The data frame x has a column in which the values are periodic. For each unique value in that column, I want to calculate summation of the second column. If x is something like this:
x <- data.frame(a=c(1:2,1:2,1:2),b=c(1,4,5,2,3,4))
a b
1 1 1
2 2 4
3 1 5
4 2 2
5 1 3
6 2 4
The output I want is the following data frame:
a b
1 9
2 10
Using aggregate as follows will get you your desired result
aggregate(b ~ a, x, sum)
Here is the option with dplyr
library(dplyr)
x %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 2 x 2
# a b
# <int> <dbl>
#1 1 9.00
#2 2 10.0

Filter all rows of a group according to specific member of group [duplicate]

This question already has an answer here:
How to filter (with dplyr) for all values of a group if variable limit is reached?
(1 answer)
Closed 5 years ago.
I want to filter an entire group based on a value at a specified row.
In the data below, I'd like to remove all rows of group ID, according the value of Metric for Hour == '2'. (Note that I am not trying to filter based on two conditions here, I'm trying to filter based on one condition but at a specific row)
Sample data:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
Metric <- c(3,4,1,6,7,8,8,3,6,1,1)
x <- data.frame(ID, Hour, Metric)
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
6 B 0 8
7 B 2 8
8 B 5 3
9 B 6 6
10 C 0 1
11 C 2 1
I want to filter each ID based on whether Metric > 5 for Hour == '2'. The result should look like this (all rows of ID B are removed):
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
10 C 0 1
11 C 2 1
A dplyr-based solution would be preferred, but any help is much appreciated.
Adapting How to filter (with dplyr) for all values of a group if variable limit is reached?
we get:
x %>%
group_by(ID) %>%
filter(any(Metric[Hour == '2'] <= 5))
# # A tibble: 7 x 3
# # Groups: ID [2]
# ID Hour Metric
# <fctr> <fctr> <dbl>
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
These type of problems can be also answered by first creating a by group intermediate variable, to flag whether rows should be removed.
Method 1:
x %>%
group_by(ID) %>%
mutate(keep_group = (any(Metric[Hour == '2'] <= 5))) %>%
ungroup %>%
filter(keep_group) %>%
select(-keep_group)
Method 2:
groups_to_keep <-
x %>%
filter(Hour == '2', Metric <= 5) %>%
select(ID) %>%
distinct() # N.B. this sorts groups_to_keep by ID which may not be desired
# ID
# 1 A
# 2 C
x %>%
inner_join(groups_to_keep, by = 'ID')
# ID Hour Metric
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
Method 3 - as suggested by #thelatemail (safe with respect to duplicates in ID):
groups_not_to_keep <-
x %>%
filter(Hour == 2, Metric > 5) %>%
select(ID)
x %>%
anti_join(groups_not_to_keep, by = 'ID')
Not in (!()) should be useful here. Try this
library(dplyr)
filter(x, Metric > 5 & Hour == '2')$ID # gives B
subset(x, !(ID %in% filter(x, Metric > 5 & Hour == '2')$ID))

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources