R data.table set specific columns to their last values, by group - r

Data looks like this:
Col1 Col2 Col3 Group
1 1 2 1
1 1 3 1
2 2 4 1
2 3 3 2
2 3 4 2
2 4 5 2
3 4 6 2
I want to set Col1 and Col3 to their LAST value, within Group
For instance, the last value of Col1 Group 2 is 3. So in Group 2, I want all values of Col1 to be set to 3.
Expected result:
Col1 Col2 Col3 Group
2 1 4 1
2 1 4 1
2 2 4 1
3 3 6 2
3 3 6 2
3 4 6 2
3 4 6 2
How can this be done with data.table?

We can use tidyverse. We group by 'Group', and use mutate_at to select the variable of interest, replace with the last value of each of the columns
library(dplyr)
df1 %>%
group_by(Group) %>%
mutate_at(vars(Col1, Col3), last)
# A tibble: 7 x 4
# Groups: Group [2]
# Col1 Col2 Col3 Group
# <int> <int> <int> <int>
#1 2 1 4 1
#2 2 1 4 1
#3 2 2 4 1
#4 3 3 6 2
#5 3 3 6 2
#6 3 4 6 2
#7 3 4 6 2
Or with data.table, use the same logic, (if it is not a data.table, convert to data.table with setDT), specify the columns of interst in .SDcols, loop through the Subset of Data.table (.SD), get the last value and assign (:=) it to the columns
library(data.table)
nm1 <- c("Col1", "Col3")
setDT(df1)[, (nm1) := lapply(.SD, last), by = Group, .SDcols = nm1]
data
df1 <- structure(list(Col1 = c(1L, 1L, 2L, 2L, 2L, 2L, 3L), Col2 = c(1L,
1L, 2L, 3L, 3L, 4L, 4L), Col3 = c(2L, 3L, 4L, 3L, 4L, 5L, 6L),
Group = c(1L, 1L, 1L, 2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA,
-7L))

library(data.table)
cols <- c("Col1", "Col3")
DT[, (cols) := .SD[.N], by = Group, .SDcols = cols][]
# Col1 Col2 Col3 Group
# 1: 2 1 4 1
# 2: 2 1 4 1
# 3: 2 2 4 1
# 4: 3 3 6 2
# 5: 3 3 6 2
# 6: 3 4 6 2
# 7: 3 4 6 2
Data
DT <- fread("Col1 Col2 Col3 Group
1 1 2 1
1 1 3 1
2 2 4 1
2 3 3 2
2 3 4 2
2 4 5 2
3 4 6 2")

Related

How to add new rows conditionally on R

I have a df with
v1 t1 c1 o1
1 1 9 1
1 1 12 2
1 2 2 1
1 2 7 2
2 1 3 1
2 1 6 2
2 2 3 1
2 2 12 2
And I would like to add 2 rows each time that v1 changes it's value, in order to get this:
v1 t1 c1 o1
1 1 1 1
1 1 1 2
1 2 9 1
1 2 12 2
1 3 2 1
1 3 7 2
2 1 1 1
2 1 1 2
1 2 3 1
1 2 6 2
2 3 3 1
2 3 12 2
So what I'm doing is that every time v1 changes its value I'm adding 2 rows of ones and adding a 1 to the values of t1. This is kind of tricky. I've been able to do it in Excel but I would like to scale to big files in R.
We may do the expansion in group_modify
library(dplyr)
df1 %>%
group_by(v1) %>%
group_modify(~ .x %>%
slice_head(n = 2) %>%
mutate(across(-o1, ~ 1)) %>%
bind_rows(.x) %>%
mutate(t1 = as.integer(gl(n(), 2, n())))) %>%
ungroup
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
Or do a group by summarise
df1 %>%
group_by(v1) %>%
summarise(t1 = as.integer(gl(n() + 2, 2, n() + 2)),
c1 = c(1, 1, c1), o1 = rep(1:2, length.out = n() + 2),
.groups = 'drop')
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
data
df1 <- structure(list(v1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), t1 = c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), c1 = c(9L, 12L, 2L, 7L, 3L, 6L,
3L, 12L), o1 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))

Count number of observations by group

I'm trying to count the number of every observation for each variable in a dataset regarding a specific group.
The data looks like this:
grp v1 vn
1 2 5
2 4
3 3 4
1 3
1 2 12
4 5
5 3 6
5 6
The Result should be a table like this:
grp v1 vn
1 2 3
2 1 0
3 1 1
4 0 1
5 2 1
I tried to use
x %>% group_by(grp) %>% summarise(across(everything(),n = n()))
but it didn`t really worked.
Any help is appreciated. Thanks in advance!
You can also use the following solution:
library(dplyr)
df %>%
group_by(grp) %>%
summarise(across(v1:vn, ~ sum(!is.na(.x))))
# A tibble: 5 x 3
grp v1 vn
<int> <int> <int>
1 1 2 3
2 2 1 0
3 3 1 1
4 4 0 1
5 5 2 1
Get the data in long format, count non-NA values for each column in each group and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -grp) %>%
group_by(grp, name) %>%
summarise(n = sum(!is.na(value))) %>%
ungroup %>%
pivot_wider(names_from = name, values_from = n)
# grp v1 vn
# <int> <int> <int>
#1 1 2 3
#2 2 1 0
#3 3 1 1
#4 4 0 1
#5 5 2 1
data
df <- structure(list(grp = c(1L, 2L, 3L, 1L, 1L, 4L, 5L, 5L), v1 = c(2L,
4L, 3L, NA, 2L, NA, 3L, 6L), vn = c(5L, NA, 4L, 3L, 2L, 5L, 6L,
NA)), class = "data.frame", row.names = c(NA, -8L))
Using data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(!is.na(x))), grp]
# grp v1 vn
#1: 1 2 3
#2: 2 1 0
#3: 3 1 1
#4: 4 0 1
#5: 5 2 1
Using aggregate.
aggregate(cbind(v1, vn) ~ grp, replace(dat, is.na(dat), 0), function(x) sum(as.logical(x)))
# grp v1 vn
# 1 1 2 3
# 2 2 1 0
# 3 3 1 1
# 4 4 0 1
# 5 5 2 1
Data:
dat <- read.table(header=T, text='grp v1 vn
1 2 5
2 4 NA
3 3 4
1 NA 3
1 2 12
4 NA 5
5 3 6
5 6 NA
')

define an indicator when number of duplicate rows -1 is equal one of the column

I have some duplicate rows whose are the same in some columns, I want to define indicator if the number of duplicate rows -1 are equal the number of one of the column .
example
SAMPN PERNO ARR_HR HHMEM
1 1 2 1
1 2 2 1
2 1 3 2
2 3 3 2
3 1 4 2
3 2 4 2
3 3 4 2
rows are duplicate if they are the same in first ,second and third columns. I want the indicator to be 1 if number of duplicate rows -1 is equal HHMEM .
for example 2 first rows are duplicate so 2-1=1=HHMEM so indicator is 1.
out put
SAMPN PERNO ARR_HR HHMEM indicator
1 1 2 1 1
1 2 2 1 1
2 1 3 2 0
2 3 3 2 0
3 1 4 2 1
3 2 4 2 1
3 3 4 2 1
After grouping by 'SAMPN' and other grouping variables (from OP's comments) create the 'indicator' by coercing the logical vector ((n()- 1) == HHMEM) into binary with as.integer
library(dplyr)
df1 %>%
group_by(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer((n()-1) == HHMEM))
# A tibble: 7 x 5
# Groups: SAMPN [3]
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1
NOTE: We don't need to create any additional column and then remove it later
Or the same logic in base R with ave
df1$indicator <- +(with(df1, HHMEM == ave(HHMEM, HHMEM, SAMPN,
ARR_HR, FUN = length)-1))
Or using duplicated with table
i1 <- table(cumsum(!duplicated(df1[c(1, 3, 4)])))
as.integer(rep(i1, i1) - 1 == df1$HHMEM)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 2L, 2L, 3L, 3L, 3L), PERNO = c(1L,
2L, 1L, 3L, 1L, 2L, 3L), ARR_HR = c(2L, 2L, 3L, 3L, 4L, 4L, 4L
), HHMEM = c(1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA,
-7L))
We can use add_count to get count and compare it with HHMEM.
library(dplyr)
df %>%
add_count(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer(n - 1 == HHMEM)) %>%
select(-n)
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1

How do I count only previous value not using summarize in R?

This is my dataset.
num col1
1 SENSOR_01
2 SENSOR_01
3 SENSOR_01
4 SENSOR_05
5 SENSOR_05
6 SENSOR_05
7 NA
8 SENSOR_01
9 SENSOR_01
10 SENSOR_05
11 SENSOR_05
structure(list(num = 1:11, col1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
NA, 1L, 1L, 2L, 2L), .Label = c("SENSOR_01", "SENSOR_05" ), class =
"factor"), count = c(3L, 3L, 3L, 3L, 3L, 3L, 0L, 2L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA, -11L))
I would like to count for only previous duplicated rows. In the row 1-3, there are sensor 3 repeatedly 3 times so count = 3. Here is my expected outcome.
num col1 count
1 SENSOR_01 3
2 SENSOR_01 3
3 SENSOR_01 3
4 SENSOR_05 3
5 SENSOR_05 3
6 SENSOR_05 3
7 NA 1
8 SENSOR_01 2
9 SENSOR_01 2
10 SENSOR_05 2
11 SENSOR_05 2
Using dplyr, How can I make this outcome?
We can use rleid to create groups and then count number of rows in each group.
library(dplyr)
df %>%
group_by(group = data.table::rleid(col1)) %>%
mutate(n = n()) %>%
ungroup() %>%
dplyr::select(-group)
# A tibble: 11 x 4
# num col1 count n
# <int> <fct> <int> <int>
# 1 1 SENSOR_01 3 3
# 2 2 SENSOR_01 3 3
# 3 3 SENSOR_01 3 3
# 4 4 SENSOR_05 3 3
# 5 5 SENSOR_05 3 3
# 6 6 SENSOR_05 3 3
# 7 7 NA 1 1
# 8 8 SENSOR_01 2 2
# 9 9 SENSOR_01 2 2
#10 10 SENSOR_05 2 2
#11 11 SENSOR_05 2 2
Keeping both the columns for comparison purposes.
Or using data.table
library(data.table)
setDT(df)[, n := .N, by = rleid(col1)]
Like an option, we can use order of variables (rownames in traditional data.frame). The idea is simple:
If within the group of identical sensor names, the distance between adjacent records is equal to 1 and the same is true in a global view, without grouping - set the flag for this record to zero or one otherwise;
Still within the group of identical sensor names, find cumulative sum of flags, which allows us to identify all subgroups of records appearing consequently in global data set;
Still within the group count the number of elements in each individual subgroup;
Repeat for each group of records.
In tidyverse:
dat %>%
mutate(tmp = 1:n()) %>%
group_by(col1) %>%
add_count(tmp = cumsum(c(0, diff(tmp)) > 1)) %>%
ungroup() %>%
select(-tmp)
# # A tibble: 11 x 3
# num col1 n
# <int> <fct> <int>
# 1 1 SENSOR_01 3
# 2 2 SENSOR_01 3
# 3 3 SENSOR_01 3
# 4 4 SENSOR_05 3
# 5 5 SENSOR_05 3
# 6 6 SENSOR_05 3
# 7 7 NA 1
# 8 8 SENSOR_01 2
# 9 9 SENSOR_01 2
# 10 10 SENSOR_05 2
# 11 11 SENSOR_05 2
Data:
dat <- structure(
list(
num = 1:11,
col1 = structure(
c(1L, 1L, 1L, 2L, 2L, 2L, NA, 1L, 1L, 2L, 2L),
.Label = c("SENSOR_01", "SENSOR_05" ),
class = "factor")
),
class = "data.frame",
row.names = c(NA, -11L)
)
We can use base R with rle to create the 'count' column
df$count <- with(rle(df$col1), rep(lengths, lengths))
df$count
#[1] 3 3 3 3 3 3 1 2 2 2 2
Or the dplyr implementation of the above
library(dplyr)
df %>%
mutate(count = with(rle(col1), rep(lengths, lengths)))
Or an option with tidyverse without including any other packages
library(dplyr)
df %>%
group_by(grp = replace_na(col1, "VALUE"),
grp = cumsum(grp != lag(grp, default = first(grp)))) %>%
mutate(count = n()) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 3
# num col1 count
# <int> <chr> <int>
# 1 1 SENSOR_01 3
# 2 2 SENSOR_01 3
# 3 3 SENSOR_01 3
# 4 4 SENSOR_05 3
# 5 5 SENSOR_05 3
# 6 6 SENSOR_05 3
# 7 7 <NA> 1
# 8 8 SENSOR_01 2
# 9 9 SENSOR_01 2
#10 10 SENSOR_05 2
#11 11 SENSOR_05 2
data
df <- structure(list(num = 1:11, col1 = c("SENSOR_01", "SENSOR_01",
"SENSOR_01", "SENSOR_05", "SENSOR_05", "SENSOR_05", NA, "SENSOR_01",
"SENSOR_01", "SENSOR_05", "SENSOR_05")),
class = "data.frame", row.names = c(NA,
-11L))

Creating a new variable based on the orders of existing variables using R

Hoping to create the new variable X based on three existing variables: "SubID" "Day" and "Time". I used to have three sorting functions in excel to do this manually: first sort by the "SubID," and then sort by the "Day," and lastly sort by "Time." X should be from 1 to the largest number of rows for each SubID, based on the order of Day and Time.
SubID: assigned subject number
Day: each subject's day number (1,2,3...21)
Time: 1, 2, 3
X: the number of rows marked as the same SubID
SubID Day Time X
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 4
1 2 2 5
2 1 1 1
2 1 2 2
2 1 3 3
2 2 3 6
2 2 2 5
2 2 1 4
I have been doing this manually in excel and I am sure there must be a smarter way to do it in R, but I am new to R and don't know how. Thank you in advance!
May be with data.table package. You will have to install it in case you haven't already. I have commented the command.
# install.packages("data.table")
library(data.table)
we can generate your data in the following way.
df <- data.frame(SubId=sample(1:2,10,replace=TRUE),
Day=sample(1:2,10,replace=TRUE),
Time=sample(1:2,10,replace=TRUE))
Then convert the data.frame into data.table.
setDT(df)
##> df
## SubId Day Time
## 1: 1 2 1
## 2: 1 1 1
## 3: 1 1 2
## 4: 2 2 1
## 5: 2 1 1
## 6: 1 2 2
## 7: 1 2 1
## 8: 1 2 2
## 9: 2 1 1
## 10: 2 1 2
Finally we can order my SubId, Day ,Time. As the table is ordered as we wanted, we just have to number the rows from 1 to the number of observations in each SubId.
df[order(SubId,Day,Time),X:=1:.N,SubId]
##> df
## SubId Day Time X
## 1: 1 2 1 3
## 2: 1 1 1 1
## 3: 1 1 2 2
## 4: 2 2 1 4
## 5: 2 1 1 1
## 6: 1 2 2 5
## 7: 1 2 1 4
## 8: 1 2 2 6
## 9: 2 1 1 2
## 10: 2 1 2 3
May be this helps
library(dplyr)
df1 %>%
group_by(SubID) %>%
mutate(X1 = row_number(as.numeric(paste0(Day, Time))))
# A tibble: 11 x 5
# Groups: SubID [2]
# SubID Day Time X X1
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 1 2 2 2
# 3 1 1 3 3 3
# 4 1 2 1 4 4
# 5 1 2 2 5 5
# 6 2 1 1 1 1
# 7 2 1 2 2 2
# 8 2 1 3 3 3
# 9 2 2 3 6 6
#10 2 2 2 5 5
#11 2 2 1 4 4
Or using order
df1 %>%
group_by(SubID) %>%
mutate(X1 = order(Day, Time))
Or with data.table
library(data.table)
setDT(df1)[, X1 := order(Day, Time), by = SubID]
data
df1 <- structure(list(SubID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), Day = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Time = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 3L, 2L, 1L), X = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 5L, 4L)), class = "data.frame",
row.names = c(NA,
-11L))

Resources