how to remove observations in R dependent on a specific condition - r

I am trying to drop observations in R from my dataset. I need each Person_ID to have wave 0 AND (wave 1 OR wave 3 OR wave 6 OR wave 12 OR wave 18). Can someone help me?
Initial dataset
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12
Wanted result
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
3 0
3 1
Thanks!

You can do a grouped filter. We keep a person if both 0 and any of 1, 3, 6, 12, 18 are in their corresponding wave values.
library(tidyverse)
tbl <- read_table2(
"Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12"
)
tbl %>%
group_by(Person_ID) %>%
filter(0 %in% wave, any(c(1, 3, 6, 12, 18) %in% wave))
#> # A tibble: 8 x 2
#> # Groups: Person_ID [2]
#> Person_ID wave
#> <dbl> <dbl>
#> 1 1 0
#> 2 1 1
#> 3 1 3
#> 4 1 6
#> 5 1 12
#> 6 1 18
#> 7 3 0
#> 8 3 1
Created on 2019-03-25 by the reprex package (v0.2.1)

We can also do this in base R
df1[with(df1, Person_ID %in% intersect(Person_ID[wave %in% c(1, 3, 6, 12, 18)],
Person_ID[!wave])),]
# Person_ID wave
#1 1 0
#2 1 1
#3 1 3
#4 1 6
#5 1 12
#6 1 18
#8 3 0
#9 3 1
data
df1 <- structure(list(Person_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 4L, 4L), wave = c(0L, 1L, 3L, 6L, 12L, 18L, 0L, 0L, 1L, 6L,
12L)), class = "data.frame", row.names = c(NA, -11L))

Related

How to add new rows conditionally on R

I have a df with
v1 t1 c1 o1
1 1 9 1
1 1 12 2
1 2 2 1
1 2 7 2
2 1 3 1
2 1 6 2
2 2 3 1
2 2 12 2
And I would like to add 2 rows each time that v1 changes it's value, in order to get this:
v1 t1 c1 o1
1 1 1 1
1 1 1 2
1 2 9 1
1 2 12 2
1 3 2 1
1 3 7 2
2 1 1 1
2 1 1 2
1 2 3 1
1 2 6 2
2 3 3 1
2 3 12 2
So what I'm doing is that every time v1 changes its value I'm adding 2 rows of ones and adding a 1 to the values of t1. This is kind of tricky. I've been able to do it in Excel but I would like to scale to big files in R.
We may do the expansion in group_modify
library(dplyr)
df1 %>%
group_by(v1) %>%
group_modify(~ .x %>%
slice_head(n = 2) %>%
mutate(across(-o1, ~ 1)) %>%
bind_rows(.x) %>%
mutate(t1 = as.integer(gl(n(), 2, n())))) %>%
ungroup
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
Or do a group by summarise
df1 %>%
group_by(v1) %>%
summarise(t1 = as.integer(gl(n() + 2, 2, n() + 2)),
c1 = c(1, 1, c1), o1 = rep(1:2, length.out = n() + 2),
.groups = 'drop')
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
data
df1 <- structure(list(v1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), t1 = c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), c1 = c(9L, 12L, 2L, 7L, 3L, 6L,
3L, 12L), o1 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))

select first and last occurences of a value in a given month

I have a dataset that records the changes in a group from a certain ID, in a given month.
In the example, in july, the ID 5 changed from group 2 to group 1, then from group 1 to 2, and so on.
I need to get only the first and the last changes made in this ID/month.
ID groupTO groupFROM MONTH
5 2 1 6
5 1 2 7
5 2 1 7
5 3 2 7
5 1 3 7
5 2 1 8
5 1 2 8
5 2 1 8
6 1 2 6
6 3 1 6
6 2 1 7
6 3 2 8
6 1 3 8
In this case, i need the results to be:
ID groupTO groupFROM MONTH
5 2 1 6
5 1 2 7
5 1 3 7
5 2 1 8
5 2 1 8
6 1 2 6
6 3 1 6
6 2 1 7
6 3 2 8
6 1 3 8
If i remove the duplicates (ID/MONTH), i can get the first occurence, but how do i get the last one?
Here's an easy way you can do with dplyr;
library(dplyr)
# Create data
dt <-
data.frame(Id = c(rep(5, 8), rep(6, 5)),
groupTO = c(2, 1, 2, 3, 1, 2, 1, 2, 1, 3, 2, 3, 1),
groupFROM = c(1, 2, 1, 2, 3, 1, 2, 1, 2, 1, 1, 2, 3),
MONTH = c(6, 7, 7, 7, 7, 8, 8, 8, 6, 6, 7, 8, 8))
dt %>%
# Group by ID and month
group_by(Id, MONTH) %>%
# Get first and last row
slice(c(1, n())) %>%
# To remove cases where first is same as last
distinct()
# # A tibble: 9 x 4
# # Groups: Id, MONTH [6]
# Id groupTO groupFROM MONTH
# <dbl> <dbl> <dbl> <dbl>
# 5 2 1 6
# 5 1 2 7
# 5 1 3 7
# 5 2 1 8
# 6 1 2 6
# 6 3 1 6
# 6 2 1 7
# 6 3 2 8
# 6 1 3 8
Using data.table
library(data.table)
unique(setDT(df1)[, .SD[c(1, .N)], .(ID, MONTH)])
# ID MONTH groupTO groupFROM
#1: 5 6 2 1
#2: 5 7 1 2
#3: 5 7 1 3
#4: 5 8 2 1
#5: 6 6 1 2
#6: 6 6 3 1
#7: 6 7 2 1
#8: 6 8 3 2
#9: 6 8 1 3
data
df1 <- structure(list(ID = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L,
6L, 6L, 6L), groupTO = c(2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L,
3L, 2L, 3L, 1L), groupFROM = c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 3L), MONTH = c(6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L,
6L, 6L, 7L, 8L, 8L)), class = "data.frame", row.names = c(NA,
-13L))
Here is a base R solution using split
dfout <- do.call(rbind,c(make.row.names = F,
lapply(split(df,df[c("Id","MONTH")],lex.order = T),
function(v) if (nrow(v)==1) v[1,] else v[c(1,nrow(v)),])))
such that
> dfout
Id groupTO groupFROM MONTH
1 5 2 1 6
2 5 1 2 7
3 5 1 3 7
4 5 2 1 8
5 5 2 1 8
6 6 1 2 6
7 6 3 1 6
8 6 2 1 7
9 6 3 2 8
10 6 1 3 8```
A base R way using ave where we select 1st and last row for each ID and MONTH and select the unique rows in the dataframe.
unique(subset(df, ave(groupTO == 1, ID, MONTH, FUN = function(x)
seq_along(x) %in% c(1, length(x)))))
# ID groupTO groupFROM MONTH
#1 5 2 1 6
#2 5 1 2 7
#5 5 1 3 7
#6 5 2 1 8
#9 6 1 2 6
#10 6 3 1 6
#11 6 2 1 7
#12 6 3 2 8
#13 6 1 3 8

how to add value of different column with respect of a column

suppose I have
SAMPN PERNO loop car bus walk mode
1 1 1 3.4 2.5 1.5 1
1 1 1 3 2 1 2
1 1 1 4 2 5 3
1 1 2 14 1 3 1
1 1 2 5 8 2 1
2 1 1 1 5 5 3
2 1 1 9 4 3 3
mode column is crossponding to car bus and walk.
mode==1 walk
mode==2 car
mode==3 bus
SAMPN is index of family , PERNO members in family and loop tour of each person. I want to add the value of mode of each person in each family in each loop.
for example in first family SAMPN==1 first person PERNO==1 we have 3 rows for first trip loop==1. in this tour mode of first row is walk (mode==1),mode of second row is car (mode==2),mode of third row is bus (mode==3)
so I will add walk of first row by car of second and bus of third 3.4+2+5=10.4. same for others
Output:
SAMPN PERNO loop car bus walk mode utility
1 1 1 3.4 2.5 1.5 1 10.4
1 1 1 3 2 1 2 10.4
1 1 1 4 2 5 3 10.4
1 1 2 14 1 3 1 19
1 1 2 5 8 2 1 19
2 1 1 1 5 5 3 8
2 1 1 9 4 3 3 8
df %>%
mutate(utility = case_when(mode == 1 ~ car, # using the order in the example,
mode == 2 ~ bus, # not the order in the table
mode == 3 ~ walk,
TRUE ~ 0)) %>%
count(SAMPN, PERNO, loop, wt = utility, name = "utility")
## A tibble: 3 x 4
# SAMPN PERNO loop utility
# <int> <int> <int> <dbl>
#1 1 1 1 10.4
#2 1 1 2 19
#3 2 1 1 8
Or, to get the exact output:
df %>%
mutate(utility= case_when(mode == 1 ~ car,
mode == 2 ~ bus,
mode == 3 ~ walk,
TRUE ~ 0)) %>%
group_by(SAMPN, PERNO, loop) %>%
mutate(utility = sum(utility))
## A tibble: 7 x 8
## Groups: SAMPN, PERNO, loop [3]
# SAMPN PERNO loop car bus walk mode utility
# <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
#1 1 1 1 3.4 2.5 1.5 1 10.4
#2 1 1 1 3 2 1 2 10.4
#3 1 1 1 4 2 5 3 10.4
#4 1 1 2 14 1 3 1 19
#5 1 1 2 5 8 2 1 19
#6 2 1 1 1 5 5 3 8
#7 2 1 1 9 4 3 3 8
Here is an option using base R. Create a column index matching the 'mode' with a named column name ('nm1'0, then cbind with row index, extract the corresponding elements from the dataset, use ave to get th esum grouped by 'SAMPN', and 'loop' column to assign it to 'utility'
nm1 <- setNames(names(df1)[4:6], 1:3)[as.character(df1$mode)]
i1 <- cbind(seq_len(nrow(df1)), match(nm1, names(df1)))
df1$utility <- ave(df1[i1], df1$SAMPN, df1$PERNO, df1$loop, FUN = sum)
df1$utility
#[1] 10.4 10.4 10.4 19.0 19.0 8.0 8.0
data
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), PERNO = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), loop = c(1L, 1L, 1L, 2L, 2L, 1L, 1L),
car = c(3.4, 3, 4, 14, 5, 1, 9), bus = c(2.5, 2, 2, 1, 8,
5, 4), walk = c(1.5, 1, 5, 3, 2, 5, 3), mode = c(1L, 2L,
3L, 1L, 1L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-7L))

how refill a column with the help of 2 other column?

I have a data based 3 groups : SAMPN,PERNO,loop
there are 2 columns, mode1 and mode2. and a column called int.
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2
SAMPN is family index, PERNO is index of persons in each family and loop is tour of each person. the last row of each loop for each person is 0 or 2 and and rest of loop is NA. in each family and for each person and each loop I want copy the column mode 1 in int if the last row of loop is 0 and copy mode2 if the last row of loo is 2.
output
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 1
1 1 1 2 1 2
1 1 1 3 2 3
1 2 1 3 2 2
1 2 1 1 1 1
2 2 1 3 2 3
2 2 1 1 3 1
2 2 1 3 1 3
2 2 2 1 2 2
2 2 2 3 1 1
the first 3 rows is loop of first person in the first family, I filled that loop by mode1 because the third row was 0. and so on
Here's a way using dplyr
df <- read.table(h=T,text="SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2")
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = if(last(int) == 0) mode1 else mode2) %>%
ungroup()
#> # A tibble: 10 x 6
#> SAMPN PERNO loop mode1 mode2 int
#> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 2 1
#> 2 1 1 1 2 1 2
#> 3 1 1 1 3 2 3
#> 4 1 2 1 3 2 2
#> 5 1 2 1 1 1 1
#> 6 2 2 1 3 2 3
#> 7 2 2 1 1 3 1
#> 8 2 2 1 3 1 3
#> 9 2 2 2 1 2 2
#> 10 2 2 2 3 1 1
If you have more values than 0 or 2, switch could be a good alternative :
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = switch(
as.character(last(int)),
`0` = mode1,
`2` = mode2)) %>%
ungroup()
# same output!
We can also use case_when
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = case_when(rep(last(int) == 0, n()) ~ mode1, TRUE ~mode2))
# A tibble: 10 x 6
# Groups: loop, SAMPN, PERNO [4]
# SAMPN PERNO loop mode1 mode2 int
# <int> <int> <int> <int> <int> <int>
# 1 1 1 1 1 2 1
# 2 1 1 1 2 1 2
# 3 1 1 1 3 2 3
# 4 1 2 1 3 2 2
# 5 1 2 1 1 1 1
# 6 2 2 1 3 2 3
# 7 2 2 1 1 3 1
# 8 2 2 1 3 1 3
#9 2 2 2 1 2 2
#10 2 2 2 3 1 1
data
df <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), PERNO = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), loop = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), mode1 = c(1L, 2L, 3L, 3L,
1L, 3L, 1L, 3L, 1L, 3L), mode2 = c(2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 2L, 1L), int = c(NA, NA, 0L, NA, 2L, NA, NA, 0L, NA, 2L)),
class = "data.frame", row.names = c(NA,
-10L))

applying sorting function in r for every four rows returns dataframe sorted but without extended selection, other columns are not sorted accordingly

I need every four rows to be sorted by the 4th column, separately from the next four rows, made a function :
for (i in seq(1,nrow(data_frame), by=4)) {
data_frame[i:(i+3),4] <- sort(data_frame[i:(i+3),4], decreasing=TRUE) }
problem is only the 4th column gets sorted but the corresponding rows are maintained.
from
x y z userID
-1 1 2 5 1
-2 1 1 2 2
-3 0 0 5 5
-6 1 2 5 3
-4 1 1 2 6
-5 0 0 5 4
-4 1 1 2 1
-5 0 0 5 5
to -
x y z userID
-1 1 2 5 5
-2 1 1 2 3
-3 0 0 5 2
-6 1 2 5 1
-4 1 1 2 6
-5 0 0 5 5
-4 1 1 2 4
-5 0 0 5 1
With tidyverse, we can use %/% to create a grouping column with %/% and use that to sort the 'userID'
library(tidyverse)
df1 %>%
group_by(grp = (row_number()-1) %/% 4 + 1) %>%
#or use
#group_by(grp = cumsum(rep(c(TRUE, FALSE, FALSE, FALSE), length.out = n()))) %>%
mutate(userID = sort(userID, decreasing = TRUE))
# A tibble: 8 x 5
# Groups: grp [2]
# x y z userID grp
# <int> <int> <int> <int> <dbl>
#1 1 2 5 5 1
#2 1 1 2 3 1
#3 0 0 5 2 1
#4 1 2 5 1 1
#5 1 1 2 6 2
#6 0 0 5 5 2
#7 1 1 2 4 2
#8 0 0 5 1 2
Or using base R with ave
with(df1, ave(userID, (seq_along(userID)-1) %/% 4 + 1,
FUN = function(x) sort(x, decreasing = TRUE)))
#[1] 5 3 2 1 6 5 4 1
data
df1 <- structure(list(x = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L), y = c(2L,
1L, 0L, 2L, 1L, 0L, 1L, 0L), z = c(5L, 2L, 5L, 5L, 2L, 5L, 2L,
5L), userID = c(1L, 2L, 5L, 3L, 6L, 4L, 1L, 5L)), row.names = c(NA,
-8L), class = "data.frame")
In base R, we can split every 4 rows, order the fourth column and return the updated dataframe back.
df[] <- do.call(rbind, lapply(split(df, gl(nrow(df)/4, 4)),
function(p) p[order(p[[4]], decreasing = TRUE), ]))
df
# x y z userID
#1 0 0 5 5
#2 1 2 5 3
#3 1 1 2 2
#4 1 2 5 1
#5 1 1 2 6
#6 0 0 5 5
#7 0 0 5 4
#8 1 1 2 1
tidyverse approach using the same logic would be
library(tidyverse)
df %>%
group_split(gl(n()/4, 4), keep = FALSE) %>%
map_dfr(. %>% arrange(desc(userID)))

Resources