Keep duplicate values only if they are represented in first sampling period - r

I am trying to clean my data so that only duplicate values that have an observation in my first sampling period are kept. For instance, if my data frame looks like this:
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,4,4), period = c(1,2,3,1,2,3,2,3,1,3), mass = rnorm(10, 5, 2))
df
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
7 3 2 4.466666
8 3 3 6.940979
9 4 1 6.226222
10 4 3 4.233397
I would like to keep observations only the observations that are duplicated for individuals measured during period 1. My new data frame would then look like this:
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
9 4 1 6.226222
10 4 3 4.233397
Using suggestions on this page (Remove all unique rows) I have tried using the following command, but it leaves in the observations for individual 3 (which was not measured in period 1).
subset(df, duplicated(ID) | duplicated(ID, fromLast=T))

If you want a base solution, the following should work, as well.
> df_new <- df[df$ID %in% df$ID[df$period == 1], ]
> df_new
ID period mass
1 1 1 3.238832
2 1 2 3.428847
3 1 3 1.205347
4 2 1 8.498452
5 2 2 7.523085
6 2 3 3.613678
9 4 1 3.324095
10 4 3 1.932733

You can use dplyr as follows:
library(dplyr)
df %>% group_by(ID) %>% filter(1 %in% period)
#Source: local data frame [8 x 3]
#Groups: ID [3]
# ID period mass
# <dbl> <dbl> <dbl>
#1 1 1 7.622950
#2 1 2 7.960665
#3 1 3 5.045723
#4 2 1 4.366568
#5 2 2 4.400645
#6 2 3 6.088367
#7 4 1 2.282713
#8 4 3 2.461640

Related

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

R Tidyverse - Randomize by ID

I have a df like this one:
id <- c(1,1,2,2,3,3,4,4,5,5)
v1 <- c(3,1,2,3,4,5,6,1,5,4)
pos <- c(1,2,1,2,1,2,1,2,1,2)
df <- data.frame(id,v1,pos)
How can I "randomize" the values of v1 WHILE keeping the inherent order from the "Id" var and also the values of "pos" such as I get df with randomized values like this:
id v1 pos
1 1 1
1 3 2
2 2 1
2 3 2
3 5 1
3 4 2
4 6 1
4 1 2
5 5 1
5 4 2
Above and example of resulting df with id and pos staying as originally created and v1 randomized.
Thx!
Is sample what you're looking for?
df %>%
group_by(id) %>%
mutate(v1 = sample(v1, size = length(v1)))
# A tibble: 10 x 3
# Groups: id [5]
id v1 pos
<dbl> <dbl> <dbl>
1 1 3 1
2 1 1 2
3 2 3 1
4 2 2 2
5 3 4 1
6 3 5 2
7 4 1 1
8 4 6 2
9 5 5 1
10 5 4 2

Filter on groups where where at the max value of one variable, another variable equals a particular value

I want to filter on groups where at the max value of one variable, another variable equals a particular value.
I have data like so:
library(tidyverse)
df1 <- data.frame(grp = rep(letters[1:2],each=5),
day = 1:5,
value = c(0,5,7,1,1,5,8,5,3,0)) %>%
group_by(grp)
grp day value
1 a 1 0
2 a 2 5
3 a 3 7
4 a 4 1
5 a 5 1
6 b 1 5
7 b 2 8
8 b 3 5
9 b 4 3
10 b 5 0
And I want to filter on groups where at the max(day), value equals 1.
So the output would look like this:
grp day value
1 a 1 0
2 a 2 5
3 a 3 7
4 a 4 1
5 a 5 1
Data.table or dplyr solutions are welcome. Thanks!
As it is already grouped, simply apply filter by checking whether 'value' that corresponds to max value of day (which.max(day)) is 1
library(dplyr)
df1 %>%
filter(value[which.max(day)] ==1)
# A tibble: 5 x 3
# Groups: grp [1]
# grp day value
# <fct> <int> <dbl>
#1 a 1 0
#2 a 2 5
#3 a 3 7
#4 a 4 1
#5 a 5 1
Or have two conditions and wrap with any
df1 %>%
filter(any(value ==1 & day == max(day)))

Subseting data frame based on multiple criteria for deletion of rows

Consider the following data frame consisting of column names "id" and "x", where each id is repeated four times. Data is as follows:
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
The question is about how to subset the data frame by the following criteria:
(1) keep all entries of each id, if its corresponding values in column x does not contain 3 or it has 3 as the last number.
(2) for a given id with multiple 3s in column x, keep all the numbers up to the first 3 and delete the remaining 3s. The expected output would look like:
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
I am familiar with the use of the 'filter' function in dplyr package to subset data, but this particular situation confuses me because of the complexity of the above criteria. Any help on this would be greatly appraciated.
Here's one solution that uses / creates some new columns to help you filter on:
library(dplyr)
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
df %>%
group_by(id) %>% # for each id
mutate(num_threes = sum(x == 3), # count number of 3s
flag = ifelse(unique(num_threes) > 0, # if there is a 3
min(row_number()[x == 3]), # keep the row of the first 3
0)) %>% # otherwise put a 0
filter(num_threes == 0 | row_number() <= flag) %>% # keep ids with no 3s or up to first 3
ungroup() %>%
select(-num_threes, -flag) # remove helpful columns
# # A tibble: 13 x 2
# id x
# <dbl> <dbl>
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 1
# 5 2 2
# 6 2 3
# 7 3 1
# 8 3 2
# 9 3 2
# 10 3 3
# 11 4 2
# 12 4 2
# 13 4 3
this works for me:
data
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
commands
library(dplyr)
df <- mutate(df, before = lag(x))
df$condition1 <- 1
df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]
result
x id
1 2
1 2
1 1
1 1
2 2
2 3
3 1
3 2
3 2
3 3
4 2
4 2
4 3`
One idea is to pick out the rows with x==3 and use unique() over them. Then append the unique rows with just single 3 to the rest part of the data frame, and finally order the rows.
Here is a solution with base R for the idea above:
res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))
which give
> res
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
DATA
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Resources