Get the id's that are not sampled [duplicate] - r

This question already has answers here:
How I can select rows from a dataframe that do not match?
(3 answers)
Subsetting a data frame to the rows not appearing in another data frame
(5 answers)
Closed 2 years ago.
I would like to get the ids that are not sampled
id <- rep(1:10,each=2)
trt <- rep(c("A","B"),2)
score <- rnorm(20,0,1)
df <- data.frame(id,trt,score)
df$id <- as.factor(df$id)
df
id trt score
1 1 A 0.4920104
2 1 B 0.5030771
3 2 A 1.4030437
4 2 B 0.4132130
5 3 A -2.4449382
6 3 B -1.0981531
7 4 A -0.6013329
8 4 B -0.8411616
9 5 A -0.2696329
10 5 B -0.9869931
11 6 A 1.0681588
12 6 B 1.7500570
13 7 A 0.6008876
14 7 B -0.2181209
15 8 A -1.2943954
16 8 B -2.4495156
17 9 A 0.7680115
18 9 B 0.5497457
19 10 A -1.9713569
20 10 B -0.7696987
df <- df %>% filter(id %in% sample(levels(id),5))
df
id trt score
1 3 A 1.8816245
2 3 B 0.8614810
3 5 A 0.5508704
4 5 B -1.4144959
5 7 A 0.5174229
6 7 B 0.5244466
7 9 A 0.4318934
8 9 B -1.6376436
9 10 A 0.1746228
10 10 B 1.6319294
Here I would like to get the other ids. How can I code for this? Suppose there are many ids and not possible to select them manually
id trt score
1 1 A 0.07040075
2 1 B -0.70388700
3 2 A 0.78421333
4 2 B -0.90052385
7 4 A -0.48052247
8 4 B -0.66198818
11 6 A 1.12168455
12 6 B 0.90454813
15 8 A 1.54550328
16 8 B 0.64822307
........................................................................................................................................................................................................................................................................................................................

If we assign the filtered object to a new one ('df1') instead of assigning on the original object name, an option is anti_join
library(dplyr)
anti_join(df, df1, by = 'id')
Or another option is filter
df %>%
filter(! id %in% df1$id)
data
df1 <- df %>%
filter(id %in% sample(levels(id),5))

Related

Keep row as soon as cumulative value reaches a certain threshold R

I have a dataframe where I would like to keep a row as soon as the cumulative value of a column reaches a certain level. The dataset could look like this:
set.seed(0)
n <- 10
dat <- data.frame(id=1:n,
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=abs(rnorm(n)))
dat
id group age type x
1 1 A 26 type 1 0.928567035
2 2 B 21 type 2 0.294720447
3 3 A 24 type 3 0.005767173
4 4 B 18 type 4 2.404653389
5 5 A 19 type 5 0.763593461
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
8 8 B 28 type 8 0.289461574
9 9 A 19 type 9 0.299215118
10 10 B 28 type 10 0.411510833
Where I want to keep a row as soon as the cumulative value of x reaches a threshold (e.g. 1), starting to count again as soon as a row was retained. Which would result in the following output:
id group age type x
2 2 B 21 type 2 0.294720447
4 4 B 18 type 4 2.404653389
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
10 10 B 28 type 10 0.411510833
I am trying to get a dplyr based solution but can't seem to figure it out. Any tips?
You can use purrr::accumulate to compute the cumsum with threshold, then use dplyr::slice_tail to get the last value before the cumsum cuts the threshold:
library(dplyr)
library(purrr)
dat %>%
group_by(a = cumsum(x == accumulate(x, ~ ifelse(.x <= 1, .x + .y, .y)))) %>%
slice_tail(n = 1)
# id group age type x gp
# 1 2 B 21 type 2 0.295 1
# 2 4 B 18 type 4 2.40 2
# 3 6 B 30 type 6 0.799 3
# 4 7 A 24 type 7 1.15 4
# 5 10 B 28 type 10 0.412 5
Another option is to use MESS::cumsumbinning, which might be more friendly to use:
library(MESS)
library(dplyr)
dat %>%
group_by(a = cumsumbinning(x, 1, cutwhenpassed = T)) %>%
slice_tail(n = 1)
Mael beat me with the cumsumbinning() from the MESS-package...
Here is a data.table option using that function:
library(MESS)
library(data.table)
setDT(dat)[, .SD[.N], by = MESS::cumsumbinning(x, 1, cutwhenpassed = TRUE)]
# MESS id group age type
# 1: 1 2 B 21 type 2
# 2: 2 4 B 18 type 4
# 3: 3 6 B 30 type 6
# 4: 4 7 A 24 type 7
# 5: 5 10 B 28 type 10

Class totals on each row in an R dataframe [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
I have a dataframe like the following in R:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5),ncol=2)
For each row of this dataframe, I want to include the total value for each class (A,B,C) so that the dataframe will look this this:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5,20,20,20,20,19,19,19,19,17,17,17,17),ncol=3)
What's a way I could accomplish this?
Thanks in advance for your help.
Using R base
df <- data.frame(df)
df$Total <- ave(as.numeric(df$X2), df$X1, FUN=sum)
A dplyr solution would be this:
data.frame(df) %>%
group_by(X1) %>%
mutate(Sum = sum(as.numeric(X2)))
# A tibble: 12 × 3
# Groups: X1 [3]
X1 X2 Sum
<chr> <chr> <dbl>
1 A 4 20
2 A 6 20
3 A 8 20
4 A 2 20
5 B 2 19
6 B 7 19
7 B 2 19
8 B 8 19
9 C 9 17
10 C 1 17
11 C 2 17
12 C 5 17

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13

Merging multiple connected columns

I have two different columns for several samples, which are connected. I want to merge all columns of type 1 to one column and all of type 2 to one column, but the rows should stay connected.
Example:
a1 <- c(1, 2, 3, 4, 5)
b1 <- c(1, 4, 9, 16, 25)
a2 <- c(2, 4, 6, 8, 10)
b2 <- c(4, 8, 12, 16, 20)
df1 <- data.frame(a1, b1, a2, b2)
a1 b1 a2 b2
1 1 1 2 4
2 2 4 4 8
3 3 9 6 12
4 4 16 8 16
5 5 25 10 20
I want to have it like this:
a b
1 1 1
2 2 4
3 2 4
4 3 9
5 4 8
6 4 16
7 5 25
8 6 12
9 8 16
10 10 20
My case
This is the example in my case. I have a lot of columns with different names and I want to extract abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new data frame, with all abs_dist in one column and all mean_vel in one column, but still connected.
I tried with unlist, but then of course the connection gets lost.
Thanks in advance.
A base R option using reshape
subset(
reshape(
setNames(df1, gsub("(\\d)", ".\\1", names(df1))),
direction = "long",
varying = 1:ncol(df1)
),
select = -c(time, id)
)
gives
a b
1.1 1 1
2.1 2 4
3.1 3 9
4.1 4 16
5.1 5 25
1.2 2 4
2.2 4 8
3.2 6 12
4.2 8 16
5.2 10 20
An option with pivot_longer from tidyr by specifying the names_sep as a regex lookaround to match between a lower case letter ([a-z]) and a digit in the column names
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
select(-grp)
-output
# A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 1 1
# 2 2 4
# 3 2 4
# 4 4 8
# 5 3 9
# 6 6 12
# 7 4 16
# 8 8 16
# 9 5 25
#10 10 20
With the edited post, we need to change the names_sep i.e. the delimiter is now _ between a lower case letter and a digit
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])_(?=[0-9])") %>%
select(-grp)
or with base R, use split.default on the substring of column names into a list of data.frame, then unlist each list element by looping over the list and convert to data.frame
data.frame(lapply(split.default(df1, sub("\\d+", "", names(df1))),
unlist, use.names = FALSE))
For the sake of completeness, here is a solution which uses data.table::melt() and the patterns() function to specify columns which belong together:
library(data.table)
melt(setDT(df1), measure.vars = patterns(a = "a", b = "b"))[
order(a,b), !"variable"]
a b
1: 1 1
2: 2 4
3: 2 4
4: 3 9
5: 4 8
6: 4 16
7: 5 25
8: 6 12
9: 8 16
10: 10 20
This reproduces the expected result for OP's sample dataset.
A more realistic example: reshape only selected columns
With the edit of the question, the OP has clarifified that the production data contains many more columns than those which need to be reshaped:
I have a lot of columns with different names and I want to extract
abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new
data frame, with all abs_dist in one column and all mean_vel in one
column, but still connected.
So, the OP wants to extract and reshape the columns of interest in one go while ignoring all other data in the dataset.
To simulate this situation, we need a more elaborate dataset which includes other columns as well:
df2 <- cbind(df1, c1 = 11:15, c2 = 21:25)
df2
a1 b1 a2 b2 c1 c2
1 1 1 2 4 11 21
2 2 4 4 8 12 22
3 3 9 6 12 13 23
4 4 16 8 16 14 24
5 5 25 10 20 15 25
With a modified version of the code above
library(data.table)
cols <- c("a", "b")
result <- melt(setDT(df2), measure.vars = patterns(cols), value.name = cols)[, ..cols]
setorderv(result, cols)
result
we get
a b
1: 1 1
2: 2 4
3: 3 9
4: 4 16
5: 5 25
6: 2 4
7: 4 8
8: 6 12
9: 8 16
10: 10 20
For the production dataset as pictured in the edit, the OP needs to set
cols <- c("abs_dist", "mean_vel")

Checking the presence of values in multiple datasets

I have a number of tables and all the "a" columns of the tables must have identical values for the analysis I am conducting. The actual tables are very big so I will use simplified (mock) data frames.
Let's say I have the following data:
A <- data.frame(a = c(3,4,5,6,7,8), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
B <- data.frame(a = c(2,3,4,5,6,7), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
C <- data.frame(a = c(1,2,3,4,5,6), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
D <- data.frame(a = c(4,5,6,7,8,9), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
Now, each data frame has unidentical values in column "a"s. My goal is to delete the entire rows that contain different values as compared to all the other tables.
In order to have identical values in column "a" for all tables A, B and C, I could use the following operations:
A <- A[A$a %in% B$a,]
B <- B[B$a %in% A$a,]
C <- C[C$a %in% B$a,]
B <- B[B$a %in% C$a,]
A <- A[A$a %in% C$a,]
This is already getting very tedious as you can see. What if I throw the table D or other data frames in this mix. It's becoming almost impossible to proceed, as each table contain at least one unique value.
One dplyr option could be:
bind_rows(list(A, B, C, D), .id = "ID") %>%
mutate(n_datasets = max(ID)) %>%
group_by(a) %>%
filter(n_distinct(ID) == n_datasets)
ID a b c n_datasets
<chr> <dbl> <dbl> <dbl> <chr>
1 1 4 5 6 4
2 1 5 6 7 4
3 1 6 7 8 4
4 2 4 6 7 4
5 2 5 7 8 4
6 2 6 8 9 4
7 3 4 7 8 4
8 3 5 8 9 4
9 3 6 9 10 4
10 4 4 4 5 4
11 4 5 5 6 4
12 4 6 6 7 4

Resources