Checking the presence of values in multiple datasets

Checking the presence of values in multiple datasets - r

I have a number of tables and all the "a" columns of the tables must have identical values for the analysis I am conducting. The actual tables are very big so I will use simplified (mock) data frames.
Let's say I have the following data:
A <- data.frame(a = c(3,4,5,6,7,8), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
B <- data.frame(a = c(2,3,4,5,6,7), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
C <- data.frame(a = c(1,2,3,4,5,6), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
D <- data.frame(a = c(4,5,6,7,8,9), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
Now, each data frame has unidentical values in column "a"s. My goal is to delete the entire rows that contain different values as compared to all the other tables.
In order to have identical values in column "a" for all tables A, B and C, I could use the following operations:
A <- A[A$a %in% B$a,]
B <- B[B$a %in% A$a,]
C <- C[C$a %in% B$a,]
B <- B[B$a %in% C$a,]
A <- A[A$a %in% C$a,]
This is already getting very tedious as you can see. What if I throw the table D or other data frames in this mix. It's becoming almost impossible to proceed, as each table contain at least one unique value.

One dplyr option could be:
bind_rows(list(A, B, C, D), .id = "ID") %>%
mutate(n_datasets = max(ID)) %>%
group_by(a) %>%
filter(n_distinct(ID) == n_datasets)
ID a b c n_datasets
<chr> <dbl> <dbl> <dbl> <chr>
1 1 4 5 6 4
2 1 5 6 7 4
3 1 6 7 8 4
4 2 4 6 7 4
5 2 5 7 8 4
6 2 6 8 9 4
7 3 4 7 8 4
8 3 5 8 9 4
9 3 6 9 10 4
10 4 4 4 5 4
11 4 5 5 6 4
12 4 6 6 7 4

Related

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13

Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.

You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]

A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13

How to remove all rows from dataframe if count of simillar `person_id` values are not `== 2`

I need remove all rows from dataframe if count of simillar person_id values are not == 2. For example:
a1 <- data.frame(person_id = 1:5, b=letters[1:5])
a2 <- data.frame(person_id = 2:6, b=letters[6:10])
data = rbind(a1, a2)
person_id b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 2 f
7 3 g
8 4 h
9 5 i
10 6 j
Row 1 and 10 must be removed, because person_id==1 and person_id==6 have only 1 record. For example person_id==2 have 2 rows.
How can I get new dataset with only rows where count of rows with person_id values are == 2 (and in future 3 or 4)?

Base R solution:
subset(
data,
ave(person_id, person_id, FUN = length) == 2
)

To remove the rows where count of person_id isn't equal to 2:
library(dplyr)
data %>%
group_by(person_id) %>%
filter(n() == 2)
person_id b
<int> <chr>
1 2 b
2 3 c
3 4 d
4 5 e
5 2 f
6 3 g
7 4 h
8 5 i

Get the id's that are not sampled [duplicate]

This question already has answers here:
How I can select rows from a dataframe that do not match?
(3 answers)
Subsetting a data frame to the rows not appearing in another data frame
(5 answers)
Closed 2 years ago.
I would like to get the ids that are not sampled
id <- rep(1:10,each=2)
trt <- rep(c("A","B"),2)
score <- rnorm(20,0,1)
df <- data.frame(id,trt,score)
df$id <- as.factor(df$id)
df
id trt score
1 1 A 0.4920104
2 1 B 0.5030771
3 2 A 1.4030437
4 2 B 0.4132130
5 3 A -2.4449382
6 3 B -1.0981531
7 4 A -0.6013329
8 4 B -0.8411616
9 5 A -0.2696329
10 5 B -0.9869931
11 6 A 1.0681588
12 6 B 1.7500570
13 7 A 0.6008876
14 7 B -0.2181209
15 8 A -1.2943954
16 8 B -2.4495156
17 9 A 0.7680115
18 9 B 0.5497457
19 10 A -1.9713569
20 10 B -0.7696987
df <- df %>% filter(id %in% sample(levels(id),5))
df
id trt score
1 3 A 1.8816245
2 3 B 0.8614810
3 5 A 0.5508704
4 5 B -1.4144959
5 7 A 0.5174229
6 7 B 0.5244466
7 9 A 0.4318934
8 9 B -1.6376436
9 10 A 0.1746228
10 10 B 1.6319294
Here I would like to get the other ids. How can I code for this? Suppose there are many ids and not possible to select them manually
id trt score
1 1 A 0.07040075
2 1 B -0.70388700
3 2 A 0.78421333
4 2 B -0.90052385
7 4 A -0.48052247
8 4 B -0.66198818
11 6 A 1.12168455
12 6 B 0.90454813
15 8 A 1.54550328
16 8 B 0.64822307
........................................................................................................................................................................................................................................................................................................................

If we assign the filtered object to a new one ('df1') instead of assigning on the original object name, an option is anti_join
library(dplyr)
anti_join(df, df1, by = 'id')
Or another option is filter
df %>%
filter(! id %in% df1$id)
data
df1 <- df %>%
filter(id %in% sample(levels(id),5))

Separate data frame depending on one column duplicates

I have a large data frame with a lot of rows and columns. In one column there are characters, some of them occur only once, other multiple times. I would now like to separate the whole data frame, so that I end up with two data frames, one with all the rows that have characters that repeat themselves in this one column and another one with all the rows with the charcaters that occur only once. Like for example:
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
> df
One Two Three
1 1 4 a
2 2 5 b
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
7 7 1 f
8 8 8 e
9 9 1 g
10 10 9 c
I wish to have two data frames like
> dfSingle
One Two Three
1 1 4 a
2 2 5 b
7 7 1 f
9 9 1 g
> dfMultiple
One Two Three
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
8 8 8 e
10 10 9 c
I tried with the duplicated() function
dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))
but it does not work as the first of the "c", "d" and "e" go to the "dfSingle".
I also tried to do a for-loop
MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
if(df$Three[i] %in% MulipleValues){
dfMultiple[x,] = df[i,]
x = x+1
} else {
dfSingle[y,] = df[i,]
y = y+1
}
}
It seems to do the right thing as the data frames have now the right amont of rows but they somehow have 0 columns.
> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows
What am I doing wrong? Or is there another way to do this?
Thanks for your help!

In base R, we can use split with duplicated which will return you list of two dataframes.
df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
df1
#$`FALSE`
# One Two Three
#1 1 4 a
#2 2 5 b
#7 7 1 f
#9 9 1 g
#$`TRUE`
# One Two Three
#3 3 3 c
#4 4 6 d
#5 5 2 d
#6 6 7 e
#8 8 8 e
#10 10 9 c
where df1[[1]] can be considered as dfSingle and df1[[2]] as dfMultiple.

Here is a dplyr one for fun,
library(dplyr)
df %>%
group_by(Three) %>%
mutate(new = n() > 1) %>%
split(.$new)
which gives,
$`FALSE`
# A tibble: 4 x 4
# Groups: Three [4]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
$`TRUE`
# A tibble: 6 x 4
# Groups: Three [3]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE

A way with dplyr:
library(dplyr)
df %>%
group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)
Output:
[[1]]
# A tibble: 4 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
[[2]]
# A tibble: 6 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE

You can do it using base R
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
str(df)
df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))
dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)

Rearranging data frame columns in R (mutate, dplyr)

I have a data frame like so
Type Number Species
A 1 G
A 2 R
A 7 Q
A 4 L
B 4 S
B 5 T
B 3 H
B 9 P
C 12 K
C 11 T
C 6 U
C 5 Q
Where I have used group_by(Type)
My goal is to collapse this data by having NUMBER be the top 2 values in the number column, and then making a new column(Number_2) that is the second 2 values.
Also I would want the Species values for the bottom two numbers to be deleted, so that the species corresponds to the higher number in the row
I would like to use dplyr and the final would look like this
Type Number Number_2 Species
A 7 1 Q
A 4 2 L
B 5 3 T
B 9 4 P
C 12 6 K
C 11 5 T
as of now the order that number_2 is in doesn't matter, as long as it is in the same type....
I don't know if this is possible but if it is does anyone know how...
thanks!

You can try
library(data.table)
setDT(df1)[order(-Number), list(Number1=Number[1:2],
Number2=Number[3:4],
Species=Species[1:2]), keyby = Type]
# Type Number1 Number2 Species
#1: A 7 2 Q
#2: A 4 1 L
#3: B 9 4 P
#4: B 5 3 T
#5: C 12 6 K
#6: C 11 5 T
Or using dplyr with do
library(dplyr)
df1 %>%
group_by(Type) %>%
arrange(desc(Number)) %>%
do(data.frame(Type=.$Type[1L],
Number1=.$Number[1:2],
Number2 = .$Number[3:4],
Species=.$Species[1:2], stringsAsFactors=FALSE))
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T

Here's a different dplyr approach.
library(dplyr)
# Start creating the data set with top 2 values and store as df1:
df1 <- df %>%
group_by(Type) %>%
top_n(2, Number) %>%
ungroup() %>%
arrange(Type, Number)
# Then, get the anti-joined data (the not top 2 values), arrange, rename and select
# the number colummn and cbind to df1:
out <- df %>%
anti_join(df1, c("Type","Number")) %>%
arrange(Type, Number) %>%
select(Number2 = Number) %>%
cbind(df1, .)
This results in:
> out
# Type Number Species Number2
#1 A 4 L 1
#2 A 7 Q 2
#3 B 5 T 3
#4 B 9 P 4
#5 C 11 T 5
#6 C 12 K 6

This could be another option using ddply
library(plyr)
ddply(dat[order(Number)], .(Type), summarize,
Number1 = Number[4:3], Number2 = Number[2:1], Species = Species[4:3])
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Checking the presence of values in multiple datasets - r

Related

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

How to remove all rows from dataframe if count of simillar `person_id` values are not `== 2`

Get the id's that are not sampled [duplicate]

Separate data frame depending on one column duplicates

Rearranging data frame columns in R (mutate, dplyr)

Categories

Resources