Filter dataset based on occurrence [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have some large dataset (more than 500 000 rows) and I want to filter it in R. I just want to retain the most relevant information so I thought that it would be a good idea to just save the rows whose elements have an occurrence greater than some value. For example I have this data:
A B
2 5
4 7
2 8
3 7
2 9
4 2
1 0
And I want to retain the rows whose element of the A row has an occurrence greater than 1. In this case the output will be:
A B
2 5
4 7
2 8
2 9
4 2
I know how to do it with for loops and rbind but since the dataset I am using is very big the performance is greatly hindered. Any advice?

We can do this using either data.table, dplyr or base R methods. By using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'A', if the nrows are greater than 1, we get the Subset of Data.table (.SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD, by = A]
Or we use dplyr. We group by 'A', filter the groups that have nrows greater than 1 (n() >1)
library(dplyr)
df1 %>%
group_by(A) %>%
filter(n()>1)
Or using ave from base R, we get a logical index and use that to subset the dataset
df1[with(df1, ave(seq_along(A), A, FUN=length))> 1,]
Or without using any groupings, we can use duplicated to get the index and subset
df1[duplicated(df1$A)|duplicated(df1$A, fromLast=TRUE),]

Related

Is there an R function for performing basic operations on every column of a data frame? [duplicate]

This question already has answers here:
Standardize data columns in R
(16 answers)
Closed 2 years ago.
I have a data frame with n columns like the one below with all the columns being numeric (ex. below only has 3, but the actual one has an unknown number).
col_1 col_2 col_3
1 3 7
3 8 9
5 5 2
8 10 1
11 9 2
I'm trying to transform the data on every column based on this equation: (x-min(col)/(max(col)-min(col)) so that every element is scaled based on the values in the column.
Is there a way to do this without using a for loop to iterate through every column? Would sapply or tapply work here?
We can use scale on the dataset
scale(df1)
Or if we want to use a custom function, create the function, loop over the columns with lapply, apply the function and assign it back to the dataframe
f1 <- function(x) (x-min(col)/(max(col)-min(col))
df1[] <- lapply(df1, f1)
Or this can be done with mutate_all
library(dplyr)
df1 %>%
mutate_all(f1)
In complement to #akrun answer, you can also do that using data.table
library(data.table)
setDT(df)
df[,lapply(.SD, function(x) return((x-min(col)/(max(col)-min(col)))]
If you want to use a subset of columns, you can use .SDcols argument, e.g.
library(data.table)
df[,lapply(.SD, function(x) return((x-min(col)/(max(col)-min(col))),
.SDcols = c('a','b')]

Find last values by condition [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 6 years ago.
I have a very large data frame that I need to subset by last values. I know that the data.table library includes the last() function which returns the last value of an array, but what I need is to subset foo by the last value in id for every separate value in track. Values in id are consecutive integers, but the last values will be different for every track.
> head(foo)
track id coords.x coords.y
1 0 0 -79.90732 43.26133
2 0 1 -79.90733 43.26124
3 0 2 -79.90733 43.26124
4 0 3 -79.90733 43.26124
5 0 4 -79.90725 43.26121
6 0 5 -79.90725 43.26121
The output would look something like this.
track id coords.x coords.y
1 0 57 -79.90756 43.26123
2 1 98 -79.90777 43.26231
3 2 61 -79.90716 43.26200
... and so on
How would one apply the last() function (or another function like tail()) to produce this output?
We can try with dplyr, grouping by track and selecting only the last row of every group.
library(dplyr)
df %>%
group_by(track) %>%
filter(row_number() == n())
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'track' get the last row with tail
library(data.table)
setDT(df1)[, tail(.SD, 1), by = track]
As the also mentioned another logic with 'id' about the consecutive numbers, we can also create a logical index using diff, get the row index (.I) and subset the rows.
setDT(df1)[df1[, .I[c(FALSE, diff(id) ! = 1)], by = track]$V1]
Or we can do this using base R itself
df1[!duplicated(df1$track, fromLast=TRUE),]
Or another option is dplyr
library(dplyr)
df1 %>%
group_by(track) %>%
slice(n())

R Data-Frame: Get Maximum of Variable B condititional on Variable A [duplicate]

This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I am searching for an efficient and fast way to do the following:
I have a data frame with, say, 2 variables, A and B, where the values for A can occur several times:
mat<-data.frame('VarA'=rep(seq(1,10),2),'VarB'=rnorm(20))
VarA VarB
1 0.95848233
2 -0.07477916
3 2.08189370
4 0.46523827
5 0.53500190
6 0.52605101
7 -0.69587974
8 -0.21772252
9 0.29429577
10 3.30514605
1 0.84938361
2 1.13650996
3 1.25143046
Now I want to get a vector giving me for every unique value of VarA
unique(mat$VarA)
the maximum of VarB conditional on VarA.
In the example here that would be
1 0.95848233
2 1.13650996
3 2.08189370
etc...
My data-frame is very big so I want to avoid the use of loops.
Try this:
library(dplyr)
mat %>% group_by(VarA) %>%
summarise(max=max(VarB))
Try to use data.table package.
library(data.table)
mat <- data.table(mat)
result <- mat[,max(VarB),VarA]
print(result)
Try this:
library(plyr)
ddply(mat, .(VarA), summarise, VarB=min(VarB))

Select groups with more than one distinct value per group [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 7 years ago.
I have data like below:
ID category class
1 a m
1 a s
1 b s
2 a m
3 b s
4 c s
5 d s
I want to subset the data by only including those "ID" which have several (> 1) different categories.
My expected output:
ID category class
1 a m
1 a s
1 b s
Is there a way to doing so?
I tried
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(category, class) > 1)
But it gave me an error:
# Error: expecting a single value
Using data.table
library(data.table) #see: https://github.com/Rdatatable/data.table/wiki for more
setDT(data) #convert to native 'data.table' type by reference
data[ , if(uniqueN(category) > 1) .SD, by = ID]
uniqueN is data.table's (fast) native mask for length(unique()), and .SD is just the whole data.table (in more general cases, it can represent a subset of columns, e.g. when the .SDcols argument is activated). So basically the middle statement (j, the column selection argument) says to return all columns and rows associated with an ID for which there are at least two distinct values of category.
Use the by argument to extend to a case involving counts ok multiple columns.

Delete rows in data frame based on multiple columns from another data frame in R [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 7 years ago.
I would like to remove rows that have specific values for columns that match values in another data frame.
a<-c(1,1,2,2,2,4,5,5,5,5)
b<-c(10,10,22,30,30,30,40,40,40,40)
c<-c(1,2,1,2,2,2,2,1,1,2)
d<-rnorm(1:10)
data<-data.frame(a,b,c,d)
a<-c(2,5)
b<-c(30,40)
c<-c(2,1)
x<-data.frame(a,b,c)
So that y can become:
a b c d
1 10 1 -0.2509255
1 10 2 0.4142277
2 22 1 -0.1340514
4 30 2 -1.5372009
5 40 2 1.9001932
5 40 2 -1.2825212
I tried the following, which did not work:
y<-data[!data$a==a & !data$b==b & !data$c==c,]
y<-subset(data, !data$a==x$a & !data$b==x$b & !data$c==x$c)
I also tried to just flag the ones that should be removed in order to subset in a second step, but this did not work either:
y<-data
y$rm<-ifelse(y$a==x$a & y$b==x$b & y$c==x$c, 1, 0)
The real "data" and "x" are much longer, and there are variable number of rows in data that match each row in x.
We can use anti_join from dplyr. It will return all rows from 'data' that are not matching values in 'x'. We specify the variables to be considered in the by argument.
library(dplyr)
anti_join(data, x, by=c('a', 'b', 'c'))

Resources