Print rows that contain values in vector/column in R - r

I have some data looking like this:
1. matrix
a b c d e
f 4 5 6 7 8
g 1 2 3 4 5
h 3 2 1 6 7
2. column/vector
v <- c(5,4,8,6,0)
How can I print all the rows that contain the data in vector?
I've seen there's a function called filter that could work, or maybe lapply / grep?

Use filter with if_any:
library(dplyr)
df %>%
filter(if_any(everything(), ~ . %in% v))

Or using base R (assuming it is a matrix)
m1[rowSums(`dim<-`(m1 %in% v, dim(m1))) > 0,]

Related

Find rows where multiple columns have the same value

library(tidyverse)
d = data.frame(x=c('A','B','C'), y=c('A','B','D'), z=c('X','B','C'), a=1:3)
print(d)
x y z a
1 A A X 1
2 B B B 2
3 C D C 3
d %>% filter(x==y) # Returns rows 1 and 2
d %>% filter(x==z) # Returns rows 2 and 3
d %>% filter(x==y & x==z) # Returns row 2
How can I do what the very last line is doing with more concise syntax for some arbitrary set of columns? For example, filter(all.equal(x,y,z)) which doesn't work but expresses the idea.
With comparisons, on multiple columns, an easier option is to take one column out (x), while keeping the rest by looping in if_all, then do the ==, so that it will return only TRUE when all the comparisons for that particular row is TRUE
library(dplyr)
d %>%
filter(if_all(y:z, ~ x == .x))
Same idea, with across instead of if_all;
d %>%
filter(across(y:z, ~`==`(.x, x)))

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

R dataframe subset collation optimization

This isn't a question about how to do something per se, it's more about how to do something better.
In R, say I have a dataframe, df:
df<-read.table(text="
Column1 Column2 Category
1 1 A
2 1 B
3 1 D
4 1 E
5 2 B
6 3 B
7 4 C
8 4 C
9 5 E
10 6 A", header=TRUE)
Now I would like to create a list (of dataframes) where each dataframe in the list is a subset of df where each subset is conditional on Category. I can create this as follows:
mylist <-list()
mylist[[1]] <- subset(df,df$Category=='A')
mylist[[2]] <- subset(df,df$Category=='B')
mylist[[3]] <- subset(df,df$Category=='C')
mylist[[4]] <- subset(df,df$Category=='D')
mylist[[5]] <- subset(df,df$Category=='E')
Now this works but is pretty clunky, is effectively a hard-coded loop and won't scale easily if I have more than five categories.
Is there a tighter/better way to do this?
You can use the function split
split(df,df$Category)
You can use dplyr library and loop for this case:
library(dplyr)
mylist <-list()
for ( v in unique(df$Category)){
mylist[[length(mylist)+1]] <- filter(df, Category == v)
}
mylist

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

Resources