I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).
Imagine I have this data frame:
S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)
What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.
In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.
In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.
I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.
If any of you could give a help, I'll be very grateful.
A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])
[1] 11
The most complex part is how to check multiple columns. In this case we use sapply to compare the columns c(3,5,6) by equality ('==') with S1, (e2 is the second argument of the == function).
As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])
Applying the same procedure to the two comparisons and defining only the vector of the same values:
equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))
sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])
Note: Use drop=FALSE for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the , this way:
sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])
A solution using dplyr. There are two steps. The first filter function finds rows with S1 == S3. The second filter_at function checks columns other than S1, S3, and count all are not equal to S1, which should be the same as S3 after the first filter function.
library(dplyr)
df2 <- df %>%
filter(S1 == S3) %>%
filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
count S1 S2 S3 S4 S5
1 10 1 2 1 3 3
2 1 1 2 1 2 3
Then the total count is as follows.
sum(df2$count)
[1] 11
Using dplyr, rowwise, filter :
library(dplyr)
df %>%
rowwise() %>%
filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>%
pull(count) %>%
sum()
# [1] 11
Related
Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?
We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]
We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220
Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663
For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)
The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]
I have a data.frame called c41 (HERE). Some column names (e.g., type) in this data frame are repeated once or twice. As a result, data.frame adds a ".number" suffix to distinguish between them.
Suppose I want to subset variable type == 3 among all column names that have a "type" root in their names. Currently, I drop the ".number" suffixes and then subset but that incorrectly returns nothing.
Question: In BASE R, how can I subset a variable value (type == 3) without needing to include the ".number" suffixes (e.g., type == 3 instead of type.1 == 3)?
In other words, how can I find any "type" whose value is 3 regardless of its numeric suffix.
c41 <- read.csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c42 <- setNames(c41, sub("\\.\\d+$", "", names(c41))) # Take off the `".number"` suffixes
subset(c42, type == 3) # Now subset ! But it return nothing!
Renaming the columns to make them non-unique is a recipe for a headache and is not advisable. Without renaming the columns, in base R you could do something like this instead:
c41[rowSums(c41[grep("^type", names(c41))] == 3, na.rm = TRUE) > 0,]
I don't think subset() can be used here if column names are duplicated.
EDIT: I see that you edited your question to specify base R. Can't help you there! But perhaps a dplyr solution is of interest.
You can use dplyr::filter_at and the starts_with helper.
library(dplyr)
library(readr)
c4 <- read_csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3))
Adding a select_at to display just the relevant columns:
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3)) %>%
select_at(vars(starts_with("type")))
Result:
# A tibble: 2 x 2
type type_1
<dbl> <dbl>
1 1 3
2 2 3
My aim is to find row indices of a matrix (dat) that contain matching rows of another matrix (xy).
I find it easy to do this with smaller matrices, as shown in the examples. But the matrices I have a very large number of rows.
For toy example the matrices dat and xy are given below. The aim is to recover the indices 14, 58, 99. In my case, both these matrices have a very larger number of rows.
# toy data
dat <- iris
dat$Sepal.Length <- dat$Sepal.Length * (1 + runif(150))
xy <- dat[c(14, 58, 99), c(1, 5)]
For small matrices, the solutions would be
# solution 1
ind <- NULL
for(j in 1 : length(x)) {
ind[j] <- which((dat$Sepal.Length ==xy[j, 1]) & (dat$Species == xy[j, 2]))
}
Or
# solution 2
which(outer(dat$Sepal.Length, xy[, 1], "==") &
outer(dat$Species, xy[, 2], "=="), arr.ind=TRUE)
But given the size of my data, these methods are not feasible. The first method takes a lot of time and the other fails due to lack of memory.
I wish I know more data.table and dplyr.
With data.table, it's a join:
library(data.table)
setDT(dat); setDT(xy)
dat[xy, on=names(xy), which=TRUE]
# [1] 14 58 99
You could try this dplyr solution. Depends on how big your data frames are.
#use dplyr filter
library(dplyr)
dat %>%
mutate(row_no = row_number()) %>%
filter(Sepal.Length %in% xy$Sepal.Length & Species %in% xy$Species) %>%
select(row_no)
#> row_no
#> 1 14
#> 2 58
#> 3 99
I used paste0() to concatenate Sepal.Length and Species into a temporary variable.
Then match() to return the index of the matches between the two temporary variables.
Then not, '!', is.na() to remove the non-matches and convert to a logical vector.
Then return which() indices are true.
which(!is.na(match(paste0(dat$Sepal.Length, dat$Species), paste0(xy$Sepal.Length, xy$Species))))
[1] 14 58 99
PS: merge() accepts combined variables in by.x and by.y:
merge(dat, xy, by.x=c("Sepal.Length", "Species"), by.y=c("Sepal.Length", "Species"), all.x=FALSE, all.y=TRUE)
Following chinsoon12's suggestion, try this:
library(dplyr)
dat$rowind <- 1:nrow(dat) # adds row index if wanted (not necessary though)
newDf <- semi_join(dat, xy, by = c("Species", "Sepal.Length"))
For the setup you provided, you could use:
library(tidyverse)
dat %>%
mutate(row_num = row_number()) %>%
inner_join(xy, by = c("Sepal.Length", "Species")) %>%
pull(row_num)
This adds a column for the initial row number, does an inner join to produce a data frame with rows in dat that match rows from xy, and then pulls the indices. (An inner join will return all rows from dat that match rows from xy, while a semi-join will return only one row from dat for each row in xy.)
It's worth noting that in this example we are dealing with data frames, not matrices:
> class(xy)
[1] "data.frame"
> class(dat)
[1] "data.frame"
The above code won't work if the data is in matrix form - can you convert your matrices to data frames or tibbles?
if your data is huge, You can hash your rows firstly (for both matrix),then match the row hash values,using digest package.
target_matrix<-iris
query_matrix<-iris[c(14, 58, 99),]
target_row_hash<-apply(target_matrix,1,digest)
query_row_hash<-apply(query_matrix,1,digest)
row_nums<-match(query_row_hash,target_row_hash)
row_nums
output:
14 58 99
I wish to do exactly this: Take dates from one dataframe and filter data in another dataframe - R
except without joining, as I am afraid that after I join my data the result will be too big to fit in memory, prior to the filter.
Here is sample data:
tmp_df <- data.frame(a = 1:10)
I wish to do an operation that looks like this:
lower_bound <- c(2, 4)
upper_bound <- c(2, 5)
tmp_df %>%
filter(a >= lower_bound & a <= upper_bound) # does not work as <= is vectorised inappropriately
and my desired result is:
> tmp_df[(tmp_df$a <= 2 & tmp_df$a >= 2) | (tmp_df$a <= 5 & tmp_df$a >= 4), , drop = F]
# one way to get indices to subset data frame, impractical for a long range vector
a
2 2
4 4
5 5
My problem with memory requirements (with respect to the join solution linked) is when tmp_df has many more rows and the lower_bound and upper_bound vectors have many more entries. A dplyr solution, or a solution that can be part of pipe is preferred.
Maybe you could borrow the inrange function from data.table, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
If you'd like to stick with dplyr it has similar functionality provided through the between function.
# ranges I want to check between
my_ranges <- list(c(2,2), c(4,5), c(6,7))
tmp_df <- data.frame(a=1:10)
tmp_df %>%
filter(apply(bind_rows(lapply(my_ranges,
FUN=function(x, a){
data.frame(t(between(a, x[1], x[2])))
}, a)
), 2, any))
a
1 2
2 4
3 5
4 6
5 7
Just be aware that the argument boundaries are included by default and that cannot be changed as with inrange
I do I remove all rows in a dataframe where a certain row meets a string match criteria?
For example:
A,B,C
4,3,Foo
2,3,Bar
7,5,Zap
How would I return a dataframe that excludes all rows where C = Foo:
A,B,C
2,3,Bar
7,5,Zap
Just use the == with the negation symbol (!). If dtfm is the name of your data.frame:
dtfm[!dtfm$C == "Foo", ]
Or, to move the negation in the comparison:
dtfm[dtfm$C != "Foo", ]
Or, even shorter using subset():
subset(dtfm, C!="Foo")
You can use the dplyr package to easily remove those particular rows.
library(dplyr)
df <- filter(df, C != "Foo")
I had a column(A) in a data frame with 3 values in it (yes, no, unknown). I wanted to filter only those rows which had a value "yes" for which this is the code, hope this will help you guys as well --
df <- df [(!(df$A=="no") & !(df$A=="unknown")),]
if you wish to using dplyr, for to remove row "Foo":
df %>%
filter(!C=="Foo")
I know this has been answered but here is another option:
library (dplyr)
df %>% filter(!c=="foo)
OR
df[!df$c=="foo", ]
If your exclusion conditions are stored in another data frame you could use rows_delete:
library(dplyr)
removal_df <- data.frame(C = "Foo")
df %>%
rows_delete(removal_df, by = "C")
A B C
1 2 3 Bar
2 7 5 Zap
This is also handy if you have multiple exclusion conditions so you do not have to write out a long filter statement.
Note: rows_delete is only available if you have dplyr >= 1.0.0