R - ff package : find the most frequent element in ffdf and delete the rows where is located - r

I need a suggestion to find the most frequent element in ffdf and after that to delete the rows where is located.
I decided to try the ff package as I'm working with very big data and with base R I am running out of memory.
Here is a little example:
# create a base R Matrix
> z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
> z
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "b"
[4,] "c" "c"
[5,] "b" "a"
# convert z to ffdf
> u=as.data.frame(z, stringsAsFactors=TRUE)
> u=as.ffdf(u)
> u
ffdf data
V1 V2
1 a b
2 a c
3 b b
4 c c
5 b a
Im looking for:
Export the most frequent element in ffdf (in this case it is "b")
Delete from ffdf all the rows where "b" is located
So, the new ffdf must be as below:
V1 V2
1 a c
2 c c
In base R I found the way with the "table" function
temp <- table(as.vector(z))
t1<-names(temp)[temp == max(temp)]
z1<- z[rowSums(z== t1[1]) == 0, ]
But working with huge data I need something like the ff package.

require(ff)
z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
u <- as.data.frame(z, stringsAsFactors=TRUE)
u <- as.ffdf(u)
u
The following should work on any sized dataset. It uses table.ff and ffwhich from ffbase, ffrowapply from ff and indexing based on ff integer vectors.
require(ffbase)
require(plyr)
## Detect most frequent item (assuming the levels of all columns can be different)
columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]
## Identify the lines where the most frequent item occurs in each row of the ffdf
idx <- ffrowapply(
EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)),
X=u,
RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers
## Remove them
u[idx, ]

Related

How to obtain the list of elements from a Venn diagram

I have a Venn diagram made from 3 lists, I would like to obtain all the different sub-lists, common elements between two lists, between the tree of them, and the unique elements for each list. Is there a way to make this as straight forward as possible?
AW.DL <- c("a","b","c","d")
AW.FL <- c("a","b", "e", "f")
AW.UL <- c("a","c", "e", "g")
venn.diagram(
x = list(AW.DL, AW.FL, AW.UL),
category.names = c("AW.DL" , "AW.FL","AW.UL" ),
filename = '#14_venn_diagramm.png',
output=TRUE,
na = "remove"
)
I found that the package VennDiagram has a function calculate.overlap() but I wasn't able to find a way to name the sections from this function. However, if you use package gplots , there is the function venn() which will return the intersections attribute.
AW.DL <- c("a","b","c","d")
AW.FL <- c("a","b", "e", "f")
AW.UL <- c("a","c", "e", "g")
library(gplots)
lst <- list(AW.DL,AW.FL,AW.UL)
ItemsList <- venn(lst, show.plot = FALSE)
lengths(attributes(ItemsList)$intersections)
Output:
> lengths(attributes(ItemsList)$intersections)
A B C A:B A:C B:C A:B:C
1 1 1 1 1 1 1
To get elements, just print attributes(ItemsList)$intersections:
> attributes(ItemsList)$intersections
$A
[1] "d"
$B
[1] "f"
$C
[1] "g"
$`A:B`
[1] "b"
$`A:C`
[1] "c"
$`B:C`
[1] "e"
$`A:B:C`
[1] "a"

Finding specific elements in lists

I am stuck at one of the challenges proposed in a tutorial I am reading.
# Using the following code:
challenge_list <- list(words = c("alpha", "beta", "gamma"),
numbers = 1:10
letter = letters
# challenge_list
# Extract the following things:
#
# - The word "gamma"
# - The letters "a", "e", "i", "o", and "u"
# - The numbers less than or equal to 3
I have tried using the followings:
## 1
challenge_list$"gamma"
## 2
challenge_list [[1]["gamma"]]
But nothing works.
> challenge_list$words[challenge_list$words == "gamma"]
[1] "gamma"
> challenge_list$letter[challenge_list$letter %in% c("a","e","i","o","u")]
[1] "a" "e" "i" "o" "u"
> challenge_list$numbers[challenge_list$numbers<=3]
[1] 1 2 3
We can use a function and then do the subset if it is numeric or not and then use Map to pass the list to vector that correspond to the original list element and apply the f1. This would return the new list with the filtered values
f1 <- function(x, y) if(is.numeric(x)) x[ x <= y] else x [x %in% y]
out <- Map(f1, challenge_list, list('gamma', 3, c("a","e","i","o","u")))
out
-output
#$words
#[1] "gamma"
#$numbers
#[1] 1 2 3
#$letter
#[1] "a" "e" "i" "o" "u"
Try this. Most of R objects can be filtered using brackets. In the case of lists you have to use a pair of them like [[]][] because the first one points to the object inside the list and the second one makes reference to the elements inside them. For vectors the task is easy as you only can use a pair of brackets and set conditions to extract elements. Here the code:
#Data
challenge_list <- list(words = c("alpha", "beta", "gamma"),
numbers = 1:10
letter = letters
#Code
challenge_list[[1]][1]
letter[letter %in% c("a", "e", "i", "o","u")]
numbers[numbers<=3]
As I have noticed your data is in a list, you can also play with the position of the elements like this:
#Data 2
challenge_list <- list(words = c("alpha", "beta", "gamma"),numbers = 1:10,letter = letters)
#Code 2
challenge_list[[1]][1]
challenge_list[[3]][challenge_list[[3]] %in% c("a", "e", "i", "o","u")]
challenge_list[[2]][challenge_list[[2]]<=3]
Output:
challenge_list[[1]][1]
[1] "alpha"
challenge_list[[3]][challenge_list[[3]] %in% c("a", "e", "i", "o","u")]
[1] "a" "e" "i" "o" "u"
challenge_list[[2]][challenge_list[[2]]<=3]
[1] 1 2 3

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

Subsetting a data frame using another data frame

I'm having trouble with something that shouldn't be that hard to come around. What I would like to do is subsetting a data.frame by using another data.frame, and more precisely, by using a certain parameter.
Here goes the example:
df1<- t(data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("0.5","3","0","0","5","0","15"), C=c("0","0","3","15","15","0","0"), D=c("0.5","0.5","0.5","0","0","0","0"), E=c("37.5","37.5","0.5","62.5","0.5","0.5","1")))
df2<- data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("vasc", "vasc","vasc","spha", "moss","moss","moss"), C=c("a", "a", "b", "a", "c","d","a"))
Now, let's say that I want in my df1 only the objects A (here they are species) that are "vasc" in df2 in my df1.
For that I've tried a few things such as:
df3 <- subset(df2, B=="vasc")
df4 <- df1[,c(df1, as.vector(df2))]
But in doing so, I have an error of type:
Error in df1[, c(df1, as.vector(df2))] : invalid subscript type 'list'
Therefore, I've tried to unlist my dataframes but nothing seems to work. I've been on this problem for a while now and I did explore the forum to see if anyone had an elegant solution to my problem but it looks like not.
Another way of doing this subsetting was to do the following bit of code, but it didn't work either even though I felt closer to the solution:
try11 <- list(df2, df1)%>% rbindlist(., fill=T) # with df1 not transposed
df11 <- try11[try11=="vasc",]
I hope the code is good enough and my explanation clear enough.
Thank you!
You might try:
library(data.table)
setDT(df1)
setDT(df2)
dtPruned <- df1[A %in% df2[B == "vasc", A]]
Make sure to remove the t() call in your df1 definition for this to work, however. Basically, what it's doing is selecting the A column in df2 where B = "vasc". It then selects the rows from df1 where A is in those A's from df2.
You can do it with dplyr
library(dplyr)
species <- as.character(df2[df2$B == "vasc",1])
df1 %>%
slice(A %in% species)
## A tibble: 3 x 5
# A B C D E
# <fct> <fct> <fct> <fct> <fct>
#1 ABI 0.5 0 0.5 37.5
#2 ABI 0.5 0 0.5 37.5
#3 ABI 0.5 0 0.5 37.5
PS
Your data contains only factor. Maybe you want yo use number as numeric class.
This should do it. First we create a character vector (x) of all A values where B == vasc in df2. Then we select columns from df1 where A == x:
# Create a character vector of all A values when B == vasc
x <- as.character(df2[df2$B == "vasc", 1])
# Select columns where row A == x
df1[, which(df1[1, ] %in% x)]
[,1] [,2] [,3]
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
If we avoid the t call, we can do:
df1[df1$A %in% df2[df2$B == "vasc", 1], ]
A B C D E
1 ABI 0.5 0 0.5 37.5
2 BET 3 0 0.5 37.5
3 ALN 0 3 0.5 0.5
We could transpose the data frame to retain the same format as above:
t(df1[df1$A %in% df2[df2$B == "vasc", 1], ])
1 2 3
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
Data:
df1 <- t(data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("0.5","3","0","0","5","0","15"),
C = c("0","0","3","15","15","0","0"),
D = c("0.5","0.5","0.5","0","0","0","0"),
E = c("37.5","37.5","0.5","62.5","0.5","0.5","1")
)
)
df2 <- data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("vasc", "vasc","vasc","spha", "moss","moss","moss"),
C = c("a", "a", "b", "a", "c","d","a")
)

In R, how can I set the names of an object and return it in one line?

I would like to set the names of my R object and return it in one line. It should look something like:
names(doWork(), c("a", "b", "c"))
And perform the equivalent of:
x <- doWork()
names(x) <- c("a", "b", "c")
x
Is this possible?
You can try setNames
x <- setNames(doWork(), letters[1:3])
To add to what #rawr states:
`names<-`(x, letters[1:3])
works. This isn't super interesting for setting names, since setNames exists, but there are many other attribute replacement functions that don't have a corresponding attribute setting function, so this can become useful (when playing code golf). For example, if we want to set column names for a list of matrices:
mats <- replicate(2, matrix(sample(1:100, 4), 2), simplify=F) # list of matrices
lapply(mats, `colnames<-`, LETTERS[1:2])
Produces:
[[1]]
A B
[1,] 78 59
[2,] 39 93
[[2]]
A B
[1,] 99 54
[2,] 1 16

Resources