Frequency table with common values of 5 tables - r

I have 5 data frames and I have to analize just the first column. From these, I must obtain a frequency table of their common words (not necessarily of all data frames, for example a word can appear just in two or more dataframes).
Then I must obtain a frequency table of common words of ALL dataframes
I just tried doing a for cycle but I seems very complicated. Moreover, dataframes have different dimentions. I didn't find any useful function.
Then I tried doing
lst1 <- list(a,b,c,d,e)
newdat <- stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))[2:1]
library(dplyr)
newdat %>% group_by(val) %>% filter(uniqueN(ind) > 1) %>% count(val)
but it gives me an error
> stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))
Error in stack.default(setNames(lapply(lst1, "[", 1), seq_along(lst1))):
at least one vector element is required
Thank you

Here's my solution using purrr & dplyr:
library(purrr)
library(dplyr)
lst1 <- list(mtcars=mtcars, iris=iris, chick=chickwts, cars=cars, airqual=airquality)
lst1 %>%
map_dfr(select, value=1, .id="df") %>% # select first column of every dataframe and name it "value"
group_by(value) %>%
summarise(freq=n(), # frequency over all dataframes
n_df=n_distinct(df), # number of dataframes this value ocurrs
dfs = paste(unique(df), collapse=",")) %>%
filter(n_df > 1) %>%
filter(n_df == 5) # if value has to be in all 5 dataframes

Related

group_by list of dataframes grouping by wildcarded columns in R

A stackoverflow member (Gregor Thomas) helped me, in my previous post, to learn about pivot_longer in order to transform my dataset to do operations on them.
This works great if there is a constant grouping column(s).
However I found that I have many index columns TS_Wafer(n) resulting in many dataframes.
I combined the dataframes into a list and was able to use the lapply function to perform the pivot_longer on the list of dataframes, however I am stuck when trying to perform the group_by operration.
The grouping needs to be grouped such that the n in TS_Wafer(n) matches the Wafer number.
So for example if the dataset is:
TS_Wafer1 TS_Wafer2 Wafer value
2022-06-29T03:43:53.767582 1 418.274905
2022-06-29T03:43:53.767582 1 449.370044
2022-06-29T03:43:53.767582 1 412.800065
2022-06-29T03:43:53.767582 1 429.350565
2022-06-29T02:11:52.485032 2 439.345743
2022-06-29T02:11:52.485032 2 415.363545
2022-06-29T02:11:52.485032 2 427.456437
2022-06-29T02:11:52.485032 2 438.357252
I want to find the max and min where the dataset is grouped with TS_Wafer1 and Wafer=1
Here is the code I have so far:
dflist <- lapply(ls(pattern="df[0-9]+"), function(x) get(x)) # combine dataframes into list
apply_long_func <- function(df) {
df %>%
pivot_longer(
cols= -starts_with("TS"),
names_pattern = "([0-9]+).*([0-9]+)",
names_to = c("Wafer", "Radius"),
values_to = "Temperature"
) %>%
as.data.frame
}
dflong <- lapply(dflist, apply_long_func) #Gives the dataset shown in the example above
#This is where Im not sure
apply_group_func <- function(df){
df %>%
group_by(TS,Wafer) %>%
summarize(
max=max(value),
min = min(value),
.groups = "drop"
) %>%
as.data.frame
}
I would then use the same lapply function for the group_by but how do I specify TS_Wafer(i)?
Should I use a for loop?
Any help would be greatly appreciated

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Count number of zeros in each row of large data.frame using purrr::map function

I have a very large dataframe 280,000 x 20 and many of the rows (obs) only have 1 or 0 values in them. The function I'm using needs at least 2 values per operation. I can iterate using a for loop but it takes a long time. I would like to use one of the purrr map functions to increase speed as I will be doing this many times. This is how I've been doing it with a for loop:
library(Matrix)
M1 <- as.matrix(rsparsematrix(100, 20, .1, rand.x = runif))
x <- vector("integer")
for(i in 1:dim(M1)[1]){
l <- (length(which(M1[i,] == 0)))
x <- c(x,l)
}
ind <- which(x == 19 | x == 20)
M1 <- M1[-ind,]
I haven't figured out the right way to do it in using map. I assume it requires creating another column using mutate.
M1 %>% mutate(zero_count = length(map(which(. == 0))))
It is not clear about the expected. First, we convert the matrix to tibble or data.frame, then mutate columns to logical columns, reduce to a single vector by adding (+) all the TRUE values in each row and cbind with the vector with the original matrix ('M1')
library(tidyverse)
M1 %>%
as_tibble %>%
mutate_all(funs(.==0)) %>%
reduce(`+`) %>%
cbind(M1, Count = .)
Update
For subsetting the rows based on the sum
M1 %>%
as_tibble %>%
mutate_all(funs(.==0)) %>%
reduce(`+`) %>%
`%in%`(19:20) %>%
magrittr::extract(M1, .,)
With base R, it is rowSums on a logical matrix and cbind with the original matrix
cbind(M1, Count = rowSums(!M1))
Or subsetting with the rowSums
M1[rowSums(!M1) %in% 19:20, ]
You can achieve the same thing with apply
apply(M1, 1 , function(x) sum(!x))

Extract a dataframe from a column of dataframes (tidyverse approach)

I have been able to do some nice things with purrr to be able to work with dataframe columns within a dataframe. By which I am referring to a column of a dataframe where every cell contains a dataframe itself.
I am trying to find out the idiomatic approach for extracting one of these dataframes back out.
Example
# Create a couple of dataframes:
df1 <- tibble::tribble(~a, ~b,
1, 2,
3, 4)
df2 <- tibble::tribble(~a, ~b,
11, 12,
13, 14)
# Make a dataframe with a dataframe column containing
# our first two dfs as cells:
meta_df <- tibble::tribble(~df_name, ~dfs,
"One", df1,
"Two", df2)
My question is, what is the tidyverse-preferred way of getting one of these dataframes back out of meta_df? Say I get the cell I want using select() and filter():
library("magrittr")
# This returns a 1x1 tibble with the only cell containing the 2x2 tibble that
# I'm actually after:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs)
This works, but seems non-tidyverse-ish:
# To get the actual tibble that I'm after I can wrap the whole lot in brackets
# and then use position [[1, 1]] index to get it:
(meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs))[[1, 1]]
# Or a pipeable version:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
`[[`(1, 1)
I have a feeling that this might be a situation where the answer is in purrr rather than dplyr, and that it might be a simple trick once you know it, but I'm coming up blank so far.
better solution:
Use tidyr::unnest():
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
tidyr::unnest()
other solution:
You can use pull (the tidyverse way to select the column, equivalent to $), but it returns a one-element list of tibbles, so you need to add %>% .[[1]] to the end.
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::pull(dfs) %>% .[[1]]

Resources