Loop over left joins - r

I've been trying to loop over left joins (using R). I need to create a table with columns representing samples from a larger table. Each column of the new table should represent each of these samples.
library(tidyr)
largetable <- data.frame(PlotCode=c(rep("Plot1",20),rep("Plot2",20)),
Category=c(rep("A",8),rep("B",8),rep("C",4),rep("A",12),rep("B",4),rep("C",4)))
a <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"))
##example of code to loop over 100 left joins derived from samples of two elements from a large table. It fails to create the columns.
for (i in 1:100){
count <- largetable %>% group_by(PlotCode) %>% sample_n(2, replace = TRUE)%>%
count(PlotCode,Category)
colnames(count)[3] <- paste0("n",i)
b <- left_join(a, count, by = c("PlotCode","Category"))
}
##example of desired output table. Columns n1 to n100 should change depending of samples.
b <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"),
n1=c(2,1,0,1),
n2=c(1,1,1,1),
n3=c(2,0,1,2))
How can I loop over left joins so each column corresponds to a different sample?

Instead of for loop we can use rerun/replicate to repeat a process n times.
In each iteration we randomly select 2 rows from each PlotCode and count their Category so you will have n lists which can be joined together using reduce and rename the column as per your choice and replace NA with 0.
library(dplyr)
library(purrr)
n <- 10
rerun(n, largetable %>%
group_by(PlotCode) %>%
slice_sample(n = 2, replace = TRUE) %>%
count(PlotCode,Category)) %>%
reduce(full_join, by = c('PlotCode', 'Category')) %>%
rename_with(~paste0('n', seq_along(.)), starts_with('n')) %>%
mutate(across(starts_with('n'), tidyr::replace_na, 0))
# PlotCode Category n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Plot1 A 1 0 2 2 0 1 0 1 2 2
#2 Plot1 B 1 0 0 0 1 1 2 1 0 0
#3 Plot2 B 1 0 0 0 1 0 0 0 0 0
#4 Plot2 C 1 2 0 0 0 0 1 1 0 0
#5 Plot1 C 0 2 0 0 1 0 0 0 0 0
#6 Plot2 A 0 0 2 2 1 2 1 1 2 2

Related

One hot encode list of vectors

Is there a quick way to one-hot encode lists of vectors (with different lenghts) in R, preferably using tidyverse?
For example:
vals <- list(a=c(1), b=c(2,3), c=c(1,2))
The wanted result is a wide dataframe:
1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
Thanks!
We can enframe the list and convert them into separate rows, create a dummy column and convert the data into wide-format using pivot_wider.
library(tidyverse)
enframe(vals) %>%
unnest(value) %>%
mutate(temp = 1) %>%
pivot_wider(names_from = value, values_from = temp, values_fill = list(temp = 0))
# name `1` `2` `3`
# <chr> <dbl> <dbl> <dbl>
#1 a 1 0 0
#2 b 0 1 1
#3 c 1 1 0
One base R option could be:
t(table(stack(vals)))
values
ind 1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
A base R approach,
do.call(rbind, lapply(vals, function(i) as.integer(!is.na(match(unique(unlist(vals)), i)))))
# [,1] [,2] [,3]
#a 1 0 0
#b 0 1 1
#c 1 1 0

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

Using mutate to create columns from column values [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 3 years ago.
With the following data frame, I would like to create new columns based on the "Type" column values using 'mutate' and count the number of instances that appear. The data should be grouped by "Group" and "Choice".
Over time, the "Type" column will have new values added in that aren't already listed, so the code should be flexible in that respect.
Is this possible using the dplyr library?
library(dplyr)
df <- data.frame(Group = c("A","A","A","B","B","C","C","D","D","D","D","D"),
Choice = c("Yes","Yes","No","No","Yes","Yes","Yes","Yes","No","No","No","No"),
Type = c("Fruit","Construction","Fruit","Planes","Fruit","Trips","Construction","Cars","Trips","Fruit","Planes","Trips"))
The desired result should be the following:
result <- data.frame(Group = c("A","A","B","B","C","D","D"),
Choice = c("Yes","No","Yes","No","Yes","Yes","No"),
Fruit = c(1,1,0,1,0,0,1),
Construction = c(0,1,0,0,1,0,0),
Planes = c(0,0,1,0,0,0,1),
Trips = c(0,0,0,0,1,0,2),
Cars = c(0,0,0,0,0,1,0))
We can do a count and then spread
library(tidyverse)
df %>%
count(Group, Choice, Type) %>%
spread(Type, n, fill = 0)
# A tibble: 7 x 7
# Group Choice Cars Construction Fruit Planes Trips
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A No 0 0 1 0 0
#2 A Yes 0 1 1 0 0
#3 B No 0 0 0 1 0
#4 B Yes 0 0 1 0 0
#5 C Yes 0 1 0 0 1
#6 D No 0 0 1 1 2
#7 D Yes 1 0 0 0 0

I am trying to identify patterns of missing values in rows of a dataset

I am trying to find patterns in missing values in rows.
For example if I have this data set:
a b c d
1 0.1 NA NA
2 NA 3 4
5 NA 6 NA
I expect the output to be:
n a b c d m
1 0 0 1 1 2
1 0 1 0 0 1
1 0 1 0 1 2
where column n shows the number of rows missing values in column m and 1's indicate missing values (except for columns n and m) .That is, the interpretation of the first row of the output is as follows: 1 row is missing 2 values which are for variables c and d; second row: 1 row is missing 1 value in variable b and so on.
I have tried using the subtable() function in extracat package(archived version) but I cant find the locations of missing values in each variables. I can only find frequencies.
rowmiss<-rowSums(is.na(dat1[1:ncol(dat1)]))
r1<-matrix(rowmiss, nrow=nrow(dat1))
subtable(rowmiss,1)
I expect the output to be as shown above. What I am finding so far is the frequency of missing values in rows but I expect patterns and positions of missing values.
Here's a tidyverse approach. The n column seems redundant, should it be doing something else?
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
mutate(val = is.na(val) * 1) %>%
group_by(rowid) %>% mutate(m = sum(val)) %>% ungroup() %>%
spread(col, val) %>%
mutate(n = 1) %>%
select(n, a:d, m)
# A tibble: 3 x 6
n a b c d m
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 1 2
2 1 0 1 0 0 1
3 1 0 1 0 1 2
An alternative way of doing this with tidyverse:
library(tidyverse)
df %>%
mutate_all(~ is.na(.) %>% as.numeric()) %>%
mutate(m = rowSums(.)) %>%
group_by_all() %>%
count()
Output (you may also want to ungroup() if doing anything further with the df):
# A tibble: 3 x 6
# Groups: a, b, c, d, m [3]
a b c d m n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 0 1 1 2 1
2 0 1 0 0 1 1
3 0 1 0 1 2 1
mice::md.pattern() also does basically what you want, but returns a matrix with some of the useful info in the rownames, so would require a bit of processing to trun into a dataframe.

How do I sum recurring values according to a level in a column and output a table of counts?

I'm new to R and I have data that looks something like this:
categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)
Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.
I like my new dataframe output to look something like this:
A B C count
1 1 1 1
1 1 0 2
1 0 1 0
0 1 1 2
1 0 0 2
0 1 0 0
0 0 1 2
0 0 0 0
The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.
Is there a function in R that will help me achieve this? Thank you in advance.
We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:
library(dplyr)
library(tidyr)
as.data.frame.matrix(t(table(dat))) %>%
group_by_all() %>%
summarize(Count = n())
Result:
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<int> <int> <int> <int>
1 0 0 1 2
2 0 1 1 2
3 1 0 0 2
4 1 1 0 2
5 1 1 1 1
Edit (thanks to #C. Braun). Here is how to also include the zero A, B, C combinations:
as.data.frame.matrix(t(table(dat))) %>%
bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
group_by_all() %>%
summarize(Count = n()-1)
or with complete, as suggested by #Ryan:
as.data.frame.matrix(t(table(dat))) %>%
mutate(non_missing = 1) %>%
complete(A, B, C) %>%
group_by(A, B, C) %>%
summarize(Count = sum(ifelse(is.na(non_missing), 0, 1)))
Result:
# A tibble: 8 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 1 2
3 0 1 0 0
4 0 1 1 2
5 1 0 0 2
6 1 0 1 0
7 1 1 0 2
8 1 1 1 1
We have
xxtabs <- function(df, formula) {
xt <- xtabs(formula, df)
xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
as.data.frame(xxt)
}
and
> xxtabs(dat, ~ animals + categories)
A B C Freq
1 0 0 0 0
2 1 0 0 2
3 0 1 0 0
4 1 1 0 2
5 0 0 1 2
6 1 0 1 0
7 0 1 1 2
8 1 1 1 1
(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation
xt <- xtabs(~ animals + categories, dat)
then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame
xxt <- xtabs(~ ., as.data.frame.matrix(xt))
coerced to the desired form
as.data.frame(xxt)
I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.
One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping
dat %>% xxtabs(~ animals + categories)
If I understood you correctly, this should do the trick.
require(tidyverse)
dat %>%
mutate(value = 1) %>%
spread(categories, value) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
group_by(A, B, C) %>%
summarize(Count = n())
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <int>
1 0. 0. 1. 2
2 0. 1. 1. 2
3 1. 0. 0. 2
4 1. 1. 0. 2
5 1. 1. 1. 1
Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.
dcast(as.data.table(dat), animals ~ categories, length)[
CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]

Resources