Cross-tab in R with data.tables - r

Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
to arrive at this result:
final_matrix
L A B C D E F
1: A 3 1 2 <NA> 1 <NA>
2: B 1 0 <NA> <NA> <NA> <NA>
3: C 2 <NA> 0 1 1 <NA>
4: D <NA> <NA> 1 0 <NA> <NA>
5: E 1 <NA> 1 <NA> 1 1
6: F <NA> <NA> <NA> <NA> 1 0
7: tot 7 1 4 1 4 1
(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:
table(toy_data$from,toy_data$to)
A B C D E F
A 3 1 1 0 1 0
C 1 0 0 1 0 0
E 0 0 1 0 1 1
Thanks. My function if you have general improvements or best practices I am super happy:
create_edge_cols<- function(dt,column){
#this function takes a df and a column,
#computes the number of edges among this column and all the other in dt
#returns a column (list) with the cross-tabulation of columns
tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
print(tot_edges_i)
# now tabulate links of column
tab = data.table(table(unlist(dt[(from==column&to!=column)|
(from!=column&to==column)])))
setnames(tab, "V1", "L")
setnames(tab, "N", column)
setorder(tab,"L")
tab[L==column,column] = length(dt[to==column & to == from,from])
#tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
return(tab)
}
#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))
#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
info_column = copy(create_edge_cols(toy_data,col))
first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}
## function to set first row as name
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row
# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix
EDIT: solution suggested by previous answer now deleted:
library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)

Here is a way. What is missing in the question's table statement are factor levels, table is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA to counts equal to zero.
There is also a print issue, see the final two instructions. The default for S# class "table" method print is not to print NA's. This can be changed manually.
library(data.table)
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#> to
#> from A B C D E F
#> A 3 1 1 1
#> B
#> C 1 1
#> D
#> E 1 1 1
#> F
print(tbl, na.print = NA)
#> to
#> from A B C D E F
#> A 3 1 1 <NA> 1 <NA>
#> B <NA> <NA> <NA> <NA> <NA> <NA>
#> C 1 <NA> <NA> 1 <NA> <NA>
#> D <NA> <NA> <NA> <NA> <NA> <NA>
#> E <NA> <NA> 1 <NA> 1 1
#> F <NA> <NA> <NA> <NA> <NA> <NA>
Created on 2022-03-28 by the reprex package (v2.0.1)
Edit
To add a column sums row at the bottom of the cross table, rbind the result above with colSums. Note that there's no longer need for print(tbl, na.print = NA), the method print (autoprint) being called is now the matrix method.
library(data.table)
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
class(tbl) # check the output object class
#> [1] "table"
tbl <- rbind(tbl, tot = colSums(tbl, na.rm = TRUE))
is.na(tbl) <- tbl == 0
class(tbl) # check the output object class, it's no longer "table"
#> [1] "matrix" "array"
tbl
#> A B C D E F
#> A 3 1 1 NA 1 NA
#> B NA NA NA NA NA NA
#> C 1 NA NA 1 NA NA
#> D NA NA NA NA NA NA
#> E NA NA 1 NA 1 1
#> F NA NA NA NA NA NA
#> tot 4 1 2 1 2 1
Created on 2022-03-29 by the reprex package (v2.0.1)

Related

Split variable from comma into an ordered dataframe

I have a dataframe like this, where the values are separated by comma.
# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B
I would like to have the next data frame
# Event1 Event2 Event3 Event4 Event5
# A B C NA NA
# NA NA C NA NA
# A B NA NA NA
# A B NA D E
# A B NA NA E
I have tried with cSplit but I don't have the desired df. Is possible?
NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.
1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.
Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.
s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))
data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.
s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)
mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
This could also be represented as an R 4.2+ pipeline:
DF |>
with(setNames(Events, seq_along(Events))) |>
strsplit(",") |>
stack() |>
with(tapply(values, data.frame(ind, values), c)) |>
`colnames<-`(NULL) |>
data.frame(Event = _)
Note
The input in reproducible form:
Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)
Another approach using tidyverse:
library(dplyr)
library(purrr)
library(stringr)
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
letters <- Events %>% str_split(",") %>% unlist() %>% unique()
df <- data.frame(Events)
df %>%
map2_dfc(.y = letters, ~ ifelse(str_detect(.x, .y), .y, NA)) %>%
set_names(nm = paste0("Events", 1:length(letters)))
#> # A tibble: 5 × 5
#> Events1 Events2 Events3 Events4 Events5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A B C <NA> <NA>
#> 2 <NA> <NA> C D <NA>
#> 3 A B <NA> <NA> <NA>
#> 4 A B <NA> D E
#> 5 A B <NA> <NA> E
Created on 2022-07-11 by the reprex package (v2.0.1)
This tidyverse solution is easily the most economical in terms of amount of code used:
library(tidyverse)
data.frame(Events) %>%
# split the strings by the comma:
mutate(Events = str_split(Events, ",")) %>%
# unnest splitted values wider into columns:
unnest_wider(Events, names_sep = "")
# A tibble: 5 × 4
Events1 Events2 Events3 Events4
<chr> <chr> <chr> <chr>
1 A B C NA
2 C D NA NA
3 B A NA NA
4 D B A E
5 A E B NA
Data:
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
We can try the following base R code
> d <- t(table(stack(setNames(strsplit(df$Events, ","), 1:nrow(df)))))
> as.data.frame.matrix(`dim<-`(colnames(d)[ifelse(d > 0, d * col(d), NA)], dim(d)))
V1 V2 V3 V4 V5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E

Conditional replacement of values in R

I have a question in R. I have a dataset whose cells I would like to change based on the value of the column next to each other
Data <- tibble(a = 1:5,
b = c("G","H","I","J","K"),
c = c("G","H","J","I","J"))
I would like to change the chr. to NA if b and c have the same chr.
Desired output
Data <- tibble(a = 1:5,
b = c("NA","NA","I","J","K"),
c = c("NA","NA","J","I","J"))
Thanks a lot for your help in advance.
library(data.table)
setDT(Data)[b == c, c("b", "c") := NA]
# a b c
# 1: 1 <NA> <NA>
# 2: 2 <NA> <NA>
# 3: 3 I J
# 4: 4 J I
# 5: 5 K J
With base R:
Data[Data$b == Data$c, c('b', 'c')] <- "NA"
Data
# # A tibble: 5 x 3
# a b c
# <int> <chr> <chr>
# 1 1 NA NA
# 2 2 NA NA
# 3 3 I J
# 4 4 J I
# 5 5 K J
Using which to subset Data on the rows where band c have the same values:
Data[c("b","c")][which(Data$b == Data$c),] <- NA
Result:
Data
# A tibble: 5 x 3
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J
With dplyr
library(dplyr)
Data %>%
rowwise() %>%
mutate(b = ifelse(b %in% c & c %in% b, "NA", b))%>%
mutate(c = ifelse(b == "NA", "NA", c))
Output:
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J
Another base R option
cols <- c("b", "c")
Data[cols] <- replace(Data[cols], Data[cols] == Data[rev(cols)], NA)
gives
> Data
# A tibble: 5 x 3
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J

Transforming an R Dataframe with 2 columns and delimiter in rows

I have a dataframe that has two columns "id" and "detail" (df_current below). I need to group the dataframe by id, and spread the file so that the columns become "Interface1", "Interface2", etc. and the contents under the interface columns are the immediate values under each time the interface value appears. Essentially the "!" is working as a separator, but it is not needed in the output.
The desired output is shown below as: "df_needed_from_current".
I have tried multiple approaches (group_by, spread, reshape, dcast etc.), but can't get it to work. Any help would be greatly appreciated!
Sample Current Dataframe (code to create under):
id
detail
1
!
1
Interface1
1
a
1
b
1
!
1
Interface2
1
a
1
b
2
!
2
Interface1
2
a
2
b
2
c
2
!
2
Interface2
2
a
3
!
3
Interface1
3
a
3
b
3
c
3
d
df_current <- data.frame(
id = c("1","1","1","1","1","1","1","1","2",
"2","2","2","2","2","2","2","3","3",
"3","3","3","3","4","4","4","4","4",
"4","4","4","4","4","4","4","4","4",
"5","5","5","5","5","5","5","5","5",
"5","5","5","5"),
detail = c("!", "Interface1","a","b","!",
"Interface2","a","b","!","Interface1",
"a","b","c","!","Interface2","a",
"!", "Interface1","a","b","c","d",
"!", "Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b","c","!","Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b"))
Dataframe Needed (code to create under):
ID
Interface1
Interface2
Interface3
1
a
a
NA
1
b
b
NA
2
a
a
NA
2
b
NA
NA
2
c
NA
NA
3
a
NA
NA
3
b
NA
NA
3
c
NA
NA
3
d
NA
NA
df_needed_from_current <- data.frame(
id = c("1","1","2","2","2","3","3","3","3","4","4","4","5","5","5"),
Interface1 = c("a","b","a","b","c","a","b","c","d","a","b","NA","a","b","NA"),
Interface2 = c("a","b","a","NA","NA","NA","NA","NA","NA","a","b","c","a","b","c"),
Interface3 = c("NA","NA","NA","NA","NA","NA","NA","NA","NA","a","b","c","a","b","NA")
)
We remove the rows where the 'detail' values is "!", then create a new column 'interface' with only values that have prefix 'Interface' from 'detail', use fill from tidyr to fill the NA elements with the previous non-NA, filter the rows where the 'detail' values are not the same as 'interface' column, create a row sequence id with rowid(from data.table) and reshape to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df_current %>%
filter(detail != "!") %>%
mutate(interface = case_when(str_detect(detail, 'Interface') ~ detail)) %>%
group_by(id) %>%
fill(interface) %>%
ungroup %>%
filter(detail != interface) %>%
mutate(rn = rowid(id, interface)) %>%
pivot_wider(names_from = interface, values_from = detail) %>%
select(-rn)
# A tibble: 15 x 4
# id Interface1 Interface2 Interface3
# <chr> <chr> <chr> <chr>
# 1 1 a a <NA>
# 2 1 b b <NA>
# 3 2 a a <NA>
# 4 2 b <NA> <NA>
# 5 2 c <NA> <NA>
# 6 3 a <NA> <NA>
# 7 3 b <NA> <NA>
# 8 3 c <NA> <NA>
# 9 3 d <NA> <NA>
#10 4 a a a
#11 4 b b b
#12 4 <NA> c c
#13 5 a a a
#14 5 b b b
#15 5 <NA> c <NA>

Dataframe: Order column contents matching first column [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I can't believe I'm struggling with this after just 3 years of taking a break from R...
Basically, I'm having a dataframe where in the first column all possible values are listed. The subsequent columns have either one of the values or NA. I would love to sort every subsequent column in a way so that when it contains the value at some point, it shall be in the row where that value is in the first column.
Probably easier to explain with an example:
Original:
Letters Category1 Category2 Category3 Category4
A NA A NA NA
B A NA NA D
C NA NA NA A
D NA C NA NA
E E B C NA
Desired state:
Letters Category1 Category2 Category3 Category4
A A A NA A
B NA B NA NA
C NA C C NA
D NA NA NA D
E E NA NA NA
Am I overlooking a built-in function / library to do this elegantly? My approach would probably to build a function that creates a new dataframe and checks the contents row by row, but this seems very inefficient...
use tidyverse
df <- read.table(text = "Letters Category1 Category2 Category3 Category4
A NA A NA NA
B A NA NA D
C NA NA NA A
D NA C NA NA
E E B C NA", header = T)
library(tidyverse)
df %>%
pivot_longer(-Letters, values_drop_na = T) %>%
mutate(Letters = value) %>%
arrange(name) %>%
pivot_wider(Letters, names_from = name, values_from = value)
#> # A tibble: 5 x 5
#> Letters Category1 Category2 Category3 Category4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A A A <NA> A
#> 2 E E <NA> <NA> <NA>
#> 3 C <NA> C C <NA>
#> 4 B <NA> B <NA> <NA>
#> 5 D <NA> <NA> <NA> D
Created on 2020-12-01 by the reprex package (v0.3.0)
use data.table
library(data.table)
dt <- as.data.table(df)
dt_long <- melt(data = dt, id.vars = "Letters", na.rm = T)
dcast(dt_long, value ~ variable, value.var = "value")
#> value Category1 Category2 Category3 Category4
#> 1: A A A <NA> A
#> 2: B <NA> B <NA> <NA>
#> 3: C <NA> C C <NA>
#> 4: D <NA> <NA> <NA> D
#> 5: E E <NA> <NA> <NA>
Created on 2020-12-01 by the reprex package (v0.3.0)
One way using dplyr and tidyr would be :
library(dplyr)
library(tidyr)
vals <- df$Letters
df %>%
pivot_longer(cols = starts_with('Category'),
values_drop_na = TRUE) %>%
group_by(value) %>%
mutate(Letters = vals[cur_group_id()]) %>%
arrange(name) %>%
pivot_wider() %>%
arrange(Letters) -> result
result
# Letters Category1 Category2 Category3 Category4
# <chr> <chr> <chr> <chr> <chr>
#1 A A A NA A
#2 B NA B NA NA
#3 C NA C C NA
#4 D NA NA NA D
#5 E E NA NA NA

Matching values from multiple columns in 1 data frame to key in second data frame and creating columns

I have 2 data frames. One (df1) looks like this:
var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA
And the other (df2) looks like this:
var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
3 7 j k z
...
with all of the values listed out in var.1-var.4 in df1 in var.a of df2.
I want to match var.a from df2 across all of the columns listed in df1 and then add these columns to df1 with new/combined column names. So for instance it'll look like this:
var.1 var1.b var1.c var1.d ... var.4 var4.b var4.c var4.d
1 7 j k z 2 f g h
2 4 j k l 7 j k z
3 2 f g h NA NA NA NA
Thanks in advance!
Here's a tidyverse solution. First, I define the data frames.
df1 <- read.table(text = " var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA", header = TRUE)
df2 <- read.table(text = " var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
4 7 j k z", header=TRUE)
Then, I load the libraries.
# Load libraries
library(tidyr)
library(dplyr)
library(tibble)
Finally, I restructure the data.
# Manipulate data
df1 %>%
rownames_to_column() %>%
gather(variable, value, -rowname) %>%
left_join(df2, by = c("value" = "var.a")) %>%
gather(foo, bar, -variable, -rowname) %>%
unite(goop, variable, foo) %>%
spread(goop, bar) %>%
select(-rowname)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
which gives,
#> var.1_value var.1_var.b var.1_var.c var.1_var.d var.2_value var.2_var.b
#> 1 7 j k z 9 <NA>
#> 2 4 j k l 6 <NA>
#> 3 2 f g h <NA> <NA>
#> var.2_var.c var.2_var.d var.3_value var.3_var.b var.3_var.c var.3_var.d
#> 1 <NA> <NA> 1 b c d
#> 2 <NA> <NA> 9 <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> var.4_value var.4_var.b var.4_var.c var.4_var.d
#> 1 2 f g h
#> 2 7 j k z
#> 3 <NA> <NA> <NA> <NA>
Created on 2019-05-30 by the reprex package (v0.3.0)
This is a little bit convoluted, but I'll try to explain.
I turn row numbers into a column at first, as this will help me put the data back together at the very end.
I go from wide to long format for df1.
I join df2 to df1 based on var.a and var.1 (now called value), respectively.
I go from wide to long again.
I combine the variable names from each data frame into one variable.
Finally, I go from long to wide format (this is where the row numbers come in handy) and drop the row numbers.

Resources