Split variable from comma into an ordered dataframe - r

I have a dataframe like this, where the values are separated by comma.
# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B
I would like to have the next data frame
# Event1 Event2 Event3 Event4 Event5
# A B C NA NA
# NA NA C NA NA
# A B NA NA NA
# A B NA D E
# A B NA NA E
I have tried with cSplit but I don't have the desired df. Is possible?
NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.

1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.
Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.
s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))
data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.
s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)
mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
This could also be represented as an R 4.2+ pipeline:
DF |>
with(setNames(Events, seq_along(Events))) |>
strsplit(",") |>
stack() |>
with(tapply(values, data.frame(ind, values), c)) |>
`colnames<-`(NULL) |>
data.frame(Event = _)
Note
The input in reproducible form:
Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)

Another approach using tidyverse:
library(dplyr)
library(purrr)
library(stringr)
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
letters <- Events %>% str_split(",") %>% unlist() %>% unique()
df <- data.frame(Events)
df %>%
map2_dfc(.y = letters, ~ ifelse(str_detect(.x, .y), .y, NA)) %>%
set_names(nm = paste0("Events", 1:length(letters)))
#> # A tibble: 5 × 5
#> Events1 Events2 Events3 Events4 Events5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A B C <NA> <NA>
#> 2 <NA> <NA> C D <NA>
#> 3 A B <NA> <NA> <NA>
#> 4 A B <NA> D E
#> 5 A B <NA> <NA> E
Created on 2022-07-11 by the reprex package (v2.0.1)

This tidyverse solution is easily the most economical in terms of amount of code used:
library(tidyverse)
data.frame(Events) %>%
# split the strings by the comma:
mutate(Events = str_split(Events, ",")) %>%
# unnest splitted values wider into columns:
unnest_wider(Events, names_sep = "")
# A tibble: 5 × 4
Events1 Events2 Events3 Events4
<chr> <chr> <chr> <chr>
1 A B C NA
2 C D NA NA
3 B A NA NA
4 D B A E
5 A E B NA
Data:
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")

We can try the following base R code
> d <- t(table(stack(setNames(strsplit(df$Events, ","), 1:nrow(df)))))
> as.data.frame.matrix(`dim<-`(colnames(d)[ifelse(d > 0, d * col(d), NA)], dim(d)))
V1 V2 V3 V4 V5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E

Related

Fill a column with one of four date columns based on another R

I have a DF with 5 columns like so;
A B Date1 Date2 Date3 Date4
1 x NA NA NA
2 NA y NA NA
3 NA NA z NA
4 NA NA NA f
I want to use the dplyr package and the case_when() function to state something like this
df <- df %>%
mutate(B = case_when(
A == 1 ~ B == Date1,
A == 2 ~ B == Date2,
A == 3 ~ B == Date3,
A == 4 ~ B == Date4))
Essentially based on the value of A I would like to fill B with one of 4 date columns.
A is of class character, B and the Date are all class Date.
Problem is when I apply this to the dataframe it simply doesn't work. It returns NAs and changes the class of B to boolean. I am using R version 4.1.2. Any help is appreciated.
You can use coalesce() to find first non-missing element.
library(dplyr)
df %>%
mutate(B = coalesce(!!!df[-1]))
# A Date1 Date2 Date3 Date4 B
# 1 1 x <NA> <NA> <NA> x
# 2 2 <NA> y <NA> <NA> y
# 3 3 <NA> <NA> z <NA> z
# 4 4 <NA> <NA> <NA> f f
The above code is just a shortcut of
df %>%
mutate(B = coalesce(Date1, Date2, Date3, Date4))
If the B needs to be filled based on the value of A, then here is an idea with c_across():
df %>%
rowwise() %>%
mutate(B = c_across(starts_with("Date"))[A]) %>%
ungroup()
# # A tibble: 4 × 6
# A Date1 Date2 Date3 Date4 B
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 x NA NA NA x
# 2 2 NA y NA NA y
# 3 3 NA NA z NA z
# 4 4 NA NA NA f f
The other answers are superior, but if you must use your current code for the actual application, the corrected version is:
df %>%
mutate(B = case_when(
A == 1 ~ Date1,
A == 2 ~ Date2,
A == 3 ~ Date3,
A == 4 ~ Date4))
Output:
# A B Date1 Date2 Date3 Date4
# 1 x x <NA> <NA> <NA>
# 2 y <NA> y <NA> <NA>
# 3 z <NA> <NA> z <NA>
# 4 f <NA> <NA> <NA> f
As it seems, you want diagonal values from columns with Date, you can use diag:
df$B <- diag(as.matrix(df[grepl("Date", colnames(df))]))
#[1] "x" "y" "z" "f"
Other answers (if you want to coalesce):
With max:
df$B <- apply(df[2:5], 1, \(x) max(x, na.rm = T))
With c_across:
df %>%
rowwise() %>%
mutate(B = max(c_across(Date1:Date4), na.rm = T))
output
A Date1 Date2 Date3 Date4 B
1 1 x <NA> <NA> <NA> x
2 2 <NA> y <NA> <NA> y
3 3 <NA> <NA> z <NA> z
4 4 <NA> <NA> <NA> f f

Cross-tab in R with data.tables

Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
to arrive at this result:
final_matrix
L A B C D E F
1: A 3 1 2 <NA> 1 <NA>
2: B 1 0 <NA> <NA> <NA> <NA>
3: C 2 <NA> 0 1 1 <NA>
4: D <NA> <NA> 1 0 <NA> <NA>
5: E 1 <NA> 1 <NA> 1 1
6: F <NA> <NA> <NA> <NA> 1 0
7: tot 7 1 4 1 4 1
(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:
table(toy_data$from,toy_data$to)
A B C D E F
A 3 1 1 0 1 0
C 1 0 0 1 0 0
E 0 0 1 0 1 1
Thanks. My function if you have general improvements or best practices I am super happy:
create_edge_cols<- function(dt,column){
#this function takes a df and a column,
#computes the number of edges among this column and all the other in dt
#returns a column (list) with the cross-tabulation of columns
tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
print(tot_edges_i)
# now tabulate links of column
tab = data.table(table(unlist(dt[(from==column&to!=column)|
(from!=column&to==column)])))
setnames(tab, "V1", "L")
setnames(tab, "N", column)
setorder(tab,"L")
tab[L==column,column] = length(dt[to==column & to == from,from])
#tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
return(tab)
}
#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))
#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
info_column = copy(create_edge_cols(toy_data,col))
first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}
## function to set first row as name
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row
# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix
EDIT: solution suggested by previous answer now deleted:
library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)
Here is a way. What is missing in the question's table statement are factor levels, table is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA to counts equal to zero.
There is also a print issue, see the final two instructions. The default for S# class "table" method print is not to print NA's. This can be changed manually.
library(data.table)
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#> to
#> from A B C D E F
#> A 3 1 1 1
#> B
#> C 1 1
#> D
#> E 1 1 1
#> F
print(tbl, na.print = NA)
#> to
#> from A B C D E F
#> A 3 1 1 <NA> 1 <NA>
#> B <NA> <NA> <NA> <NA> <NA> <NA>
#> C 1 <NA> <NA> 1 <NA> <NA>
#> D <NA> <NA> <NA> <NA> <NA> <NA>
#> E <NA> <NA> 1 <NA> 1 1
#> F <NA> <NA> <NA> <NA> <NA> <NA>
Created on 2022-03-28 by the reprex package (v2.0.1)
Edit
To add a column sums row at the bottom of the cross table, rbind the result above with colSums. Note that there's no longer need for print(tbl, na.print = NA), the method print (autoprint) being called is now the matrix method.
library(data.table)
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
class(tbl) # check the output object class
#> [1] "table"
tbl <- rbind(tbl, tot = colSums(tbl, na.rm = TRUE))
is.na(tbl) <- tbl == 0
class(tbl) # check the output object class, it's no longer "table"
#> [1] "matrix" "array"
tbl
#> A B C D E F
#> A 3 1 1 NA 1 NA
#> B NA NA NA NA NA NA
#> C 1 NA NA 1 NA NA
#> D NA NA NA NA NA NA
#> E NA NA 1 NA 1 1
#> F NA NA NA NA NA NA
#> tot 4 1 2 1 2 1
Created on 2022-03-29 by the reprex package (v2.0.1)

Dataframe: Order column contents matching first column [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I can't believe I'm struggling with this after just 3 years of taking a break from R...
Basically, I'm having a dataframe where in the first column all possible values are listed. The subsequent columns have either one of the values or NA. I would love to sort every subsequent column in a way so that when it contains the value at some point, it shall be in the row where that value is in the first column.
Probably easier to explain with an example:
Original:
Letters Category1 Category2 Category3 Category4
A NA A NA NA
B A NA NA D
C NA NA NA A
D NA C NA NA
E E B C NA
Desired state:
Letters Category1 Category2 Category3 Category4
A A A NA A
B NA B NA NA
C NA C C NA
D NA NA NA D
E E NA NA NA
Am I overlooking a built-in function / library to do this elegantly? My approach would probably to build a function that creates a new dataframe and checks the contents row by row, but this seems very inefficient...
use tidyverse
df <- read.table(text = "Letters Category1 Category2 Category3 Category4
A NA A NA NA
B A NA NA D
C NA NA NA A
D NA C NA NA
E E B C NA", header = T)
library(tidyverse)
df %>%
pivot_longer(-Letters, values_drop_na = T) %>%
mutate(Letters = value) %>%
arrange(name) %>%
pivot_wider(Letters, names_from = name, values_from = value)
#> # A tibble: 5 x 5
#> Letters Category1 Category2 Category3 Category4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A A A <NA> A
#> 2 E E <NA> <NA> <NA>
#> 3 C <NA> C C <NA>
#> 4 B <NA> B <NA> <NA>
#> 5 D <NA> <NA> <NA> D
Created on 2020-12-01 by the reprex package (v0.3.0)
use data.table
library(data.table)
dt <- as.data.table(df)
dt_long <- melt(data = dt, id.vars = "Letters", na.rm = T)
dcast(dt_long, value ~ variable, value.var = "value")
#> value Category1 Category2 Category3 Category4
#> 1: A A A <NA> A
#> 2: B <NA> B <NA> <NA>
#> 3: C <NA> C C <NA>
#> 4: D <NA> <NA> <NA> D
#> 5: E E <NA> <NA> <NA>
Created on 2020-12-01 by the reprex package (v0.3.0)
One way using dplyr and tidyr would be :
library(dplyr)
library(tidyr)
vals <- df$Letters
df %>%
pivot_longer(cols = starts_with('Category'),
values_drop_na = TRUE) %>%
group_by(value) %>%
mutate(Letters = vals[cur_group_id()]) %>%
arrange(name) %>%
pivot_wider() %>%
arrange(Letters) -> result
result
# Letters Category1 Category2 Category3 Category4
# <chr> <chr> <chr> <chr> <chr>
#1 A A A NA A
#2 B NA B NA NA
#3 C NA C C NA
#4 D NA NA NA D
#5 E E NA NA NA

Matching values from multiple columns in 1 data frame to key in second data frame and creating columns

I have 2 data frames. One (df1) looks like this:
var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA
And the other (df2) looks like this:
var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
3 7 j k z
...
with all of the values listed out in var.1-var.4 in df1 in var.a of df2.
I want to match var.a from df2 across all of the columns listed in df1 and then add these columns to df1 with new/combined column names. So for instance it'll look like this:
var.1 var1.b var1.c var1.d ... var.4 var4.b var4.c var4.d
1 7 j k z 2 f g h
2 4 j k l 7 j k z
3 2 f g h NA NA NA NA
Thanks in advance!
Here's a tidyverse solution. First, I define the data frames.
df1 <- read.table(text = " var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA", header = TRUE)
df2 <- read.table(text = " var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
4 7 j k z", header=TRUE)
Then, I load the libraries.
# Load libraries
library(tidyr)
library(dplyr)
library(tibble)
Finally, I restructure the data.
# Manipulate data
df1 %>%
rownames_to_column() %>%
gather(variable, value, -rowname) %>%
left_join(df2, by = c("value" = "var.a")) %>%
gather(foo, bar, -variable, -rowname) %>%
unite(goop, variable, foo) %>%
spread(goop, bar) %>%
select(-rowname)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
which gives,
#> var.1_value var.1_var.b var.1_var.c var.1_var.d var.2_value var.2_var.b
#> 1 7 j k z 9 <NA>
#> 2 4 j k l 6 <NA>
#> 3 2 f g h <NA> <NA>
#> var.2_var.c var.2_var.d var.3_value var.3_var.b var.3_var.c var.3_var.d
#> 1 <NA> <NA> 1 b c d
#> 2 <NA> <NA> 9 <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> var.4_value var.4_var.b var.4_var.c var.4_var.d
#> 1 2 f g h
#> 2 7 j k z
#> 3 <NA> <NA> <NA> <NA>
Created on 2019-05-30 by the reprex package (v0.3.0)
This is a little bit convoluted, but I'll try to explain.
I turn row numbers into a column at first, as this will help me put the data back together at the very end.
I go from wide to long format for df1.
I join df2 to df1 based on var.a and var.1 (now called value), respectively.
I go from wide to long again.
I combine the variable names from each data frame into one variable.
Finally, I go from long to wide format (this is where the row numbers come in handy) and drop the row numbers.

Matching columns with other columns in data frames and adding certain columns of matching values

I have tried searching for something but cannot find it. I have found similar threads but still they don't get what I want. I know there should be an easy way to do this without writing a loop function. Here it goes
I have two data frame df1 and df2
df1 <- data.frame(ID = c("a", "b", "c", "d", "e", "f"), y = 1:6 )
df2 <- data.frame(x = c("a", "c", "g", "f"), f=c("M","T","T","M"), obj=c("F70", "F60", "F71", "F82"))
df2$f <- as.factor(df2$f)
now I want to match df1 and df2 "ID" and "x" column with each other. But I want to add new columns to the df1 data frame that matches "ID" and "x" from df2 as well. The final output of df1 should look like this
ID y obj f1 f2
a 1 F70 M NA
b 2 NA NA NA
c 3 F60 NA T
d 4 NA NA NA
e 5 NA NA NA
f 6 F82 M NA
We can do this with tidyverse after joining the two datasets and spread the 'f' column
library(tidyverse)
left_join(df1, df2, by = c(ID = "x")) %>%
group_by(f) %>%
spread(f, f) %>%
select(-6) %>%
rename(f1 = M, f2 = T)
# A tibble: 6 × 5
# ID y obj f1 f2
#* <chr> <int> <fctr> <fctr> <fctr>
#1 a 1 F70 M NA
#2 b 2 NA NA NA
#3 c 3 F60 NA T
#4 d 4 NA NA NA
#5 e 5 NA NA NA
#6 f 6 F82 M NA
Or a similar approach with data.table
library(data.table)
dcast(setDT(df2)[df1, on = .(x = ID)], x+obj + y ~ f, value.var = 'f')[, -6, with = FALSE]
Here is a base R process.
# combine the data.frames
dfNew <- merge(df1, df2, by.x="ID", by.y="x", all.x=TRUE)
# add f1 and f2 variables
dfNew[c("f1", "f2")] <- lapply(c("M", "T"),
function(i) factor(ifelse(as.character(dfNew$f) == i, i, NA)))
# remove original factor variable
dfNew <- dfNew[-3]
ID y obj f1 f2
1 a 1 F70 M <NA>
2 b 2 <NA> <NA> <NA>
3 c 3 F60 <NA> T
4 d 4 <NA> <NA> <NA>
5 e 5 <NA> <NA> <NA>
6 f 6 F82 M <NA>

Resources