how to use melt and dcast on tough data frame

how to use melt and dcast on tough data frame - r

I have a data frame that has one value in each cell, but my last column is a list.
Example. Here there are 3 columns. X and Y columns have one value in each row. But column Z is actually a list. It can have multiple values in each cell.
X Y Z
1 a d h, i, j
2 b e j, k
3 c f l, m, n, o
I need to create this:
X Y Z
1 a d h
2 a d i
3 a d j
4 b e j
4 b e k
5 c f l
6 c f m
7 c f n
8 c f o
Can someone help me figure this out ? I am not sure how to use melt or dcast or any other function for this.
Thanks.

unnest from tidyr works
library(tidyr)
unnest(dat, Z)

Related

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na

library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.

Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Select a row based on two other columns R [duplicate]

This question already has answers here:
Find max per group and return another column
(4 answers)
Closed 5 years ago.
I have a dataframe df
df = data.frame(L = rep(letters[1:6], each = 2),
M = rep(letters[7:12]),
freq = sample(c(5, 10), replace = FALSE))
L M freq
1 a g 5
2 a h 10
3 b i 5
4 b j 10
5 c k 5
6 c l 10
7 d g 5
8 d h 10
9 e i 5
10 e j 10
11 f k 5
12 f l 10
I want to select the most frequent M for each L.
In this example the output would show:
h, j, l, h, j, l
Frequency is not necessarily every second value in the actual problem.
How can I do this easily?
I've tried a tapply approach, but get stuck here because this seems to only apply to variables and can't be used to subset a subset data frame. (This didn't result in anything close so I won't post the approach)

We can do
library(data.table)
setDT(df)[, .(M = M[which.max(freq)]), L]
# L M
#1: a h
#2: b j
#3: c l
#4: d h
#5: e j
#6: f l
Or order the 'freq' and select the first 'M' for each 'L'
setDT(df)[order(-freq), .(M = M[1]) , L]

Another solution using dplyr
df %>% group_by(L) %>% top_n(1, freq) %>% .$M
#### [1] h j l h j l
eventually transform into character at the end...

Merge multiple dataframes in R [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 6 years ago.
I need to merge multiple dataframe with the matching values in column A. What is the most efficient way to do this and get the result.
df1
A B C
2 x r
1 c r
3 y t
df2
A D E
3 e y
1 t t
2 y t
df3
A F G
1 g y
2 f y
3 h k
result
A B C D E F G
1 c r t t g y
2 x r y t f y
3 y t y t h k

One solution is to use dplyr package and it's inner_join as follows:
library(dplyr)
df <- inner_join(df1, df2)
df <- inner_join(df, df3)
Resulting output is:
df
A B C D E F G
1 2 x r y t f y
2 1 c r t t g y
3 3 y t e y h k
Note, inner_join keeps only rows where A matches.
If you want it arranged by column A, you can add this line:
arrange(df, A)
A B C D E F G
1 1 c r t t g y
2 2 x r y t f y
3 3 y t e y h k
To merge a variable length list of data frames, it appears Reduce can be helpful along with the above inner_join:
df <- Reduce(inner_join, list(df1, df2, df3))
arrange(df, A)
A B C D E F G
1 1 c r t t g y
2 2 x r y t f y
3 3 y t e y h k

split data.frame into list based on row values across columns

I would like to split a data.frame into a list based on row values/characters across all columns of the data.frame.
I wrote lists of data.frames to file using write.list {erer}
So now when I read them in again, they look like this:
dummy data
set.seed(1)
df <- cbind(data.frame(col1=c(sample(LETTERS, 4),"col1",sample(LETTERS, 7))),
data.frame(col2=c(sample(LETTERS, 4),"col2",sample(LETTERS, 7))),
data.frame(col3=c(sample(LETTERS, 4),"col3",sample(LETTERS, 7))))
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
5 col1 col2 col3
6 F M A
7 W R J
8 Y X U
9 P I H
10 N Y K
11 B T M
12 E E Y
And I would like to split into lists by c("col1","col2","col3") producing
[[1]]
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
[[2]]
col1 col2 col3
1 F M A
2 W R J
3 Y X U
4 P I H
5 N Y K
6 B T M
7 E E Y
Feels like it should be straightforward using split, but my attempts so far have failed. Also, as you see, I can't split by a certain row interval.
Any pointers would be highly appreciated, thanks!

Try
lapply(split(d1, cumsum(grepl(names(d1)[1], d1$col1))), function(x) x[!grepl(names(d1)[1], x$col1),])
#$`0`
# col1 col2 col3
#1 G E Q
#2 J R D
#3 N J G
#4 U Y I
#$`1`
# col1 col2 col3
#6 F M A
#7 W R J
#8 Y X U
#9 P I H
#10 N Y K
#11 B T M
#12 E E Y

This should be general, if you want to split if a line is exactly like the colnames:
dfSplit<-split(df,cumsum(Reduce("&",Map("==",df,colnames(df)))))
for (i in 2:length(dfSplit)) dfSplit[[i]]<-dfSplit[[i]][-1,]
The second line can be written a little more R-style as #DavidArenburg suggested in the comments.
dfSplit[-1] <- lapply(dfSplit[-1], function(x) x[-1, ])
It has also the added benefit of doing nothing if dfSplit has length 1 (opposite to my original second line, which would throw an error).

merge two dataframe based on matching two exchangable columns in each dataframe

I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])

The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i

This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>

One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e

You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA

Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.