Related
Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")
This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.
You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G
We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))
I've simplified my problem conceptually as follows. I have a list (mylist) comprised of two data frames. I know how to reorder them both (say, on the 4th and 1st variable) using lapply:
mylist<-lapply(mylist, function(x) x[order(x[,4],x[,1]),])
Now I am trying to use lapply() and rank() to add a 5th column to each dataframe in the list, and populate the column with the rank (the rank within that dataframe, on the 4th variable say).
Ive tried dozens of permutations of this
mylist[,5]<-lapply(mylist, function(x) rank(x[,4], ties.method="first"))
nothing works right. Help! Thanks
> mylist
[[1]]
a b c d
1 1 4 7 A
2 2 5 8 A
3 3 6 9 B
[[2]]
a b c d
1 9 6 3 A
2 8 5 2 A
3 7 4 1 B
Well it couldn't be:
mylist[,5]<-lapply(mylist, function(x) rank(x[,4], ties.method="first"))
... because mylist[,5] doesn't make any sense. mylist you said was a two element list so it really didn't even have columns. So you need to loop over the elements and add the column to them individually:
mylist <-lapply(mylist, function(x) { rl <- rank( x[,4], ties.method="first")
x <- cbind( x, rl=rl)
x [ order(x['rl']) , ] } )
Your lapply returns the output of rank which is a vector. You need to append a column to each data.frame then return the data.frame
mylist <- list(data.frame(a = rnorm(5), b = rpois(5, 3)), data.frame(a = rnorm(5), b = rpois(5, 3)) )
mylist
## [[1]]
## a b
## 1 -1.31730854 4
## 2 0.04395243 1
## 3 0.15370905 0
## 4 -0.77556501 4
## 5 1.12879380 4
##
## [[2]]
## a b
## 1 -0.96314478 3
## 2 -0.54824004 6
## 3 0.34943917 1
## 4 -0.07077913 0
## 5 1.10519356 3
lapply(mylist, function(x) { x$c <- rank(x$b); x })
## [[1]]
## a b c
## 1 -1.31730854 4 4
## 2 0.04395243 1 2
## 3 0.15370905 0 1
## 4 -0.77556501 4 4
## 5 1.12879380 4 4
##
## [[2]]
## a b c
## 1 -0.96314478 3 3.5
## 2 -0.54824004 6 5.0
## 3 0.34943917 1 2.0
## 4 -0.07077913 0 1.0
## 5 1.10519356 3 3.5
Using the data provided by #JakeBurkhead, you can simply use transform
set.seed(123)
mylist <- list(data.frame(a = rnorm(5), b = rpois(5, 3)),
data.frame(a = rnorm(5), b = rpois(5, 3)) )
lapply(mylist, transform, c = rank(b, ties.method = "first"))
## [[1]]
## a b c
## 1 -0.560476 6 5
## 2 -0.230177 3 2
## 3 1.558708 4 4
## 4 0.070508 3 3
## 5 0.129288 1 1
## [[2]]
## a b c
## 1 1.28055 4 5
## 2 -1.72727 3 3
## 3 1.69018 3 4
## 4 0.50381 2 2
## 5 2.52834 1 1
Hoping there's a simple answer here but I can't find it anywhere.
I have a numeric matrix with row names and column names:
# 1 2 3 4
# a 6 7 8 9
# b 8 7 5 7
# c 8 5 4 1
# d 1 6 3 2
I want to melt the matrix to a long format, with the values in one column and matrix row and column names in one column each. The result could be a data.table or data.frame like this:
# col row value
# 1 a 6
# 1 b 8
# 1 c 8
# 1 d 1
# 2 a 7
# 2 c 5
# 2 d 6
...
Any tips appreciated.
Use melt from reshape2:
library(reshape2)
#Fake data
x <- matrix(1:12, ncol = 3)
colnames(x) <- letters[1:3]
rownames(x) <- 1:4
x.m <- melt(x)
x.m
Var1 Var2 value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
...
The as.table and as.data.frame functions together will do this:
> m <- matrix( sample(1:12), nrow=4 )
> dimnames(m) <- list( One=letters[1:4], Two=LETTERS[1:3] )
> as.data.frame( as.table(m) )
One Two Freq
1 a A 7
2 b A 2
3 c A 1
4 d A 5
5 a B 9
6 b B 6
7 c B 8
8 d B 10
9 a C 11
10 b C 12
11 c C 3
12 d C 4
Assuming 'm' is your matrix...
data.frame(col = rep(colnames(m), each = nrow(m)),
row = rep(rownames(m), ncol(m)),
value = as.vector(m))
This executes extremely fast on a large matrix and also shows you a bit about how a matrix is made, how to access things in it, and how to construct your own vectors.
A modification that doesn't require you to know anything about the storage structure, and that easily extends to high dimensional arrays if you use the dimnames, and slice.index functions:
data.frame(row=rownames(m)[as.vector(row(m))],
col=colnames(m)[as.vector(col(m))],
value=as.vector(m))
Hi I have a table with comma delimited columns and I need to convert the comma delimited values to new rows. for exmaple the given table is
Name Start End
A 1,2,3 4,5,6
B 1,2 4,5
C 1,2,3,4 6,7,8,9
I need to convert it like
Name Start End
A 1 4
A 2 5
A 3 6
B 1 4
B 2 5
C 1 6
C 2 7
C 3 8
C 4 9
I can do that using VB script but I need to solve it using R
Can anyone solve this?
You might have asked this question on SO as there is no issue dealing with statistics :)
Anyway, I made up a quite complicated and ugly solution which might work for you:
# load your data
x <- structure(list(Name = c("A", "B", "C"), Start = c("1,2,3", "1,2",
"1,2,3,4"), End = c("4,5,6", "4,5", "6,7,8,9")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -3L), class = "data.frame")
Which looks like in R like:
> x
Name Start End length
1 A 1,2,3 4,5,6 3
2 B 1,2 4,5 2
3 C 1,2,3,4 6,7,8,9 4
Data transformation with the help of strsplit calls:
data <- data.frame(cbind(
rep(x$Name,as.numeric(lapply(strsplit(x$Start,","), length))),
unlist(lapply(strsplit(x$Start,","), cbind)),
unlist(lapply(strsplit(x$End,","), cbind))
))
Naming the new data frame:
names(data) <- c("Name", "Start", "End")
Which looks like:
> data
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
Here's an approach that should work for you. I'm assuming that your three input vectors are in different objects. We are going to create a list of those inputs and write a function that process each object and returns them in the form of a data.frame with plyr.
The things to take note of here are the splitting of the character vector into it's component parts, then using as.numeric to convert the numbers from the character form when they were split. Since R fills matrices by column, we define a 2 column matrix and let R fill the values for us. We then retrieve the Name column and put it all together in a data.frame. plyr is nice enough to process the list and convert it into a data.frame for us automatically.
library(plyr)
a <- paste("A",1, 2,3,4,5,6, sep = ",", collapse = "")
b <- paste("B",1, 2,4,5, sep = ",", collapse = "")
c <- paste("C",1, 2,3,4,6,7,8,9, sep = ",", collapse = "")
input <- list(a,b,c)
splitter <- function(x) {
x <- unlist(strsplit(x, ","))
out <- data.frame(x[1], matrix(as.numeric(x[-1]), ncol = 2))
colnames(out) <- c("Name", "Start", "End")
return(out)
}
ldply(input, splitter)
And the output:
> ldply(input, splitter)
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
The separate_rows() function in tidyr is the boss for observations with multiple delimited values...
# create data
library(tidyverse)
d <- data_frame(
Name = c("A", "B", "C"),
Start = c("1,2,3", "1,2", "1,2,3,4"),
End = c("4,5,6", "4,5", "6,7,8,9")
)
d
# # A tibble: 3 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1,2,3 4,5,6
# 2 B 1,2 4,5
# 3 C 1,2,3,4 6,7,8,9
# tidy data
separate_rows(d, Start, End)
# # A tibble: 9 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
# use convert set to TRUE for integer column modes
separate_rows(d, Start, End, convert = TRUE)
# # A tibble: 9 x 3
# Name Start End
# <chr> <int> <int>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
Here's another, just for fun. Take d as the original data.
f <- function(x, ul = TRUE)
{
x <- deparse(substitute(x))
if(ul) unlist(strsplit(d[[x]], ','))
else strsplit(d[[x]], ',')
}
> data.frame(Name = rep(d$Name, sapply(f(End, F), length)),
Start = f(Start), End = f(End))
# Name Start End
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9