Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")
I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)
I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2
Hi I have a table with comma delimited columns and I need to convert the comma delimited values to new rows. for exmaple the given table is
Name Start End
A 1,2,3 4,5,6
B 1,2 4,5
C 1,2,3,4 6,7,8,9
I need to convert it like
Name Start End
A 1 4
A 2 5
A 3 6
B 1 4
B 2 5
C 1 6
C 2 7
C 3 8
C 4 9
I can do that using VB script but I need to solve it using R
Can anyone solve this?
You might have asked this question on SO as there is no issue dealing with statistics :)
Anyway, I made up a quite complicated and ugly solution which might work for you:
# load your data
x <- structure(list(Name = c("A", "B", "C"), Start = c("1,2,3", "1,2",
"1,2,3,4"), End = c("4,5,6", "4,5", "6,7,8,9")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -3L), class = "data.frame")
Which looks like in R like:
> x
Name Start End length
1 A 1,2,3 4,5,6 3
2 B 1,2 4,5 2
3 C 1,2,3,4 6,7,8,9 4
Data transformation with the help of strsplit calls:
data <- data.frame(cbind(
rep(x$Name,as.numeric(lapply(strsplit(x$Start,","), length))),
unlist(lapply(strsplit(x$Start,","), cbind)),
unlist(lapply(strsplit(x$End,","), cbind))
))
Naming the new data frame:
names(data) <- c("Name", "Start", "End")
Which looks like:
> data
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
Here's an approach that should work for you. I'm assuming that your three input vectors are in different objects. We are going to create a list of those inputs and write a function that process each object and returns them in the form of a data.frame with plyr.
The things to take note of here are the splitting of the character vector into it's component parts, then using as.numeric to convert the numbers from the character form when they were split. Since R fills matrices by column, we define a 2 column matrix and let R fill the values for us. We then retrieve the Name column and put it all together in a data.frame. plyr is nice enough to process the list and convert it into a data.frame for us automatically.
library(plyr)
a <- paste("A",1, 2,3,4,5,6, sep = ",", collapse = "")
b <- paste("B",1, 2,4,5, sep = ",", collapse = "")
c <- paste("C",1, 2,3,4,6,7,8,9, sep = ",", collapse = "")
input <- list(a,b,c)
splitter <- function(x) {
x <- unlist(strsplit(x, ","))
out <- data.frame(x[1], matrix(as.numeric(x[-1]), ncol = 2))
colnames(out) <- c("Name", "Start", "End")
return(out)
}
ldply(input, splitter)
And the output:
> ldply(input, splitter)
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
The separate_rows() function in tidyr is the boss for observations with multiple delimited values...
# create data
library(tidyverse)
d <- data_frame(
Name = c("A", "B", "C"),
Start = c("1,2,3", "1,2", "1,2,3,4"),
End = c("4,5,6", "4,5", "6,7,8,9")
)
d
# # A tibble: 3 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1,2,3 4,5,6
# 2 B 1,2 4,5
# 3 C 1,2,3,4 6,7,8,9
# tidy data
separate_rows(d, Start, End)
# # A tibble: 9 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
# use convert set to TRUE for integer column modes
separate_rows(d, Start, End, convert = TRUE)
# # A tibble: 9 x 3
# Name Start End
# <chr> <int> <int>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
Here's another, just for fun. Take d as the original data.
f <- function(x, ul = TRUE)
{
x <- deparse(substitute(x))
if(ul) unlist(strsplit(d[[x]], ','))
else strsplit(d[[x]], ',')
}
> data.frame(Name = rep(d$Name, sapply(f(End, F), length)),
Start = f(Start), End = f(End))
# Name Start End
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9