reshaping data with time represented as spells - r

I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.

You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6

So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4

Related

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Separate Comma Delimited Cells To New Rows

Hi I have a table with comma delimited columns and I need to convert the comma delimited values to new rows. for exmaple the given table is
Name Start End
A 1,2,3 4,5,6
B 1,2 4,5
C 1,2,3,4 6,7,8,9
I need to convert it like
Name Start End
A 1 4
A 2 5
A 3 6
B 1 4
B 2 5
C 1 6
C 2 7
C 3 8
C 4 9
I can do that using VB script but I need to solve it using R
Can anyone solve this?
You might have asked this question on SO as there is no issue dealing with statistics :)
Anyway, I made up a quite complicated and ugly solution which might work for you:
# load your data
x <- structure(list(Name = c("A", "B", "C"), Start = c("1,2,3", "1,2",
"1,2,3,4"), End = c("4,5,6", "4,5", "6,7,8,9")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -3L), class = "data.frame")
Which looks like in R like:
> x
Name Start End length
1 A 1,2,3 4,5,6 3
2 B 1,2 4,5 2
3 C 1,2,3,4 6,7,8,9 4
Data transformation with the help of strsplit calls:
data <- data.frame(cbind(
rep(x$Name,as.numeric(lapply(strsplit(x$Start,","), length))),
unlist(lapply(strsplit(x$Start,","), cbind)),
unlist(lapply(strsplit(x$End,","), cbind))
))
Naming the new data frame:
names(data) <- c("Name", "Start", "End")
Which looks like:
> data
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
Here's an approach that should work for you. I'm assuming that your three input vectors are in different objects. We are going to create a list of those inputs and write a function that process each object and returns them in the form of a data.frame with plyr.
The things to take note of here are the splitting of the character vector into it's component parts, then using as.numeric to convert the numbers from the character form when they were split. Since R fills matrices by column, we define a 2 column matrix and let R fill the values for us. We then retrieve the Name column and put it all together in a data.frame. plyr is nice enough to process the list and convert it into a data.frame for us automatically.
library(plyr)
a <- paste("A",1, 2,3,4,5,6, sep = ",", collapse = "")
b <- paste("B",1, 2,4,5, sep = ",", collapse = "")
c <- paste("C",1, 2,3,4,6,7,8,9, sep = ",", collapse = "")
input <- list(a,b,c)
splitter <- function(x) {
x <- unlist(strsplit(x, ","))
out <- data.frame(x[1], matrix(as.numeric(x[-1]), ncol = 2))
colnames(out) <- c("Name", "Start", "End")
return(out)
}
ldply(input, splitter)
And the output:
> ldply(input, splitter)
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9
The separate_rows() function in tidyr is the boss for observations with multiple delimited values...
# create data
library(tidyverse)
d <- data_frame(
Name = c("A", "B", "C"),
Start = c("1,2,3", "1,2", "1,2,3,4"),
End = c("4,5,6", "4,5", "6,7,8,9")
)
d
# # A tibble: 3 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1,2,3 4,5,6
# 2 B 1,2 4,5
# 3 C 1,2,3,4 6,7,8,9
# tidy data
separate_rows(d, Start, End)
# # A tibble: 9 x 3
# Name Start End
# <chr> <chr> <chr>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
# use convert set to TRUE for integer column modes
separate_rows(d, Start, End, convert = TRUE)
# # A tibble: 9 x 3
# Name Start End
# <chr> <int> <int>
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9
Here's another, just for fun. Take d as the original data.
f <- function(x, ul = TRUE)
{
x <- deparse(substitute(x))
if(ul) unlist(strsplit(d[[x]], ','))
else strsplit(d[[x]], ',')
}
> data.frame(Name = rep(d$Name, sapply(f(End, F), length)),
Start = f(Start), End = f(End))
# Name Start End
# 1 A 1 4
# 2 A 2 5
# 3 A 3 6
# 4 B 1 4
# 5 B 2 5
# 6 C 1 6
# 7 C 2 7
# 8 C 3 8
# 9 C 4 9

Resources