I have a data.frame which consists of linear intervals for each id:
df <- data.frame(id = c(rep("a",3),rep("b",4),rep("d",4)),
start = c(3,4,10,5,6,9,12,8,12,15,27),
end = c(7,8,12,8,9,13,13,10,15,26,30))
I'm looking for an efficient function that will unite all intersecting intervals per each id. For df the result ill be:
res.df <- data.frame(id = c("a","a","b","d","d","d"),
start = c(3,10,5,8,12,27),
end = c(8,12,13,10,26,30))
For which eventually I'll be able to sum up all the united intervals per each id to get their combined length:
sapply(unique(res.df$id), function(x) sum(res.df$end[which(res.df$id == x)]-res.df$start[which(res.df$id == x)]+1))
#source("https://bioconductor.org/biocLite.R")
#biocLite("IRanges")
library(IRanges)
df1 <- as(df, "RangedData")
as.data.frame(reduce(df1, by = "id", min.gapwidth = 0.5))
# space start end width id
#1 1 3 8 6 a
#2 1 10 12 3 a
#3 1 5 13 9 b
#4 1 8 10 3 d
#5 1 12 26 15 d
#6 1 27 30 4 d
Related
I have to reshape a dataset that consists of observations (obs) and elements (p) that are linked to the observations. The data (characteristics) of the elements are in new columns attached to the observations.
A MWE looks like this:
set.seed(1)
data <- data.frame(obs_id = c(1:3),
char1 = sample(1:10, 3),
p1 = 1,
p1_char = sample(11:20, 3),
p2 = 2,
p2_char = sample(11:20, 3),
p3 = 3,
p3_char = sample(11:20, 3))
Which results in a data that looks as follows:
> data
obs_id p1 p1_char1 p2 p2_char1 p3 p3_char1
1 1 1 20 2 12 3 16
2 2 1 16 2 20 3 11
3 3 1 18 2 16 3 13
obs_idare the observations. pX indicates the various elements, and pX_charX the characteristics.
Now, I have to create the data in a long format with two new columns. The first one should be named p and contain all element numbers. So, for so good. This can e.g., be easily achieved with gather from the tidyr package:
library(magrittr)
library(tidyr)
data_long1 <- gather(data, key = p_variable, value = p,
p1, p2, p3)
Filtering out the first observations, everything is should be:
> data_long1 %>% filter(obs_id == 1)
obs_id p1_char1 p2_char1 p3_char1 p_variable p
1 1 20 12 16 p1 1
2 1 20 12 16 p2 2
3 1 20 12 16 p3 3
Now, the second new column should be named char and be filled with characteristics of the elements. I can stack them independently with gather, too.
data_long2 <- gather(data, key = char_variable, value = char,
p1_char1, p2_char1, p3_char1)
> data_long2 %>% filter(obs_id == 1)
obs_id p1 p2 p3 char_variable char
1 1 1 2 3 p1_char1 20
2 1 1 2 3 p2_char1 12
3 1 1 2 3 p3_char1 16
Now, I could combine the two with bind_cols() to get what I want
data_long <- bind_cols(data_long1, data_long2)
> data_long %>%
+ select(obs_id, p, char) %>%
+ filter(obs_id == 1)
obs_id p char
1 1 1 20
2 1 2 12
3 1 3 16
The problem is that I would need to do this for every new variable of the elements that I want to stack on top of each other.
My question is this: Is there a more efficient way to create two or more columns when I format the data from wide to long? What if I have a pX_char2 variable in the original data that I want to transform in a char2 variable in the final data?
As #domaeg points out in the comments, this can be accomplished with the new pivot_longer function from tidyr 1.0.0:
library(tidyverse)
data %>% pivot_longer(-obs_id,
names_to = "p",
names_pattern = "p([0-9])",
values_to = "char")
Which produces:
# A tibble: 9 x 3
obs_id p char
<int> <chr> <int>
1 1 1 20
2 1 2 12
3 1 3 16
4 2 1 16
5 2 2 20
6 2 3 11
7 3 1 18
8 3 2 16
9 3 3 13
For good measure, I couldn't reproduce your data with the seed, so I set it up directly like this, if anyone else wants a shot:
txt <- "obs_id p1 p1_char1 p2 p2_char1 p3 p3_char1
1 1 1 20 2 12 3 16
2 2 1 16 2 20 3 11
3 3 1 18 2 16 3 13"
data <- read.table(text = txt, header = TRUE)
Let's suppose I want to estimate the time lag between two groups within a data.frame.
Here an example of my data:
df_1 = data.frame(time = c(1,3,5,6,8,11,15,16,18,20), group = 'a') # create group 'a' data
df_2 = data.frame(time = c(2,7,10,13,19,25), group = 'b') # create group 'b' data
df = rbind(df_1, df_2) # merge groups
df = df[with(df, order(time)), ] # order by time
rownames(df) = NULL #remove row names
> df
time group
1 1 a
2 2 b
3 3 a
4 5 a
5 6 a
6 7 b
7 8 a
8 10 b
9 11 a
10 13 b
11 15 a
12 16 a
13 18 a
14 19 b
15 20 a
16 25 b
Now I need to subtract the time observation from group b to the time observation from group a.
i.e. 2-1, 7-6, 10-8, 13-11, 19-18 and 25-20.
# Expected output
> out
[1] 1 1 2 2 1 5
How can I achieve this?
We can find indices of b and subtract the time value from it's previous index.
inds <- which(df$group == "b")
df$time[inds] - df$time[inds - 1]
#[1] 1 1 2 2 1 5
Here's a tidyverse solution. First add a column by basic logic of the appearance of group b with transmute and a subtraction of the preceding column. Then filter to just the results, and convert to vector with deframe
library(tidyverse)
df %>%
transmute(result = if_else(group == "b", time - lag(time), 0)) %>%
filter(result != 0) %>%
deframe()
result:
[1] 1 1 2 2 1 5
I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.
Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)
Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5
I have the following data frame as an example
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
> df
score total1 total2
1 a 1 16
2 b 2 17
3 c 3 18
4 d 4 19
5 e 5 20
6 f 6 21
7 g 7 22
8 h 8 23
9 i 9 24
10 j 10 25
11 k 11 26
12 l 12 27
13 m 13 28
14 n 14 29
15 o 15 30
I would like to aggregate my data frame by sum by grouping the rows having different name, i.e.
groups sum1 sum2
'a-b-c' 6 51
'c-d-e' 21 60
etc
All the given answers to this kind of question assume that the strings repeat in the row.
The usual aggregate function that I use to obtain the summary delivers a different result:
aggregate(df$total1, by=list(sum1=df$score %in% c('a','b','c'), sum2=df$score %in% c('d','e','f')), FUN=sum)
sum1 sum2 x
1 FALSE FALSE 99
2 TRUE FALSE 6
3 FALSE TRUE 15
If you want a tidyverse solution, here is one possibility:
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
df %>%
mutate(groups = case_when(
score %in% c("a","b","c") ~ "a-b-c",
score %in% c("d","e","f") ~ "d-e-f"
)) %>%
group_by(groups) %>%
summarise_if(is.numeric, sum)
returns
# A tibble: 3 x 3
groups total1 total2
<chr> <int> <int>
1 a-b-c 6 51
2 d-e-f 15 60
3 <NA> 99 234
Add a "groups" column with the category value.
df$groups = NA
and then define each group like this:
df$groups[df$score=="a" | df$score=="b" | df$score=="c" ] = "a-b-c"
Finally aggregate by that column.
Here's a solution that works for any sized data frame.
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
# I'm adding a row to demonstrate that the grouping pattern works when the
# number of rows is not equally divisible by 3.
df <- rbind(df, data.frame(score = letters[16], total1 = 16, total2 = 31))
# A vector that represents the correct groupings for the data frame.
groups <- c(rep(1:floor(nrow(df) / 3), each = 3),
rep(floor(nrow(df) / 3) + 1, nrow(df) - length(1:(nrow(df) / 3)) * 3))
# Your method of aggregation by `groups`. I'm going to use `data.table`.
require(data.table)
dt <- as.data.table(df)
dt[, group := groups]
aggDT <- dt[, list(score = paste0(score, collapse = "-"),
total1 = sum(total1), total2 = sum(total2)), by = group][
, group := NULL]
aggDT
score total1 total2
1: a-b-c 6 51
2: d-e-f 15 60
3: g-h-i 24 69
4: j-k-l 33 78
5: m-n-o 42 87
6: p 16 31
I have a dataset with multiple duplicate IDs which have different categorical values. Following is an example data set.
suppressMessages(library(dplyr))
DUMMY_DATA <- data.frame(ID = c(11,22,22,33,33,33,44,44,55,55,55,55),
CATEGORY1 = c("E","B","C","C","C","D","A","A","B","C","E","B"),
CATEGORY2 = c ("AA","AA","BB","CC","DD","BB","AA","EE","AA","CC","BB","EE"),
stringsAsFactors = FALSE)
> DUMMY_DATA
ID CATEGORY1 CATEGORY2
1 11 E AA
2 22 B AA
3 22 C BB
4 33 C CC
5 33 C DD
6 33 D BB
7 44 A AA
8 44 A EE
9 55 B AA
10 55 C CC
11 55 E BB
12 55 B EE
I want to aggregate values of ID from another dataset which gives the rank of categorical values. AS follows.
Category_Rank1 <- data.frame(VAR = c("A","B","C","D","E"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank1
VAR RANK
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
Category_Rank2 <- data.frame(VAR = c("AA","BB","CC","DD","EE"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank2
VAR RANK
1 AA 1
2 BB 2
3 CC 3
4 DD 4
5 EE 5
For each group of IDs from DUMMY_DAT I want to look up the Category_Rank and then alot that category to the ID which has the best rank. Following is my solution.
hierarchyTransform <- function(x,dataset){
x <- unique(x)
dataset <- dataset%>%
filter(dataset[,1] %in% x)
dataset <- dataset%>%
filter(dataset[,2] == min(dataset[,2]))
return(dataset[1,1])
}
NEW_DATA <- DUMMY_DATA%>%
group_by(ID)%>%
summarise(CATEGORY1_CLEAN = hierarchyTransform(x=CATEGORY1,
dataset = Category_Rank1),
CATEGORY2_CLEAN = hierarchyTransform(x=CATEGORY2,
dataset = Category_Rank2))
I get the Following Result.
> NEW_DATA
# A tibble: 5 × 3
ID CATEGORY1_CLEAN CATEGORY2_CLEAN
<dbl> <chr> <chr>
1 11 E AA
2 22 B AA
3 33 C BB
4 44 A AA
5 55 B AA
This is exactly what I want but the problem is time taken for this operation. My Original Data set has around 1 million rows and when I group it based on ID I get about 200,000 groups. So the hierarchyTransform function is applied for 200,000 groups which takes about 15 mins for a single variable and I have to perform this operation for 10 other variables which increases the time. Is there any solution to reduce the time taken for this operation.
If you know the rank order of the levels of CATEGORY (which is alphabetic in your example) then you can turn CATEGORY into a factor with the levels ordered according to the desired ranking. Then sort by CATEGORY, group by ID, and take the first row for each ID.
DUMMY_DATA$CATEGORY = factor(DUMMY_DATA$CATEGORY, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA %>%
arrange(ID, CATEGORY) %>%
group_by(ID) %>%
slice(1)
ID CATEGORY
1 11 E
2 22 B
3 33 C
4 44 A
5 55 B
UPDATE: To respond to your comment and updated question: The code below will, for each ID, select the value of highest rank from each category column.
DUMMY_DATA$CATEGORY1 = factor(DUMMY_DATA$CATEGORY1, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA$CATEGORY2 = factor(DUMMY_DATA$CATEGORY2, levels=c("AA","BB","CC","DD","EE"), ordered=TRUE)
Now you can do either of the following:
DUMMY_DATA %>% group_by(ID) %>%
summarise(CATEGORY1 = min(CATEGORY1),
CATEGORY2 = min(CATEGORY2))
DUMMY_DATA %>% group_by(ID) %>%
summarise_all(funs(min))
ID CATEGORY1 CATEGORY2
1 11 E AA
2 22 B AA
3 33 C BB
4 44 A AA
5 55 B AA