Regroup, summarise and combine variables - r

I've been breaking my head to understand how to do this but so far I couldn't find an easy solution.
I have the following dataset:
Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4
What I'm trying to do is based on the Itin variable, to create a path variable, while keeping the passengers variable.
The easiest way of understanding this is by seeing it as taking a normal flight with a scale somewhere. For example in Itin = 1 one passenger goes from A to B to C. The only thing that has to be kept is the Origin A Destination B, destination C and passengers as it is, which is equal to 1. Just like on the example below.
Path Passengers
A-B-C 1
A-B 3
E-B 10
A-C 2
E-B 4
I've tried several options with group_by with dplyr, as it is often quicker than the base options, but I couldn't really get the result as on the second example with a new variable Path. I thought as well to use tidyr but I'm not really sure how it could help here.
Any idea on how to do this?
Edit: As for the Path variable, it doesn't really matter if ends up as A-B-C, or A,B,C or A B C as I will only look at the syntax.

EDIT A faster solution using data.table
df1<-read.table(text="Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4",header=TRUE, stringsAsFactors=FALSE)
library(data.table)
DT <-data.table(df1)
DT[,.(Passengers, Path = paste(Origin[1],paste(Destination, collapse = " "),
collapse = " ")), by=Itin]
Itin Passengers Path
1: 1 1 A B C
2: 1 1 A B C
3: 2 3 A B
4: 3 10 E B
5: 4 2 A C
6: 5 4 E B
Here's my orignal solution with dplyr:
df1<-read.table(text="Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Itin) %>%
summarise(Passengers=max(Passengers),
Path = paste(Origin[1],paste(Destination, collapse = " "),
collapse = " "))
# A tibble: 5 × 3
Itin Passengers Path
<int> <int> <chr>
1 1 1 A B C
2 2 3 A B
3 3 10 E B
4 4 2 A C
5 5 4 E B

Reading data:
read.table(textConnection("Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4"), header=T, stringsAsFactors=F) -> df
Using base R in this case:
Path <- lapply(unique(df$Itin), function(it) {
x <- subset(df, Itin==it)
c(x$Origin[1], x$Destination)
})
new_df <- unique(df[,c("Itin", "Passengers")])
new_df$Path <- Path
> new_df
Itin Passengers Path
1 1 1 A, B, C
3 2 3 A, B
4 3 10 E, B
5 4 2 A, C
6 5 4 E, B

Related

Simplest way to replace a list of values in a data frame with a list of new values

Say we have a data frame with a factor (Group) that is a grouping variable for a list of IDs:
set.seed(123)
data <- data.frame(Group = factor(sample(5,10, replace = T)),
ID = c(1:10))
In this example, the ID's belong to one of 5 Groups, labeled 1:5. We simply want to replace 1:5 with A:E. In other words, if Group == 1, we want to change it to A, if Group == 2, we want to change it to B, and so on. What is the simplest way to achieve this?
You may assign new labels= in a names list using factor once again.
data$Group1 <- factor(data$Group, labels=list("1"="A", "2"="B", "3"="C", "4"="D", "5"="E"))
## more succinct:
data$Group2 <- factor(data$Group, labels=setNames(list("A", "B", "C", "D", "E"), 1:5))
data
# Group ID Group1 Group2 Group3
# 1 3 1 C C C
# 2 3 2 C C C
# 3 2 3 B B B
# 4 2 4 B B B
# 5 3 5 C C C
# 6 5 6 E E E
# 7 4 7 D D D
# 8 1 8 A A A
# 9 2 9 B B B
# 10 3 10 C C C
This for general, if indeed capital letters are wanted see #RonakShah's solution.
You can use the built-in constant in R LETTERS :
data$new_group <- LETTERS[data$Group]
data
# Group ID new_group
#1 3 1 C
#2 3 2 C
#3 2 3 B
#4 2 4 B
#5 3 5 C
#6 5 6 E
#7 4 7 D
#8 1 8 A
#9 2 9 B
#10 3 10 C
Created a new column (new_group) here for comparison purposes. You can overwrite the same column if you wish to.

find the length of the longest chain of characters in each row of a dataframe

I want to find the length of the longest chain of characters following a pattern. Let's say I have this dataframe, and I want to find the length of rows were the sequence "a" is repeating how do i find it?
id = c(1, 2, 3,4,5)
A = c("a","a","a","a","a")
B = c("a","a","b","a","d")
C = c("b","a","c","a","a")
D = c("a","a","a","b","c")
E = c("a","a","e","c","a")
df = data.frame(id,A,B,C,D,E,stringsAsFactors=FALSE)
df$Count = c(2,5,1,3,1)
id A B C D E Count
1 a a b a a 2
2 a a a a a 5
3 a b c a e 1
4 a a a b c 3
5 a d a c a 1
You can use rle (run-length encoding).
rles = apply(df[2:6], 1, rle)
result = sapply(rles, function(x) max(x$lengths[x$values == "a"]))
df$new_count = result
df
# id A B C D E Count new_count
# 1 1 a a b a a 2 2
# 2 2 a a a a a 5 5
# 3 3 a b c a e 1 1
# 4 4 a a a b c 3 3
# 5 5 a d a c a 1 1
See ?rle or many other questions on this site if you search for "[r] rle" for additional details.

Determining if values of previous rows repeat in dataframe

I have some data organized like this:
set.seed(12)
ids <- matrix(replicate(1000,sample(LETTERS[1:4],2)),ncol=2,byrow=T)
df <- data.frame(
event = 1:100,
id1 = ids[,1],
id2 = ids[,2],
grp = rep(1:10, each=100), stringsAsFactors=F)
head(df,10)
event id1 id2 grp
1 1 A C 1
2 2 D A 1
3 3 A D 1
4 4 A B 1
5 5 A D 1
6 6 B C 1
7 7 B D 1
8 8 B D 1
9 9 B D 1
10 10 C A 1
There are pairs of ids (id1 & id2). Within a row they are never the same. There is a variable called grp. There are 10 groups. Each group could be considered a separate sample of data. The event variable goes from 1-100 in each group.
The first question I have is quite straightforward. Within each group, for each row, is the combination of the two ids (id1-id2) the same as the previous row, the reverse of the previous row, or neither of these two options. Obviously, if there is an A-C combination on row 100 of one group, I am not interested in whether it is reversed, the same or whatever on row 1 of the following group.
This is my temporary solution:
#Give each id pair and identifier:
df$pair <- paste(pmin(df$id1,df$id2), pmax(df$id1,df$id2))
#For each grp, work out using `lag` if previous row contains same pair of ids, and if they are in same or reversed order:
df.sp <- split(df, df$grp)
df$value <- unlist(lapply(df.sp, function(x) ifelse(x$pair!=lag(x$pair), NA, ifelse(x$id1==lag(x$id1), 1, 0)) ))
This gives:
head(df,10)
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D NA
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C NA
This works - showing 0 as a reversal, 1 as a copy and NA as neither.
The more complex question I am interested in is the following. Within each group (grp), for each row, find if its combination of two ids (the pair) previously occurred in that grp. If they did, then return whether they were in the same order or reversed order the immediate previous time they occurred.
That result would look like this:
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D 1
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C 0
e.g. row 10 is returned as a 0 because the combination A-C previously occurred and was in the reverse order (row 1). on row 5 a 1 is returned as A-D previously occurred in the same order on row 3.
You're almost there! The second question is equivalent to the first question, just grouping by pair as well as group. I converted the code to dplyr (though I appreciate the spirit behind keeping the question in base). I also removed the second ifelse, replacing it with a numeric conversion of the logical, which should be more performant (and some will find easier to read).
df %>% group_by(grp) %>%
mutate(
pair = paste(pmin(id1, id2), pmax(id1, id2)),
prev_row = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))
) %>%
group_by(grp, pair) %>%
mutate(prev_any = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))) %>%
head(10)
# Source: local data frame [10 x 7]
# Groups: grp, pair [5]
#
# event id1 id2 grp pair prev_row prev_any
# (int) (chr) (chr) (int) (chr) (dbl) (dbl)
# 1 1 A C 1 A C NA NA
# 2 2 D A 1 A D NA NA
# 3 3 A D 1 A D 0 0
# 4 4 A B 1 A B NA NA
# 5 5 A D 1 A D NA 1
# 6 6 B C 1 B C NA NA
# 7 7 B D 1 B D NA NA
# 8 8 B D 1 B D 1 1
# 9 9 B D 1 B D 1 1
# 10 10 C A 1 A C NA 0
For such grouping, filtering and mutating tasks, I find dplyr to be very helpful. Here is one way I came up with how you can achieve your goal:
df %>% group_by(grp) %>% mutate(value = ifelse(id1 == lag(id1) & id2 == lag(id2), 1, ifelse(id1 == lag(id2) & id2 == lag(id1), 0, NA)))
Within each group, you compare the ID values and conditionally assign a new value column. Hope this helps.

how to group my string vector in two data frame?

Hi: I have a simple question but it confuse me a lot. Below are my codes:
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
I want my newdata , which merge b$recipt into a and becomes:
>newdata
url id recipt
1 a n
2 b n
3 c n
4 d y
5 e n
please give me some hint, thanks
You could try this:
a$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])
#> a
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Here it is assumed that the recipt entries are the same for any given value of url in b. If this is not the case, things become more complicated.
If you want to keep a unchanged and generate a new frame newdata with the new column, then the above code can be slightly modified in a rather trivial way:
newdata <- a
newdata$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])
You could use match
transform(a, recipt= b$recipt[match(url, b$url)])
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Or using the devel version of data.table. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(a)[unique(b[c(1,3)], by='url'), on='url']
# url id recipt
#1: 1 a n
#2: 2 b n
#3: 3 c n
#4: 4 d y
#5: 5 e n
So I think what you want is to merge a onto b as there are multiple prices with the same url. Thus b is your base data frame and you want to append an id value to it. Some of the id values would be repeated.
One easy way is to do this with dplyr.
library(dplyr)
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
left_join(b, a, by = "url")
url price recipt id
1 1 10 n a
2 1 10 n a
3 2 20 n b
4 2 20 n b
5 2 20 n b
6 3 30 n c
7 3 30 n c
8 3 30 n c
9 3 30 n c
10 4 40 y d
11 4 40 y d
12 5 50 n e
13 5 50 n e

R: fill a new column in a data frame with a value by matching variables in reverse

I apologize for the title of this question. I can't figure out how a good way to briefly describe what I want to do.
I have something like this, with >8000 rows:
x y value_xy
A B 7
A C 2
B A 3
B C 6
C A 2
C B 1
I want to create a new column, value_yx, that looks like this:
x y value_xy value_yx
A B 7 3
A C 2 2
B A 3 7
B C 1 1
C A 2 2
C B 1 1
For each value of x and y, I want to have a new column that finds the value of y to x (as y appears later in the x column). Sometimes these values are equal, other times they aren't.
I have explored using for loops, ave(), and several other functions, but I haven't been able to make it work.
Try merge. The by.x and by.y arguments specify columns to be matched, and here the order of matching columns is reversed in by.y:
merge(x = df, y = df, by.x = c("x", "y"), by.y = c("y", "x"))
# x y value_xy.x value_xy.y
# 1 A B 7 3
# 2 A C 2 2
# 3 B A 3 7
# 4 B C 6 1
# 5 C A 2 2
# 6 C B 1 6
Looks like I was beat to it but an alternative solution with mapply
df$value_yx = mapply(function(x_flip, y_flip) df[df$x == y_flip & df$y == x_flip,]$value_xy, df$x, df$y)
# x y value_xy value_yx
#1 A B 7 3
#2 A C 2 2
#3 B A 3 7
#4 B C 6 1
#5 C A 2 2
#6 C B 1 6
xtabs will return a value-matrix that can be indexed by a two-column, character-valued matrix formed from the first two columns and are probably factors (hence the need for the as.character()-conversion:
> dfrm$value_yx <- xtabs(value_xy~x+y, dfrm)[
sapply(dfrm[2:1],as.character) ]
> dfrm
x y value_xy value_yx
1 A B 7 3
2 A C 2 2
3 B A 3 7
4 B C 6 1
5 C A 2 2
6 C B 1 6
--- See what is being indexed
> xtabs(value_xy~x+y, dfrm)
y
x A B C
A 0 7 2
B 3 0 6
C 2 1 0

Resources