I have a df data and I would like to add to a new column a value that exist in a previous column and row if the factor is the same.
Here is a sample:
data <- structure(list(Id = c("a", "b", "b", "b", "a", "a", "b", "b",
"a", "a"), duration.minutes = c(NA, 139L, 535L, 150L, NA, NA,
145L, 545L, 144L, NA), event = structure(c(1L, 4L, 3L, 4L, 2L,
1L, 4L, 3L, 4L, 2L), .Label = c("enter", "exit", "stop", "trip"
), class = "factor")), .Names = c("Id", "duration.minutes", "event"
), class = "data.frame", row.names = 265:274)
and I would like to add a new column called "duration.minutes.past" like this:
data <- structure(list(Id = c("a", "b", "b", "b", "a", "a", "b", "b",
"a", "a"), duration.minutes = c(NA, 139L, 535L, 150L, NA, NA,
145L, 545L, 144L, NA), event = structure(c(1L, 4L, 3L, 4L, 2L,
1L, 4L, 3L, 4L, 2L), .Label = c("enter", "exit", "stop", "trip"
), class = "factor"), duration.minutes.past = c(NA, NA, 139,
NA, NA, NA, NA, 145, NA, NA)), .Names = c("Id", "duration.minutes",
"event", "duration.minutes.past"), row.names = 265:274, class = "data.frame")
As you can see, I added in this new column duration.minutes.past the duration.minutes of the previous trip for the same Id. if the Id is different or if is it not a stop, then the value for duration.minutes.past is NA.
Help is much appreciated!
A possible solution using dplyr,
library(dplyr)
df %>%
group_by(Id) %>%
mutate(new = replace(lag(duration.minutes), event != 'stop', NA))
#Source: local data frame [10 x 4]
#Groups: Id [2]
# Id duration.minutes event new
# <chr> <int> <fctr> <int>
#1 a NA enter NA
#2 b 139 trip NA
#3 b 535 stop 139
#4 b 150 trip NA
#5 a NA exit NA
#6 a NA enter NA
#7 b 145 trip NA
#8 b 545 stop 145
#9 a 144 trip NA
#10 a NA exit NA
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'Id', we create the lag column of 'duration.minutes' using shift), then change the value to 'NA' where the 'event' is not equal to 'stop'
library(data.table)
setDT(data)[, duration.minutes.past := shift(duration.minutes),
Id][event != "stop", duration.minutes.past := NA][]
data
# Id duration.minutes event duration.minutes.past
#1: a NA enter NA
#2: b 139 trip NA
#3: b 535 stop 139
#4: b 150 trip NA
#5: a NA exit NA
#6: a NA enter NA
#7: b 145 trip NA
#8: b 545 stop 145
#9: a 144 trip NA
#10: a NA exit NA
Or this can be done with base R using ave
data$duration.minutes.past <- with(data, NA^(event != "stop") *
ave(duration.minutes, Id, FUN = function(x) c(NA, x[-length(x)])))
Related
I've been trying to summarize data by multiple groups, where the new column should be a summary of the proportion of one column to another, by these groups. Because these two columns never both contain a value, their proportions cannot be calculated per row.
Below is an example.
By, P_Common and Number7 groups, I'd like the total N_count/A_count
structure(list(P_Common = c("B", "B", "C", "C", "D", "E", "E",
"F", "G", "G", "B", "G", "E", "D", "F", "C"), Number_7 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L),
N_count = c(0L, 4L, 22L, NA, 7L, 0L, 44L, 16L, NA, NA, NA,
NA, NA, NA, NA, NA), A_count = c(NA, NA, NA, NA, NA, NA,
NA, NA, 0L, 4L, 7L, NA, 23L, 4L, 7L, 17L)), class = "data.frame", row.names = c(NA,
-16L))
P_Common Number_7 N_count A_count
B 1 0 NA
B 1 4 NA
C 1 22 NA
C 1 NA NA
D 2 7 NA
E 2 0 NA
E 2 44 NA
F 2 16 NA
B 1 NA 7
G 3 NA NA
E 1 NA 23
D 2 NA 4
F 1 NA 7
C 1 NA 17
In this example there'd be quite some 0 / NA values but that's okay, they can stay in, but overall it would become like
P_Common Number_7 Propo
B 1 0.571428571
C 1 1.294117647
D 2 1.75
... etc
You can do:
df %>%
group_by(P_Common, Number_7) %>%
summarise(Propo = sum(N_count, na.rm = T) / sum(A_count, na.rm = T))
P_Common Number_7 Propo
<chr> <int> <dbl>
1 B 1 0.571
2 C 1 1.29
3 D 2 1.75
4 E 1 0
5 E 2 Inf
6 F 1 0
7 F 2 Inf
8 G 3 0
Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?
You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]
With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))
I am trying in vain to get the segments within each bar of the bar chart to order based on the value (largest value within a bar on the bottom, smallest on top).
I've researched this and would think this should work, but something is not right and I can't find the issue. I tried this solution here, but no luck.
Here is a reproducible example:
library(dplyr)
my_repro <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "2020-04-01", class = "factor"),
Grp = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D",
"D", "D", "E", "E", "E"), Segment = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Seg1",
"Seg2", "Seg3"), class = "factor", scores = structure(c(Seg1 = NA_real_,
Seg2 = NA_real_, Seg3 = NA_real_), .Dim = 3L, .Dimnames = list(
c("Seg1", "Seg2", "Seg3")))), Value = c(220, 75, NA,
NA, 400, NA, 350, NA, NA, 170, NA, NA, 375, 100,
NA)), row.names = c(NA, -15L), class = c("tbl_df", "tbl",
"data.frame"))
# Reorder the Group based on the Value
my_repro$Grp <- reorder(my_repro$Grp, -my_repro$Value)
#my_repro$Grp <- as.character(my_repro$Grp) # I later added this line too, no luck
# Plot
ggplot(my_repro, aes(x=Segment, y=Value, fill=Grp)) +
geom_col()
This gives the following tibble:
# A tibble: 15 x 4
Date Grp Segment Value
<fct> <fct> <fct> <dbl>
1 2020-04-01 A Seg1 220
2 2020-04-01 A Seg2 75
3 2020-04-01 A Seg3 NA
4 2020-04-01 B Seg1 NA
5 2020-04-01 B Seg2 400
6 2020-04-01 B Seg3 NA
7 2020-04-01 C Seg1 350
8 2020-04-01 C Seg2 NA
9 2020-04-01 C Seg3 NA
10 2020-04-01 D Seg1 170
11 2020-04-01 D Seg2 NA
12 2020-04-01 D Seg3 NA
13 2020-04-01 E Seg1 375
14 2020-04-01 E Seg2 100
15 2020-04-01 E Seg3 NA
And the following graph:
It appears that bars are being ordered alphabetically, which I know is an issue on many questions like this, but I thought this line of code would solve it: my_repro$Grp <- reorder(my_repro$Grp, -my_repro$Value)
I then added this line, just before the plot code, so that Grp would not be a factor that was put in alphabetical order: my_repro$Grp <- as.character(my_repro$Grp) but I get the same plot
Any idea how to fix?
Thanks!
Try the fct_reorder function from the forcats package:
library(dplyr)
library(forcats)
my_repro <- my_repro %>%
group_by(Segment) %>%
mutate(Grp = fct_reorder(Grp, Value))
# Plot
ggplot(my_repro, aes(x=Segment, y=Value, fill=Grp)) +
geom_col()
I have this sample:
> a
Ship duration.minutes event Location
1 a NA enter Skagen
2 a 1616 trip <NA>
3 a 4308 stop Copenhagen
4 b 1646 trip <NA>
5 b 5751 stop Gdynia
6 b 75 trip <NA>
7 b 45666 stop Gdansk
8 c 2531 trip <NA>
9 c 5360 stop Szczecin
10 d 287 trip <NA>
I would like to add a new column called "destination", and to add the name of the destination in these cells.
The output would be:
> output
Ship duration.minutes event Location Destination
1 a NA enter Skagen NA
2 a 1616 trip <NA> Copenhagen
3 a 4308 stop Copenhagen <NA>
4 b 1646 trip <NA> Gdynia
5 b 5751 stop Gdynia <NA>
6 b 75 trip <NA> Gdansk
7 b 45666 stop Gdansk <NA>
8 c 2531 trip <NA> Szczecin
9 c 5360 stop Szczecin <NA>
10 d 287 trip <NA> <NA>
It means that it is working per Ship: it would give the destination for the ship a only. It is taking the next Location after a trip for this very ship.
I tried with moves <- setDT(a)[, .(from = Location[-.N], to = Location[-1L]) , Ship] but it does not keep the column duration.minutes:
> dput(moves)
structure(list(Ship = c("a", "a", "b", "b", "b", "c"), from = structure(c(4L,
NA, NA, 3L, NA, NA), .Label = c("Copenhagen", "Gdansk", "Gdynia",
"Skagen", "Szczecin"), class = "factor"), to = structure(c(NA,
1L, 3L, NA, 2L, 5L), .Label = c("Copenhagen", "Gdansk", "Gdynia",
"Skagen", "Szczecin"), class = "factor")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .Names = c("Ship",
"from", "to"), .internal.selfref = <pointer: 0x00000000003e0788>)
It looks like this:
> moves
Ship from to
1: a Skagen <NA>
2: a <NA> Copenhagen
3: b <NA> Gdynia
4: b Gdynia <NA>
5: b <NA> Gdansk
6: c <NA> Szczecin
The sample of the data called a is:
> dput(data)
structure(list(Ship = c("a", "a", "a", "b", "b", "b", "b", "c",
"c", "d"), duration.minutes = c(NA, 1616L, 4308L, 1646L, 5751L,
75L, 45666L, 2531L, 5360L, 287L), event = structure(c(1L, 3L,
2L, 3L, 2L, 3L, 2L, 3L, 2L, 3L), .Label = c("enter", "stop",
"trip"), class = "factor"), Location = structure(c(4L, NA, 1L,
NA, 3L, NA, 2L, NA, 5L, NA), .Label = c("Copenhagen", "Gdansk",
"Gdynia", "Skagen", "Szczecin"), class = "factor")), .Names = c("Ship",
"duration.minutes", "event", "Location"), row.names = c(NA, -10L
), class = c("data.table", "data.frame"))
I am afraid it is hard to work with setDT. Is there a way to keep the column duration.minutes?
I'm not sure if this covers all of your use cases, but you could use the lead function to capture the next value for each Ship. It also seems like it makes more sense to have all of the values in a single column, rather than separate Location and Destination columns.
library(tidyverse)
a %>%
group_by(Ship) %>%
mutate(Destination = lead(Location),
Location = coalesce(Location, Destination)) %>%
select(-Destination)
Ship duration.minutes event Location
<chr> <int> <fct> <fct>
1 a NA enter Skagen
2 a 1616 trip Copenhagen
3 a 4308 stop Copenhagen
4 b 1646 trip Gdynia
5 b 5751 stop Gdynia
6 b 75 trip Gdansk
7 b 45666 stop Gdansk
8 c 2531 trip Szczecin
9 c 5360 stop Szczecin
10 d 287 trip <NA>
If you want to keep separate columns, then you can shorten the code to this:
a %>%
group_by(Ship) %>%
mutate(Destination = lead(Location))
For the data sample you provided, fill can also create a single column in one step:
a %>%
group_by(Ship) %>%
fill(Location, .direction="up")
I would like to create a new column for a data frame with values from the intersection of a row and a column.
I have a data.frame called "time":
q 1 2 3 4 5
a 1 13 43 5 3
b 2 21 12 3353 34
c 3 21 312 123 343
d 4 123 213 123 35
e 4556 11 123 12 3
And another table, called "event":
q dt
a 1
b 3
c 4
d 2
e 1
I want to put another column called inter on the second table that will be fill the values that are in the intersection between the q and the columns dt from the first data.frame. So the result would be this:
q dt inter
a 1 1
b 3 12
c 4 123
d 2 123
e 1 4556
I have tried to use merge(event, time, by.x = "q", by.y = "dt"), but it generate the error that they aren't the same id. I have also tried to transpose the time data.frame to cross section the values but I didn't have success.
library(reshape2)
merge(event, melt(time, id.vars = "q"),
by.x=c('q','dt'), by.y=c('q','variable'), all.x = TRUE)
Output:
q dt value
1 a 1 1
2 b 3 12
3 c 4 123
4 d 2 123
5 e 1 4556
Notes
We use the function melt from the package reshape2 to convert the data frame time from wide to long format. And then we merge (left outer join) the data frames event and the melted time by two columns (q and dt in event, q and variable in the melted time) .
Data:
time <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), `1` = c(1L, 2L, 3L, 4L, 4556L), `2` = c(13L,
21L, 21L, 123L, 11L), `3` = c(43L, 12L, 312L, 213L, 123L), `4` = c(5L,
3353L, 123L, 123L, 12L), `5` = c(3L, 34L, 343L, 35L, 3L)), .Names = c("q",
"1", "2", "3", "4", "5"), class = "data.frame", row.names = c(NA,
-5L))
event <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), dt = c(1L, 3L, 4L, 2L, 1L)), .Names = c("q",
"dt"), class = "data.frame", row.names = c(NA, -5L))
This may be a little clunky but it works:
inter=c()
for (i in 1:nrow(time)) {
xx=merge(time,event,by='q')
dt=xx$dt
z=y[i,dt[i]+1]
inter=c(inter,z)
final=cbind(time[,1],dt,inter)
}
colnames(final)=c('q','dt','inter')
Hope it helps.
Output:
q dt inter
[1,] 1 1 1
[2,] 2 3 12
[3,] 3 4 123
[4,] 4 2 123
[5,] 5 1 4556