Related
I need to duplicate rows with incontinuous dates to fill all the dates in a dataframe.
Suppose this df:
df <- data.frame(date = c("2022-07-05", "2022-07-07", "2022-07-11", "2022-07-15", "2022-07-18"), letter = c("a", "b", "a", "b", "c"))
The desired output is this df_new:
df_new <- data.frame(date = c("2022-07-05", "2022-07-06",
"2022-07-07", "2022-07-08", "2022-07-09", "2022-07-10",
"2022-07-11", "2022-07-12", "2022-07-13", "2022-07-14",
"2022-07-15"),
letter = c("a", "a",
"b", "b", "b", "b",
"a", "a", "a", "a",
"c"))
Could you please help ?
We could use complete from tidyr to expand the data based on the min/max date incremented by '1 day' and then fill the NA elements in 'letter' by the previous non-NA element
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(min(date), max(date), by = '1 day')) %>%
fill(letter)
I am trying to generate running number in dplyr using row_number but the results are not as desired. I would like to have say cat A to have 1, 2, 3 in var1 Any leads?
library(dplyr)
df <- tibble(
cat = c("A", "B", "C", "A", "A", "B"),
date = seq.Date(Sys.Date(), Sys.Date() + 5,
by = 1),
age = c(12, 13, 34,23,32,34)
)
df <- df %>%
arrange(cat, date) %>%
group_by(cat, date) %>%
mutate(var1 = row_number())
df
I have a spark dataframe I'm manipulating using sparklyr that looks like the following:
input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170),
date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"),
group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"),
event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))
I'd like to aggregate the data such that I have a count of the number of "events" (where event == 1 ) and "non_events" (where event == 0) for each combination such that the final output looks like the following:
data.frame(group_a = c(1,0,0,1,0,1),
group_b = c(0,1,0,1,1,0),
group_c = c(0,0,1,0,1,1),
event_occured = c(3,1,2,0,2,2),
event_not_occured = c(4,2,2,0,2,2))
So, for example, there were no combinations where A and B were groups for the same ID so that gets a 0 for event and non_event. There were 4 IDs in which group A was involved in, of which 3 resulted in an event and 1 resulted in a non_event, so on and so forth.
What approach using sparklyr (or dplyr or pyspark) would allow for aggregation as described above? I tried the following but I'm getting the exact same number of event as event_not_occurred so I must be doing something wrong but can't pinpoint it:
combo_path_sdf <- input_data %>%
group_by(id) %>%
arrange(date) %>%
mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()
combo_path_sdf %>%
group_by(id, order_seq) %>%
summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
group_b = max(ifelse(group_b == "B", 1, 0)),
group_c = max(ifelse(group_c == "C", 1, 0)),
events = sum(event)) %>%
group_by(order_seq, group_a, group_b, group_c) %>%
summarize(event = sum(events),
total_sequences = n()) %>%
mutate(event_not_occured = total_sequences - event)
Final output in the following format would be ok too:
data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"),
event_occured = c(3,1,2,1,2,2),
event_not_occured = c(4,2,2,1,2,2))
(image below for A,B is incorrect, should be 1,1 not 0,0)
The following matches your requested output format, and process the data in the way I understand you want, but (as per the comment by #Martin Gal) does not match the example result you provided.
input_data %>%
group_by(id) %>%
summarise(group_a = max(ifelse(group == 'A', 1, 0)),
group_b = max(ifelse(group == 'B', 1, 0)),
group_c = max(ifelse(group == 'C', 1, 0)),
event_occured = sum(ifelse(event == 1, 1, 0)),
event_not_occured = sum(ifelse(event == 0, 1, 0)),
.groups = "drop") %>%
group_by(group_a, group_b, group_c) %>%
summarise(event_occured = sum(event_occured),
event_not_occured = sum(event_not_occured),
.groups = "drop")
This idea is a two step summary process. The first summarise creates an indicator for group from each event and counts the number of events/non-events. The second summarise, combines all similar groups.
Regarding the code you are using that produces the same number of events and non-events. Take a look at hts_combined. This is not defined in the code you have shared and hence your script might be reading a variable from elsewhere.
I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.
Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces
You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:
I have two data sets from which I would like to generate histograms showing how the data overlap by name (A, B, C). I have written a custom function so I can use ggplot with map2.
I would like the graphs to be titled according to the name of each data set, so "A", "B", "C." Does anyone know of a way to do this?
# load packages
library(ggplot2)
library(dplyr)
library(purrr)
## load and format data 1
df1_raw <- data.frame(name = c("A", "B", "C", "A", "C", "B"),
start = c(1, 3, 4, 5, 2, 1),
end = c(6, 5, 7, 8, 6, 7))
df1 <- split(x = df1_raw, f = df1_raw$name) # split data by name
df1 <- lapply(df1, function(x) Map(seq.int, x$start, x$end)) # generate sequence intervals
df1 <- map(df1, unlist) # unlist sequences
df1 <- lapply(df1, data.frame) # convert to df
## load and format data 2
df2_raw <- data.frame(name = c("C", "B", "C", "A", "A", "B"),
start = c(5, 4, 3, 4, 4, 5),
end = c(7, 8, 7, 6, 9, 6))
df2 <- split(x = df2_raw, f = df2_raw$name) # split data by name
df2 <- lapply(df2, function(x) Map(seq.int, x$start, x$end)) # generate sequence intervals
df2 <- map(df2, unlist) # unlist sequences
df2 <- lapply(df2, data.frame) # convert to df
## write custom ggplot function and generate graphs
gplot <- function(data1, data2) {
ggplot() +
geom_histogram(data = data1, aes(x = X..i..), binwidth = 1, color = "grey", fill = "grey") +
geom_histogram(data = data2, aes(x = X..i..), binwidth = 1, fill = "pink", alpha = 0.7) +
labs(
title = ls(data1))
}
hist <- map2(df1, df2, gplot)
I also tried the following in the title field in my function:
deparse(substitute(data1))
Another similar option to what #GregorThomas mentioned in the comments, you could add a name variable to your data.frames and pull from that in your gplot() function. I've also shown how you might combine a few of your data manipulation steps:
# load packages
library(ggplot2)
library(dplyr)
library(purrr)
## load and format data 1
df1_raw <- data.frame(name = c("A", "B", "C", "A", "C", "B"),
start = c(1, 3, 4, 5, 2, 1),
end = c(6, 5, 7, 8, 6, 7))
df1 <- df1_raw %>%
split(.$name) %>% # split data by name
imap(function(x, x_name) {
data.frame(value = Map(seq.int, x$start, x$end) %>% unlist,
name = x_name)
})
## load and format data 2
df2_raw <- data.frame(name = c("C", "B", "C", "A", "A", "B"),
start = c(5, 4, 3, 4, 4, 5),
end = c(7, 8, 7, 6, 9, 6))
df2 <- df2_raw %>%
split(.$name) %>% # split data by name
imap(function(x, x_name) {
data.frame(value = Map(seq.int, x$start, x$end) %>% unlist,
name = x_name)
})
## change the title component of your previous function
gplot <- function(data1, data2) {
ggplot() +
geom_histogram(data = data1, aes(x = value), binwidth = 1, color = "grey", fill = "grey") +
geom_histogram(data = data2, aes(x = value), binwidth = 1, fill = "pink", alpha = 0.7) +
ggtitle(data1$name[1])
}
## plot it
map2(df1, df2, gplot)