Related
I have a dataframe with approximately 3 million rows. Each row is assigned a unique ID and has up to 4 dates. I wish to create a set of new columns for month and year (i.e. Jan-21, Feb-21, Mar-21, etc) and assign a value of "0" for each month/year prior to the first date, and then a value of "1" for the month/year containing the date for each ID, and maintain the value of "1" in each subsequent month/year column until the next column that matches the 2nd date.
I understand that it's easier to help me with examples, so I have put together this dput output with an example of what my current data looks like:
structure(list(id = c(1, 2, 3, 4, 5), date1 = structure(c(1623801600,
1615420800, 1654560000, 1620259200, 1615248000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), date2 = structure(c(1629158400, 1621987200,
1658448000, 1623974400, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date3 = structure(c(NA, 1630454400, 1662076800, 1647907200,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), date4 = structure(c(NA,
1639008000, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
And this is what I would like it to look like:
structure(list(id = c(1, 2, 3, 4, 5), `Mar-21` = c(0, 1, 0, 0,
1), `Apr-21` = c(0, 1, 0, 0, 1), `May-21` = c(0, 2, 0, 1, 1),
`Jun-21` = c(1, 2, 0, 2, 1), `Jul-21` = c(1, 2, 0, 2, 1),
`Aug-21` = c(2, 2, 0, 2, 1), `Sep-21` = c(2, 3, 0, 2, 1),
`Oct-21` = c(2, 3, 0, 2, 1), `Nov-21` = c(2, 3, 0, 2, 1),
`Dec-21` = c(2, 4, 0, 2, 1), `Jan-22` = c(2, 4, 0, 2, 1),
`Feb-22` = c(2, 4, 0, 2, 1), `Mar-22` = c(2, 4, 0, 3, 1),
`Apr-22` = c(2, 4, 0, 3, 1), `May-22` = c(2, 4, 0, 3, 1),
`Jun-22` = c(2, 4, 1, 3, 1), `Jul-22` = c(2, 4, 2, 3, 1),
`Aug-22` = c(2, 4, 2, 3, 1), `Sep-22` = c(2, 4, 3, 3, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Just a note that I have this dataset in both wide and long format, in case using it in a long format makes more sense.
Thank you!
This was a fun exercise! I'm sure there are a billion ways to do this more efficiently, but I think this works and was a fun puzzle for me. I first put the dates into long format to get a min and max. Then I made a sequence of those dates by month. I then used expand grid to make all combinations of the months with each ID to join it to the original data frame. Then I just summed how many dates1:4 were greater then the months in the list. I had to use floor_date to change dates1:4 to the first of the month. Hopefully this helps!
library(dplyr)
library(lubridate)
library(tidyr)
dat2<-dat%>%
tidyr::pivot_longer(cols = -id, values_drop_na = T)
dat_min_max<-data.frame("Min" = min(dat2$value), "Max" = max(dat2$value))
month_seq<-seq(dat_min_max$Min, dat_min_max$Max+months(1), by = "month")
dat3<-dat%>%
mutate(date1 = floor_date(date1, "month"),
date2 = floor_date(date2, "month"),
date3 = floor_date(date3, "month"),
date4 = floor_date(date4, "month")
)%>%
left_join(expand.grid(dat$id, month_seq), by = c("id" = "Var1"))%>%
rowwise()%>%
mutate(c = sum(date1 <= Var2, date2 <= Var2, date3 <= Var2, date4 <= Var2, na.rm = T))%>%
mutate(Var2 = format(Var2, "%b-%y"))%>%
select(-date1, -date2, -date3, -date4)%>%
tidyr::pivot_wider(names_from = Var2, values_from = c)
I have this dataframe:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
I would like to create a heatmap, with patient ID in the x axis and clas1, clas2 and clas3 in the y axis. The values represented in the heat map would be the raw value of each "clas". Here I post a drawing of what I would like
I apologise because I don't have available more colours to represent this, but this is only an example and any colour scale could be used.
An important thing is that I would like to distinguish between zeros and NAs so ideally NAs have their own colour or appear in white (empty).
I hope this is understandable enough.
But any questions just ask
Many thanks!
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
library(tidyverse)
df %>% pivot_longer(!PatientID) %>%
ggplot(aes(x= PatientID, y = name, fill = value)) +
geom_tile()
Created on 2021-05-25 by the reprex package (v2.0.0)
Here is a base R option with ``heatmap`
heatmap(t(`row.names<-`(as.matrix(df[-1]), df$PatientID)))
# Which is like
# x <- as.matrix(df[-1]
# row.names(x) <- df$PatientID
# heatmap(t(x))
Preparing the data
I'll give 4 options, in all four you need to assign the rownames and remove the id column. I.e.:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
rownames(df) <- df$PatientID
df$PatientID <- NULL
df
The output is:
> df
clas1 clas2 clas3
3454 1 4 1
345 0 1 NA
5 5 0 0
348 NA 3 5
567 NA 1 5
79 4 0 5
Base R
With base R (decent output):
heatmap(as.matrix(df))
gplots
With gplots (a bit ugly, but many more parameters to control):
library(gplots)
heatmap.2(as.matrix(df))
heatmaply
With heatmaply you have nicer defaults to use for the dendrograms (it also organizes them in a more "optimal" way).
You can learn more about the package here.
Static
Static heatmap with heatmaply (better defaults, IMHO)
library(heatmaply)
ggheatmap(df)
Now with colored dendrograms
library(heatmaply)
ggheatmap(df, k_row = 3, k_col = 2)
With no dendrogram:
library(heatmaply)
ggheatmap(df, dendrogram = F)
Interactive
Interactive heatmap with heatmaply (hover tooltip, and the ability to zoom - it's interactive!):
library(heatmaply)
heatmaply(df)
And anything you can do with the static ggheatmap you can also do with the interactive heatmaply version.
Here is another option:
df <- data.frame(PatientID = c("3454","345","5","348","567","79"),
clas1 = c(1, 0, 5, NA, NA, 4),
clas2 = c(4, 1, 0, 3, 1, 0),
clas3 = c(1, NA, 0, 5, 5, 5), stringsAsFactors = F)
# named vector for heatmap
cols <- c("0" = "white",
"1" = "green",
"2" = "orange",
"3" = "yellow",
"4" = "pink",
"5" = "black",
"99" = "grey")
labels_legend <- c("0" = "0",
"1" = "1",
"2" = "2",
"3" = "3",
"4" = "4",
"5" = "5",
"99" = "NA")
df1 <- df %>%
pivot_longer(
cols = starts_with("clas"),
names_to = "names",
values_to = "values"
) %>%
mutate(PatientID = factor(PatientID, levels = c("3454", "345", "5", "348", "567", "79")))
ggplot(
df1,
aes(factor(PatientID), factor(names))) +
geom_tile(aes(fill= factor(values))) +
# geom_text(aes(label = values), size = 5, color = "black") + # text in tiles
scale_fill_manual(
values = cols,
breaks = c("0", "1", "2", "3", "4", "5", "99"),
labels = labels_legend,
aesthetics = c("colour", "fill"),
drop = FALSE
) +
scale_y_discrete(limits=rev) +
coord_equal() +
theme(line = element_blank(),
title = element_blank()) +
theme(legend.direction = "horizontal", legend.position = "bottom")
I am new here and still studying R so I am dealing with an error.
Here is what I get from console
Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
I don't know what can I do to make it work. I want to get a scatterplot.
ggplot(data = diagnoza, aes(x = Plecc, y = P32.01))
Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
Adding geom_point as suggested by #zx8754 gives me a scatter plot. There is still the warning you reported which is related to some of your variables being of type haven_labelled, so I guess you imported your data from SPSS.
To get rid of this warning you could convert your variables to R factors using haven::as_factor. Probably it would be best to do that for the whole dataset after importing your data.
diagnoza <- structure(list(Plecc = c(2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 1, 1, 1, 2, 1, 1, 2), P32.01 = structure(c(3, 4, 5, 5, 5,
5, 5, 4, 3, 5, 3, 4, 3, 4, 5, 5, 5, 3, 4, 5), label = "P32.01. odpoczynek w domu (oglądanie TV)", format.spss = "F1.0", display_width = 12L, labels = c(Nigdy = 1,
Rzadko = 2, `Od czasu do czasu` = 3, Często = 4, `Bardzo często` = 5
), class = c("haven_labelled", "vctrs_vctr", "double"))), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
library(haven)
library(ggplot2)
# Convert labelled vector to a factor
diagnoza$P32.01 <- haven::as_factor(diagnoza$P32.01)
ggplot(data = diagnoza, aes(x = Plecc, y = P32.01)) +
geom_point()
I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]
I need to display two datasets on the same faceted plots with ggplot2. The first dataset (dat) is to be shown as crosses like this:
While the second dataset (dat2) is to be shown as a color line. For an element of context, the second dataset is actually the Pareto frontier of the first set...
Both datasets (dat and dat2) look like this:
modu mnc eff
1 0.3080473 0 0.4420544
2 0.3110355 4 0.4633741
3 0.3334024 9 0.4653061
Here's my code so far:
library(ggplot2)
dat <- structure(list(modu = c(0.30947265625, 0.3094921875, 0.32958984375,
0.33974609375, 0.33767578125, 0.3243359375, 0.33513671875, 0.3076171875,
0.3203125, 0.3205078125, 0.3220703125, 0.28994140625, 0.31181640625,
0.352421875, 0.31978515625, 0.29642578125, 0.34982421875, 0.3289453125,
0.30802734375, 0.31185546875, 0.3472265625, 0.303828125, 0.32279296875,
0.3165234375, 0.311328125, 0.33640625, 0.3140234375, 0.33515625,
0.34314453125, 0.33869140625), mnc = c(15, 9, 6, 0, 10, 12, 14,
9, 5, 11, 0, 15, 0, 2, 14, 13, 14, 17, 11, 12, 13, 6, 4, 0, 13,
7, 10, 12, 7, 13), eff = c(0.492448979591836, 0.49687074829932,
0.49421768707483, 0.478571428571428, 0.493537414965986, 0.493809523809524,
0.49891156462585, 0.499319727891156, 0.495102040816327, 0.492285714285714,
0.482312925170068, 0.498911564625851, 0.479931972789116, 0.492857142857143,
0.495238095238095, 0.49891156462585, 0.49530612244898, 0.495850340136055,
0.50156462585034, 0.496, 0.492897959183673, 0.487959183673469,
0.495605442176871, 0.47795918367347, 0.501360544217687, 0.497850340136054,
0.493496598639456, 0.493741496598639, 0.496734693877551, 0.499659863945578
)), .Names = c("modu", "mnc", "eff"), row.names = c(NA, 30L), class = "data.frame")
dat2 <- structure(list(modu = c(0.26541015625, 0.282734375, 0.28541015625,
0.29216796875, 0.293671875), mnc = c(0.16, 0.28, 0.28, 0.28,
0.28), eff = c(0.503877551020408, 0.504149659863946, 0.504625850340136,
0.505714285714286, 0.508503401360544)), .Names = c("modu", "mnc",
"eff"), row.names = c(NA, 5L), class = "data.frame")
dat$modu = dat$modu
dat$mnc = dat$mnc*50
dat$eff = dat$eff
dat2$modu = dat2$modu
dat2$mnc = dat2$mnc*50
dat2$eff = dat2$eff
res <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat)[ii], collapse="/")), simplify=F))
ggplot(res, aes(x=x, y=y))+ geom_point(shape=4) +
facet_wrap(~ var, scales="free")
How should I go about doing this? Do I need to add a layer? If so, how to do this in a faceted plot?
Thanks!
Here's one way:
pts <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat)[ii], collapse="/")), simplify=F))
lns <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat2[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat2)[ii], collapse="/")), simplify=F))
gg.df <- rbind(cbind(geom="pt",pts),cbind(geom="ln",lns))
ggplot(gg.df,aes(x,y)) +
geom_point(data=gg.df[gg.df$geom=="pt",], shape=4)+
geom_path(data=gg.df[gg.df$geom=="ln",], color="red")+
facet_wrap(~var, scales="free")
The basic idea is to create separate data.frames for the points and the lines, then bind them together row-wise with an extra column (geom) indicating which geometry the data goes with. Then we plot the points based on the subset of gg.df with geom=="pt" and similarly with the lines.
The result isn't very interesting with your limited example, but this seems (??) to be what you want. Notice the use of geom_path(...) rather than geom_line(...). The latter orders the x-values before plotting.