I am fairly new to R and am trying to make some figures, but having trouble with renaming data. Basically, I had a super large data set from SPSS that I imported into R and created a smaller data table with one variable I am trying to look at. I was successful in getting my data into the long format, but my Time column is not represented the way I'd like.
When I got my data into the long format, I made a data Time column and the data in that column says TIME1COMPOSITE, TIME2COMPOSITE, TIME3COMPOSITE - which are the original column names from the SPSS file. I would prefer for it to instead read Time1, Time2, or Time3 (so that it can look better on the axis label for the graph I am making). Is there a simple way to do this? Either to rename the data points or to just rename the labels on the graph?
Here is an example of what my code looks like:
dt<- data.table(dt)
#Putting into long format
dt <- melt(dt, measure.vars = c("TIME1COMPOSITE", "TIME2COMPOSITE", "TIME3COMPOSITE"), variable.name = "Time", value.name = "CompositeScore")
#Computing means
dt[, meanCompositeScore:= mean(CompositeScore), by=c("Condition", "Time")]
#Plotting
plot <- ggplot(dt, aes(x=Time, y=meanCompositeScore, color=Condition)) + geom_point()
plot
The easiest method with the code you suggested have would be to change the column names at the beginning using the colnames() function.
colnames(dt) <- c("colname1","colname2", ...)
Another method using the tidy format would be to use the rename() function (from dplyr).
dt %>%
rename(Time1 = TIME1COMPOSITE, Time2 = TIME2COMPOSITE, Time3 = TIME3COMPOSITE)
To change the names once the calculations have occurred you could convert the time to a factor and relabel them. We can use the as.factor() function to convert the array.
dt$Time <- as.factor(dt$Time)
revalue(dt$Time, c("Time1" = "TIME1COMPOSITE", "Time2" = "TIME2COMPOSITE", "Time3" = "TIME3COMPOSITE"))
To add the labels in the graph we can convert it to a factor and set the levels at the line we use the graph using the as.factor() function.
levels = c("Time1", "Time2", "Time3")
plot <- ggplot(dt, aes(x=as.factor(Time, levels = levels), y=meanCompositeScore, color=Condition)) + geom_point()
A final method would be to relabel the graph labels rather than the values using the scale_x_discrete() ggplot function.
plot <- ggplot(dt, aes(x=Time, y=meanCompositeScore, color=Condition)) +
geom_point() +
scale_x_discrete(labels = c('Time1','Time2','Time3'))
Let me know if any method doesn't work for you and I will attempt to clarify the method or rectify the mistake.
Related
I have 2 data frames:
df1 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
I want to plot them, so as to to have the main graph be a numerical y scale geom_point, and then collate in the second dataframe with the labels (a,b,c) at the correct timeStamps on the continuous time series x axis.
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, y= text))
The difficulty I think lies in the fact that the timeStamps do not perfectly match up, and I keep getting returned with "Error: Discrete value supplied to continuous scale". Can anybody please offer some advice here?
The end result should look something like this (this an example from a much larger dataframe)
labeled time series using labels from different time series dataframe
Thank you
The issue is not the timeStamp but that for the geom_point you are mapping a numeric or continuous variable on y while for the geom_text you map a discrete one on y. Hence you get the error
Error: Discrete value supplied to continuous scale
To fix that map your text on the label aes (which BTW is required for geom_text) and use the y aes to specify the position where you want to add the labels:
library(ggplot2)
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, label = text, y = 6))
DATA
df1 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
Update: removed 1. answer:
I am still not sure. Also #stefan's answer seems more correct, but maybe you think of something like this:
If you want to position the labels from df2 on top of the points from df1 conditional to the nearest time points between df1 and df2 then we would need to use roll from data.table. This answer was adapted from here Merging two sets of data by data.table roll='nearest' function
library(data.table)
library(tidyverse)
setDT(df1)
setDT(df2)
# Create time column by which to do a rolling join
df1[, time := timeStamp]
df2[, time := timeStamp]
setkey(df1, time)
setkey(df2, time)
set_merged <- df2[df1, roll = "nearest"]
set_merged %>%
as_tibble() %>%
ggplot(aes(x = time, y=value, group=1)) +
geom_point() +
geom_line()+
geom_text(aes(x=time, y=max(value)+0.1, label=text))+
theme_minimal()
How can I make this in order of month, x axis is not in date class its in character? I tried using reorder and sort it doesn't work for my case.
Two approaches.
Fake data:
set.seed(42) # R-4.0.2
dat <- data.frame(
when = sample(c("Apr20", "Feb20", "Mar20"), size = 500, replace = TRUE),
charge = 10000 * rexp(500)
)
ggplot(dat, aes(charge, when)) +
geom_boxplot() +
coord_flip()
Date class
This is what I'll call "The Right Way (tm)", for two reasons: if the data is date-like, them let's use Date; and allow R to handle the ordering naturally.
dat$when2 <- as.Date(paste0("01", dat$when), "%d%b%y")
ggplot(dat, aes(charge, when2, group = when)) +
geom_boxplot() +
coord_flip() +
scale_y_date(labels = function(z) format(z, format = "%b%y"))
(I should note that I need both when2 and group=when: since when2 is a continuous variable, ggplot2 is not going to auto-group things based on it, so we need group=.)
factor
I think this is the wrong approach, for two reasons: (1) not using dates as the numeric data they are; and (2) the more months you have, the more you have to manually control the levels within the factors.
However, having said that:
dat$when3 <- factor(dat$when, levels = c("Feb20", "Mar20", "Apr20"))
ggplot(dat, aes(charge, when3)) +
geom_boxplot() +
coord_flip()
(You could easily overwrite dat$when instead of creating a new variable dat$when3, but I kept it separate because I went back and forth during code-testing here. Frankly, if you prefer to not go the Date route, then doing this allows other things to be ordered correctly, too.)
I'm currently working on automating some basic experiential analysis using R. Currently, I've got my script setup as follows which generates the plot shown below.
data <- list()
for (experiment in experiments) {
path = paste('../out/', experiment, '/', plot, '.csv', sep="")
data[[experiment]] <- read.csv(path, header=F)
}
df <- data.frame(Year=1:40,
'current'=colMeans(data[['current']]),
'vip'=colMeans(data[['vip']]),
'vipbonus'=colMeans(data[['vipbonus']]))
df <- melt(df, id.vars = 'Year', variable.name = 'Series')
plotted <- ggplot(df, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = ylabel, title = title)
file = paste(plot, '.png', sep="")
ggsave(filename = file, plot = plotted)
While this is close to what we want the final product to look like, the series labels need to be updated. Ideally we want them to be something like "VIP, no bonus", "VIP, with bonus" and so forth, but obviously using labels like that in the data frame is not valid R (and invalid characters are automatically replaced with . even with backticks). Since these experiments are a work in progress, we also know that we are gong to need more series labels in the future so we don't want to lose the ability of ggplot to automatically set the colors for us.
How can I set the series labels to be appropriate for humans?
The OP explained that he is currently working on automating some basic experiential analysis, part of which is the relabeling of the series. The OP showed also some code which is used to prepare the data to be plotted.
Based on the additional information supplied in comments, I believe the overall processing could be streamlined which will address the series labeling issue as well.
Some preparations
# used for creating file paths
experiments <- c("current", "vip", "vipbonus")
# used for labeling the series
exp_labels <- c("Current", "VIP, no bonus", "VIP, with bonus")
plot <- "dataset1" # e.g.
paths <- paste0(file.path("../out", experiments, plot), ".csv")
paths
#[1] "../out/current/dataset1.csv" "../out/vip/dataset1.csv" "../out/vipbonus/dataset1.csv"
Read data
library(data.table) #version 1.10.4 used here
# read all files into one large data.table
# add running count in column "Series" to identify the source of each row
DT <- rbindlist(lapply(paths, fread, header = FALSE), idcol = "Series")
# rename file chunks = Series, use predefined labels
DT[, Series := factor(Series, labels = exp_labels)]
Reshape and aggregate by groups
# reshape from wide to long
molten <- melt(DT, id.vars = "Series")
# compute means by Series and Year = variable
aggregated <- molten[, .(value = mean(value)), by = .(Series, variable)]
# take factor level number of "variable" as Year
aggregated[, Year := as.integer(variable)]
Note that aggregation is done in long format (after melt()) to save typing the same command for each column.
Create chart & save to disk
library(ggplot2)
ggplot(aggregated, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = "ylabel", title = "title")
file = paste(plot, '.png', sep="")
ggsave(filename = file) # by default, the last plot is saved
While this may not be an ideal approach, what we found that worked for us was to update the relevant series labels after the melt command was performed:
df$Series <- as.character(df$Series)
df$Series[df$Series == "current"] <- "Current"
df$Series[df$Series == "vip"] <- "VIP, no bonus"
df$Series[df$Series == "vipbonus"] <- "VIP, with bonus"
Which results in plots like the following:
You can try this
library(tidyverse)
df <- df %>% dplyr::mutate(Series = as.character(Series),
Series = fct_recode(Series,
"Current" = "current",
"VIP, no bonus" = "vip",
"VIP, with bonus" = "vipbonus"))
Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()
I am trying to plot the following data
factor <- as.factor(c(1,2,3))
V1_mean <- c(100,200,300)
V2_mean <- c(350,150,60)
V1_stderr <- c(5,9,3)
V2_stderr <- c(12,9,10)
plot <- data.frame(factor,V1_mean,V2_mean,V1_stderr,V2_stderr)
I want to create a plot with factor on the x-axis, value on the y-axis and seperate lines for V1 and V2 (hence the points are the values of V1_mean on one line and V2_mean on the other). I would also like to add error bars for these means based on V1_stderr and V2_stderr
Many thanks
I'm not sure regarding your desired output, but here's a possible solution.
First of all, I wouldn't call your data plot as this is a stored function in R which is being commonly used
Second of all, when you want to plot two lines in ggplot you'll usually have to tide your data using functions such as melt (from reshape2 package) or gather (from tidyr package).
Here's an a possible approach
library(ggplot2)
library(reshape2)
dat <- data.frame(factor, V1_mean, V2_mean, V1_stderr, V2_stderr)
mdat <- cbind(melt(dat[1:3], "factor"), melt(dat[c(1, 4:5)], "factor"))
names(mdat) <- make.names(names(mdat), unique = TRUE)
ggplot(mdat, aes(factor, value, color = variable)) +
geom_point(aes(group = variable)) + # You can also add `geom_point(aes(group = variable)) + ` if you want to see the actual points
geom_errorbar(aes(ymin = value - value.1, ymax = value + value.1))