I am looking for a way to transform time series of different length into a unique length.
I think this question has already been asked by I can't find it. I guess I am just not using the right vocabulary for the question.
Data 1: 20 variables x 250 observations (time points)
Data 2: 20 variables x 50 observations (time points)
I would like to transform these data into 100 observations while keeping the shape of curves for the 20 variables in both cases.
Thanks a lot
Sample data
set.seed(123)
data <- matrix(0, 250, 20)
data[1, ] <- rnorm(20)
for (i in 2:nrow(data)) {
data[i, ] <- data[i - 1, ] + rnorm(20, 0, 0.02)
}
rownames(data) <- 0:249
One way of handling this is with reshape2 and dplyr:
library("reshape2")
library("dplyr")
library("ggplot2")
molten <- melt(data, varnames = c("Time", "Variable"))
Plot of original data:
ggplot(molten, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
Now reduce the data.frame by a factor of 5 using means of the values in each time period:
shorter <- molten %>%
group_by(Variable, Time %/% 5) %>%
summarise(value = mean(value), Time = mean(Time))
Plot new data:
ggplot(shorter, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
If you want original wide form of data:
shorterWide <- acast(shorter, Time ~ Variable)
I think I find a way using this function
Basic two-dimensional cubic spline fitting in R
I guess the keyword I was missing was cubic spline.
In my case I want to do something like that
spline(Data1, n = 100)
spline(Data1, n = 100)
Related
I'm trying to draw 18 graphs using R and the ggplot2 package. My data look like this:
v1 v2 v3 ... v18 subject group
534 543 512 ... 410 1 (6.5, 18]
437 576 465 ... 420 2 (0, 6.5]
466 487 492 ... 501 3 (18, 55]
And I need to create a "faceted" histogram showing distributions for all of the groups in one frame (i. e. to conveniently present all of the subgroups' distributions) like this:
I came up with this code for a single plot:
ggplot(data = df, aes (x = v1)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
But since there are 18 variables (v1, v2,...), I'm looking for a way to write an efficient function/loop/command that would draw all the 18 graphs without me having to copy/paste and change the variable name 18 times. Like this:
ggplot(data = df, aes (x = **v1**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v2**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v3**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
I know that the solution probably lies in looping and it seems like a useful skill to have, so I'm also using this opportunity to learn this right.
Thank you, any help is appreciated! (And thanks to all the suggestions so far!)
This is where I've gotten so far with the kind help of the user below:
for (v in c(v1,v2)) {
pdf("plots.pdf")
histograms <- ggplot(data = data, aes (x = v)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
print(histograms)
}
dev.off()
EDIT A significantly revised answer is provided having clarified the needs.
The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.
My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.
A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.
# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data
# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))
Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.
# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)
# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify
# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)
# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)
In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.
As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.
# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)
My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.
The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.
library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience
Now to examine each of the 18 plots (this may not work in RStudio):
opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option
To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.
for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}
And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)
library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)
I've got a dataset of different energies (eV) and related counts. I changed the detection wavelength throughout the measurement which resulted in having a first column with all wavelength and than further columns. There the different rows are filled with NAs because no data was measured at the specific wavelength.
I would like to plot the spectra in R, but it doesn't work because the length of X and y values differs for each column.
It would be great, if someone could help me.
Thank you very much.
It would be better if we could work with (simulated) data you provided. Here's my attempt at trying to visualize your problem the way I see it.
library(ggplot2)
library(tidyr)
# create and fudge the data
xy <- data.frame(measurement = 1:20, red = rnorm(20), green = rnorm(20, mean = 10), uv = NA)
xy[16:20, "green"] <- NA
xy[16:20, "uv"] <- rnorm(5, mean = -3)
# flow it into "long" format
xy <- gather(xy, key = color, value = value, - measurement)
# plot
ggplot(xy, aes(x = measurement, y = value, group = color)) +
theme_bw() +
geom_line()
I have panel data with ID=1,2,3... year=2007,2008,2009... and a factor foreign=0,1, and a variable X.
I would like to create a time series plot with x-axis=year, y-axis=values of X that compares the average (=mean) development of each factor over time. As there are 2 factors, there should be two lines, one solid and one dashed.
I assume that the first step involves the calculation of the means for each year and factor of X, i.e. in a panel setting. The second step should look something like this:
ggplot(data, aes(x=year, y=MEAN(X), group=Foreign, linetype=Foreign))+geom_line()+theme_bw()
Many thank.
Using dplyr to calculate the means:
library(dplyr)
# generate some data (because you didn't provide any, or any way or generating it...)
data = data.frame(ID = 1:200,
year = rep(1951:2000, each = 4),
foreign = rep(c(0, 1), 100),
x = rnorm(200))
# For each year, and seperately for foreign or not, calculate mean x.
data.means <- data %>%
group_by(year, foreign) %>%
summarize(xmean = mean(x))
# plot. You don't need group = foreign
ggplot(data.means, aes(x = year, y = xmean, linetype = factor(foreign))) +
geom_line() +
theme_bw()
I need to interpolate annual data from a 5-year interval and so far I found how to do it for one observation using approx(). But I have a large data set and when trying to use ddply() to apply for each row, no matter what I try in the last row of code I keep receiving error messages.
e.g:
town <- data.frame(name = c("a","b","c"), X1990 = c(100,300,500), X1995=c(200,400,700))
d1990 <-c(1990)
d1995 <-c(1995)
town_all <- cbind(town,d1990,d1995)
library(plyr)
Input <- data.frame(town_all)
x <- c(town_all$X1990, town_all$X1995)
y <- c(town_all$d1990, town_all$d1995)
approx_frame <- function(df) (approx(x=x, y=y, method="linear", n=6, ties="mean"))
ddply(Input, town_all$X1990, approx_frame)
Also, if you know what function calculates geometric interpolation, it will be great. (I was only able to find examples of spline or constant methods.)
I would first put the data in long format (each column corresponds to a variable, so one column for 'year' and one for 'value'). Then, I use data.table, but the same approach could be followed with dplyr or another split-apply-combine method. This interp function is meant to do geometric interpolation with a constant rate calculated for each interval.
## Sample data (added one more year)
towns <- data.frame(name=c('a', 'b', 'c'),
x1990=c(100, 300, 500),
x1995=c(200, 400, 700),
x2000=c(555, 777, 999))
## First, transform data from wide -> long format, clean year column
library(data.table) # or use reshape2::melt
towns <- melt(as.data.table(towns), id.vars='name', variable.name='year') # wide -> long
towns[, year := as.integer(sub('[[:alpha:]]', '', year))] # convert years to integers
## Function to interpolate at constant rate for each interval
interp <- function(yrs, values) {
tt <- diff(yrs) # interval lengths
N <- head(values, -1L)
P <- tail(values, -1L)
r <- (log(P) - log(N)) / tt # rate for interval
const_rate <- function(N, r, time) N*exp(r*(0:(time-1L)))
list(year=seq.int(min(yrs), max(yrs), by=1L),
value=c(unlist(Map(const_rate, N, r, tt)), tail(P, 1L)))
}
## geometric interpolation for each town
res <- towns[, interp(year, value), by=name]
## Plot
library(ggplot2)
ggplot(res, aes(year, value, color=name)) +
geom_line(lwd=1.3) + theme_bw() +
geom_point(data=towns, cex=2, color='black') + # add points interpolated between
scale_color_brewer(palette='Pastel1')
I have a long format data frame where rows represent the responses (one of four categories) of different people. An example dataset is provided here:
df <- data.frame(person=c(rep("A",100),rep("B",100)),resp=c(sample(4,100,replace=TRUE),sample(4,100,replace=TRUE)))
df$resp <- factor(df$resp)
summary(df)
person resp
A:100 1:52
B:100 2:55
3:54
4:39
I want to present a chart where the x-axis plots the response category, the y-axis shows the proportion of responses in a category, and where error bars are calculated via bootstrapping (sampling with replacement).
I can calculate the proportion (in an extremely kludgy way; I'm sure this could be improved but this is not my main concern):
pFrame <- ddply(df,.(person,resp),summarise,trials = length(resp))
# can't figure out how to calculate the proportion with plyr.
pFrame$prop <- NA
people <- unique(df$person)
responses <- unique(df$resp)
for (i in 1 : length(people)){
nTrials <- nrow(subset(df,person==people[i]))
for (j in 1 : 4){
pFrame$prop[pFrame$person==people[i] & pFrame$resp==responses[j]] <- pFrame$trials[pFrame$person==people[i] & pFrame$resp==responses[j]] / nTrials
}
}
and plot it:
ggplot(pFrame,aes(x=resp,y=prop,colour=person)) + geom_point()
but I would really like to use something like stat_summary(fun.data="mean_cl_boot") to show the variability on the proportions (i.e. acting on the original data frame df, and bootstrapping over the rows). I've tried a few attempts at creating custom functions but this doesn't seem trivial because the factor levels need to be transformed for the bootstrap first.
I couldn't get ggplot's "mean_cl_boot" to work. Here is an alternative solution though:
library(boot)
summary_for_plot <- melt(prop.table(table(df), 1))
names(summary_for_plot) <- c("person", "resp", "V1")
# function for boot()
summary_function <- function(df, d){
melt(prop.table(table(df[d,]), 1))[, 3]
}
bootres <- boot(df, statistic = summary_function, R=100)
# get the standard deviation, used for the confidence intervals
summary_for_plot$sd <- sd(bootres$t)
ggplot(summary_for_plot, aes(x= resp, y = V1, color = person)) + geom_point() +
geom_errorbar(aes(ymin = V1-sd, ymax = V1+sd), width = 0.2)