I need to interpolate annual data from a 5-year interval and so far I found how to do it for one observation using approx(). But I have a large data set and when trying to use ddply() to apply for each row, no matter what I try in the last row of code I keep receiving error messages.
e.g:
town <- data.frame(name = c("a","b","c"), X1990 = c(100,300,500), X1995=c(200,400,700))
d1990 <-c(1990)
d1995 <-c(1995)
town_all <- cbind(town,d1990,d1995)
library(plyr)
Input <- data.frame(town_all)
x <- c(town_all$X1990, town_all$X1995)
y <- c(town_all$d1990, town_all$d1995)
approx_frame <- function(df) (approx(x=x, y=y, method="linear", n=6, ties="mean"))
ddply(Input, town_all$X1990, approx_frame)
Also, if you know what function calculates geometric interpolation, it will be great. (I was only able to find examples of spline or constant methods.)
I would first put the data in long format (each column corresponds to a variable, so one column for 'year' and one for 'value'). Then, I use data.table, but the same approach could be followed with dplyr or another split-apply-combine method. This interp function is meant to do geometric interpolation with a constant rate calculated for each interval.
## Sample data (added one more year)
towns <- data.frame(name=c('a', 'b', 'c'),
x1990=c(100, 300, 500),
x1995=c(200, 400, 700),
x2000=c(555, 777, 999))
## First, transform data from wide -> long format, clean year column
library(data.table) # or use reshape2::melt
towns <- melt(as.data.table(towns), id.vars='name', variable.name='year') # wide -> long
towns[, year := as.integer(sub('[[:alpha:]]', '', year))] # convert years to integers
## Function to interpolate at constant rate for each interval
interp <- function(yrs, values) {
tt <- diff(yrs) # interval lengths
N <- head(values, -1L)
P <- tail(values, -1L)
r <- (log(P) - log(N)) / tt # rate for interval
const_rate <- function(N, r, time) N*exp(r*(0:(time-1L)))
list(year=seq.int(min(yrs), max(yrs), by=1L),
value=c(unlist(Map(const_rate, N, r, tt)), tail(P, 1L)))
}
## geometric interpolation for each town
res <- towns[, interp(year, value), by=name]
## Plot
library(ggplot2)
ggplot(res, aes(year, value, color=name)) +
geom_line(lwd=1.3) + theme_bw() +
geom_point(data=towns, cex=2, color='black') + # add points interpolated between
scale_color_brewer(palette='Pastel1')
Related
I'm trying to plot a 3D plane from three variables. I've read many questions on the topic but haven't been able to find what I'm looking for.
I have two sets of variables:
prob <- seq(0,1,by=0.01)
n <- seq(999,9999, by = 1000)
n <- c(9,99,n)
combis <- expand.grid(prob,n)
which I then use to produce my results:
res <- apply(combis,1,calc,pos=pos)
where the values of res can be between 0 and 2/3.
So I'm trying to make a 3d plot where n,prob are x,z and y is res. However most packages I've found require matrices etc, and cannot get this to work.
Any help would be appreciated, and sorry if I haven't found the answer.
Assuming that res is just a vector, you can just combine your data and reshape it into a wide-format matrix and then plot with something like the lattice package
prob <- seq(0,1,by=0.01)
n <- seq(999,9999, by = 1000)
n <- c(9,99,n)
combis <- expand.grid(prob,n)
res <- runif(n=nrow(combis), 0, 0.67) #generate sample data for res
dat <- cbind(combis, res)
library(reshape2)
datm <- acast(data = dat, Var1~Var2, value.var = "res") #cast it into wide format
library(lattice)
library(latticeExtra)
cloud(datm, panel.3d.cloud = panel.3dbars, xlab="n", ylab="res", zlab="prob")
I'm new to R and i'm having some trouble in solving this problem.
I have the following table/dataframe:
I am trying to generate a boxplot like this one:
However, i want that the x-axis be scaled according to the labels 1000, 2000, 5000, etc.
So, i want that the distance between 1000 and 2000 be different from the distance between 50000 and 100000, since the exact distance is not the same.
Is it possible to do that in R?
Thank you everyone and have a nice day!
Maybe try to convert the data set in to this format, ie as integers in a column, rather than a header title?
# packages
library(ggplot2)
library(reshape2)
# data in ideal format
dt <- data.frame(x=rep(c(1,10,100), each=5),
y=runif(15))
# data that we have. Use reshape2::dcast to get data in to this format
dt$id <- rep(1:5, 3)
dt_orig <- dcast(dt, id~x, value.var = "y")
dt_orig$id <- NULL
names(dt_orig) <- paste0("X", names(dt_orig))
# lets get back to dt, the ideal format :)
# melt puts it in (variable, value) form. Need to configure variable column
dt2 <- melt(dt_orig)
# firstly, remove X from the string
dt2$variable <- gsub("X", "", dt2$variable)
# almost there, can see we have a character, but need it to be an integer
class(dt2$variable)
dt2$variable <- as.integer(dt2$variable)
# boxplot with variable X axis
ggplot(dt2, aes(x=variable, y=value, group=variable)) + geom_boxplot() + theme_minimal()
Base way of re-shaping data: https://www.statmethods.net/management/reshape.html
I am looking for a way to transform time series of different length into a unique length.
I think this question has already been asked by I can't find it. I guess I am just not using the right vocabulary for the question.
Data 1: 20 variables x 250 observations (time points)
Data 2: 20 variables x 50 observations (time points)
I would like to transform these data into 100 observations while keeping the shape of curves for the 20 variables in both cases.
Thanks a lot
Sample data
set.seed(123)
data <- matrix(0, 250, 20)
data[1, ] <- rnorm(20)
for (i in 2:nrow(data)) {
data[i, ] <- data[i - 1, ] + rnorm(20, 0, 0.02)
}
rownames(data) <- 0:249
One way of handling this is with reshape2 and dplyr:
library("reshape2")
library("dplyr")
library("ggplot2")
molten <- melt(data, varnames = c("Time", "Variable"))
Plot of original data:
ggplot(molten, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
Now reduce the data.frame by a factor of 5 using means of the values in each time period:
shorter <- molten %>%
group_by(Variable, Time %/% 5) %>%
summarise(value = mean(value), Time = mean(Time))
Plot new data:
ggplot(shorter, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
If you want original wide form of data:
shorterWide <- acast(shorter, Time ~ Variable)
I think I find a way using this function
Basic two-dimensional cubic spline fitting in R
I guess the keyword I was missing was cubic spline.
In my case I want to do something like that
spline(Data1, n = 100)
spline(Data1, n = 100)
I have a long format data frame where rows represent the responses (one of four categories) of different people. An example dataset is provided here:
df <- data.frame(person=c(rep("A",100),rep("B",100)),resp=c(sample(4,100,replace=TRUE),sample(4,100,replace=TRUE)))
df$resp <- factor(df$resp)
summary(df)
person resp
A:100 1:52
B:100 2:55
3:54
4:39
I want to present a chart where the x-axis plots the response category, the y-axis shows the proportion of responses in a category, and where error bars are calculated via bootstrapping (sampling with replacement).
I can calculate the proportion (in an extremely kludgy way; I'm sure this could be improved but this is not my main concern):
pFrame <- ddply(df,.(person,resp),summarise,trials = length(resp))
# can't figure out how to calculate the proportion with plyr.
pFrame$prop <- NA
people <- unique(df$person)
responses <- unique(df$resp)
for (i in 1 : length(people)){
nTrials <- nrow(subset(df,person==people[i]))
for (j in 1 : 4){
pFrame$prop[pFrame$person==people[i] & pFrame$resp==responses[j]] <- pFrame$trials[pFrame$person==people[i] & pFrame$resp==responses[j]] / nTrials
}
}
and plot it:
ggplot(pFrame,aes(x=resp,y=prop,colour=person)) + geom_point()
but I would really like to use something like stat_summary(fun.data="mean_cl_boot") to show the variability on the proportions (i.e. acting on the original data frame df, and bootstrapping over the rows). I've tried a few attempts at creating custom functions but this doesn't seem trivial because the factor levels need to be transformed for the bootstrap first.
I couldn't get ggplot's "mean_cl_boot" to work. Here is an alternative solution though:
library(boot)
summary_for_plot <- melt(prop.table(table(df), 1))
names(summary_for_plot) <- c("person", "resp", "V1")
# function for boot()
summary_function <- function(df, d){
melt(prop.table(table(df[d,]), 1))[, 3]
}
bootres <- boot(df, statistic = summary_function, R=100)
# get the standard deviation, used for the confidence intervals
summary_for_plot$sd <- sd(bootres$t)
ggplot(summary_for_plot, aes(x= resp, y = V1, color = person)) + geom_point() +
geom_errorbar(aes(ymin = V1-sd, ymax = V1+sd), width = 0.2)
Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))