I have two variables that i want to bootstrap and graph the resulting linear regression line on the same xy plane from each new data set.
I was thinking that i can hold each resulting intercept and slope from the lm() but i dont know how i could graph that information for each resulting pair of information in the same graph. I know that abline() can do one pair but not all of them. Feel free to throw anything at me.
intercept_stuff<-rep(NA,T)
opp_stuff<-rep(NA,T)
N<-1000
for(t in 1:T){
idx <- sample(1:N, size =N, replace=TRUE)
intercept_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[1]
opp_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[2]
}
Here is an example of how to do multiple pairs of lines on ggplot with some simulated data. Hopefully this will give you some useful clues:
library(reshape2)
library(tibble)
# simulate some data
obs <- c(1:90)
values1 <- rnorm(90,mean=0,sd=1)
values2 <- rnorm(90,mean=5,sd=2)
values3 <- rnorm(90,mean=10,sd=3)
df <- as.tibble(cbind(obs,values1,values2,values3))
It looks like this:
> df
# A tibble: 90 x 4
obs values1 values2 values3
<dbl> <dbl> <dbl> <dbl>
1 1. -0.162 7.47 10.7
2 2. 0.518 5.17 7.61
3 3. 1.52 7.66 4.42
# ... with 80 more rows
Then melt it into long form:
m.df <- melt(df,id="obs",measures=c("values1","values2","values3"))
to look like this:
> m.df
obs variable value
1 1 values1 -0.16228714
2 2 values1 0.51755370
3 3 values1 1.52433685
4 4 values1 -1.82067006
5 5 values1 -1.42180601
...
Then plot many lines (as long as there are groups like the color group here, they will be unique lines):
ggplot(m.df,aes(x=obs,y=value,color=variable)) + geom_line()
And here is the plot:
Good luck!
Related
I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.
Allow to me explain in more detail.
I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.
So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.
The data:
Here I will replicate a small slice of the data, and will include a missing value for realism.
set.seed(2000)
df <- data.frame(year = rep(c(1946:1970),length.out=50),
ccode1 = rep(c("2"), length.out = 50),
ccode2 = rep(c("20","31"), each=25),
kappavv = rnorm(50,mean = 0, sd=0.25),
output = NA)
df$kappavv[12] <- NA
What I've done:
NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- na_interpolation(dat_ts)
trialseries <- holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.
What isn't working:
I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.
I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.
Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.
You can put the for loop code in a function.
library(dplyr)
library(purrr)
apply_func <- function(df) {
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1),
end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
trialseries <- forecast::holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
return(df)
}
Split the data by ccode2 and apply apply_func.
df %>%group_split(ccode2) %>% map_df(apply_func)
# year ccode1 ccode2 kappavv output
# <int> <chr> <chr> <dbl> <dbl>
# 1 1946 2 20 -0.213 NA
# 2 1947 2 20 -0.0882 NA
# 3 1948 2 20 0.223 0.286
# 4 1949 2 20 0.435 0.413
# 5 1950 2 20 0.229 0.538
# 6 1951 2 20 -0.294 0.477
# 7 1952 2 20 -0.485 -0.675
# 8 1953 2 20 0.524 0.405
# 9 1954 2 20 0.0564 0.0418
#10 1955 2 20 0.294 0.161
# … with 40 more rows
I'm generating scatter plot of the comparison of 2 variables in a data frame, where the colors of the points were determined by their type (another column in the data frame). However, there are many points that overlap each other and therefore I want an estimation of how many points there are in a given part of the grid. (for example: In case of hexbin package, one can estimate how many values there are in a given hexagon on the grid)
My question is: how can I use the hexbin package where for every group I will have another color scale, so in that way it will be possible to distinguish between the groups and get an evaluation of how many values there are.
I tried to google it but didn't find any satisfying solution.
All the options that I found were focused only on the distinction between the groups.
My code so far:
ggplot(data_for_scatter_plot,aes(x=Log2FoldChange, y=TT_frequency, color=factor(Type))) +
geom_point(alpha = 0.6)
where my data_for_scatter_plot data frame is:
Gene Log2FoldChange length TT_frequency Type
ENSG00000007968 1.928153 24791 0.05623008 up_regulated
ENSG00000009724 2.209263 20711 0.05842306 down_regulated
ENSG00000010219 1.794972 53099 0.08250626 other_genes
ENSG00000053438 3.815411 2479 0.10851150 up_regulated
And the graph that I get is:
While I want to get the following graph for each group in a different color scale:
While this is not exactly what you asked for it might be a push in the right direction.
I don't think multiple color/cont scales are possible at the same time. You can change the color of the lines around the hexes, and increase line thickness.
First some sample data (please provide this yourself in the future)
library(tidyverse)
library(hexbin)
set.seed(1)
data_for_scatter_plot <-
crossing(
tibble(Trial =seq(1:100)),
tibble(Extra = seq(1:10)),
tribble(
~Gene, ~Log2FoldChange, ~length, ~TT_frequency, ~Type,
"ENSG00000007968", 1.928153, 24791, 0.05623008, "up_regulated",
"ENSG00000009724", 2.209263, 20711, 0.05842306, "down_regulated",
"ENSG00000010219", 1.794972, 53099, 0.08250626, "other_genes",
"ENSG00000053438", 3.815411, 2479 , 0.10851150, "up_regulate")) %>%
mutate(
Log2FoldChange = Log2FoldChange*0.001*Trial+rnorm(n=n(), mean=0, sd = 0.1),
TT_frequency = TT_frequency-0.00001*Trial+rnorm(n=n(), mean=0, sd = 0.005)
)
data_for_scatter_plot
# A tibble: 4,000 x 7
Trial Extra Gene Log2FoldChange length TT_frequency Type
<int> <int> <chr> <dbl> <dbl> <dbl> <chr>
1 1 1 ENSG00000007968 -0.0607 24791 0.0505 up_regulated
2 1 1 ENSG00000009724 0.0206 20711 0.0622 down_regulated
3 1 1 ENSG00000010219 -0.0818 53099 0.0853 other_genes
4 1 1 ENSG00000053438 0.163 2479 0.102 up_regulate
5 1 2 ENSG00000007968 0.0349 24791 0.0461 up_regulated
6 1 2 ENSG00000009724 -0.0798 20711 0.0614 down_regulated
7 1 2 ENSG00000010219 0.0505 53099 0.0754 other_genes
8 1 2 ENSG00000053438 0.0776 2479 0.117 up_regulate
9 1 3 ENSG00000007968 0.0595 24791 0.0654 up_regulated
10 1 3 ENSG00000009724 -0.0283 20711 0.0653 down_regulated
# ... with 3,990 more rows
Now the plot:
data_for_scatter_plot %>%
ggplot(aes(x=Log2FoldChange, y=TT_frequency, color=factor(Type)))+
geom_hex(size = 1)
I hope this helps.
---EDIT after comment---
I think a better solution for you is to facet. This will also make you able to distinguish overlapping groups
data_for_scatter_plot %>%
ggplot(aes(x=Log2FoldChange, y=TT_frequency, color=factor(Type)))+
geom_hex()+
facet_grid(cols = vars(factor(Type)))
I have a dataframe with character and numeric data. I would like to use dplyr to create a summary grouped by time points and trials generating the following:
averages
standard deviations
variation
ratio between time points
(etc etc)
I feel like all of this could be done in the dplyr pipe, but I am struggling to make a ratio of averages between time points within trials.
I fully admit that I may be carrying around a hammer looking for nails, so please feel free to recommend solutions that utilize other packages or functions, but ideally I'd like simple/straight forward code for ease of use by multiple collaborators.
library(dplyr)
# creating an example DF
num <- runif(100, 50, 3200)
smpl <- 1:100
df <- data.frame( num, smpl)
df$time <- "time1"
df$time[seq(2,100,2)] <- "time2"
df$trial <- "a"
df$trial[26:50] <- "b"
df$trial[51:75] <- "c"
df$trial[75:100] <- "d"
# using the magic of pipelines to calculate useful things
df1 <- df %>%
group_by(time, trial) %>%
summarise(avg = mean(num),
var = var(num),
stdev = sd(num))
I'd love to get [the ratio time2/time1 of the avg for each trial] included in this block above, but I don't know how to call "avg" specifically by "time1" vs "time2" within the pipe.
From here on, nothing does quite what I'm hoping for...
df1 <- df1[with(df1,order(trial,time)),]
# this better ressembles my actual DF structure,
# so reordering it will make some of my next attempts to solve this make more sense
I tried to use the fact that 'every other line' is different (this is not ideal because each df will have a different number of rows, so I will either introduce NAs or it will require constantly change these #'s (or writing a function to constantly change them))
tm2 <- data.frame(x=df1$avg[seq(2,4,2)])
tm1 <- data.frame(x=df1$avg[seq(1,3,2)])
so minimally, this is the ratio I'd like included in the df, but tied to the avg & trial columns:
tm2/tm1
It doesn't matter to me 'which' time row this ratio ends up in, so long as it is consistent across all the trials (so if a column of ratios has "blank" for every "time1" and "value" for every "time2", that's fine).
# I added in a separate column to allow 'match' later
tm1$time <- "time1"
tm2$time <- "time1" # to keep them all 'in row'
df1$avg_tm1 <- tm1$x[match(df1$time, tm1$time)]
df1$avg_tm2 <- tm2$x[match(df1$time, tm2$time)]
but this fails to match by 'trial' also, since that info is lost in this new tm1 df ; this really makes me think it should all be done in dplry the first time...
Then I tried to create a new column in the tm1 df with the ratio
tm2$ratio <-tm2$x/tm1$x
and add in the ratio values only if the avg matches
df1$ratio <- tm2$ratio[match(tm2$x, df1$avg)]
This might work, but when I extract the avg values, it rounds, so the numbers do not match exactly. I'm also cautious about this because if I process ridiculous amounts of data, there's a higher and higher chance that two random averages will be similar enough to misplace these ratios.
I tried several other things that completely failed, so let's pretend that something worked and entered the ratio into the df1 as separate columns
Then any further calculations or annotations are straight forward:
df2 <- df1 %>%
mutate(ratio = avg_tm2/avg_tm1,
lost = 1- ratio,
word = paste0(round(lost*100),"%"))
But I am still stuck on 'how' to call specific cells inside the pipe or which other tools/packages to use to calculate deltas or ratios between cells in the same column.
Thanks in advance
We could group by 'trial' and mutate to create the 'ratio' column
df1 %>%
group_by(trial) %>%
mutate(ratio = last(avg)/first(avg))
# A tibble: 8 x 6
# Groups: trial [4]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 time1 a 1815. 715630. 846. 0.795
#2 time1 b 2012. 1299823. 1140. 0.686
#3 time1 c 1505. 878168. 937. 1.09
#4 time1 d 1387. 902364. 950. 1.17
#5 time2 a 1444. 998943. 999. 0.795
#6 time2 b 1380. 720135. 849. 0.686
#7 time2 c 1641. 1205778. 1098. 1.09
#8 time2 d 1619. 582418. 763. 1.17
NOTE: We used set.seed(2) for creating the dataset
Work out a separate data.frame:
set.seed(2)
# your code above to generate df1
df2 <- select(df1, time, trial, avg) %>%
spread(time, avg) %>%
mutate(ratio = time2/time1)
df2
# # A tibble: 4 × 4
# trial time1 time2 ratio
# <chr> <dbl> <dbl> <dbl>
# 1 a 1815.203 1443.731 0.7953555
# 2 b 2012.436 1379.981 0.6857266
# 3 c 1505.474 1641.439 1.0903135
# 4 d 1386.876 1619.341 1.1676176
and now you can merge the relevant column onto the original frame:
left_join(df1, select(df2, trial, ratio), by="trial")
# Source: local data frame [8 x 6]
# Groups: time [?]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 time1 a 1815.203 715630.4 845.9494 0.7953555
# 2 time1 b 2012.436 1299823.3 1140.0979 0.6857266
# 3 time1 c 1505.474 878168.3 937.1063 1.0903135
# 4 time1 d 1386.876 902363.7 949.9282 1.1676176
# 5 time2 a 1443.731 998943.3 999.4715 0.7953555
# 6 time2 b 1379.981 720134.6 848.6074 0.6857266
# 7 time2 c 1641.439 1205778.0 1098.0792 1.0903135
# 8 time2 d 1619.341 582417.5 763.1629 1.1676176
I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))
I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)