How to plot recurrencies in R - r

How can I plot a recurrency in R.
Any solution with base plot, ggplot2, lattice, or a dedicated package is welcome.
For example:
Imagine I have these data:
mydata <- data.frame(t=1:10, Y=runif(10))
t Y
1 0.3744869
2 0.6314202
3 0.3900789
4 0.6896278
5 0.6894134
6 0.5549006
7 0.4296244
8 0.4527201
9 0.3064433
10 0.5783539
I could transform it like this:
mydata2 <- data.frame(t=c(NA,mydata$t),Y=c(NA,mydata$Y),Y2=c(mydata$Y, NA))
t Y Y2
NA NA 0.9103703
1 0.9103703 0.1426041
2 0.1426041 0.4150476
3 0.4150476 0.2109258
4 0.2109258 0.4287504
5 0.4287504 0.1326900
6 0.1326900 0.4600964
7 0.4600964 0.9429571
8 0.9429571 0.7619739
9 0.7619739 0.9329098
10 0.9329098 NA
(or similar methods, but I can have problems with missing data)
And plot it
plot(Y2~Y, data=mydata2)
I guess I must use some grouping function such as ave or apply. But it's not an elegant solution, and if I have more columns it can become difficult to generalize the transformation.
For example
mydata3 <- data.frame(x=sample(10,100, replace=T),t=1:100, Y=2*runif(100)+1)
For every x (or combination of values on other columns) I want to plot Y_{i+1} ~ Y_i, on the same plot.
Other tools, such as Mathematica have functions to plot sequences directly.

I've found a solution, thoug not very beautiful:
For this sample data.
mydata <- data.frame(x=sample(4,25, replace=T),t=1:25, Y=2*runif(25)+1)
newdata <- mydata[order(mydata$x, mydata$t), ]
newdata$prev <- ave(newdata$Y, newdata$x, FUN=function(x) c(NA,head(x,-1)))
plot(Y~prev, data=newdata)
In this example you don't have rows for every t value, you would need to first generate NAs for missing values. But it's just a quick solution. In my real data I have many observations for each t.
lag.plot can plot recurrence plots but not within each subgroup.

Related

Construct dataframe columns based on a function of other columns in R

I am trying to figure out a way to do this in R, ideally with something in the apply() family of functions (i.e. not with a for loop).
I want to use a function based on four other columns in my data frame and I want to save the results of that function in three new columns of the data frame.
For example if I have (with test data):
x <- c("var1","var2","var3","var4")
A_x <- c(5,4,3,2)
A_notx <- c(5,6,7,8)
B_x <- c(10,10,5,15)
B_notx <- c(10,10,15,5)
example <- data.frame(A_x,A_notx,B_x,B_notx)
rownames(example) <- x
A_x A_notx B_x B_notx
var1 5 5 10 10
var2 4 6 10 10
var3 3 7 5 15
var4 2 8 15 5
And I want to use oddsratio() from the epitools library on these counts, how could I save the odds ratio as well as the upper and lower bounds as 3 new columns? I would like example$odds, example$upper, and example$lower to exist in my dataframe.
I have messed around a bit with apply() and and by() but can't seem to figure it out. With apply it changes the structure of the row from data frame to matrix, and it is outside of the scope of the function to set column values within the function. Perhaps this whole thing is better served by a list object than a data frame? In the end I want to have all the information on hand (counts, statistics, etc.) for a given variable name, with a variable in each column.
Perhaps this is what you're looking for?
example <- cbind(example,
t(apply(example,1,function(x){
oddsratio(as.table(rbind(x[1:2],x[3:4])))$measure[2,]
}
)))
example
A_x A_notx B_x B_notx estimate lower upper
var1 5 5 10 10 0.99999998 0.20537812 4.8690679
var2 4 6 10 10 0.68116864 0.13043731 3.2586139
var3 3 7 5 15 1.28836297 0.20019246 7.2563905
var4 2 8 15 5 0.09603445 0.01039156 0.5446693

loess() on subsets of data

Apologies if this has been answered before, but I am unable to find an applicable example.
I am trying to detrend some data for variogram analysis.
I have a dataframe 'aa' with columns 'y' 'long' 'lat' and 'z'.
I am trying to run:
loess(aa2$y ~ aa$long + aa$lat, aa, degree =2) on each level of factor z.
In the end, I need a dataframe of 'Long' 'Lat' 'Residual' and 'Z', residuals coming from the multiple facor-specific loess objects.
Given my limited knowledge of R, I cannot figure out the proper syntax to make this happen.
I am assuming one of the *apply functions could be used but I don't know the language well enough to write it properly.
Thank you for any guidance or clarification.
Like this?
aa <- data.frame(y=rnorm(100),long=rnorm(100),lat=rnorm(100),Z=rep(1:4, each=25))
result <- do.call(rbind,lapply(unique(aa$Z),function(z){
df <- aa[aa$Z==z,]
fit <- loess(y~long+lat,df,degree=2)
cbind(Z=z,long=df$long,lat=df$lat,residuals=fit$residuals)
}))
head(result)
# Z long lat residuals
# 1 1 0.9622113 0.03114804 -0.2189496
# 2 1 -0.6539525 0.32908716 1.3904483
# 3 1 1.0066978 -0.78833830 0.1044707
# 4 1 -1.0873116 -0.55218226 1.8526030
# 5 1 -1.1286776 1.68879949 0.2459814
# 6 1 -1.0052768 -0.85890027 -0.9842824

Histograms in R with a "more" categorie, similar to MS Excel

Consider the following frequency data:
> table(income)
income
3 5 6 7 8 5000
2 7 2 2 2 1
When I type >hist(income) I get the following histogram
So as you can see, the fact that most income values are concentrated around 5 and there is one value quite distant from the others makes the histogram not look very good. MS Excel can consider the 5000 value as of another category, so the data would like this instead:
> table(income)
income
3 5 6 7 8 more
2 7 2 2 2 1
So plotting this as a histogram would look much better, so you can see the frequency within a shorter range:
Is there anyway to do this either with the hist() function or others functions from lattice or ggplot2? I do however, don't want to overwrite the values that exceed a certain threshold, so as I do lose any information.
Thanks a lot!
Data generation:
income <- c(rep(3,2), rep(5,7), rep(6,2), rep(7,2), rep(8,2), 5000)
Function for preparing data for plotting:
nice.data <- function(x, threshold=10){
x[x>threshold] <- "More"
x
}
Plotting:
library(ggplot2)
ggplot() + geom_histogram(aes(x=nice.data(income))) + xlab("Income")
Result:

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

How to pass a list to ggplot2?

I'm trying to do a boxplot of a list of values at ggplot2, but the problem is that it doesn't know how to deal with lists, what should I try ?
E.g.:
k <- list(c(1,2,3,4,5),c(1,2,3,4),c(1,3,6,8,14),c(1,3,7,8,10,37))
k
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 3 6 8 14
[[4]]
[1] 1 3 7 8 10 37
If I pass k as an argument to boxplot() it will handle it flawlessly and produce a nice (well not so nice... hehehe) boxplot with the range of all the values as the Y-axis and the list index (each element) as the X-axis.
How should I achieve the exact same effect with ggplot2 ? I think that dataframes or matrices are not an option because the vectors are of different length.
Thanks
The answer is that you don't. ggplot2 is designed to work with data frames, particularly long form data frames. That means you need your data as one tall vector, with a grouping factor:
d <- data.frame(x = unlist(k),
grp = rep(letters[1:length(k)],times = sapply(k,length)))
ggplot(d,aes(x = grp, y = x)) + geom_boxplot()
And as pointed out in the comments, melt achieves the same result as this manual reshaping and is much simpler. I guess I like to make things difficult.

Resources