I've got panel data and have been playing around with k-means clustering. So now I've got a panel of factor values that are mostly stable but I'd like to smooth that out a bit more so that (for example) the data says "Wyoming was in group 1 in earlier years, moved into group 2, then moved into group 5" rather than "Wyoming was in group 1,1,1,2,3,2,2,5,5,5".
So the approach I'm taking is to use rollapply() to calculate the modal value. Below is code that works to calculate the mode ("Mode()"), and a wrapper for that ("ModeR()") that (perhaps clumsily) resolves the problem of multi-modal windows by randomly picking a mode. All that is fine, but when I put it into rollapply() I'm getting problems.
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq)) # from dplyr
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(temp[,1])
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[round(runif(1,min=0.5000001,max=length(out)+0.499999999))])
}
temp <- round(runif(20,min=1,max=10)) # A vector to test this out on.
cbind(temp,rollapply(data=temp,width=5,FUN=ModeR,fill=NA,align="right"))
which returned:
temp
[1,] 5 NA
[2,] 6 NA
[3,] 5 NA
[4,] 5 NA
[5,] 7 1
[6,] 6 1
[7,] 5 1
[8,] 5 1
[9,] 3 2
[10,] 1 3
[11,] 5 3
[12,] 7 3
[13,] 5 3
[14,] 4 3
[15,] 3 3
[16,] 4 2
[17,] 8 2
[18,] 5 2
[19,] 6 3
[20,] 6 3
Compare that with:
> ModeR(temp[1:5])
[1] 5
Levels: 5 6 7
> ModeR(temp[2:6])
[1] 6
Levels: 5 6 7
So it seems like the problem is in how ModeR is being applied in rollapply(). Any ideas?
Thanks!
Rick
Thanks to /u/murgs! His comment pointed me in the right direction (in addition to helping me streamline ModeR() using sample()).
ModeR() as written above returns a factor (as does Mode()). I need it to be a number. I can fix this by updating my code as follows:
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq))
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(as.numeric(as.character(temp[,1]))) #HERE'S THE BIG CHANGE
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[sample(1:length(out),1)]) #HERE'S SOME IMPROVED CODE!
}
Now rollapply() does what I expected it to do! There's still that weird as.character() bit (otherwise it rounds down the number). I'm not sure what's going on there, but the code works so I won't worry about it...
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have 12 plots in ggplot, and I am arranging them with grid.arrange. I manually set the number of rows in the grid to 4, the number of columns to 3. Since 3 x 4 = 12, this works like a charm.
But what if I have an arbitrary number of plots? Say 13... How would I programmatically find the number of rows and columns to use that makes the entire plot the most "square-like" shape?
I'd like to do this in R.
Update
Link to data: http://github.com/ngfrey/DataGenii/blob/master/exampleMedicalData.csv
Here is the code I was working on this morning. Hopefully it will provide a more illustrative example. Note how I set the number of rows and columns in the return(list(plots=plots, numrow=4, numcol=3)) part of the function:
makePlots<- function(fdf){
idx<- which(sapply(fdf, is.numeric))
idx<- data.frame(idx)
names(idx)<- "idx"
idx$names<- rownames(idx)
plots<- list()
for(i in 2:length(idx$idx)) {
varname<- idx$names[i]
mydata<- fdf[, idx$names[i]]
mydata<- data.frame(mydata)
names(mydata)<- varname
g<- ggplot(data=mydata, aes_string(x=varname) )
g<- g + geom_histogram(aes(y=..density..), color="black", fill='skyblue')+ geom_density() + xlab(paste(varname))
print(g)
plots<- c(plots, list(g))
}
return(list(plots=plots, numrow=4, numcol=3 ))
}
res<- makePlots(fdf)
do.call(grid.arrange, c(res$plots, nrow=res$numrow, ncol=res$numcol))
?n2mfrow finds a default layout for you; in fact, it's already used by grid.arrange if nrow and ncol are missing
grid.arrange(grobs = replicate(7, rectGrob(), simplify=FALSE))
In practice there are a limited number of plots that can reasonably be displayed and few arrangements that are aesthetically pleasing, so you could just tabulate.
However, what we can do is specify a tolerance for the ratio of the larger dimension to the smaller. Then we find the closest two numbers that factor our target. If these are within tolerance, we are done. Otherwise, we add waste to our target. This terminates at the sooner of finding a suitable pair or at the next largest square. (Tolerance should be > 1).
Smallest of the closest two numbers that factor a given n
fact<-function(n) {
k<-floor(sqrt(n));
for(i in k:1) {if (n%%i == 0) return(i)}
}
Search for a near square within tolerance
nearsq<-function(n,tol=5/3+0.001) {
m<-ceiling(sqrt(n))^2;
for(i in n:m) {
a<-fact(i);
b<-i/a;
if(b/a < tol) return(c(a,b))
}
}
Examples
#probably too many plots
nearsq(83)
#> [1] 8 11
#a more reasonable range of plots, tabulated
cbind(12:36,t(Vectorize(nearsq)(12:36)))
[,1] [,2] [,3]
[1,] 12 3 4
[2,] 13 3 5
[3,] 14 3 5
[4,] 15 3 5
[5,] 16 4 4
[6,] 17 4 5
[7,] 18 4 5
[8,] 19 4 5
[9,] 20 4 5
[10,] 21 4 6
[11,] 22 4 6
[12,] 23 4 6
[13,] 24 4 6
[14,] 25 5 5
[15,] 26 5 6
[16,] 27 5 6
[17,] 28 5 6
[18,] 29 5 6
[19,] 30 5 6
[20,] 31 5 7
[21,] 32 5 7
[22,] 33 5 7
[23,] 34 5 7
[24,] 35 5 7
[25,] 36 6 6
Here's how I got this bad boy to work:
(I could still tighten up the axis labels, and probably compress the first 2 if statements in the makePlots() function so it would run faster, but I'll tackle that at a later date/post)
library(gmp)
library(ggplot2)
library(gridExtra)
############
factors <- function(n)
{
if(length(n) > 1)
{
lapply(as.list(n), factors)
} else
{
one.to.n <- seq_len(n)
one.to.n[(n %% one.to.n) == 0]
}
}
###########
makePlots<- function(fdf){
idx<- which(sapply(fdf, is.numeric))
idx<- data.frame(idx)
names(idx)<- "idx"
idx$names<- rownames(idx)
plots<- list()
for(i in 2:length(idx$idx)) {
varname<- idx$names[i]
mydata<- fdf[, idx$names[i]]
mydata<- data.frame(mydata)
names(mydata)<- varname
g<- ggplot(data=mydata, aes_string(x=varname) )
g<- g + geom_histogram(aes(y=..density..), color="black", fill='skyblue')+ geom_density() + xlab(paste(varname))
print(g)
plots<- c(plots, list(g))
}
numplots<- 0
#Note: The reason I put in length(idx$idx)-1 is because the first column is the row indicies, which are usually numeric ;)
#isprime returns 0 for non-prime numbers, 2 for prime numbers
if(length(idx$idx) == 2){
numplots<- length(idx$idx)
ncolx<- 1
nrowx<- 2
} else if(length(idx$idx)==3){
numplots<- length(idx$idx)
ncolx<- 1
nrowx<- 3
} else if(isprime((length(idx$idx)-1)) !=0){
numplots<- length(idx$idx)
facts<- factors(numplots)
ncolx<- facts[length(facts)/2]
nrowx<- facts[(length(facts)/2) + 1]
} else{numplots<- (length(idx$idx)-1)
facts<- factors(numplots)
ncolx<- facts[length(facts)/2]
nrowx<- facts[(length(facts)/2) + 1]}
if(abs(nrowx-ncolx)>2){
ncolx<- ncolx+1
nrowx<- ceiling(numplots/ncolx)
}
return(list(plots=plots, numrow=nrowx, numcol=ncolx ))
}
res<- makePlots(fdf)
do.call(grid.arrange, c(res$plots, nrow=res$numrow, ncol=res$numcol))
I admit that I am totally new to R and have a few beginner's problems;
my problem is the following:
I have quite a long matrix TEST of length 5000 with 2 columns (column 1 = time; column 2 = concentration of a species).
I want to use the right concentration values for calculation of propensities in stochastic simulations.
I already have an alogrithm that gives me the simulation time t_sim; what I would need is a line of code that gives the respective concentration value at t= t_sim;
also: the time vector might have a big step size so that t_sim would have to be rounded to a bigger value in order to call the respective concentration value.
I know this probably quite an easy problem but I really do not see the solution in R.
Best wishes and many thanks,
Arne
Without sample data this answer is kind of a shot in the dark, but I think that this might work:
t_conc <- TEST[which.min(abs(t_sim-TEST[,1])),2]
where TEST is the matrix with two columns as described in the OP and the output t_conc is the concentration that corresponds to the value of time in the matrix that is closest to the input value t_sim.
Here's another shot in the dark:
set.seed(1);
N <- 20; test <- matrix(c(sort(sample(100,N)),rnorm(N,0.5,0.2)),N,dimnames=list(NULL,c('time','concentration')));
test;
## time concentration
## [1,] 6 0.80235623
## [2,] 16 0.57796865
## [3,] 19 0.37575188
## [4,] 20 0.05706002
## [5,] 27 0.72498618
## [6,] 32 0.49101328
## [7,] 34 0.49676195
## [8,] 37 0.68876724
## [9,] 43 0.66424424
## [10,] 57 0.61878026
## [11,] 58 0.68379547
## [12,] 61 0.65642726
## [13,] 62 0.51491300
## [14,] 63 0.10212966
## [15,] 67 0.62396515
## [16,] 83 0.48877425
## [17,] 86 0.46884090
## [18,] 88 0.20584952
## [19,] 89 0.40436999
## [20,] 97 0.58358831
t_sim <- 39;
test[findInterval(t_sim,test[,'time']),'concentration'];
## concentration
## 0.6887672
Note that findInterval() returns the index of the lesser time value if t_sim falls between two time values, as my example shows. If you want the greater, you need a bit more work:
i <- findInterval(t_sim,test[,'time']);
if (test[i,'time'] != t_sim && i < nrow(test)) i <- i+1;
test[i,'concentration'];
## concentration
## 0.6642442
If you want the nearest, see R: find nearest index.
I'm working on an excel-file consisting of a 261 x 10 matrix. The matrix consists of the weekly returns of 10 stocks from 2010 to 2015. So, I have 10 variables (stocks) and 261 observations (weekly returns) for each variable.
For my master thesis I have to apply a "rearrangement algorithm" developed by Rüschendorf and Puccetti (2012) on my matrix. I'm not going into further details on the theorical side of that concept. The thing is that I downloaded a package capable of performing the rearrangement algorithm in R. I tested it out and it works perfectly.
Actually the only thing I need to know is how to import my excel-matrix into R in order to be capable of performing the rearrangement algorithm on it. I can rewrite my matrix into R (manually) just by encoding every element of the matrix by using the matrix programming formula in R:
A = matrix( c(), nrow= , ncol= , byrow=TRUE)
The problem is that doing so for such a big matrix (261 x 10) would be very time consuming. Is their any way to import my excel-matrix in R and that R recognizes it as matrix consisting of numerical values ready for calculations (similar to the case of doing it manually) ? In such a way that I just have to run the "rearrangement algorithm" function provided in R.
Thanks in advance.
I make a selection within an opened Excel sheet and copied to the clipboard. This then worked on a Mac:
> con=pipe("pbpaste")
> dat <- data.matrix( read.table(con) )
> dat
V1 V2 V3
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
[4,] 4 4 4
[5,] 5 5 5
[6,] 6 6 6
[7,] 7 7 7
[8,] 8 8 8
[9,] 9 9 9
[10,] 10 10 10
[11,] 11 11 11
[12,] 12 12 12
[13,] 13 13 13
[14,] 14 14 14
The method is somewhat different on Windows devices but the help page for ?connections should have your OS-specific techniques.
You didn't provide a minimal reproducible example, so the answers are probably gonna of lesser quality. Anyway, you should be able to load the the excel file with something like:
require(XLConnect)
wrkbk <- loadWorkbook("path/to/your/data.xlsx")
df <- readWorksheet(wrkbk, sheet = 1, header = TRUE)
And then convert the data.frame to a matrix via
ans <- as.matrix(df)
Otherwise, you need to save your file as a .txt or .csv plain-text file and use read.table or read.csv and the like. Consult their respective help pages.
I have two files, "testi" containing few numbers and "testo" containing their square roots. I have another test named file which contains some numbers for which I want their square roots. I used the command
model <- mlp(testi,testo,size=50,learnFuncParams = c(0.001),maxit = 5000)
xyz <- predict(model,test)
The values which I get from "xyz" are
xyz
#[1,] 0.9971085
#[2,] 0.9992253
#[3,] 0.9992997
#[4,] 0.9993009
#[5,] 0.9993009
#[6,] 0.9993009
#[7,] 0.9993009
Whereas "test" contains
1 4
2 16
3 36
4 64
5 100
6 144
7 196
Please let me know why does this happen?
mlp has logistic output, you need to specify linOut=TRUE. In general, normalizing your data would also help.
I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.
I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1
x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.