I am trying to code a function that converts Chinese numerals to Arabic.
The idea is to get a set of scaled_digit vectors and a set of scale_factor vectors, multiply and add them up to generate the desired output.
digit=c('〇'=0,'一'=1,'二'=2,'三'=3,'四'=4,'五'=5,'六'=6,'七'=7,'八'=8,'九'=9)
scale=c('十'=10^1,'百'=10^2,'千'=10^3,'萬'=10^4,'億'=10^8,'兆'=10^12)
One of the problems I encounter is this:
When I have a number that is a few hundred 10^4s (萬) and a few thousand 10^12s (兆), I am left with a scale-factor vector that is as follows:
scale_factor=
structure(c(1000, 1e+12, 100, 10, 10000, 1000, 100, 10), .Names = c("千",
"兆", "百", "十", "萬", "千", "百", "十"))
[千] 兆 [百] [十] 萬 千 百 十
[1e+03] 1e+12 [1e+02] [1e+01] 1e+04 1e+03 1e+02 1e+01
Scale factors to be adjusted have been marked by [ ].
The positions of the cardinal scale_factors can be found with the following code:
cardinal_scale_factor=which(diff(scale_factor)>=0, T)+1
兆 萬
2 5
How do I code so that scale_factor[1] and scale_factor[3:4] can be multiplied by scale_factor[2] and [5] respectively?
Expected result:
[千] 兆 [百] [十] 萬 千 百 十
[1e+15] 1e+12 [1e+06] [1e+05] 1e+04 1e+03 1e+02 1e+01
A possible solution:
w1 <- which(rev(cummax(rev(scale_factor)) > rev(scale_factor)))
grp <- cumsum(c(1,diff(w1)) > 1) + 1
w2 <- aggregate(w1, list(grp), max)[[2]] + 1
scale_factor[w1] <- scale_factor[w1] * scale_factor[w2][grp]
which gives:
> scale_factor
千 兆 百 十 萬 千 百 十
1e+15 1e+12 1e+06 1e+05 1e+04 1e+03 1e+02 1e+01
What this does:
With cummax(rev(scale_factor)) you get the cummulative maximum of the reversed scale.
Comparing that with the reversed scale (cummax(rev(scale_factor)) > rev(scale_factor)) gives a logical vector.
Wrapping the logical vector from step 2 in rev and which subsequently, wou get an index vector of positions w1 that do not conform to the decreasing condition.
With cumsum(c(1,diff(w1)) > 1) + 1 you can group these positions for case the 3rd and 4th values in the example data.
With aggregate(w1, list(grp), function(x) max(x) + 1)[[2]] you determine the positions of the multipliers.
Finally, you multiply the values in scale_factor as determined in w1 with the multipliers from w2. You need to index w2 with the group numbers from grp.
Related
I have a couple of questions regarding to the piece of code shown below, the function called "Func1" will return a matrix as a result, the size of the matrix will be 50 rows and 15 columns, I called it "M", and "M2" is just the transpose of it. W0 is the initial value for the next part of the code, if I run the function called "Rowresult", then it also give me a 50*15 matrix.
My first question is: if I want to run the "Rowresult" function for different W0 values,such as W0 = 10,20,30. and I want to have 3 matrices in the size of 50*15 with different W0 values as results,how could I achieve it?
My second question is : if you tried my code in R, you will see a matrix called "wealth_result 2" as a result. once I got this big matrix, I would like to divide it (50*15 matrix) into three same size matrix, each matrix has a size of 50*5 (so they share the same rows but different columns, the first matrix takes the first 5 columns, the second takes 6-10 columns, third one takes 11-15 columns),and then I want to work out how many positive rows (rows with all numbers positive) among each of the 50 *5 matrix? How could I achieve this?
N=15
func1<-function(N){
alpha1 = 8.439e-02
beta1 = 8.352e-01
mu = 7.483e-03
omega = 1.343e-04
X_0 = -3.092031e-02
sigma_0 = 0.03573968
eps = rt (N,7.433e+00)
# loops
Xn= numeric (N)
sigma= numeric (N)
sigma[1] = sigma_0
Xn[1] = X_0
for (t in 2:N){
sigma[t] = sqrt (omega + alpha1 * (Xn[t-1])^2 + beta1* (sigma[t-1])^2)
Xn[t] = sigma[t] * eps[t]
}
Y = mu + Xn
}
# return matrix
M<-replicate(50,func1(N))
# returns matrix
M2<-t(M)
View(M2)
# wealth with initial wealth 10
W0=10
# 10,20,30,40
r= c(0.101309031, -0.035665516, -0.037377270, -0.005928941, 0.036612849,
0.062404039, 0.124240950, -0.034843633, 0.004770613, 0.005018101,
0.097685945, -0.090660099, 0.004863099, 0.029215984, 0.020835366)
Rowresult<- function(r){
const = exp(cumsum(r))
exp.cum = cumsum(1/const)
wealth=const*(W0 - exp.cum)
wealth
}
# wealth matrix
wealth_result <-apply(M2,1,Rowresult)
wealth_result2 <-t(wealth_result )
View(wealth_result2)
This delivers the desired counds of (all) "positive rows":
> sapply(1:3, function(m) sum( rowSums( wealth_result2[ , (1:5)+(m-1)*5 ] >0 )) )
[1] 250 230 2
How can I round off a number like 0.0000234889 (or in the form 8.829847e-07) to a power of ten, either below or above (whichever is my choice), ie here 0.00001 or 0.0001 ?
I tried round(...., digits=-100000) but it returns an error NaN error.
Ex: round(2e-07, digits=6) gives 0, while I would like 1e-06 and another function to give 1e-07.
# Is this what you're looking for?
# find the nearest power of ten for some number
x <- 0.0000234889 # Set test input value
y <- log10(x) # What is the fractional base ten logarithm?
yy <- round(y) # What is the nearest whole number base ten log?
xx <- 10 ^ yy # What integer power of ten is nearest the input?
print(xx)
# [1] 1e-05
The digits argument to the round() function must be positive. If you want your number to show up in scientific notation with an exponent n, just just do
round(value, 10^n)
However, this will only get you what you want up to a point. For example, you can do round(0.0000234889, 10^6) but you still get 2.34889e-05. (Notice that an exponent of 6 was specified but you got 5.)
Use options("scipen" = ) like this:
num <- 0.0000234889
> num
[1] 2.34889e-05
options("scipen" = 10)
options()$scipen
> num
[1] 0.0000234889
This will change the global option for the session. Read documentation here:https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html
I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins).
v<-c(1:4000)
V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals
Use cut and tapply:
> tapply(v, cut(v, 60), median)
(-3,67.7] (67.7,134] (134,201] (201,268]
34.0 101.0 167.5 234.0
(268,334] (334,401] (401,468] (468,534]
301.0 367.5 434.0 501.0
(534,601] (601,668] (668,734] (734,801]
567.5 634.0 701.0 767.5
(801,867] (867,934] (934,1e+03] (1e+03,1.07e+03]
834.0 901.0 967.5 1034.0
(1.07e+03,1.13e+03] (1.13e+03,1.2e+03] (1.2e+03,1.27e+03] (1.27e+03,1.33e+03]
1101.0 1167.5 1234.0 1301.0
(1.33e+03,1.4e+03] (1.4e+03,1.47e+03] (1.47e+03,1.53e+03] (1.53e+03,1.6e+03]
1367.5 1434.0 1500.5 1567.0
(1.6e+03,1.67e+03] (1.67e+03,1.73e+03] (1.73e+03,1.8e+03] (1.8e+03,1.87e+03]
1634.0 1700.5 1767.0 1834.0
(1.87e+03,1.93e+03] (1.93e+03,2e+03] (2e+03,2.07e+03] (2.07e+03,2.13e+03]
1900.5 1967.0 2034.0 2100.5
(2.13e+03,2.2e+03] (2.2e+03,2.27e+03] (2.27e+03,2.33e+03] (2.33e+03,2.4e+03]
2167.0 2234.0 2300.5 2367.0
(2.4e+03,2.47e+03] (2.47e+03,2.53e+03] (2.53e+03,2.6e+03] (2.6e+03,2.67e+03]
2434.0 2500.5 2567.0 2634.0
(2.67e+03,2.73e+03] (2.73e+03,2.8e+03] (2.8e+03,2.87e+03] (2.87e+03,2.93e+03]
2700.5 2767.0 2833.5 2900.0
(2.93e+03,3e+03] (3e+03,3.07e+03] (3.07e+03,3.13e+03] (3.13e+03,3.2e+03]
2967.0 3033.5 3100.0 3167.0
(3.2e+03,3.27e+03] (3.27e+03,3.33e+03] (3.33e+03,3.4e+03] (3.4e+03,3.47e+03]
3233.5 3300.0 3367.0 3433.5
(3.47e+03,3.53e+03] (3.53e+03,3.6e+03] (3.6e+03,3.67e+03] (3.67e+03,3.73e+03]
3500.0 3567.0 3633.5 3700.0
(3.73e+03,3.8e+03] (3.8e+03,3.87e+03] (3.87e+03,3.93e+03] (3.93e+03,4e+03]
3767.0 3833.5 3900.0 3967.0
In the past, i've used this function
evenbins <- function(x, bin.count=10, order=T) {
bin.size <- rep(length(x) %/% bin.count, bin.count)
bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
bin <- rep(1:bin.count, bin.size)
if(order) {
bin <- bin[rank(x,ties.method="random")]
}
return(factor(bin, levels=1:bin.count, ordered=order))
}
and then i can run it with
v.bin <- evenbins(v, 60)
and check the sizes with
table(v.bin)
and see they all contain 66 or 67 elements. By default this will order the values just like cut will so each of the factor levels will have increasing values. If you want to bin them based on their original order,
v.bin <- evenbins(v, 60, order=F)
instead. This just split the data up in the order it appears
This result shows the 59 median values of the break-points. The 60 bin values are probably as close to equal as possible (but probably not exactly equal).
> sq <- seq(1, 4000, length = 60)
> sapply(2:length(sq), function(i) median(c(sq[i-1], sq[i])))
# [1] 34.88983 102.66949 170.44915 238.22881 306.00847 373.78814
# [7] 441.56780 509.34746 577.12712 644.90678 712.68644 780.46610
# ......
Actually, after checking, the bins are pretty darn close to being equal.
> unique(diff(sq))
# [1] 67.77966 67.77966 67.77966
I have a log-log plot displaying tick marks on the y-axis from 1 to 7. I would like to obtain the interval between each tickmark for a function.
> par("yaxp")
[1] 1 7 -6
I was expecting par("yaxp")[3] to be 1.
Why is it -6? How can I get R to return the interval between tickmarks?
As said in the other help , when log is used the signification of xaxp is a little bit different. But here since the n is negative we still in the linear case. In other works the distance between steps is
(1-7)/(-6) = 1
You can use axTicks to get the ticks used internally by axes to plot the ticks. Here an example modified from the help of axTicks to show you diffrenet values of n:
op <- par(mfrow = c(4, 1))
for(x in 9999 * c(1/99999,1, 2, 8)) {
plot(x, 9, log = "x")
cat(formatC(par("xaxp"), width = 5),";", T <- axTicks(1),"\n")
rug(T, col = adjustcolor("red", 0.5), lwd = 4)
}
par(op)
## n negative (your case): linear case
0.05 0.25 -4 ; 0.05 0.1 0.15 0.2 0.25
## n =3 : k 10^j with k in {1,2,5} & j in par("usr")[1:2]
1000 1e+05 3 ; 200 500 1000 2000 5000 10000 20000 50000 1e+05 2e+05 5e+05
## n =2 k 10^j with k in {1,5} & j in par("usr")[1:2]
1000 1e+06 2 ; 500 1000 5000 10000 50000 1e+05 5e+05 1e+06
## n =1 : 10^j with j in par("usr")[1:2]
1000 1e+07 1 ; 1000 10000 1e+05 1e+06 1e+07
this is definitely to do with the fact it is a log scale
from the ?par documentation:
xaxp -
A vector of the form c(x1, x2, n) giving the coordinates of the
extreme tick marks and the number of intervals between tick-marks when
par("xlog") is false. Otherwise, when log coordinates are active, the
three values have a different meaning: For a small range, n is
negative, and the ticks are as in the linear case, otherwise, n is in
1:3, specifying a case number, and x1 and x2 are the lowest and
highest power of 10 inside the user coordinates, 10 ^ par("usr")[1:2].
(The "usr" coordinates are log10-transformed here!)
see:
x<-1:100
y<-1:100
plot(x,y,log="y")
par("yaxp")
par("ylog")
compared to
plot(log(x),log(y))
par("ylog")
par("yaxp")
I want to perform winsorization in a dataframe like this:
event_date beta_before beta_after
2000-05-05 1.2911707054 1.3215648954
1999-03-30 0.5089734305 0.4269575657
2000-05-05 0.5414700258 0.5326762272
2000-02-09 1.5491034852 1.2839988507
1999-03-30 1.9380674599 1.6169735009
1999-03-30 1.3109909155 1.4468207148
2000-05-05 1.2576420753 1.3659492507
1999-03-30 1.4393018341 0.7417777965
2000-05-05 0.2624037804 0.3860641307
2000-05-05 0.5532216441 0.2618245169
2000-02-08 2.6642931822 2.3815576738
2000-02-09 2.3007578964 2.2626960407
2001-08-14 3.2681270302 2.1611010935
2000-02-08 2.2509121123 2.9481325199
2000-09-20 0.6624503316 0.947935581
2006-09-26 0.6431111805 0.8745333151
By winsorization I mean to find the max and min for beta_before for example. That value should be replaced by the second highest or second lowest value in the same column, without loosing the rest of the details in the observation. For example. In this case, in beta_before the max value is 3.2681270302 and should be replaced by 3.2681270302. The same process will be followed for the min and then for the beta_after variable. Therefore, only 2 values per column will be changes, the highest and the minimum, the rest will remain the same.
Any advice? I tried different approaches in plyr, but I ended up replacing the whole observation, which I don’t want to do. I would like to create 2 new variables, for example beta_before_winsorized and beta _after_winsorized
I thought winsorizing usually finds the value x% (typically 10%, 15%, or 20%) from the bottom of the ordered list, and replaces all the values below it with that value. Same with the top. Here you're just choosing the top and bottom value, but winsorizing usually involves specifying a percentage of values at the top and bottom to replace.
Here is a function that does the winsorzation you describe:
winsorize <- function(x) {
Min <- which.min(x)
Max <- which.max(x)
ord <- order(x)
x[Min] <- x[ord][2]
x[Max] <- x[ord][length(x)-1]
x
}
If you data are in a data frame dat, then we can windsoroize the data using your procedure via:
dat2 <- dat
dat2[, -1] <- sapply(dat[,-1], winsorize)
which results in:
R> dat2
event_date beta_before beta_after
1 2000-05-05 1.2911707 1.3215649
2 1999-03-30 0.5089734 0.4269576
3 2000-05-05 0.5414700 0.5326762
4 2000-02-09 1.5491035 1.2839989
5 1999-03-30 1.9380675 1.6169735
6 1999-03-30 1.3109909 1.4468207
7 2000-05-05 1.2576421 1.3659493
8 1999-03-30 1.4393018 0.7417778
9 2000-05-05 0.5089734 0.3860641
10 2000-05-05 0.5532216 0.3860641
11 2000-02-08 2.6642932 2.3815577
12 2000-02-09 2.3007579 2.2626960
13 2001-08-14 2.6642932 2.1611011
14 2000-02-08 2.2509121 2.3815577
15 2000-09-20 0.6624503 0.9479356
16 2006-09-26 0.6431112 0.8745333
I'm not sure where you got the value you suggest should replace the max in beta_before as the second highest is 2.6642932 in the snippet of data provided and that is what my function has used to replace with the maximum value with.
Note the function will only work if there is one minimum and maximum values respectively in each column owing to the way which.min() and which.max() are documented to work. If you have multiple entries taking the same max or min value then we would need something different:
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x
}
should do it (latter is not tested).
Strictly speaking, "winsorization" is the act of replacing the most extreme data points with an acceptable percentile (as mentioned in some of the other answers). One fairly standard R function to do this is winsor from the psych package. Try:
dat$beta_before = psych::winsor(dat$beta_before, trim = 0.0625)
dat$beta_after = psych::winsor(dat$beta_after , trim = 0.0625)
I chose trim = to be 0.0625 (the 6.25th percentile and 93.75th percentile) because you only have 16 data points and you want to "rein in" the top and bottom ones: 1/16 = 0.0625
Note that this might make the extreme data equal to a percentile number which may or may not exist in your data set: the theoretical n-th percentile of the data.
The statar package works very well for this. Copying the relevant snippet from the readme file:
# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
https://github.com/matthieugomez/statar
follow up from my previous point about actually replacing the to-be-trimmed values with value at trim position:
winsorized.sample<-function (x, trim = 0, na.rm = FALSE, ...)
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
}
if (na.rm)
x <- x[!is.na(x)]
if (!is.numeric(trim) || length(trim) != 1L)
stop("'trim' must be numeric of length one")
n <- length(x)
if (trim > 0 && n) {
if (is.complex(x))
stop("trimmed sample is not defined for complex data")
if (any(is.na(x)))
return(NA_real_)
if (trim >= 0.5) {
warning("trim >= 0.5 is odd...trying it anyway")
}
lo <- floor(n * trim) + 1
hi <- n + 1 - lo
#this line would work for just trimming
# x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
#instead, we're going to replace what would be trimmed
#with value at trim position using the next 7 lines
idx<-seq(1,n)
myframe<-data.frame(idx,x)
myframe<-myframe[ order(x,idx),]
myframe$x[1:lo]<-x[lo]
myframe$x[hi:n]<-x[hi]
myframe<-myframe[ order(idx,x),]
x<-myframe$x
}
x
}
#test it
mydist<-c(1,20,1,5,2,40,5,2,6,1,5)
mydist2<-winsorized.sample(mydist, trim=.2)
mydist
mydist2
descStat(mydist)
descStat(mydist2)