Report the mean number of characters in Corpus document - r

So I have a corpus setup reading bunch of text file with paragraphs in them.
library('tm')
my.text.location <- "C:/Users//.../*/"
apapers <- VCorpus(DirSource(my.text.location))
Now I need to find the mean of the characters in each text. Running a
mean(nchar(apapers), na.rm =T) results in a very weird output, more than the number of characters.
Any other way to get the mean?

You didn't supply a reproducible example, but rowMeans(sapply(apapers, nchar)) will return the mean number of characters over all documents. "Content" is the column you need.
A longer version is running a sapply over the corpus counting the number of per document. Transpose this data and turn it into a data.frame. The data.frame will contain two columns, content and meta. Content is the one you need. Taking the mean of the content column will give you the average number of characters in a document. The advantage of this is that you have the table in case you need to report the numbers.
# your code
my_count <- data.frame(t(sapply(apapers, nchar)))
mean(my_count$content)
Reproducible example using the crude dataset:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
# in one statement
rowMeans(sapply(crude, nchar))
content meta
1220.30 453.15
# longer version keeping intermediate results.
my_count <- data.frame(t(sapply(crude, nchar)))
mean(my_count$content)
[1] 1220.3
my_count
content meta
127 527 440
144 2634 458
191 330 444
194 394 441
211 552 441
236 2774 455
237 2747 477
242 930 453
246 2115 440
248 2066 466
273 2241 458
349 593 492
352 621 468
353 591 445
368 629 440
489 876 445
502 1166 446
543 463 447
704 1797 456
708 360 451

Related

Extracting seasonal effect without using stl or decompose

I have a data named 'bicoal' which consists of annual bituminous coal production in the United States from 1920 to 1968.
`Time Series:
Start = 1920
End = 1968
Frequency = 1
[1] 569 416 422 565 484 520 573 518 501 505 468 382 310 334 359 372 439 446 349 395
[21] 461 511 583 590 620 578 534 631 600 438 516 534 467 457 392 467 500 493 410 412
[41] 416 403 422 459 467 512 534 552 545`
I made a time series, saved under the name time_series, and wanted to extract the seasonal effect using the code plot(decompose(time_series)) and plot(stl(time_series)), but got an error message
Error in stl(time_series) :
series is not periodic or has less than two periods
Error in decompose(time_series) :
time series has no or less than 2 periods
If stl nor decompose doesn't work, is there a way to extract the seasonal effect?
Without seeing how your time series is constructed I think this might be your problem.
data <- rep(seq(1,5),5)
ts.1 <- ts(data)
stl(ts.1)
Now to fix this issue the ts function has a frequency argument that defines the period of the data.
ts.2 <- ts(data, frequency = 5)
stl(ts.2, s.window = "periodic")

Automize portfolios volatilities computation in R

Thanks for reading my post. I have a series of portfolios created from the combination of several stocks. I should compute the volatility of those portfolios using the historical daily performances of each stock. Since I have all the combinations in one data frame (called final_output), and all stocks return in another data frame (called perf, where the columns are stocks and rows days) I don't know which will be the most efficient way to automize the process. Below you can find an extract:
> Final_output
ISIN_1 ISIN_2 ISIN_3 ISIN_4
2 CH0595726594 CH1111679010 XS1994697115 CH0587331973
3 CH0595726594 CH1111679010 XS1994697115 XS2027888150
4 CH0595726594 CH1111679010 XS1994697115 XS2043119358
5 CH0595726594 CH1111679010 XS1994697115 XS2011503617
6 CH0595726594 CH1111679010 XS1994697115 CH1107638921
7 CH0595726594 CH1111679010 XS1994697115 XS2058783270
8 CH0595726594 CH1111679010 XS1994697115 JE00BGBBPB95
> perf
CH0595726594 CH1111679010 XS1994697115 CH0587331973
626 0.0055616769 -0.0023656130 1.363791e-03 1.215922e-03
627 0.0086094443 0.0060037334 0.000000e+00 2.519220e-03
628 0.0053802380 0.0009027081 0.000000e+00 7.508635e-04
629 -0.0025213543 -0.0022046297 4.864050e-05 1.800720e-04
630 0.0192416817 0.0093401627 -6.079767e-03 3.800836e-03
631 -0.0101224820 0.0051741294 6.116956e-03 -1.345184e-03
632 -0.0013293793 -0.0100475153 -4.494163e-03 -1.746106e-03
633 0.0036350604 0.0012999350 3.801130e-03 -5.997121e-05
634 0.0030097434 -0.0011484496 -1.187614e-03 -2.069131e-03
635 0.0002034381 0.0030493901 -1.851762e-03 -3.806280e-04
636 -0.0035594427 0.0167455769 -2.148123e-04 -4.709560e-04
637 0.0007654623 -0.0051958237 -3.711191e-04 1.604010e-04
638 0.0107592678 -0.0016260163 4.298764e-04 3.397951e-03
639 0.0050953486 -0.0007403020 2.011738e-03 8.790770e-04
640 0.0008532851 -0.0071121648 -9.746114e-04 5.389598e-04
641 -0.0068204614 0.0133810874 -9.755622e-05 -1.346674e-03
642 0.0091395678 0.0102591793 1.717157e-03 -1.977785e-03
643 0.0027520640 -0.0157912638 1.256440e-03 -1.301119e-04
644 -0.0048902196 0.0039494471 -1.624514e-03 -3.373340e-03
645 -0.0116838833 0.0062450826 6.625549e-04 1.205255e-03
646 0.0004566442 -0.0018570102 -3.456636e-03 4.474138e-03
647 0.0041586368 0.0085679315 4.435933e-03 1.957455e-03
648 0.0007575758 0.0002912621 0.000000e+00 2.053306e-03
649 0.0046429473 -0.0138309230 -4.435798e-03 1.541798e-03
650 0.0049731250 -0.0488164953 4.181975e-03 -9.733133e-04
651 0.0008497451 -0.0033110870 2.724477e-04 -7.555498e-04
652 0.0004494831 0.0049831300 -8.657588e-04 -1.790813e-04
653 -0.0058905751 0.0020143588 8.178287e-04 -1.213991e-03
654 0.0000000000 0.0167525773 4.864050e-05 9.365068e-04
655 0.0010043186 0.0048162231 0.000000e+00 -2.110146e-03
656 -0.0024079462 -0.0100403633 -2.431907e-03 -9.176600e-04
657 -0.0095544604 -0.0193670047 0.000000e+00 -8.935435e-03
658 0.0008123477 0.0114339172 2.437835e-03 5.530483e-03
659 0.0022828734 -0.0015415446 -3.239300e-03 2.765060e-03
660 0.0049096523 -0.0001029283 3.199079e-02 2.327835e-03
661 -0.0027702226 -0.0357198003 9.456712e-04 3.189602e-04
662 -0.0008081216 -0.0139311449 -2.891020e-02 -1.295363e-03
663 -0.0033867462 0.0068745264 -2.529552e-03 -1.496588e-04
664 -0.0015216068 -0.0558572120 -3.023653e-03 -7.992975e-03
665 0.0052829422 0.0181072771 4.304652e-03 -3.319519e-03
666 0.0084386054 0.0448545861 -8.182748e-04 4.279284e-03
667 -0.0076664829 -0.0059415480 -2.047362e-04 6.059936e-03
668 -0.0062108665 -0.0039847073 7.313506e-04 5.993467e-04
669 -0.0053350948 0.0068119154 -1.042631e-02 -2.056524e-03
670 -0.0263588067 0.0245395479 -2.188962e-02 -6.732491e-03
671 -0.0021511018 0.0220649895 1.412435e-02 1.702085e-03
672 0.0205058100 -0.0007179119 3.057527e-03 -1.002423e-02
673 0.0096862280 -0.0194488633 1.207407e-03 -1.553899e-03
674 0.0007143951 -0.0068557672 6.227450e-03 1.790274e-03
675 -0.0021926470 -0.0051114507 -6.267498e-03 -1.035691e-03
676 0.0076655765 -0.0139300847 6.583825e-03 3.059472e-03
677 -0.0032457653 0.0180480206 -4.635495e-03 1.064002e-03
678 0.0036633764 0.0060676410 -2.762676e-04 5.364970e-04
679 -0.0008111122 -0.0013635410 -1.065898e-03 1.214059e-03
680 0.0050228311 0.0055141267 3.003507e-03 1.121643e-03
681 -0.0007067495 0.0147281558 -2.699002e-03 -1.514035e-04
682 -0.0024248548 0.0002573473 -2.113685e-03 -1.423409e-03
683 -0.0002025624 0.0138417207 -4.374895e-03 1.415328e-04
684 -0.0141822418 -0.0169517332 -3.578920e-03 -1.799234e-03
685 -0.0005651749 -0.0259693324 -5.926428e-03 -3.635333e-03
686 0.0004112688 0.0133043570 -1.545642e-03 1.981828e-03
687 -0.0150565262 -0.0107757493 -1.717916e-02 -1.328749e-02
688 0.0039129754 -0.0441013167 -8.376631e-03 -5.653841e-04
689 0.0019748467 0.0115063340 -2.835394e-02 7.868428e-03
690 0.0072614108 0.0358764014 3.586897e-02 7.960077e-03
691 -0.0003604531 0.0106119001 1.024769e-04 -2.733651e-04
What I should do is look for each portfolio (each row of final_output is a portfolio, i.e. 4 stocks portfolio) in perf and compute the volatility (standard deviation) of that portfolio using the stocks historical daily performances of the last three months. (Of course, here I have pasted only 4 stocks performances for simplicity.) Once done for the first, I should do the same for all the other rows (portfolios).
Below is the formula I used for computing the volatility:
#formula for computing the volatility
sqrt(t(weights) %*% covariance_matrix %*% weights)
#where covariance_matrix is
cov(portfolio_component_monthly_returns)
#All the portfolios are equiponderated
weights = [ 0.25 0.25 0.25 0.25 ]
What I'm trying to do since yesterday is to automize the process for all the rows, indeed I have more than 10'000 rows. I'm an RStudio naif, so even trying and surfing on the new I have no results and no ideas of how to automize it. Would someone have a clue how to do it?
Hope to have been clearer as possible, in case do not hesitate to ask me.
Many thanks

R is not taking the parameter hgap in layout_with_sugiyama

I'm working on R on a graph and I'd like to have a hierarchical plot, based on the values in the vector S (a value for each node).
lay2 <- layout_with_sugiyama(grafo, attributes="all", layers = S, hgap=10, vgap=10)
plot(lay2$extd_graph, vertex.label.cex=0.5)
However, the paramaters hgap e vgap are not taken and the graph is really confused (even because I've got 162 nodes).
I'm doing something wrong or there is another way in which I can do a hierarchical graph?
I believe that layout_with_sugiyama is working just fine,
but you may be misinterpreting the output. Since you do
not provide any data, I will illustrate with some randomly
generated data.
library(igraph)
set.seed(1234)
grafo = erdos.renyi.game(162, 0.03)
lay2 <- layout_with_sugiyama(grafo, attributes="all",
hgap=10, vgap=10)
plot(lay2$extd_graph, vertex.label.cex=0.5, vertex.size=9)
I think the source of your question is the fact that the nodes
are a bit crowded together in the horizontal direction. But
that should be expected. Let's analyze the layout, starting
with the easy part, the vertical direction.
table(lay2$layout[,2])
1 11 21 31 41
24 82 42 13 1
You can see that vgap worked. The spacing is 10 units apart.
The second line up (y=11) has 82 nodes. Unless the nodes are
tiny, 82 nodes on a single, horizontal line will overlap.
But aren't they supposed to have spacing of at least 10?
They do! Let's look at that second line.
sort(lay2$layout[lay2$layout[,2]==11,1])
[1] -25 -15 -5 5 15 25 35 45 55 65 75 85 95 105 115 125 135 230
[19] 240 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420
[37] 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600
[55] 610 620 630 640 655 665 675 685 695 720 730 740 750 760 770 780 790 800
[73] 810 820 830 840 850 860 870 880 890 910
Looking at the whole graph, there is a slightly broader range.
range(lay2$layout[,1])
[1] -65 910
None of the numbers are less that 10 apart - as requested. hgap worked too!
However, what happens when you try to plot that? If you read the part of the
?igraph.plotting help page that refers to the parameter rescale,
you will see:
rescale:
Logical constant, whether to rescale the coordinates to the [-1,1]x-1,1 interval. Defaults to TRUE, the layout will be rescaled.
So the layout will be rescaled to a range of -1,1 and then plotted.
Scaled or not, you need to fit 82 nodes in a single, horizontal row,
so it is very difficult to avoid overlapping nodes.

Convert time values to numeric while keeping time characteristics

I have a data set which contains interval times of different events occurring. What I want to do, is convert the data into a numeric vector, so its easier to manipulate and run summaries/make graphs etc, while keeping its time characteristics. Here is a snippet of my data:
data <- c( "03:31", "12:17", "16:29", "09:52", "04:01", "09:00", "06:29",
"04:17", "04:42")
class(data)
[1] character
The obvious answer is :
as.numeric(data)
But I get this error:
Warning message:
NAs introduced by coercion
I thought of maybe taking the ':' out, but then it loses its time characteristics. By that, I mean that if I sum values together say 347 and 543, it would give me 890 as opposed to 930. Here is the code that I would use to take the colon out, which works fine for its purpose:
Nocolon <- gsub("[:]", "", Data, perl=TRUE)
"0331" "1217" "1629" "0952" "0401" "0900" "0629" "0417" "0442"
So essentially, what I want is for my time values to be in a form which is easy to manipulate and analyse. My idea is for it to be a numeric vector, but that is from my minimal understanding of R. My actual code has thousands of time values, and I want to create a plot that will allow me to view and determine whether the values follow a statistical distribution.
Thanks in advance!
Here are some approaches. All convert to minutes. For example, the first component is "03:31" which is 3 * 60 + 31 = 211 minutes. (1) to (5) do not use any packages.
1) %*% It works by reading data into a 2 column data frame with hours and minutes. That is converted to a matrix so that it can be matrix multiplied by c(60, 1). Finally, unravel it with c.
c(as.matrix(read.table(text = data, sep = ":")) %*% c(60, 1))
[1] 211 737 989 592 241 540 389 257 282
2) with This variation is even shorter. It creates the same data frame but and then simply mulitiplies the first column (V1) by 60 and adds it to the second column (V2).
with(read.table(text = data, sep = ":"), 60*V1+V2)
[1] 211 737 989 592 241 540 389 257 282
3) complex This converts each component to a complex number and then performs the required arithmetic on the real and imaginary parts:
data_c <- as.complex(sub(":(\\d+)", "+\\1i", data))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
3a) This variation of (3) also works and avoids regular expressions:
data_c <- as.complex(paste0(chartr(":", "+", data), "i"))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
4) eval This converts each component into an arithmetic expression which evaluates to the number of minutes and then performs the evalution. Using eval is not really recommended when you can avoid it so this one is less desirable:
sapply(parse(text = sub("(\\d+):", "60*\\1+", data)), eval)
## [1] 211 737 989 592 241 540 389 257 282
5) POSIXlt We can convert to "POSIXlt" class and then use the hour and min components:
with(unclass(as.POSIXlt(data, format = "%H:%M")), 60 * hour + min)
## [1] 211 737 989 592 241 540 389 257 282
6) chron Using the chron package we can paste on the seconds, convert to "times" class and then convert to minutes:
library(chron)
24 * 60 * as.numeric(times(paste0(data, ":00")))
## [1] 211 737 989 592 241 540 389 257 282
7) lubridate Using the lubridate package we can convert it using hm and then to numeric giving seconds and finally dividing by 60 to give minutes:
as.numeric(hm(data)) / 60
## [1] 211 737 989 592 241 540 389 257 282
Use the as.difftime function designed for this:
as.difftime(data, format="%H:%M", units="mins")
#Time differences in mins
#[1] 211 737 989 592 241 540 389 257 282

How do I make sure numbers are numeric from a .txt?

I'm setting up a script to extract the thickness and voltages from a single column text file and perform a Weibull distribution on it. When I try to use fitdistr() I get an error stating "'x' must be a non-empty numeric vector". R is supposed to interpret numbers in text files as numeric but that doesn't seem to be happening. Any thoughts?
filename <- "SampleBreakdownSet.txt"
d <- read.table(filename, header = FALSE, sep = "")
#Extract thickness from the dataset; set to variable t
t = d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV = tail(d,(nrow(d)-1))
#Calculates the breakdown field from the thickness and BDV
BDF = (BDV*10000) / t
#Calculates the Weibull parameters from the input breakdown voltages.
fitdistr(BDF, densfun ="weibull", lower = 0)
fitdistr(BDF, densfun ="weibull", lower = 0)
Error in fitdistr(BDF, densfun = "weibull", lower = 0) :
'x' must be a non-empty numeric vector
Sample data I'm using:
2
200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243
You are passing a data.frame to fitdistr, but you should be passing the vector itself.
Try this:
d <- read.table(text='200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243', header=FALSE)
t <- d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV <- d[-1, 1]
BDF <- (BDV*10000) / t
library(MASS)
fitdistr(BDF, densfun ="weibull", lower = 0)
You could also refer to the relevant column when calling fitdistr, e.g.:
fitdistr(BDF$V1, densfun ="weibull", lower = 0)
# shape scale
# 2.745485e+00 1.997509e+04
# (3.716797e-01) (1.283667e+03)

Resources