How to use read.table if data are thousand separated by dots - r

I would like to read.table from a csv file that has dots as thousand separators.
Resulting numbers should be numeric.
This is somewhat complicated as read.table allows to configure decimal signs and quote signs but not thousand separators.
The command gsub(input[10,10],pattern='[.]',replacement='') could delete the dots but transforms everything to characters. The conversation with as.numeric does work for single numbers:
> input[4,4]
[1] 1.742
97 Levels: 0 1.034 1.132 1.137 1.153 1.164 1.178 1.190 1.208 1.251 1.282 ... 950
> gsub(input[4,4],pattern='[.]',replacement='')
[1] "1742"
> as.numeric(gsub(input[4,4],pattern='[.]',replacement=''))
[1] 1742
but not for tables, as gsub(input,pattern='[.]',replacement='') yields
…
[4] "c(17, 21, 31, 38, 39, 48, 56, 52, 57, 63, 66, 68, 71, 76, 78, 79, 75, 77, 74, 73, 65, 64, 55, 50, 45, 43, 34, 36, 44, 42, 32, 5, 96, 10, 9, 6, 22, 53, 54, 14, 15, 16, 24, 18, 23, 33, 25, 28, 35, 47, 49, 51, 62, 70, 72, 69, 67, 58, 26, 94, 93, 97, 8, 41, 37, 46, 29, 40, 27, 30, 20, 19, 12, 13, 11, 7, 3, 4, 2, 95, 92, 90, 89, 87, 86, 83, 81, 80, 61, 60, 59, 91, 82, 88, 84, 85, 1, 1, 1, 1)" …
which is a vector of NA if converted to numeric. Furthermore, something else seems to be wrong with that command since most values are larger than thousand.
Is there anything else that could be useful, besides editing the original .csv files?

You can use the same answer as here, just change the coma (,) to the escaped period (\\.) in the gsub call to remove the periods.

Assuming input is of type character to begin with, this should work -
library(data.table)
dt <- data.table(dt)
dt[,input := as.numeric(gsub(input,pattern='[.]',replacement='')), by = 'input']

Related

How to consecutively subset an array and count the number of iterations?

I need to repeatedly apply a function on the resultant arrays until all data in the array is reduced to a single set, and count the number of iterations.
Data
Array ar
structure(c(0, 11, 17, 15, 22, 67, 73, 68, 31, 31, 28, 33, 34,
32, 11, 0, 9, 12, 21, 67, 73, 67, 35, 30, 34, 67, 60, 36, 17,
9, 0, 6, 19, 70, 74, 68, 36, 36, 36, 64, 66, 39, 15, 12, 6, 0,
13, 64, 69, 66, 34, 37, 39, 77, 65, 45, 22, 21, 19, 13, 0, 59,
60, 66, 38, 39, 39, 40, 43, 43, 67, 67, 70, 64, 59, 0, 10, 18,
77, 75, 78, 93, 93, 85, 73, 73, 74, 69, 60, 10, 0, 15, 76, 74,
80, 103, 101, 95, 68, 67, 68, 66, 66, 18, 15, 0, 59, 65, 73,
90, 87, 82, 31, 35, 36, 34, 38, 77, 76, 59, 0, 8, 19, 24, 28,
32, 31, 30, 36, 37, 39, 75, 74, 65, 8, 0, 12, 20, 22, 23, 28,
34, 36, 39, 39, 78, 80, 73, 19, 12, 0, 6, 14, 18, 33, 67, 64,
77, 40, 93, 103, 90, 24, 20, 6, 0, 2, 8, 34, 60, 66, 65, 43,
93, 101, 87, 28, 22, 14, 2, 0, 6, 32, 36, 39, 45, 43, 85, 95,
82, 32, 23, 18, 8, 6, 0), .Dim = c(14L, 14L))
From
a<-colSums(ar<25)
b<-which.max(a)
c<-ar[ar[,b] > 25,, drop = FALSE]
we get
structure(c(0, 11, 17, 15, 22, 67, 73, 68, 11, 0, 9, 12, 21,
67, 73, 67, 17, 9, 0, 6, 19, 70, 74, 68, 15, 12, 6, 0, 13, 64,
69, 66, 22, 21, 19, 13, 0, 59, 60, 66, 67, 67, 70, 64, 59, 0,
10, 18, 73, 73, 74, 69, 60, 10, 0, 15, 68, 67, 68, 66, 66, 18,
15, 0, 31, 35, 36, 34, 38, 77, 76, 59, 31, 30, 36, 37, 39, 75,
74, 65, 28, 34, 36, 39, 39, 78, 80, 73, 33, 67, 64, 77, 40, 93,
103, 90, 34, 60, 66, 65, 43, 93, 101, 87, 32, 36, 39, 45, 43,
85, 95, 82), .Dim = c(8L, 14L))
then from
a<-colSums(c<25)
b<-which.max(a)
d<-c[c[,b]>25,,drop=FALSE]
we get
structure(c(67, 73, 68, 67, 73, 67, 70, 74, 68, 64, 69, 66, 59,
60, 66, 0, 10, 18, 10, 0, 15, 18, 15, 0, 77, 76, 59, 75, 74,
65, 78, 80, 73, 93, 103, 90, 93, 101, 87, 85, 95, 82), .Dim = c(3L,
14L))
applying once more
a<-colSums(d<25)
b<-which.max(a)
e<-d[d[,b]>25,,drop=FALSE]
results in a array with no values
structure(numeric(0), .Dim = c(0L, 14L))
Then, the operation can be performed a number of times, and I need to know how many times; in this case it was 3 times.
Also, I need to reduce the number of the lines of the code probably with a loop function, as the action is repetitive.
You can use recursion as shown below:
fun<- function(x, i=1){
a<-which.max(colSums(x<25))
b<-x[x[,a] > 25,, drop = FALSE]
if(length(b)) Recall(b, i+1) else i
}
fun(ar)
[1] 3

Discretize the columns first - Apriori in R

I extracted this data from a file:
forests<-read.table("~/Desktop/f.txt", header = FALSE, sep = " ", fill = TRUE)
library(arules) //after installing the package
I get an error when I type
forests<- apriori(forests, parameter = list(support=0.3))
Error in asMethod(object) : column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85 not logical or a factor. Discretize the
columns first.
I tried discritize(forests). It still doesn't work.
itemFrequencyPlot(forests) also gives an error saying unable to find inherited method.
Have I imported the data incorrectly?

Decomposition of time series data

Can anybody help be decipher the output of ucm. My main objective is to check if the ts data is seasonal or not. But i cannot plot and look and every time. I need to automate the entire process and provide an indicator for the seasonality.
I want to understand the following output
ucmxmodel$s.season
# Time Series:
# Start = c(1, 1)
# End = c(4, 20)
# Frequency = 52
# [1] -2.391635076 -2.127871717 -0.864021134 0.149851212 -0.586660213 -0.697838635 -0.933982269 0.954491859 -1.531715424 -1.267769820 -0.504165631
# [12] -1.990792301 1.273673437 1.786860414 0.050859315 -0.685677002 -0.921831488 -1.283081922 -1.144376739 -0.964042949 -1.510837956 1.391991657
# [23] -0.261175626 5.419494363 0.543898305 0.002548125 1.126895943 1.474427901 2.154721023 2.501352782 0.515453691 -0.470886132 1.209419689
ucmxmodel$vs.season
# [1] 1.375832 1.373459 1.371358 1.369520 1.367945 1.366632 1.365582 1.364795 1.364270 1.364007 1.364007 1.364270 1.364795 1.365582 1.366632 1.367945
# [17] 1.369520 1.371358 1.373459 1.375816 1.784574 1.784910 1.785223 1.785514 1.785784 1.786032 1.786258 1.786461 1.786643 1.786802 1.786938 1.787052
# [33] 1.787143 1.787212 1.787257 1.787280 1.787280 1.787257 1.787212 1.787143 1.787052 1.786938 1.786802 1.786643 1.786461 1.786258 1.786032 1.785784
# [49] 1.785514 1.785223 1.784910 1.784578 1.375641 1.373276 1.371175 1.369337 1.367762 1.366449 1.365399 1.364612 1.364087 1.363824 1.363824 1.364087
# [65] 1.364612 1.365399 1.366449 1.367762 1.369337 1.371175 1.373276 1.375636 1.784453 1.784788 1.785101 1.785392 1.785662 1.785910 1.786136 1.786339
ucmxmodel$est.var.season
# Season_Variance
# 0.0001831373
How can i use the above info without looking at the plots to determine the seasonality and at what level ( weekly, monthly, quarterly or yearly)?
In addition, i am getting NULL in est
ucmxmodel$est
# NULL
Data
The data for a test is:
structure(c(44, 81, 99, 25, 69, 42, 6, 25, 75, 90, 73, 65, 55,
9, 53, 43, 19, 28, 48, 71, 36, 1, 66, 46, 55, 56, 100, 89, 29,
93, 55, 56, 35, 87, 77, 88, 18, 32, 6, 2, 15, 36, 48, 80, 48,
2, 22, 2, 97, 14, 31, 54, 98, 43, 62, 94, 53, 17, 45, 92, 98,
7, 19, 84, 74, 28, 11, 65, 26, 97, 67, 4, 25, 62, 9, 5, 76, 96,
2, 55, 46, 84, 11, 62, 54, 99, 84, 7, 13, 26, 18, 42, 72, 1,
83, 10, 6, 32, 3, 21, 100, 100, 98, 91, 89, 18, 88, 90, 54, 49,
5, 95, 22), .Tsp = c(1, 3.15384615384615, 52), class = "ts")
and
structure(c(40, 68, 50, 64, 26, 44, 108, 90, 62, 60, 90, 64, 120, 82, 68, 60,
26, 32, 60, 74, 34, 16, 22, 44, 50, 16, 34, 26, 42, 14, 36, 24, 14, 16, 6, 6,
12, 20, 10, 34, 12, 24, 46, 30, 30, 46, 54, 42, 44, 42, 12, 52, 42, 66, 40,
60, 42, 44, 64, 96, 70, 52, 66, 44, 64, 62, 42, 86, 40, 56, 50, 50, 62, 22,
24, 14, 14, 18, 18, 10, 20, 10, 4, 18, 10, 10, 14, 20, 10, 32, 12, 22, 20, 20,
26, 30, 36, 28, 56, 34, 14, 54, 40, 30, 42, 36, 52, 30, 32, 52, 42, 62, 46,
64, 70, 48, 40, 64, 40, 120, 58, 36, 40, 34, 36, 26, 18, 28, 16, 32, 18, 12,
20), .Tsp = c(1, 4.36, 52), class = "ts")
I think the most straightforward approach would be to follow Rob Hyndman's approach (he is the author of many time series packages in R). For your data it would work as follows,
require(fma)
# Create a model with multiplicative errors (see https://www.otexts.org/fpp/7/7).
fit1 <- stlf(test2)
# Create a model with additive errors.
fit2 <- stlf(data, etsmodel = "ANN")
deviance <- 2 * c(logLik(fit1$model) - logLik(fit2$model))
df <- attributes(logLik(fit1$model))$df - attributes(logLik(fit2$model))$df
# P-value
1 - pchisq(deviance, df)
# [1] 1
Based on this analysis we find the p-value of 1 which would lead us to conclude there is no seasonality.
I quite like the stl() function provided in R. Try this minimal example:
# some random data
x <- rnorm(200)
# as a time series object
xt <- ts(x, frequency = 10)
# do the decomposition
xts <- stl(xt, s.window = "periodic")
# plot the results
plot(xts)
Now you can get an estimate of the 'seasonality' by comparing the variances.
vars <- apply(xts$time.series, 2, var)
vars['seasonal'] / sum(vars)
You now have the seasonal variance as a proportion of sum of variances after decomposition.
I highly recommend reading the original paper so that you understand whats happening under the hood here. Its very accessible and I like this method as it is quite intuitive.

R - hist plot colours by quantile

I am trying to do a simple hist plot and colour the bins by quantile.
I was wondering why when the bins size change the colours gets all messed up.
Maybe I am not doing it right from the beginning.
The quantiles are
quantile(x)
0% 25% 50% 75% 100%
0.00 33.75 58.00 78.25 123.00
Then I am setting the colours with the quantile values
k = ifelse(test = x <= 34, yes = "#8DD3C7",
no = ifelse(test = (x > 34 & x <= 58), yes = "#FFFFB3",
no = ifelse(test = (x > 58 & x <= 79), yes = "#BEBADA",
no = ifelse(test = (x > 79), yes = "#FB8072", 'grey'))) )
Then when I plot with larger bin, I get :
hist(dt, breaks = 10, col = k)
Which seems right, even though the last bin is wrong (?!).
But when I try with smaller bins, the colours are not right.
Could someone help me understand why is it wrong ? Or if my code is wrong ?
The x in question
x = c(23, 23, 16, 16, 34, 34, 43, 43, 97, 97, 63, 63, 39, 39, 29,
29, 63, 63, 48, 48, 7, 7, 80, 80, 69, 69, 110, 110, 103, 103,
43, 43, 39, 39, 46, 46, 14, 14, 56, 56, 76, 76, 52, 52, 18, 18,
32, 32, 66, 66, 70, 70, 26, 26, 40, 40, 105, 105, 62, 62, 51,
51, 58, 58, 37, 37, 55, 55, 42, 42, 11, 11, 89, 89, 55, 55, 109,
109, 49, 49, 27, 27, 96, 96, 27, 27, 65, 65, 74, 74, 17, 17,
33, 33, 89, 89, 63, 63, 18, 18, 25, 25, 36, 36, 108, 108, 3,
3, 52, 52, 83, 83, 74, 74, 56, 56, 99, 99, 6, 6, 25, 25, 51,
51, 4, 4, 100, 100, 17, 17, 44, 44, 23, 23, 70, 70, 85, 85, 14,
14, 22, 22, 89, 89, 45, 45, 2, 2, 29, 29, 14, 14, 69, 69, 96,
96, 10, 10, 58, 58, 97, 97, 54, 54, 60, 60, 65, 65, 2, 2, 54,
54, 4, 4, 28, 28, 107, 107, 74, 74, 72, 72, 71, 71, 42, 42, 92,
92, 64, 64, 39, 39, 111, 111, 72, 72, 73, 73, 58, 58, 41, 41,
56, 56, 73, 73, 18, 18, 73, 73, 36, 36, 60, 60, 49, 49, 47, 47,
95, 95, 19, 19, 8, 8, 7, 7, 38, 38, 38, 38, 38, 38, 28, 28, 79,
79, 53, 53, 30, 30, 19, 19, 14, 14, 53, 53, 68, 68, 39, 39, 42,
42, 87, 87, 33, 33, 18, 18, 77, 77, 83, 83, 19, 19, 14, 14, 7,
7, 32, 32, 94, 94, 30, 30, 55, 55, 89, 89, 30, 30, 45, 45, 84,
84, 38, 38, 59, 59, 73, 73, 77, 77, 22, 22, 55, 55, 31, 31, 52,
52, 20, 20, 26, 26, 62, 62, 55, 55, 46, 46, 26, 26, 49, 49, 22,
22, 65, 65, 67, 67, 73, 73, 29, 29, 88, 88, 86, 86, 76, 76, 32,
32, 12, 12, 19, 19, 14, 14, 8, 8, 63, 63, 63, 63, 65, 65, 84,
84, 34, 34, 42, 42, 26, 26, 75, 75, 68, 68, 28, 28, 95, 95, 17,
17, 76, 76, 33, 33, 91, 91, 93, 93, 80, 80, 89, 89, 64, 64, 81,
81, 98, 98, 47, 47, 70, 70, 46, 46, 11, 11, 92, 92, 69, 69, 95,
95, 51, 51, 87, 87, 61, 61, 50, 50, 47, 47, 35, 35, 31, 31, 39,
39, 19, 19, 81, 81, 35, 35, 68, 68, 68, 68, 67, 67, 57, 57, 7,
7, 9, 9, 23, 23, 50, 50, 89, 89, 41, 41, 54, 54, 53, 53, 57,
57, 89, 89, 32, 32, 40, 40, 48, 48, 35, 35, 15, 15, 90, 90, 1,
1, 17, 17, 53, 53, 73, 73, 76, 76, 59, 59, 45, 45, 68, 68, 21,
21, 37, 37, 33, 33, 51, 51, 61, 61, 31, 31, 15, 15, 23, 23, 29,
29, 45, 45, 96, 96, 87, 87, 37, 37, 104, 104, 50, 50, 58, 58,
103, 103, 91, 91, 72, 72, 73, 73, 27, 27, 60, 60, 23, 23, 99,
99, 28, 28, 78, 78, 27, 27, 82, 82, 63, 63, 34, 34, 84, 84, 62,
62, 2, 2, 99, 99, 22, 22, 85, 85, 39, 39, 47, 47, 66, 66, 17,
17, 74, 74, 45, 45, 70, 70, 87, 87, 28, 28, 97, 97, 89, 89, 33,
33, 50, 50, 79, 79, 86, 86, 69, 69, 91, 91, 75, 75, 52, 52, 76,
76, 13, 13, 71, 71, 42, 42, 20, 20, 28, 28, 56, 56, 69, 69, 16,
16, 47, 47, 60, 60, 45, 45, 72, 72, 78, 78, 107, 107, 4, 4, 64,
64, 88, 88, 9, 9, 3, 3, 10, 10, 92, 92, 41, 41, 5, 5, 35, 35,
31, 31, 24, 24, 70, 70, 47, 47, 41, 41, 32, 32, 92, 92, 90, 90,
75, 75, 3, 3, 78, 78, 30, 30, 93, 93, 60, 60, 17, 17, 25, 25,
48, 48, 70, 70, 69, 69, 66, 66, 76, 76, 104, 104, 31, 31, 72,
72, 56, 56, 64, 64, 92, 92, 68, 68, 102, 102, 100, 100, 27, 27,
40, 40, 47, 47, 29, 29, 76, 76, 78, 78, 20, 20, 13, 13, 10, 10,
113, 113, 17, 17, 61, 61, 69, 69, 65, 65, 16, 16, 100, 100, 5,
5, 18, 18, 24, 24, 54, 54, 41, 41, 64, 64, 66, 66, 90, 90, 29,
29, 97, 97, 37, 37, 42, 42, 84, 84, 37, 37, 74, 74, 65, 65, 12,
12, 49, 49, 31, 31, 108, 108, 9, 9, 93, 93, 71, 71, 39, 39, 70,
70, 79, 79, 92, 92, 60, 60, 104, 104, 79, 79, 103, 103, 38, 38,
93, 93, 46, 46, 66, 66, 79, 79, 51, 51, 31, 31, 65, 65, 93, 93,
25, 25, 22, 22, 91, 91, 123, 123, 51, 51, 34, 34, 64, 64, 31,
31, 24, 24, 74, 74, 57, 57, 95, 95, 83, 83, 28, 28, 56, 56, 72,
72, 43, 43, 18, 18, 66, 66, 32, 32, 17, 17, 67, 67, 10, 10, 44,
44, 66, 66, 57, 57, 89, 89, 57, 57, 55, 55, 18, 18, 78, 78, 82,
82, 103, 103, 110, 110, 92, 92, 54, 54, 35, 35, 8, 8, 53, 53,
86, 86, 45, 45, 99, 99, 19, 19, 84, 84, 94, 94, 92, 92, 80, 80,
69, 69, 45, 45, 22, 22, 59, 59, 9, 9, 41, 41, 72, 72, 24, 24,
117, 117, 79, 79, 57, 57, 29, 29, 96, 96, 47, 47, 23, 23, 64,
64, 33, 33, 48, 48, 80, 80, 30, 30, 42, 42, 10, 10, 42, 42, 68,
68, 46, 46, 58, 58, 39, 39, 82, 82, 79, 79, 80, 80, 89, 89, 85,
85, 24, 24, 106, 106, 40, 40, 90, 90, 69, 69, 92, 92, 84, 84,
82, 82, 86, 86, 80, 80, 73, 73, 78, 78, 39, 39, 27, 27, 55, 55,
100, 100, 63, 63, 21, 21, 46, 46, 94, 94, 6, 6, 45, 45, 66, 66,
94, 94, 52, 52, 78, 78, 59, 59, 86, 86, 67, 67, 76, 76, 54, 54,
47, 47, 37, 37, 76, 76, 32, 32, 49, 49, 87, 87, 122, 122, 27,
27, 82, 82, 51, 51, 50, 50, 22, 22, 32, 32, 99, 99, 77, 77, 54,
54, 29, 29, 82, 82, 80, 80, 85, 85, 30, 30, 57, 57, 41, 41, 50,
50, 65, 65, 51, 51, 109, 109, 89, 89, 50, 50, 6, 6, 66, 66, 42,
42, 48, 48, 88, 88, 67, 67, 89, 89, 109, 109, 80, 80, 64, 64,
64, 64, 95, 95, 76, 76, 76, 76, 78, 78, 44, 44, 51, 51, 19, 19,
29, 29, 31, 31, 75, 75, 11, 11, 10, 10, 64, 64, 80, 80, 29, 29,
73, 73, 67, 67, 38, 38, 27, 27, 23, 23, 74, 74, 79, 79, 49, 49,
78, 78, 29, 29, 59, 59, 70, 70, 8, 8, 24, 24, 39, 39, 80, 80,
27, 27, 29, 29, 36, 36, 94, 94, 86, 86, 35, 35, 84, 84, 99, 99,
83, 83, 92, 92, 81, 81, 58, 58, 2, 2, 64, 64, 75, 75, 29, 29,
53, 53, 58, 58, 11, 11, 38, 38, 83, 83, 108, 108, 86, 86, 56,
56, 12, 12, 84, 84, 76, 76, 38, 38, 54, 54, 37, 37, 27, 27, 61,
61, 83, 83, 37, 37, 59, 59, 81, 81, 76, 76, 70, 70, 61, 61, 101,
101, 77, 77, 68, 68, 74, 74, 83, 83, 70, 70, 93, 93, 53, 53,
64, 64, 89, 89, 1, 1, 53, 53, 67, 67, 81, 81, 71, 71, 51, 51,
85, 85, 35, 35, 67, 67, 53, 53, 37, 37, 31, 31, 65, 65, 82, 82,
47, 47, 60, 60, 81, 81, 21, 21, 94, 94, 75, 75, 92, 92, 113,
113, 93, 93, 84, 84, 77, 77, 82, 82, 84, 84, 58, 58, 83, 83,
84, 84, 80, 80, 1, 1, 49, 49, 73, 73, 22, 22, 99, 99, 74, 74,
28, 28, 33, 33, 74, 74, 91, 91, 83, 83, 70, 70, 99, 99, 69, 69,
38, 38, 68, 68, 47, 47, 61, 61, 47, 47, 70, 70, 85, 85, 20, 20,
100, 100, 3, 3, 49, 49, 100, 100, 85, 85, 54, 54, 8, 8, 3, 3,
47, 47, 46, 46, 45, 45, 27, 27, 87, 87, 20, 20, 24, 24, 51, 51,
50, 50, 105, 105, 73, 73, 13, 13, 18, 18, 51, 51, 75, 75, 55,
55, 62, 62, 85, 85, 56, 56, 51, 51, 66, 66, 74, 74, 63, 63, 2,
2, 81, 81, 85, 85, 19, 19, 16, 16, 83, 83, 36, 36, 79, 79, 63,
63, 41, 41, 45, 45, 76, 76, 62, 62, 67, 67, 74, 74, 92, 92, 47,
47, 41, 41, 80, 80, 57, 57, 100, 100, 66, 66, 58, 58, 65, 65,
59, 59, 20, 20, 54, 54, 10, 10, 79, 79, 64, 64, 106, 106, 44,
44, 28, 28, 41, 41, 49, 49, 80, 80, 61, 61, 20, 20, 75, 75, 59,
59, 93, 93, 32, 32, 38, 38, 30, 30, 41, 41, 8, 8, 8, 8, 54, 54,
56, 56, 83, 83, 81, 81, 77, 77, 42, 42, 59, 59, 11, 11, 21, 21,
77, 77, 84, 84, 86, 86, 84, 84, 34, 34, 48, 48, 80, 80, 92, 92,
18, 18, 66, 66, 40, 40, 45, 45, 60, 60, 80, 80, 2, 2, 5, 5, 84,
84, 66, 66, 70, 70, 70, 70, 95, 95, 62, 62, 0, 0, 67, 67, 61,
61, 71, 71, 73, 73, 82, 82, 45, 45, 54, 54, 43, 43)
It is because you mistunderstand the col argument of hist.
The col argument is a vector where col[i] is the colour of the ith bar of the histogram.
Your k vector has one element per element of x, which is many more than the number of bars in the histogram.
In the first case, only the first 13 elements of k are used to colour the bars (in that order), since there are only 13 bars. In the second case, the first n elements of k are used to colour the bars, where n is the number of bars (see how the first 13 bars of the small-bin histogram have the same colour as the first 13 of the first histogram?).
If you want to colour the bars by quantile, you will have to work out how many bars are in each quantile (not how many data points), and create your k like that.
To do this, you need to know the histogram breaks - the breakpoints of your bins. The output of hist returns an object where you can get the breakpoints and so on - see ?hist.
# do the histogram counts to get the break points
# don't plot yet
h <- hist(x, breaks=20, plot=F) # h$breaks and h$mids
To work out the colour the bar should be, you can use either the starting coordinate of each bar (all but the last element of h$breaks), the ending coordinate of each bar (all but the first element of h$breaks) or the midpoint coordinate of each bar (h$mids). Set your colours like you did above.
The findInterval(h$mids, quantile(x), ...) works out which quantile each bar is in (determined by the bar's midpoint); it returns an integer with which interval it is in, or 0 if it's outside (though by definition every bar of the histogram is between the 0th and 100th quantile, so technically your "grey" colour is not ever used). rightmost.closed makes sure the 100% quantile value is included in the top-most colour bracket. The cols[findInterval(...)+1] is just a cool/tricksy way to do your ifelse(h$mids <= ..., "$8DD3C7", ifelse(h$mids <= ..., .....)); you could do it the ifelse way if you prefer.
cols <- c('grey', "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072")
k <- cols[findInterval(h$mids, quantile(x), rightmost.closed=T, all.inside=F) + 1]
# plot the histogram with the colours
plot(h, col=k)
Have a look at k - it is only as long as the number of bars in the histogram, rather than as long as the number of datapoints in x.

Cleaning text of tweet messages

I have a csv of tweets. I got it using this ruby library:
https://github.com/sferik/twitter .
The csv is two columns and 150 rows, the second column is the text message:
Text
1 RT #AlstomTransport: #Alstom and OHL to supply a #metro system to #Guadalajara #rail #Mexico http://t.co/H88paFoYc3 http://t.co/fuBPPqNts4
I have to do a sentiment analysis, so i need to clean the text message, removing links, RT, Via, and everything useless for the analysis.
I tried with R, using code found in several tutorials:
> data1 = gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", data1)
But the output is without any sense:
[1] "1:150"
[2] "c(113, 46, 38, 11, 108, 100, 45, 44, 9, 89, 99, 93, 102, 101, 110, 93, 61, 57, 104, 66, 86, 53, 42, 43, 37, 7, 88, 32, 122, 131, 14, 102, 105, 12, 54, 13, 72, 87, 55, 132, 29, 28, 10, 15, 81, 81, 107, 87, 106, 81, 98, 73, 65, 52, 94, 97, 65, 59, 60, 50, 48, 121, 117, 75, 79, 111, 115, 119, 118, 91, 79, 31, 76, 111, 85, 62, 91, 103, 79, 120, 78, 47, 49, 8, 129, 123, 124, 58, 71, 25, 36, 80, 127, 112, 23, 22, 35, 21, 30, 74, 82, 51, 63, 130, 135, 134, 90, 83, 63, 128, 16, 20, 19, 34, 27, 26, 33, 77, \n114, 126, 64, 69, 4, 135, 41, 40, 17, 67, 92, 96, 84, 92, 56, 18, 125, 5, 6, 133, 24, 39, 70, 95, 116, 68, 84, 109, 92, 3, 1, 2)"
Can anyone help me? Thank you.
Looks like you tried to pass in the entire data.frame to gsub rather than just the text column. gsub prefers to work on character vectors. Instead you should do
data1[,2] = gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", data1[,2])
to just transform the second column.

Resources