I want to make a vector from 3:50 in R, looking like
3 4 4 5 6 6 7 8 8 .. 50 50
I want to use a for loop in a for loop but it's not doing wat I want.
f <- c()
for (i in 3:50) {
for(j in 1:2) {
f = c(f, i)
}
}
What is wrong with it?
Another option is to use an embedded rep:
rep(3:50, rep(1:2, 24))
which gives:
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20
[28] 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38
[55] 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
This utilizes the fact that the times-argument of rep can also be an integer vector which is equal to the length of the x-argument.
You can generalize this to:
s <- 3
e <- 50
v <- 1:2
rep(s:e, rep(v, (e-s+1)/2))
Even another option using a mix of rep and rep_len:
v <- 3:50
rep(v, rep_len(1:2, length(v)))
A solution based on sapply.
as.vector(sapply(0:23 * 2 + 2, function(x) x + c(1, 2, 2)))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26
# [37] 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
Benchmarking
Here is a comparison of performance for all the current answers. The result shows that cumsum(rep(c(1, 1, 0), 24)) + 2L (m8) is the fastest, while rep(3:50, rep(1:2, 24))(m1) is almost as fast as the m8.
library(microbenchmark)
library(ggplot2)
perf <- microbenchmark(
m1 = {rep(3:50, rep(1:2, 24))},
m2 = {rep(3:50, each = 2)[c(TRUE, FALSE, TRUE, TRUE)]},
m3 = {v <- 3:50; sort(c(v,v[v %% 2 == 0]))},
m4 = {as.vector(t(cbind(seq(3,49,2),seq(4,50,2),seq(4,50,2))))},
m5 = {as.vector(sapply(0:23 * 2 + 2, function(x) x + c(1, 2, 2)))},
m6 = {sort(c(3:50, seq(4, 50, 2)))},
m7 = {rep(seq(3, 50, 2), each=3) + c(0, 1, 1)},
m8 = {cumsum(rep(c(1, 1, 0), 24)) + 2L},
times = 10000L
)
perf
# Unit: nanoseconds
# expr min lq mean median uq max neval
# m1 514 1028 1344.980 1029 1542 190200 10000
# m2 1542 2570 3083.716 3084 3085 191229 10000
# m3 26217 30329 35593.596 31871 34442 5843267 10000
# m4 43180 48321 56988.386 50891 55518 6626173 10000
# m5 30843 35984 42077.543 37526 40611 6557289 10000
# m6 40611 44209 50092.131 46779 50891 446714 10000
# m7 13879 16449 19314.547 17478 19020 6309001 10000
# m8 0 1028 1256.715 1028 1542 71454 10000
Use the rep function, along with the possibility to use recycling logical indexing ...[c(TRUE, FALSE, TRUE, TRUE)]
rep(3:50, each = 2)[c(TRUE, FALSE, TRUE, TRUE)]
## [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19
## [26] 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36
## [51] 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
If you use a logical vector (TRUE/FALSE) as index (inside [ ]), a TRUE leads to selection of the corresponding element and a FALSE leads to omission. If the logical index vector (c(TRUE, FALSE, TRUE, TRUE)) is shorter than the indexed vector (rep(3:50, each = 2) in your case), the index vector is recyled.
Also a side note: Whenever you use R code like
x = c(x, something)
or
x = rbind(x, something)
or similar, you are adopting a C-like programming style in R. This makes your code unnessecarily complex and might lead to low performance and out-of-memory issues if you work with large (say, 200MB+) data sets. R is designed to spare you those low-level tinkering with data structures.
Read for more information about the gluttons and their punishment in the R Inferno, Circle 2: Growing Objects.
The easiest way I can found is in way to create another one containing only even values (based on OP's intention) and then simply join two vectors. The example could be:
v <- 3:50
sort(c(v,v[v %% 2 == 0]))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16
# 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28
#[40] 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
# 43 44 44 45 46 46 47 48 48 49 50 50
Here is a loop-free 1 line solution:
> as.vector(t(cbind(seq(3,49,2),seq(4,50,2),seq(4,50,2))))
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17
[23] 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32
[45] 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46
[67] 47 48 48 49 50 50
It forms a matrix whose first column is the odd numbers in the range 3:50 and whose second and third columns are the even numbers in that range and then (by taking the transpose) reads it off row by row.
The problem with your nested loop approach is that the fundamental pattern is one of length 3, repeated 24 times (instead of a pattern of length 2 repeated 50 times). If you wanted to use a nested loop, the outer loop could iterate 24 times and the inner loop 3. The first pass through the outer loop could construct 3,4,4. The second pass could construct 5,6,6. Etc. Since there are 24*3 = 72 elements, you can pre-allocate the vector (by using f <- vector("numeric",74) ) so that you aren't growing it 1 element at a time. The idiom f <- c(f,i) that you are using at each stage copies all of the old elements just to create a new vector which is only 1 element longer. Here there are too few elements for it to really make a difference, but if you try to create large vectors that way the performance can be shockingly bad.
Here is a method that combines portions of a couple of the other answers.
rep(seq(3, 50, 2), each=3) + c(0, 1, 1)
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16
[21] 16 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29
[41] 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
[61] 43 44 44 45 46 46 47 48 48 49 50 50
Here is a second method using cumsum
cumsum(rep(c(1, 1, 0), 24)) + 2L
This should be very quick.
This should do too.
sort(c(3:50, seq(4, 50, 2)))
Another idea, though not competing in speed with fastest solutions:
mat <- matrix(3:50,nrow=2)
c(rbind(mat,mat[2,]))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20 21 22 22
# [31] 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
# [61] 43 44 44 45 46 46 47 48 48 49 50 50
I have a data frame (df) looks like this,
a b c
12 14 21
71 23 58
20 33 64
3 22 12
25 55 19
31 14 20
29 20 31
10 10 41
20 37 33
31 99 43
42 24 34
each element has no pattern in this data frame.
list<-c(1,3,5)
My current code is
df$d<-NA
for (i in 1:length(list)){
for( j in 1:nrow(df)){
df$d[j]<- df$c[j]- df$b[j+i]
print(mean(df$d, na.rm=TRUE))
}
}
For each element in "list", i loop it and calculate the mean(df$d), and then ask it to loop it again, then find the mean(df$d) again.
Expected result:
when i=1
a b c d
12 14 21 -2 (=21-23)
71 23 58 25 (=58-33)
20 33 64 42
3 22 12 -43
25 55 19 5
31 14 20 0
29 20 31 21
10 10 41 4
20 37 33 -66
31 99 43 19
42 24 34 NA
Then, find the mean of column "d", which is (mean(df$d, na.rm=TRUE), which is 5/10rows =0.5, this is mean is really what i need.
when i=3
a b c d
12 14 21 -1 (=21-22)
71 23 58 3 (=58-55)
20 33 64 50
3 22 12 -8
25 55 19 9
31 14 20 -17
29 20 31 -68
10 10 41 17
20 37 33 NA
31 99 43 NA
42 24 34 NA
Then, find the mean of column "d", which is (mean(df$d, na.rm=TRUE), which is -15/8rows =-1.875, this mean-value is really what i need.
This code is very slow since it has two loops running, the whole data has more than 50K rows, and the true list has more than 15 elements, so it takes forever. Would someone please help me on this, thank you so very much.
We can loop over each element in list using sapply. We use lead from dplyr to get the leading values of b and subtract it from c column and then calculate the mean of it removing the NA values.
library(dplyr)
sapply(list, function(x) mean(df$c - lead(df$b, x), na.rm = T))
#[1] 0.500000 -1.875000 -1.666667
I am working with readHTMLTable and am having difficulties performing calculations on the columns, as when I convert to numeric with as.numeric the values in the column are changed from values to rank.
Can anyone help
a=readHTMLTable("http://www.nhl.com/ice/standings.htm?season=20132014&type=LEA",which=3,trim=F)
> a[,5]
[1] 54 54 52 52 51 51 46 46 46 46 43 45 42 43 39 40 38 37 38 35 37 37 38 36 36 34 35 29 29 21
Levels: 21 29 34 35 36 37 38 39 40 42 43 45 46 51 52 54
> a[,5]=as.numeric(a[,5])
> a[,5]
[1] 16 16 15 15 14 14 13 13 13 13 11 12 10 11 8 9 7 6 7 4 6 6 7 5 5 3 4 2 2 1
I would like to be able to perform functions on the values of a[,5], not the ranks. such as mean(a[,5]) = (54+54+52...+21)/30, not
mean(a[,5])
[1] 8.933333
The problem is trying to convert a factor variable to numeric. See this post.
The canonical way to handle the problem would be as.numeric(levels(a[,5]))[a[,5]]
However, the method I often use is as.numeric(as.character(a[,5])) because it's easier to remember.
This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 9 years ago.
I am building an App using shiny and openair to analyze wind data.
Right now the data needs to be “cleaned” before uploading by the user.
I am interested in doing this automatically.
Some of the data is empty, some of is not numeric, so it is not possible to build a wind rose.
I want to:
1. Estimate how much of the data is not numeric
2. Cut it out and leave only numeric data
here is an example of the data:
the "NO2.mg" is read as a factor and not int becuse it does not consist only numbers
OK
here is a reproducible example:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
no2
[1] 5 4 c1 54 c5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[20] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[39] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
52 Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 ... c5
> as.numeric(no2)
[1] 45 34 51 46 52 1 12 23 34 45 47 48 49 50 2 3 4 5 6
[20] 7 8 9 10 11 13 14 15 16 17 18 19 20 21 22 24 25 26 27
[39] 28 29 30 31 32 33 35 36 37 38 39 40 41 42 43 44
Worst R haiku ever:
Some of the data is empty,
some of is not numeric,
so it is not possible to build a wind rose.
To convert a factor to numeric, you need to convert to character first:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
no2_num <- as.numeric(as.character(no2))
#Warning message:
# NAs introduced by coercion
no2_clean <- na.omit(no2_num) #remove NAs resulting from the bad data
# [1] 5 4 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
# [40] 37 38 39 40 41 42 43 44 45 46 47 48 49
# attr(,"na.action")
# [1] 3 5
# attr(,"class")
# [1] "omit"
length(attr(no2_clean,"na.action"))/length(no2)*100
#[1] 3.703704
OK this is how i did it i am sure someone has abetter way
i'd love it if you share with me
this is my data:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
to count the "bad data:"
sum(is.na((as.numeric(as.vector(no2)))))
and to estimate the percent of bad data:
sum(is.na((as.numeric(as.vector(no2)))))/length(no2)*100
Say I have vector:
x <- c(11,6,5,3,2,1,25,10,16,12,22,24,19,14,18,32,17,15,8,7,
33,4,27,9,29,13,30,23,20,31,26,21,28)
x
[1] 11 6 5 3 2 1 25 10 16 12 22 24 19 14 18 32 17 15 8 7 33 4 27 9 29 13 30 23 20
[30] 31 26 21 28
I want to identify which elements are not ascending. So, for example, elements 2 to 5 (values 6,5,3,2,1) are out of order because they are less than element 1 (11). Then element 6 is in order because its greater than 11, then all elements until element 16 (32) are out of order. I want to remove those elements.
Vectorized/shortcut way of doing this?
Create some data:
set.seed(1)
x <- sample(100, 30)
x
[1] 27 37 57 89 20 86 97 62 58 6 19 16 61 34 67 43 88 83 32 63 75 17 51 10 21 29 1 28 81 25
Select only those elements that are greater than or equal to the cumulative maximum:
x[x >= cummax(x)]
[1] 27 37 57 89 97