I generated a series of 10,000 random numbers through:
rand_x = rf(10000, 3, 5)
Now I want to produce another series that contains the variances at each point i.e. the column look like this:
[variance(first two numbers)]
[variance(first three numbers)]
[variance(first four numbers)]
[variance(first five numbers)]
.
.
.
.
[variance of 10,000 numbers]
I have written the code as:
c ( var(rand_x[1:1]) : var(rand_x[1:10000])
but I am only getting 157 elements in the column rather than not 10,000. Can someone guide what I am doing wrong here?
An option is to loop over the index from 2 to 10000 in sapply, extract the elements of 'rand_x' from position 1 to the looped index, apply the var and return a vector of variance output
out <- sapply(2:10000, function(i) var(rand_x[1:i]))
Your code creates a sequence incrementing by one with the variance of the first two elements as start value and the variance of the whole vector as limit.
var(rand_x[1:2]):var(rand_x[1:n])
# [1] 0.9026262 1.9026262 2.9026262
## compare:
.9026262:3.33433
# [1] 0.9026262 1.9026262 2.9026262
What you want is to loop over the vector indices, using seq_along to get the variances of sequences growing by one. To see what needs to be done, I show you first a (rather slow) for loop.
vars <- numeric() ## initialize numeric vector
for (i in seq_along(rand_x)) {
vars[i] <- var(rand_x[1:i])
}
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
The first element has to be NA because the variance of one element is not defined (division by zero).
However, the for loop is slow. Since R is vectorized we rather want to use a function from the *apply family, e.g. vapply, which is much faster. In vapply we initialize with numeric(1) (or just 0) because the result of each iteration is of length one.
vars <- vapply(seq_along(rand_x), function(i) var(rand_x[1:i]), numeric(1))
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
Data:
n <- 50
set.seed(42)
rand_x <- rf(n, 3, 5)
I have an igraph object, what I have created with the igraph library. This object is a list. Some of the components of this list have a length of 2. I would like to remove all of these ones.
IGRAPH clustering walktrap, groups: 114, mod: 0.79
+ groups:
$`1`
[1] "OTU0041" "OTU0016" "OTU0062"
[4] "OTU1362" "UniRef90_A0A075FHQ0" "UniRef90_A0A075FSE2"
[7] "UniRef90_A0A075FTT8" "UniRef90_A0A075FYU2" "UniRef90_A0A075G543"
[10] "UniRef90_A0A075G6B2" "UniRef90_A0A075GIL8" "UniRef90_A0A075GR85"
[13] "UniRef90_A0A075H910" "UniRef90_A0A075HTF5" "UniRef90_A0A075IFG0"
[16] "UniRef90_A0A0C1R539" "UniRef90_A0A0C1R6X4" "UniRef90_A0A0C1R985"
[19] "UniRef90_A0A0C1RCN7" "UniRef90_A0A0C1RE67" "UniRef90_A0A0C1RFI5"
[22] "UniRef90_A0A0C1RFN8" "UniRef90_A0A0C1RGE0" "UniRef90_A0A0C1RGX0"
[25] "UniRef90_A0A0C1RHM1" "UniRef90_A0A0C1RHR5" "UniRef90_A0A0C1RHZ4"
+ ... omitted several groups/vertices
For example, this one :
> a[[91]]
[1] "OTU0099" "UniRef90_UPI0005B28A7E"
I tried this but it does not work :
a[lapply(a,length)>2]
Any help?
Since you didn't provide any reproducible data or example, I had to produce some dummy data:
# create dummy data
a <- list(x = 1, y = 1:4, z = 1:2)
# remove elements in list with lengths greater than 2:
a[which(lapply(a, length) > 2)] <- NULL
In case you wanted to remove the items with lengths exactly equal to 2 (question is unclear), then last line should be replaced by:
a[which(lapply(a, length) == 2)] <- NULL
Unique finds unique values of a vector.
If I have a data frame:
test_data <- data.frame(x = c(rep(1.00050239485720394857,4),
1.00050239485720394854,rep(2.0002230948570293845,5),rep(3.0005903847502398475,5)),
y = c(rep(4.00423409872345,5),rep(2.034532039485722,5),rep(1.1234152304957,5)))
sapply(test_data,unique)
R returns:
x y
[1,] 1.000502 4.004234
[2,] 2.000223 2.034532
[3,] 3.000590 1.123415
As expected.
But say I fit an lm() or aov() object and then try to find unique fitted values():
set.seed(123)
y = rf(100,50,3,3)
x1 <- factor(c(rep("blue",25),
rep("green",25),
rep("orange",25),
rep("purple",25)))
bsFit <- aov(y ~ x1)
unique(bsFit$fitted.values)
R returns:
[1] 2.709076 2.709076 2.709076 2.709076 2.709076 2.709076
[7] 2.709076 4.060080 4.060080 4.060080 4.060080 3.314801
[13] 3.314801 3.314801 3.314801 1.960280 1.960280 1.960280
[19] 1.960280 1.960280
There are clearly duplicates here.
As others have said (#Tim-Biegeleisen especially), RStudio is formatting the output to a specific number of decimal places (remember anything printed to the console is formatted by RStudio). So the "duplicates", if correctly formatted to show all decimal places, aren't duplicates.
We can use format to show all decimal places:
format(unique(bsFit$fitted.values), digit = 22)
[1] "2.7090760788376542" "2.7090760788376773" "2.7090760788376604" "2.7090760788376622" "2.7090760788376627"
[6] "2.7090760788376649" "2.7090760788376640" "4.0600797479202155" "4.0600797479202164" "4.0600797479202200"
[11] "4.0600797479202146" "3.3148005388803132" "3.3148005388803128" "3.3148005388803146" "3.3148005388803137"
[16] "1.9602804435309986" "1.9602804435309984" "1.9602804435309982" "1.9602804435309988" "1.9602804435310004
I experimented to with the number of digits before an error was thrown and got 22.
I tried to find two values in the following vector, which are close to 10. The expected value is 10.12099196 and 10.63054170. Your inputs would be appreciated.
[1] 0.98799517 1.09055728 1.20383713 1.32927166 1.46857509 1.62380423 1.79743107 1.99241551 2.21226576 2.46106916 2.74346924 3.06455219 3.42958354 3.84350238 4.31005838
[16] 4.83051356 5.40199462 6.01590035 6.65715769 7.30532785 7.93823621 8.53773241 9.09570538 9.61755743 10.12099196 10.63018180 11.16783243 11.74870531 12.37719092 13.04922392
[31] 13.75661322 14.49087793 15.24414627 16.00601247 16.75709565 17.46236358 18.06882072 18.51050094 18.71908344 18.63563523 18.22123225 17.46709279 16.40246292 15.09417699 13.63404124
[46] 12.11854915 10.63054170 9.22947285 7.95056000 6.80923943 5.80717982 4.93764782 4.18947450 3.54966795 3.00499094 2.54283599 2.15165780 1.82114213 1.54222565 1.30703661
[61] 1.10879707 0.94170986 0.80084308 0.68201911 0.58171175 0.49695298 0.42525021 0.36451350 0.31299262 0.26922281 0.23197860 0.20023468 0.17313291 0.14995459 0.13009730
[76] 0.11305559 0.09840485 0.08578789 0.07490387 0.06549894 0.05735864
Another alternative could be allowing the user to control for the "tolerance" in order to set what "closeness" is, this can be done by using a simple function:
close <- function(x, value, tol=NULL){
if(!is.null(tol)){
x[abs(x-10) <= tol]
} else {
x[order(abs(x-10))]
}
}
Where x is a vector of values, value is the value of comparison for closeness, and tol is logical, if it's NULL it returns all the "close" values ordered by "closeness" to value, otherwise it returns just the values meeting the condition given in tol.
> close(x, value=10, tol=.7)
[1] 9.617557 10.120992 10.630182 10.630542
> close(x, value=10)
[1] 10.12099196 9.61755743 10.63018180 10.63054170 9.22947285 9.09570538 11.16783243
[8] 8.53773241 11.74870531 7.95056000 7.93823621 12.11854915 12.37719092 7.30532785
[15] 13.04922392 6.80923943 6.65715769 13.63404124 13.75661322 6.01590035 5.80717982
[22] 14.49087793 5.40199462 4.93764782 15.09417699 4.83051356 15.24414627 4.31005838
[29] 4.18947450 16.00601247 3.84350238 16.40246292 3.54966795 3.42958354 16.75709565
[36] 3.06455219 3.00499094 2.74346924 2.54283599 17.46236358 17.46709279 2.46106916
[43] 2.21226576 2.15165780 1.99241551 18.06882072 1.82114213 1.79743107 18.22123225
[50] 1.62380423 1.54222565 18.51050094 1.46857509 18.63563523 1.32927166 1.30703661
[57] 18.71908344 1.20383713 1.10879707 1.09055728 0.98799517 0.94170986 0.80084308
[64] 0.68201911 0.58171175 0.49695298 0.42525021 0.36451350 0.31299262 0.26922281
[71] 0.23197860 0.20023468 0.17313291 0.14995459 0.13009730 0.11305559 0.09840485
[78] 0.08578789 0.07490387 0.06549894 0.05735864
In the first example I defined "closeness" to be at most a difference of 0.7 between value and each elements in x. In the second example the function close returns a vector of values where the firsts are the closest to the value given in value and the lasts are the farest values from value.
Since my solution does not provide an easy (practical) way to find tol as #Arun pointed out, one way to find the closest values would be seting tol=NULL and asking for the exact number of close values as in:
> close(x, value=10)[1:3]
[1] 10.120992 9.617557 10.630182
This shows the three values in x closest to 10.
I can't think of a way without using sort. However, you can speed it up by using partial sort.
x[abs(x-10) %in% sort(abs(x-10), partial=1:2)[1:2]]
# [1] 9.617557 10.120992
In case the same values are present more than once, you'll get all of them here. So, you can either wrap this with unique or you can use match instead as follows:
x[match(sort(abs(x-10), partial=1:2)[1:2], abs(x-10))]
# [1] 10.120992 9.617557
dput output:
dput(x)
c(0.98799517, 1.09055728, 1.20383713, 1.32927166, 1.46857509,
1.62380423, 1.79743107, 1.99241551, 2.21226576, 2.46106916, 2.74346924,
3.06455219, 3.42958354, 3.84350238, 4.31005838, 4.83051356, 5.40199462,
6.01590035, 6.65715769, 7.30532785, 7.93823621, 8.53773241, 9.09570538,
9.61755743, 10.12099196, 10.6301818, 11.16783243, 11.74870531,
12.37719092, 13.04922392, 13.75661322, 14.49087793, 15.24414627,
16.00601247, 16.75709565, 17.46236358, 18.06882072, 18.51050094,
18.71908344, 18.63563523, 18.22123225, 17.46709279, 16.40246292,
15.09417699, 13.63404124, 12.11854915, 10.6305417, 9.22947285,
7.95056, 6.80923943, 5.80717982, 4.93764782, 4.1894745, 3.54966795,
3.00499094, 2.54283599, 2.1516578, 1.82114213, 1.54222565, 1.30703661,
1.10879707, 0.94170986, 0.80084308, 0.68201911, 0.58171175, 0.49695298,
0.42525021, 0.3645135, 0.31299262, 0.26922281, 0.2319786, 0.20023468,
0.17313291, 0.14995459, 0.1300973, 0.11305559, 0.09840485, 0.08578789,
0.07490387, 0.06549894, 0.05735864)
I'm not sure your question is clear, so here's another approach. To find the value closest to your first desired value, 10.12099196 , subtract that from the vector, take the absolute value, and then find the index of the closest element. Explicit:
delx <- abs( 10.12099196 - x)
min.index <- which.min(delx) #returns index of first minimum if there are duplicates
x[min.index] #gets you the value itself
Apologies if this was not the intent of your question.
I have a data frame, deflator.
I want to get a new data frame inflation which can be calculated by:
deflator[i] - deflator[i-4]
----------------------------- * 100
deflator [i - 4]
The data frame deflator has 71 numbers:
> deflator
[1] 0.9628929 0.9596746 0.9747274 0.9832532 0.9851884
[6] 0.9797770 0.9913502 1.0100561 1.0176906 1.0092516
[11] 1.0185932 1.0241043 1.0197975 1.0174097 1.0297328
[16] 1.0297071 1.0313232 1.0244618 1.0347808 1.0480411
[21] 1.0322142 1.0351968 1.0403264 1.0447121 1.0504402
[26] 1.0487097 1.0664664 1.0935239 1.0965951 1.1141851
[31] 1.1033155 1.1234482 1.1333870 1.1188136 1.1336276
[36] 1.1096461 1.1226584 1.1287245 1.1529588 1.1582911
[41] 1.1691221 1.1782178 1.1946234 1.1963453 1.1939922
[46] 1.2118189 1.2227960 1.2140535 1.2228828 1.2314258
[51] 1.2570788 1.2572214 1.2607763 1.2744415 1.2982076
[56] 1.3318808 1.3394186 1.3525902 1.3352815 1.3492751
[61] 1.3593859 1.3368135 1.3642940 1.3538567 1.3658135
[66] 1.3710932 1.3888638 1.4262185 1.4309707 1.4328823
[71] 1.4497201
This is a very tricky question for me.
I tried to do this using a for loop:
> d <- data.frame(deflator)
> for (i in 1:71) {d <-rbind(d,c(delfaotr ))}
I think I might be doing it wrong.
Why use data frames? This is a straightforward vector operation.
inflation = 100 * (deflator[1:67] - deflator[-(1:4)])/deflator[-(1:4)]
I agree with #Fhnuzoag that your example suggests calculations on a numeric vector, not a data frame. Here's an additional way to do your calculations taking advantage of the lag argument in the diff function (with indexes that match those in your question):
lagBy <- 4 # The number of indexes by which to lag
laggedDiff <- diff(deflator, lag = lagBy) # The numerator above
theDenom <- deflator[seq_len(length(deflator) - lagBy)] # The denominator above
inflation <- laggedDiff/theDenom
The first few results are:
head(inflation)
# [1] 0.02315470 0.02094710 0.01705379 0.02725941 0.03299085 0.03008297