This question already has an answer here:
rep() with each equals a vector
(1 answer)
Closed 2 years ago.
I have a vector that I would like to increase the elements based on another vector.
How can I increase the elements in my vector without having to manually type it out?
I want to use the two vectors
NumberofTimes<-c(4,1,2,3)
Spread<-c(0.060,0.170,0.140,0.070)
```
I.e. I want a vector with 4 of the 0.060, 1 of the 0.170, 2 of the 0.140, etc.
Instead of writing:
```
Spread<-c(0.060,0.060,0.060,0.060,0.170,0.140,0.140,0.070,0.070,0.070)
```
Base R rep function with times argument
> rep(Spread, times = NumberofTimes)
[1] 0.06 0.06 0.06 0.06 0.17 0.14 0.14 0.07 0.07 0.07
Try this using mapply() and a function with rep() to wrap sequences between Spread and NumberofTimes. Here the code:
#Data
NumberofTimes<-c(4,1,2,3)
Spread<-c(0.060,0.170,0.140,0.070)
#Apply
vecval <- unlist(mapply(function(x,y) rep(x,y),x=Spread,y=NumberofTimes))
Output:
[1] 0.06 0.06 0.06 0.06 0.17 0.14 0.14 0.07 0.07 0.07
An option with Map
unlist(Map(rep, Spread, NumberofTimes))
Related
Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.
I have a df which is 67200 obs long, with 5 vars. I would like to create a list of subsequences from one var, each of equal length (600 obs). I would like the sequence to be iterative so that I can identify rolling sequences i.e. seq1 = 0:600, seq2 = 1:601, seq3 = 2:602, and so on. I will then sum the data from each subsequence to identify the sequence with the highest total.
I understand how to make a basic sequence using seq, however after reading around SO and other sites, I can only find info on how to identify specific sequences. Any help with ideas on ways to create said subsequences would be great.
Sample Data:
Var1 Var2 Var3 Var4 Var5
0.00 0.31 0.32 0.00 0.01
0.10 0.46 0.46 0.13 0.01
0.20 0.46 0.47 0.14 0.02
0.30 0.40 0.21 0.14 0.02
0.40 0.38 0.11 0.20 0.03
0.50 0.38 0.07 0.25 0.04
Expected Output:
List of x each subsequnce
To answer your question I think you can achieve your expected output with lapply and seq :
x <- 600
n <- 0:(nrow(df) - 600)
lapply(n, function(i) seq(i, i+x))
However, reading the description it seems you are trying to perform rolling calculation and the above is not the best approach to do this. Look into zoo library it has functions like rollsum, rollmean or a general rollapply which will have better way to do this.
I have tried to sample values of two columns that are related (diversification rates of several siter groups), but I have no idea of how to do it. I am trying with sample function, but it limits me so I cannot choose any further condition.
df<-data.frame("M"=c(0.06,0.14,0.05,0.07), "H"=c(0.06,0.08,0.04,0.05))
df
# M H
# 1 0.06 0.06
# 2 0.14 0.08
# 3 0.05 0.04
# 4 0.07 0.05
sample(df,size=1000,replace=TRUE)
When I use this command, it resamples rows and columns:
H M M.1 M.2 M.3
1 0.06 0.06 0.06 0.06 0.06
2 0.08 0.14 0.14 0.14 0.14
3 0.04 0.05 0.05 0.05 0.05
4 0.05 0.07 0.07 0.07 0.07
...
But I want it to only sample one value from each row, and go to the next row with the same condition until the end of the rows. Finally, when there are no more rows, it should start all over again up to size=1000 so I can have a vector of length 1000.
Example of what I want (r = row, c = column): 0.06(r1c1), 0.14(r2c1), 0.05(r3c1), 0.05(r4c2), 0.06(r1c2), 0.14(r2c1),0.03(r3c2), 0.07(r4c1) and so on.
Thank you in advance for your help!
EDITED:
I think that what I am looking for is something like a loop function, but I still do not know how to do it.
You should first create an indexing matrix of two columns (row index and column index), then index the original matrix with it.
idx <- matrix(c(rep(1:4,250), sample(1:2, 1000, replace=T)), ncol=2)
res <- as.matrix(df)[idx]
With your specifications, you'll need to use a custom function.
Here's one small way to do it:
myfunc <- function(dataframe, nsamples = 1000){
rows = ((0:nsamples)%%nrow(df)) + 1 #use the %% to get the row to sample
cols = sample(ncol(df), nsamples, replace = TRUE) #and the cols
sapply(1:nsamples, function(x){df[rows[x],cols[x]]}) #sapply to get as a vector
}
myfunc(df,10)
[1] 0.08 0.05 0.07 0.06 0.08 0.05 0.05 0.06 0.08 0.05
I am iterating through a list which contains 4 lists. Below is the output that I get, I am wondering why I am getting this with the accuracy, for example, why is not the first just 1.00 as it is in other cases?
[[1]]
[1] 1.00 0.96 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[2]]
[1] 1.00 0.98 0.84 0.74 0.66 0.56 0.48 0.38 0.26 0.16 0.06 0.00
[[3]]
[1] 1.00 0.94 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[4]]
[1] 1.000000e+00 9.400000e-01 8.400000e-01 7.400000e-01 6.600000e-01 5.800000e-01 4.600000e-01 3.600000e-01 2.600000e-01 1.600000e-01 6.000000e-02 1.110223e-16
As I commented when you first posted it as a follow-up comment on your previous question, this is more of a display issue. The last number is effectively zero:
R> identical(0, 1.1e-16)
[1] FALSE
R> all.equal(0, 1.1e-16)
[1] TRUE
R>
While its binary representation is not zero, it evaluates to something close enough under most circumstances. So you could run a filter over your data and replace 'near-zeros' with zero, or you could debug the code and see how/why it comes out as non-zero.
Also see the R FAQ and general references on issues related to floating-point computations.
If you want floating point numbers displayed rounded to the second decimal digit then use:
lapply( mylist, round, digits=2)
This approach has the advantage that it returns numeric-mode values which a format() call would not and it can also be used with digit specifications that are "long" and could be an effective "zero-filter":
lapply(list(c(1,2), c(1.000000e+00, 9.400000e-01, 6.000000e-02, 1.110223e-16 )), round,
digits=13)
[[1]]
[1] 1 2
[[2]]
[1] 1.00 0.94 0.06 0.00
I am not sure of the exact algorithm R uses to chose the format. It is clear that a single format is used for all values in each list. It is also clear that the last list contains values of vastly different orders of magnitude: 1.000000e+00 and 1.110223e-16. I therefore think it's reasonable that R chooses to print the last list using scientific notation.
I always transpose by using t(file) command in R.
But i it is not running properly (not running at all) on big data file (250,000 rows and 200 columns). Any ideas.
I need to calculate correlation between 2nd row (PTBP1) with all other rows (except 8 rows including header). In order to do this I transpose rows to columns and then use cor function.
But I struck at transpose fn. Any help would be really appreciated!
I copied example from one of the post in stackoverflow (They are also almost discussing the same problem but seems no answer yet!)
ID A B C D E F G H I [200 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
250,000 rows
Use a matrix instead. The only advantage of a dataframe over a matrix is the capacity to have different classes in the columns and you clearly do not have that situation, since a transposed dataframe could not support such a result.
I don't get why you want to transpose the data.frame. If you just use cor it doesn't matter if your data is in rows or columns.
Actually, it is one of the major advantages of R that it doen's matter if your data fits in the classical row-column pattern as SPSS and others programs require data to be.
There are numerous ways to correlate the first row with all other rows (I don't get which rows you want to exclude). One is using a loop (here the loop is implicit in the call to one of the *apply family functions):
lapply(2:(dim(fn)[1]), function(x) cor(fn[1,],fn[x,]))
Note that I expect you data.frame to ba called fn. To skip some rows change the 2 to the number you want. Furthermore, I would probably use vapply here.
I hope this answer points you in the correct direction and that is to not use t() if you absolutely don't need it.