Sample variables from two columns in R - r

I have tried to sample values of two columns that are related (diversification rates of several siter groups), but I have no idea of how to do it. I am trying with sample function, but it limits me so I cannot choose any further condition.
df<-data.frame("M"=c(0.06,0.14,0.05,0.07), "H"=c(0.06,0.08,0.04,0.05))
df
# M H
# 1 0.06 0.06
# 2 0.14 0.08
# 3 0.05 0.04
# 4 0.07 0.05
sample(df,size=1000,replace=TRUE)
When I use this command, it resamples rows and columns:
H M M.1 M.2 M.3
1 0.06 0.06 0.06 0.06 0.06
2 0.08 0.14 0.14 0.14 0.14
3 0.04 0.05 0.05 0.05 0.05
4 0.05 0.07 0.07 0.07 0.07
...
But I want it to only sample one value from each row, and go to the next row with the same condition until the end of the rows. Finally, when there are no more rows, it should start all over again up to size=1000 so I can have a vector of length 1000.
Example of what I want (r = row, c = column): 0.06(r1c1), 0.14(r2c1), 0.05(r3c1), 0.05(r4c2), 0.06(r1c2), 0.14(r2c1),0.03(r3c2), 0.07(r4c1) and so on.
Thank you in advance for your help!
EDITED:
I think that what I am looking for is something like a loop function, but I still do not know how to do it.

You should first create an indexing matrix of two columns (row index and column index), then index the original matrix with it.
idx <- matrix(c(rep(1:4,250), sample(1:2, 1000, replace=T)), ncol=2)
res <- as.matrix(df)[idx]

With your specifications, you'll need to use a custom function.
Here's one small way to do it:
myfunc <- function(dataframe, nsamples = 1000){
rows = ((0:nsamples)%%nrow(df)) + 1 #use the %% to get the row to sample
cols = sample(ncol(df), nsamples, replace = TRUE) #and the cols
sapply(1:nsamples, function(x){df[rows[x],cols[x]]}) #sapply to get as a vector
}
myfunc(df,10)
[1] 0.08 0.05 0.07 0.06 0.08 0.05 0.05 0.06 0.08 0.05

Related

how to use the `map` family command in **purrr** pacakge to swap the columns across rows in data frame?

Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.

Generate subsequences in R

I have a df which is 67200 obs long, with 5 vars. I would like to create a list of subsequences from one var, each of equal length (600 obs). I would like the sequence to be iterative so that I can identify rolling sequences i.e. seq1 = 0:600, seq2 = 1:601, seq3 = 2:602, and so on. I will then sum the data from each subsequence to identify the sequence with the highest total.
I understand how to make a basic sequence using seq, however after reading around SO and other sites, I can only find info on how to identify specific sequences. Any help with ideas on ways to create said subsequences would be great.
Sample Data:
Var1 Var2 Var3 Var4 Var5
0.00 0.31 0.32 0.00 0.01
0.10 0.46 0.46 0.13 0.01
0.20 0.46 0.47 0.14 0.02
0.30 0.40 0.21 0.14 0.02
0.40 0.38 0.11 0.20 0.03
0.50 0.38 0.07 0.25 0.04
Expected Output:
List of x each subsequnce
To answer your question I think you can achieve your expected output with lapply and seq :
x <- 600
n <- 0:(nrow(df) - 600)
lapply(n, function(i) seq(i, i+x))
However, reading the description it seems you are trying to perform rolling calculation and the above is not the best approach to do this. Look into zoo library it has functions like rollsum, rollmean or a general rollapply which will have better way to do this.

Is there an R function to order a matrix using two or more columns?

I simulated a data of 3 columns
aa <- rep(seq(0,1,0.05), seq(21,1,-1))
bb <- NA
for(i in length(seq(0,1,0.05)):1){
bb <- c(bb,rep(seq(0,1,0.05),len = i))
}
bb <- bb[-1]
cc <- 1-(aa+bb)
Dominance <- cbind(aa,bb,cc)
Then, in my problem, a row containing (0,0,1) is equal to a row containing (1,0,0) and (0,1,0).
So I use this code below to organize my data
for(i in 1:dim(Dominance)[1]){
Dominance[i,] <- Dominance[i, order(Dominance[i,], decreasing = FALSE)]
}
The problem is that when I try to order using this code bellow, they order the first column well, but not the second column.
Dominance[order(Dominance[,1],Dominance[,2],Dominance[,3]),]
I got this as a result
[1,] 0.00 0.00 1.00
[2,] 0.00 0.00 1.00
[3,] 0.00 0.00 1.00
[4,] 0.00 0.05 0.95
...
[59,] 0.00 0.50 0.50
[60,] 0.00 0.50 0.50
[61,] 0.05 0.35 0.60
[62,] 0.05 0.35 0.60
[63,] 0.05 0.05 0.90
[64,] 0.05 0.05 0.90
The problem starts on line 61, once I have in the first column 0.05 and in the second column 0.35, but in the line 63 I have the same value in the first column (0.05) but in the second one I have a small value than 0.35.
Any ideas?
I have tried to use two other functions but they got the same results.
With the tidyverse approach, this is as simple as:
library(tidyverse)
data %>%
as_tibble() %>%
arrange(aa,bb) %>%
as.matrix()
Hope it helps!

Assign different values to a large number of columns

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

Use value from the previous row for manipulation of multiple column at once in R

Hi everyone I need a help.
I have the data-set similar to this which contain multiple rows and columns.
df<- data.frame(A=seq(0.01,0.05,0.01),
B=c(0.01, -0.24, 0, -0.21, 0),
C=seq(0.03,0.07,0.01),
D=c(0.4,0.5,0,0,2))
I used shift command and created another row E.
df[ , E := shift(A)+A]
Now I want to apply similar function to whole data frame df and create row F, G, H similar to E using similar method at once.
Thank you.
If we are using data.table, the shift can take multiple columns at once and output the lag of those. If we are not selecting any particular sets of columns, specifying shift(.SD) (.SD represents the Subset of Data.table) gives the lag of all the columns in the dataset. By assigning (:=) it to different column names (LETTERS[5:8]), we get a new set of lag columns in the original dataset.
library(data.table)
setDT(df)[, LETTERS[5:8] := shift(.SD)+.SD]
df
# A B C D E F G H
#1: 0.01 0.01 0.03 0.4 NA NA NA NA
#2: 0.02 -0.24 0.04 0.5 0.03 -0.23 0.07 0.9
#3: 0.03 0.00 0.05 0.0 0.05 -0.24 0.09 0.5
#4: 0.04 -0.21 0.06 0.0 0.07 -0.21 0.11 0.0
#5: 0.05 0.00 0.07 2.0 0.09 -0.21 0.13 2.0
Or we can loop through lapply
setDT(df)[, LETTERS[5:8] := lapply(.SD, function(x) x+shift(x))]
Here is an alternative for this:
new_cols <- c('E','F','G','H')
old_cols <- colnames(df)
for (i in seq_along(new_cols)){
eval(parse(text = paste0("df[,",new_cols[i],":= shift(",old_cols[i],")+",old_cols[i],"]")))
}

Resources