I have a data frame (d) composed of 640 observations for 55 variables.
I would like to randomly sample this data frame in 10 sub data frame of 64
observations for 55 variables. I dont want any of the observation to be in
more than one sub data-frame.
This code work for one sample
d1 <- d[sample(nrow(d),64,replace=F),]
How can I repeat this treatment ten times ?
This one give me a dataframe of 10 variable (each one is one sample...)
d1 <- replicate(10,sample(nrow(d),64,replace = F))}
Can anyone help me?
Here's a solution that returns the result in a list of data.frames:
d <- data.frame(A=1:640, B=sample(LETTERS, 640, replace=TRUE)) # an exemplary data.frame
idx <- sample(rep(1:10, length.out=nrow(d)))
res <- split(d, idx)
res[[1]] # first data frame
res[[10]] # last data frame
The only tricky part involves creating idx. idx[i] identifies the resulting data.frame, idx[i] in {1,...,10}, in which the ith row of d will occur. Such an approach assures us that no row will be put into more than 1 data.frame.
Also, note that sample returns a random permutation of (1,2,...,10,1,2,...,10).
Another approach is to use:
apply(matrix(sample(nrow(d)), ncol=10), 2, function(idx) d[idx,])
Related
In R, I have an object dataList which is a list, where each entry is a dataframe. Each dataframe has 2 columns, both of the same length (300, if it matters. dataList is 1000 entries long).
I need to take the average of all of the ith positions within this list. I.e. I need the average of all of the entries (i,2) of each dataframe. So, all 300 of the (1,2) entries should be averaged and I would like this number to be stored in the 1st spot of a new list.
I am open to any solutions as to how to do this; if there is a better way to store the data that would probably be preferable.
Here's a minimal example which should help you:
# create dummy data
d1 <- data.frame(weight = c(23,78,98,50), height=c(50,170,190,150))
d2 <- data.frame(weight = c(13,58,78,90), height=c(20,140,172,200))
# create a list
data_list <- list(d1,d2)
# find mean of second colum in a new list
l1 <- lapply(data_list, function(x) mean(x[[2]]))
print(l1)
[[1]]
[1] 140
[[2]]
[1] 133
My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))
I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?
here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)
I have a dataframe which looks like this (obviously with few variables compared to original data I need to work on with)
woe <- c('1:woe', '2:woe', '3:woe', '4:woe', '5:woe')
svi <- c('stated','verified','verified','stated','stated')
fico_avg <- ceiling(runif(5,750, 780))
count <- c(8,12,34,24,7)
df <- data.frame(cbind(woe,svi,fico_avg,count))
woe svi fico_avg count
1:woe stated 771 8
2:woe verified 759 12
3:woe verified 752 34
4:woe stated 776 24
5:woe stated 767 7
I would like to create a dataset with first row repeating 8 times( filling first 8 rows), second row repeating 12 times, third one 34 times depending on the value of variable 'count' . I tried lookup the function InsertRow() in DataCombine package. InsertRow() require RowNum as one of the argument to insert newrow. the RawNum changes as I insert newrows into the frame. Basic idea is to extract each row from original dataframe copy it x time ( if count=x) and finally row bind all those rows into one frame. Any help is appretiated. Thanks in advance
If your dataset is large - probably this should be Quicker
df <- data.frame(woe,svi,fico_avg,count)
df[rep(seq.int(1,nrow(df)), df$count),]
Works.
Try:
outdf = df
outdf = outdf[-c(1:nrow(outdf)),]
for(i in 1:nrow(df)){
for(j in 1:df[i,]$count) outdf[nrow(outdf)+1,]= df[i,]
}
outdf
You should use:
df <- data.frame(woe,svi,fico_avg,count)
rather than
df <- data.frame(cbind(woe,svi,fico_avg,count))
No need for cbind here. It actually converts your count variable from numeric to a factor variable.
Try this:
df_long <- df[rep(1:nrow(df), df$count), ]
Hope it helps
I would like to be able create a new dataframe with 6 columns from an existing dataframe with 4 columns. The two extra columns should be the value of the counters (i and j) whilst the loop is working.
my draft code is as follows
a is binary,
b is categorical
c is a number (in this case 1 to 200)
d is a number (in this example 1 to 5, in real life 1 to 2500)
#### make an example of mydata
a<- c(0,0,0,0,0,0,0,0,0,0,1,1,0,1)
b<- c("a","b","a","b","b","c","a","e","c","a","a","b","d","f")
c<- c(20,30,40,40,54,76,23,23,78,23,34,1,88,1)
d<- c(1,1,1,2,2,2,3,3,4,5,5,5,5,5)
mydata<-data.frame(a,b,c,d)
## this just generates random numbers to randomly
##select row to bind together later
set.seed(1)
choose.test<- data.frame(matrix(NA, nrow = 20, ncol = 30))
for (i in 1:20)
{
choose.test[,i]<-sample(5, 20, replace = TRUE, prob = NULL)
#random selction of sites WITH replacment
}
# this is the bit I am having trouble with
data<- NULL
for( j in 1:10){
for (i in choose.test[,j])
{ data <- rbind(data, mydata[mydata[,4]== i,])
data[,5]<-j
data[,6]<-i
}}
It would also be acceptable to create separate dataframes at each loop iteration (in the second loop using i as a counter), or open to other better suggestions as I am new to r. I also tried using assign to do this with no luck.
At each iteration I need to rbind together all the rows in column 4 which have a value equal to a random number between 1 and 5 ( in this example anyway in real life it will be between 1 and 2500 sites). These random numbers are stored in a data frame, called choose.test , where the random numbers in each column is used only once then the next iteration moves onto the next column.
Without the "data[,5]<-j data[,6]<-i" it does what almost what I want , but I would really like to have a 5th and 6th column that identify which iteration of the i and j loop the rows were from so I can analyse the data at each iteration (I am bootstrapping with this data). Clearly the code above does not work, but I am not sure how to get it to do what I want. In the current version it just add the maximum counter value to all rows at columns 5 and 6.
Many thanks,
Ben
The following code fixed my problem
data<- NULL
for( j in 1:10){
for (i in choose.test[,j])
{ data <- rbind(data, cbind(mydata[mydata[,4]== i,], i=i, j=j))}}
Credit goes to MrFlick for providing a useful comment!