Split dataframe based on one column in r - r

I have a huge dataframe of around 1M rows and want to split the dataframe based on one column & different ranges.
Example dataframe:
length<-sample(rep(1:400),100)
var1<-rnorm(1:100)
var2<-sample(rep(letters[1:25],4))
test<-data.frame(length,var1,var2)
I want to split the dataframe based on length at different ranges (ex: all rows for length between 1 and 50).
range_length<-list(1:50,51:100,101:150,151:200,201:250,251:300,301:350,351:400)
I can do this by subsetting from the dataframe, ex: test1<-test[test$length>1 &test$length<50,]
But i am looking for more efficient way using "split" (just a line)

range = seq(0,400,50)
split(test, cut(test$length, range))
But do heed Justin's suggestion and look into using data.table instead of data.frame and I'll also add that it's very unlikely that you actually need to split the data.frame/table.

Related

Loop over even/odd columns & stack them under specific ones

I have the following data set from Douglas Montgomery's book Introduction to Time Series Analysis & Forecasting:
I created a data frame called pharm from this spreadsheet. We only have two variables but they're repeated over several columns. I'd like to take all odd "Week" columns past the 2nd column and stack them under the 1st Week column in order. Conversely I'd like to do the same thing with the even "Sales, in thousands" columns. Here's what I've tried so far:
pharm2 <- data.frame(week=c(pharm$week, pharm[,3], pharm[,5], pharm[,7]), sales=c(pharm$sales, pharm[,4], pharm[,6], pharm[,8]))
This works because there aren't many columns, but I need a way to do this more efficiently because hard coding won't be practical with many columns. Does anyone know a more efficient way to do this?
If the columns are alternating, just subset with a recycling logical vector, unlist and create a new data.frame
out <- data.frame(week = unlist(pharm[c(TRUE, FALSE)]),
sales = unlist(pharm[c(FALSE, TRUE)]))
You may use the seq function to generate sequence to extract alternating columns.
pharm2 <- data.frame(week = unlist(pharm[seq(1, ncol(pharm), 2)]),
sales = unlist(pharm[seq(2, ncol(pharm), 2)]))

How to populate/fill a dataframe column with cell values of another dataframe

I have two dataframes
dataframe 1 has around million rows.. and its has two columns named 'row' and 'columns' that has the index of row and column of another dataframe (i.e. dataframe 2)..
i want to extract the values from dataframe 2 with the indexes stated in the columns named 'row' and 'columns' for each row in dataframe1.
I used a simple for loop to get the solution but it is time consuming and takes around 9 minutes, is there any other way with functions in R to solve this problem?
for(i in 1:nrow(datafram1)) {
dataframe1$value[i] = dataframe2[dataframe1$row[i],dataframe1$columns[i]]
}
You actually don't need a for loop to do this. Just add the new column to the Data Frame using the row and column names:
DataFrame1$value <- DataFrame2[DataFrame1$row, DataFrame1$column]
This should work a lot faster. If you wanted to try it a different way you could try adding the values to a new vector and then using cbind to join the vector to the Data Frame. The fact that you're trying to update the whole Data Frame during the loop is most likely what's slowing it down.
Maybe you can try the code below
dataframe1$value <- dataframe2[as.matrix(dataframe1[c("row","columns")])]
Sionce your loop only consider the rows in df1, you can cut the surplus roes on df2 and then use cbind:
dataframe2 <- dataframe2[nrow(dataframe1),]
df3 <- cbind(dataframe1, dataframe2)

Creating multiple rows in a dataframe with a single sampling command

I want to create a dataframe consisting of hundred rows, each row representing the output of this code:
sample(seq(100, 1000),20,replace=T)
Is there a way to order R to repicate the above code 100 times and create a dataframe out of it?
We can use replicate
t(replicate(100, sample(seq(100, 1000),20,replace=T)))

R: comparing rows of values in two dataframes

I have two dataframes, one original and one that should be the original plus several additional columns of data after processing. I would like to make sure that the correspondence between original columns was preserved between dataframes (i.e., all subject identifiers still match up to the original vectors of data in each row.)
If original (orig) was dim 5000 x 50 and post-processing (pp) was 5000 x 100, and the first 50 columns that should be the same in each, how can I check? Is there something like setdiff() that can compare full dataframes?
SETDIFF <- setdiff(orig[,c(1:50)], pp[,c(1:50)])
In reply to comment above: to find the row and column indices where values are not equal, use which(orig[,1:50] != pp[,1:50], arr.ind = TRUE).

R - Subset based on column name

My data frame has over 120 columns (variables) and I would like to create subsets bases on column names.
For example I would like to create a subset where the column name includes the string "mood". Is this possible?
I generally use
SubData <- myData[,grep("whatIWant", colnames(myData))]
I know very well that the "," is not necessary and
colnames
could be replaced by
names
but it would not work with matrices and I hate to change the formalism when changing objects.

Resources