Splitting a data frame into N subsets with equal number of columns - r

How can I divide my data-frame which have 250 columns into 5 subsets with 50 columns each and assign them into 5 different variables?
I have tried this
df2 <- split(df, sample(1:5, ncol(df), replace=T))
But this only splits based on number rows but not on number of columns
And I want something like this
ncol(df2_1) = 50
ncol(df2_2) = 50
ncol(df2_3) = 50
ncol(df2_4) = 50
ncol(df2_5) = 50
And these should include independent columns.

Using comments by #markus, to use split.default, we can modify the initial code, and change the sampling so we get exactly 50 in each subset,
Making some dummy data,
df <- data.frame(matrix(1:250, ncol = 250))
Then splitting, (we split this way because of this, pointed out by #markus, this is a more safe/robust version)
df2 <- lapply(split.data.frame(t(df), sample(rep(1:5, ncol(df)/5))), t)
A less robust, but more simple option is:
df2 <- split.default(df, sample(rep(1:5, ncol(df)/5)))
gives us,
> ncol(df2$`1`)
[1] 50
> ncol(df2$`2`)
[1] 50
> ncol(df2$`3`)
[1] 50
> ncol(df2$`4`)
[1] 50
> ncol(df2$`5`)
[1] 50

Related

Split a group of integers into two subgroups of approximately the same suns

I have a group of integers, as in this R data.frame:
set.seed(1)
df <- data.frame(id = paste0("id",1:100), length = as.integer(runif(100,10000,1000000)), stringsAsFactors = F)
So each element has an id and a length.
I'd like to split df into two data.frames with approximately equal sums of length.
Any idea of an R function to achieve that?
I thought that Hmisc's cut2 might do it but I don't think that's its intended use:
library(Hmisc) # cut2
ll <- split(df, cut2(df$length, g=2))
> sum(ll[[1]]$length)
[1] 14702139
> sum(ll[[2]]$length)
[1] 37564671
It's called Bin pack problem. https://en.wikipedia.org/wiki/Bin_packing_problem this link may be helpful.
Using BBmisc::binPack function,
df$bins <- binPack(df$length, sum(df$length)/2 + 1)
tapply(df$length, df$bins, sum)
results like
1 2 3
25019106 24994566 26346
Now since you want two groups,
dummy$bins[dummy$bins == 3] <- 2 #because labeled as 2's sum is smaller
result is
1 2
25019106 25020912

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)
Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5
by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

Separating data frame randomly but keeping identical values together

I have a large data set that I am trying to work with. I am currently trying to separate my data set into three different data frames, that will be used for different points of testing.
ind<-sample(3, nrow(df1), replace =TRUE, prob=c(0.40, 0.50, 0.10))
df2<-as.data.frame(df1[ind==1,1:27])
df3<-as.data.frame(df1[ind==2, 1:27])
df4<-as.data.frame(df1[ind==3,1:27])
However, the first column in df1 is an invoice number, and multiple rows can have the same invoice number, as returns and mistakes are included. I am trying to find a way that will split the data up randomly, but keep all rows with the same invoice number together.
Any suggestions on how I may manage to accomplish this?
Instead of sampling the rows, you could sample the unique invoice numbers and then select the rows with those invoice numbers.
## Some sample data
df1 = data.frame(invoice=sample(10,20, replace=T), V = rnorm(20))
## sample the unique values
ind = sample(3, length(unique(df1$invoice)), replace=T)
## Select rows by sampled invoice number
df1[df1$invoice %in% unique(df1$invoice)[ind==1], 1:2]
invoice V
2 8 -0.67717939
6 9 -0.89222154
9 8 -0.71756069
14 8 -0.03539096
15 2 0.38453752
16 9 -0.16298835
17 9 -0.30823521
20 2 -0.60198259
ind1 <- which(df1[,1] == 1)
ind2 <- which(df1[,1] == 2)
ind3 <- which(df1[,1] == 3)
df2 <- as.data.frame(df1[sample(ind1, length(ind1), replace = TRUE), 1:27])
df3 <- as.data.frame(df1[sample(ind2, length(ind2), replace = TRUE), 1:27])
df4 <- as.data.frame(df1[sample(ind3, length(ind3), replace = TRUE), 1:27])
ind determines which rows contain the the invoice numbers 1,2,3. Then to create the random data frames a random sample from only the rows that you wish are taken. Hope this helps.

Reduce nrows of large data frame to nrows of smaller data frame when dimensions are not divisible

I have two data frames. One is ~133 rows and one is ~4337 rows. They each have two columns containing the same type of information. Sun elevation in the first column and Radiance in the second column. I would like to reduce the number of rows of the large data frame to the number of rows in the small data frame so that I can proceed with analysis without getting dimension errors. I do not want to combine them into a single data frame.
The thing is, I don't want to lose any data. On further inspection, I realized that I also can't do means because that is not physically meaningful for my data.
I've been trying to find something in dplyr or reshape2 that will do this, but have not had luck so far.
Notes:
Dimensions in the example are smaller than my real world dimensions for simplicity
The solution presented here appears to be close: Calculate the mean of every 13 rows in data frame in R
However, I'm running into problems with rounding resulting in getting too many or too few rows in the resultant new data frame.
Code example trying to implement the above-linked solution:
set.seed(123)
df1 <- data.frame(sunel = sample(c(-6:4), 133, replace = TRUE),
rad = sample(c(1000:500000), 133, replace = TRUE))
df2 <- data.frame(sunel = sample(c(-15:15), 4337, replace = TRUE),
rad = sample(c(100:5000000), 4337, replace = TRUE))
df2a <- df2[df2$sunel >= -6 & df2$sunel <= 4,]
n <- (nrow(df2a) %/% 133) - 1
df3 <- aggregate(df2a, list(rep(1:(nrow(df2a) %/% n+1), each = n, len = nrow(df2a))), mean)
nrow(df1)
# [1] 133
nrow(df2a)
# [1] 1520
nrow(df3)
# [1] 150
min(df1$sunel);max(df1$sunel)
# [1] -6
# [1] 4
min(df2a$sunel);max(df2a$sunel)
# [1] -6
# [1] 4
min(df3$sunel);max(df3$sunel)
# [1] -3.2
# [1] 1.9
nrow(df3a)
# [1] 133
I have tried to change n, but due to rounding, it results in either ~130 rows (too few), or too many (as shown in the example). Another problem is that it is important for me to maintain, roughly the same range of sunel and the range in df3 is not acceptable.
Here is a hack solution I've found using caret. I would appreciate any advice on a more elegant solution.
library(caret)
133/1520
# [1] 0.0875
inTrain <- createDataPartition(df2a$sunel, p = .0875, list = FALSE)
nrow(inTrain)
# [1] 135 #Nope
inTrain <- createDataPartition(df2a$sunel, p = .0874, list = FALSE)
nrow(inTrain)
# [1] 135 #Still nope
inTrain <- createDataPartition(df2a$sunel, p = .086, list = FALSE)
nrow(inTrain)
# [1] 133 #Awesome
df3a <- df2a[inTrain, ]
min(df3a$sunel);max(df3a$sunel)
# [1] -6
# [1] 4
I suggest you to bootstrap.
http://www.ats.ucla.edu/stat/r/library/bootstrap.htm
Resampling is your solution to get a representative sample of your big dataset!
Would the sinecol package and the approxTime function hold the answer for you? It might be too restrictive for you dataset, and you would need to work out your interpolations for the xout vector.

lapply on single column in data frame

I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?
When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23

Resources