Related
I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.