R data frame extraction of elements - r

I have 2 dataframes and i only want the the numbers which are in both frames. I use this function:
CH3[CH2[,2] %in% CH3[,2],]
The Problem is, the data frames have a different length and this operation CH2[,2] %in% CH3[,2] delivers 1400 true values. I have searched for a while now but could not find a solution. If i apply it on CH3 which is 1700 long i get all appending data too which are not marked as wrong. Is there a function parameter i can use or do i need a workaround?
Edit1:
CH3<-read.table("unkown1.txt")
CH2<-read.table("unkown2.txt")
Now i only want the elements which are in both tables. I used:
CH3[CH3[,2] %in% CH2[,2],]
Which only works if CH3 is larger

merge is an option, depending on exactly what you are trying to do:
CH2 <- data.frame(a=letters[1:20], b=1:20)
CH3 <- data.frame(a=letters[15:26], b=15:26)
merge(CH2, CH3, by=2)
produces:
b a.x a.y
1 15 o o
2 16 p p
3 17 q q
4 18 r r
5 19 s s
6 20 t t
Another alternative is with intersect:
x <- intersect(CH2[, 2], CH3[, 2])
Where x is:
[1] 15 16 17 18 19 20
You can then do either of the following:
CH2[CH2[, 2] %in% x, ]
CH3[CH3[, 2] %in% x, ]
To get:
a b
15 o 15
16 p 16
17 q 17
18 r 18
19 s 19
20 t 20

Related

For loop only returns one output instead of several R [duplicate]

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

How to automatically split data frame into smaller data frames based on (and named by levels of) existing variable? [duplicate]

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

Splitting data longitudinally in R based on year [duplicate]

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

How to convert a Data Frame to list in R using one column as names of the list? [duplicate]

I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
You can also split a data frame based upon an existing column. For example, to create three data frames based on the cyl column in mtcars:
split(mtcars,mtcars$cyl)
If you want to split a dataframe according to values of some variable, I'd suggest using daply() from the plyr package.
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
You could also use
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
It gives :
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers,
Sebastian
The answer you want depends very much on how and why you want to break up the data frame.
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
newdf <- mydf[,1:3]
Or, you can choose specific rows.
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
subset() is also useful:
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the survey package is pertinent?
http://faculty.washington.edu/tlumley/survey/
If you want to split by values in one of the columns, you can use lapply. For instance, to split ChickWeight into a separate dataset for each chick:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale() function to x in each group, and combine the results (using split<- or ave)
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Resources