dplyr: Split data_frame into two randomly - r

How can I split a data_frame randomly into two without creating an index? sample_n works for me to get one part of it, but how can I collect the other part?

You can do an anti_join with the extracted part as y-dataframe and the original as x-dataframe. A small example:
library(dplyr)
df <- data_frame(x=1:20,y=runif(20))
dfy <- df %>% sample_n(10, replace=FALSE)
dfx <- anti_join(df, dfy, by="x")
this results in the following dataframes:
> df
Source: local data frame [20 x 2]
x y
1 1 0.64147504
2 2 0.35766839
3 3 0.44875782
4 4 0.01905876
5 5 0.85655599
6 6 0.88191481
7 7 0.46532067
8 8 0.09831802
9 9 0.31158184
10 10 0.39504048
11 11 0.81358862
12 12 0.41702158
13 13 0.80441008
14 14 0.69928890
15 15 0.19040897
16 16 0.94120853
17 17 0.65289448
18 18 0.46844427
19 19 0.63177479
20 20 0.58288923
the one half:
> dfx
Source: local data frame [10 x 2]
x y
1 19 0.6317748
2 17 0.6528945
3 16 0.9412085
4 15 0.1904090
5 14 0.6992889
6 11 0.8135886
7 7 0.4653207
8 6 0.8819148
9 5 0.8565560
10 3 0.4487578
the other half:
> dfy
Source: local data frame [10 x 2]
x y
1 18 0.46844427
2 8 0.09831802
3 12 0.41702158
4 4 0.01905876
5 2 0.35766839
6 10 0.39504048
7 13 0.80441008
8 9 0.31158184
9 1 0.64147504
10 20 0.58288923

Related

How can I create a df/dt where each column is the value of a row of an existing df/dt without loops?

I have a dataframe
data <- data.frame(v=c(15,25,24), x_val=c(12,7,2), y_val=c(6,6,18))
I want the resulting data to look like this with the data repeated in rows a specified number of times (here 2 times).
v1 x1 y1 v2 x2 y2 v3 x3 y3
15 12 6 25 7 6 24 2 18
15 12 6 25 7 6 24 2 18
I managed to get the data all in one row with the right column names but I'm not sure how to extend the column to a specified length with the values repeated. Further, how can I do this without loops? I want to run this with a larger dataset which can be quite slow with loops.
My code is below which gives the values in a single row.
r=NULL
r<- as.data.frame(matrix(nrow=1, ncol=1))
n<-2
for (i in 1:nrow(data_subset)){
datainarow <- data_subset[i,]
r=cbind(r,as.data.frame(datainarow))
colnames(r)[n] <- paste0("v",i)
colnames(r)[n+1] <- paste0("x",i)
colnames(r)[n+2] <- paste0("y",i)
n <- n+3
}
Thank you!
You can use uncount in the tidyr package
If you already have your data in the single row format, just do:
n=4
data %>% tidyr::uncount(n)
# A tibble: 4 x 9
v1 v2 v3 x1 x2 x3 y1 y2 y3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15 25 24 12 7 2 6 6 18
2 15 25 24 12 7 2 6 6 18
3 15 25 24 12 7 2 6 6 18
4 15 25 24 12 7 2 6 6 18
Here is one way to get that result from initial three row data frame
library(tidyverse)
n=4
data %>%
rename_all(~c("v","x","y")) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = v:y,names_sep = "") %>%
uncount(n)
This is a one-liner in base R
as.data.frame(t(as.vector(t(data))))[rep(1, 2),]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> 1 15 12 6 25 7 6 24 2 18
#> 1.1 15 12 6 25 7 6 24 2 18
Or if you wish to use the naming convention described, and have a more generalizable solution, you could use the following function:
expand_data <- function(data, reps) {
df <- as.data.frame(t(as.vector(t(data))))[rep(1, reps),]
names(df) <- paste(names(data), rep(seq(nrow(data)), each = nrow(data)), sep = "_")
rownames(df) <- NULL
df
}
which allows:
expand_data(data, 10)
v_1 x_val_1 y_val_1 v_2 x_val_2 y_val_2 v_3 x_val_3 y_val_3
1 15 12 6 25 7 6 24 2 18
2 15 12 6 25 7 6 24 2 18
3 15 12 6 25 7 6 24 2 18
4 15 12 6 25 7 6 24 2 18
5 15 12 6 25 7 6 24 2 18
6 15 12 6 25 7 6 24 2 18
7 15 12 6 25 7 6 24 2 18
8 15 12 6 25 7 6 24 2 18
9 15 12 6 25 7 6 24 2 18
10 15 12 6 25 7 6 24 2 18

How to randomly split a data frame into halves that are balanced on subject and item

The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

name columns of dynamically chosen data.frame

I am trying to name the columns of a data frame, but the data frame is chosen dynamically. Any idea why this does not work? Below is an example, but in my real case, I get a different error. As of now, I would just like to know what causes either of the errors:
Error in file(filename, "r") : cannot open the connection
In addition: Warning message:
In file(filename, "r") :
cannot open file 'df': No such file or directory
#ASSIGN data frame name dynamically
> assign(as.character("df"), data.frame(c(1:10), c(11:20)))
>
#IT WOrked
> df
c.1.10. c.11.20.
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
>
#Call the data frame dynamically, it works
> eval(parse(text = c("df")))
c.1.10. c.11.20.
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
>
#name the columns
> colnames(df) <- c("a", "b")
> df
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
>
#name columns of dynamically chosen data frame, returns and error
> colnames(eval(parse(text = c("df")))) <- c("c", "d")
Error in colnames(eval(parse(text = c("df")))) <- c("c", "d") :
target of assignment expands to non-language object
It doesn't work because R doesn't want you to use assign and (argh!) eval(parse()) for this sort of basic stuff. Lists! This is why the Lord created lists!
l <- list()
l[["df"]] <- data.frame(c(1:10), c(11:20))
colnames(l[["df"]]) <- c("a","b")
> l
$df
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20

Finding corresponding row value of a data frame using an other data frame

Is it possible to find corresponding rows of one data frame in an other data frame.
Using R commands?
After that store the result in an other data frame.
Example:
data1 = airquality[1:14,]
data2 = data.frame(index=data1$Ozone[6:14])
I want to have in an other data frame the date corrresponding the same rows of this 2 data frame. I consider the Ozone value of data1 like index.
So what i want to get finally is somethings like this in data3:
index Month Day
28 5 6
23 5 7
19 5 8
8 5 9
NA 5 10
7 5 11
16 5 12
11 5 13
14 5 14
You could use %in% operator:
data3 <- data1[data1$Ozone %in% data2$index, c("Ozone", "Month", "Day")]
data3
Ozone Month Day
5 NA 5 5
6 28 5 6
7 23 5 7
8 19 5 8
9 8 5 9
10 NA 5 10
11 7 5 11
12 16 5 12
13 11 5 13
14 14 5 14
You have NAs in your index example. R will pick all NAs in the resulting data.frame. Unless you want to pick all of them, avoid using them in indexes.
If you wanted to use row names, you could do something like this:
data1[!rownames(data1) %in% 1:5, c("Ozone", "Month", "Day")]
Ozone Month Day
6 28 5 6
7 23 5 7
8 19 5 8
9 8 5 9
10 NA 5 10
11 7 5 11
12 16 5 12
13 11 5 13
14 14 5 14
See here for further information about subsetting. Also this site is helpful.

Resources