Sampling elements in data frame [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I'm trying to do resampling of the elements of a data frame. I'm open to use other data structures if recommended, but my understanding is that a DF would be better for combining strings, numbers, etc.
Let's say my input is this data frame:
16 x y z 2
11 a b c 1
.........
And I'd like to build as output another data structure (I take, another df) like this:
16 x y z
16 x y z
11 a b c
.........
I guess my main issue is the way to append the content, which is on columns df[,1:4].
Thanks in advance, p.

It's unclear from your description, but your desired output implies that you want to duplicate columns 1:4 according to column 5, this should do the job
df[rep(seq_len(nrow(df)), df[, 5]), -5]
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c

Assuming you're starting with something like:
mydf
# V1 V2 V3 V4 V5
# 1 16 x y z 2
# 2 11 a b c 1
Then, you can just use expandRows from my "splitstackshape" package, like this:
library(splitstackshape)
expandRows(mydf, count = "V5")
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
By default, the function assumes that you are expanding your dataset based on an existing column, but you can just as easily add a numeric vector as the count argument, and set count.is.col = FALSE.

If you want to sample with replacement n rows from df data frame:
df[sample(nrow(df), n, replace=TRUE), ]

Related

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Union of data frame in R

I have 4 dataframes in a list L like below:
L[[1]]:
V1 V2
B C
A B
Z B
L[[2]]:
V1 V2
B D
A B
Z B
L[[3]]:
V1 V2
Z Y
X Z
N Z
L[[4]]:
V1 V2
Z J
X Z
N Z
This come from graph with the head C,D,Y, and J.
Obviously, C and D is from the same graph, so is Y and J.
How can I merge C with D and Y with J given these dataframes is in a list L?
What I'm thinking is, to iterate the list and pairwise comparison. If dfx intersect with dfy merge. Anyone can help with the R code?
Edit:
What I'm thinking is like this:
Get first element, compare to second, if okay, merged and save to the first element, remove the second element, move to next element until last. Repeat until remaining element not removed. With this, the list will consist of remaining element which has been merged Anyone know how to implement this in the code?
Output expected :
L[[1]]:
V1 V2
B C
B D
A B
Z B
L[[2]]:
V1 V2
Z Y
Z J
X Z
N Z
Could this be an approach to a solution for you?
# create list of data.frames
ld <- list(
data.frame(V1 = c("B","A","Z"), V2 = c("C","B","B")),
data.frame(V1 = c("B","A","Z"), V2 = c("D","B","B")),
data.frame(V1 = c("Z","X","N"), V2 = c("Y","Z","Z")),
data.frame(V1 = c("Z","X","N"), V2 = c("J","Z","Z"))
)
# suggested solution
union_ld <- data.table::rbindlist(ld)
unique(union_ld)
Results:
V1 V2
1: B C
2: A B
3: Z B
4: B D
5: Z Y
6: X Z
7: N Z
8: Z J
Update 1
Quick hack: two data frames in a list as requested by the OP. According to comment of OP, the order of the rows within each result data frame doesn't matter.
list(
unique(data.table::rbindlist(ld[1:2])),
unique(data.table::rbindlist(ld[3:4]))
)
results in:
[[1]]
V1 V2
1: B C
2: A B
3: Z B
4: B D
[[2]]
V1 V2
1: Z Y
2: X Z
3: N Z
4: Z J
The proposed solution combines the first two data frames in the list into one data frame, removes the duplicate rows. This is repeated for the last two data frames in the list. Then, the resulting data frames are combined to a list again.
Update 2
This solution uses rbindlist from package data.table. If you don't like this, the result can be returned as "pure" data frames like this
library(data.table)
list(
setDF(unique(rbindlist(ld[1:2]))),
setDF(unique(rbindlist(ld[3:4])))
)
Update 3
According to OP's comment there are more data frames which need to be combined in several groups.
# set up a list of vectors of numbers of data.frames to combine
dfs_to_combine <- list(c(1:2), c(3:4))
dfs_to_combine
[[1]]
[1] 1 2
[[2]]
[1] 3 4
# now, combine data.frames as specified
library(data.table)
lapply(dfs_to_combine, function(x) setDF(unique(rbindlist(ld[x]))))
[[1]]
V1 V2
1 B C
2 A B
3 Z B
4 B D
[[2]]
V1 V2
1 Z Y
2 X Z
3 N Z
4 Z J
This is just to reproduce your initial example. If you want to combine differently change the numbers, e.g.,
dfs_to_combine <- list(c(1), c(2, 4), c(3))

How to stretch a data frame in R? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Given such a data frame:
V1 V2
x 3
y 2
z 4
...
I'd like to transform it to:
V1
x
x
x
y
y
z
z
z
z
Each element in V1 has repeated for n times and n is the corresponding value in V2. Do you know how to implement it quickly without for loop? Thanks in advance!
Simple, where x is your data.frame:
data.frame(V1 = rep(x$V1, x$V2))
We could use expandRows
library(splitstackshape)
expandRows(df1, "V2")
# V1
#1 x
#1.1 x
#1.2 x
#2 y
#2.1 y
#3 z
#3.1 z
#3.2 z
#3.3 z

In R, extract rows based on strings in different columns

Sorry if the solution to my problem is already out there, and I overlooked it. There are a lot of similar topics which all helped me understand the basics of what I'm trying to do, but did not quite solve my exact problem.
I have a data frame df:
> type = c("A","A","A","A","A","A","B","B","B","B","B","B")
> place = c("x","y","z","x","y","z","x","y","z","x","y","z")
> value = c(1:12)
>
> df=data.frame(type,place,value)
> df
type place value
1 A x 1
2 A y 2
3 A z 3
4 A x 4
5 A y 5
6 A z 6
7 B x 7
8 B y 8
9 B z 9
10 B x 10
11 B y 11
12 B z 12
>
(my real data has 3 different values in type and 10 in place, if that makes a difference)
I want to extract rows based on the strings in columns m and n.
E.g. I want to extract all rows that contain A in type and x and z in place, or all rows with A and B in type and y in place.
This works perfectly with subset, but I want to run my scripts on different combinations of extracted rows, and adjusting the subset command every time isn't very effective.
I thought of using a vector containing as elements what to get from type and place, respectively.
I tried:
v=c("A","x","z")
df.extract <- df[df$type&df$place %in% v]
but this returns an error.
I'm a total beginner with R and programming, so please bear with me.
You could try
df[df$type=='A' & df$place %in% c('x','y'),]
# type place value
#1 A x 1
#2 A y 2
#4 A x 4
#5 A y 5
For the second case
df[df$type %in% c('A', 'B') & df$place=='y',]
Update
Suppose, you have many columns and needs to subset the dataset based on values from many columns. For example.
set.seed(24)
df1 <- cbind(df, df[sample(1:nrow(df)),], df[sample(1:nrow(df)),])
colnames(df1) <- paste0(c('type', 'place', 'value'), rep(1:3, each=3))
row.names(df1) <- NULL
You can create a list of the values from the columns of interest
v1 <- setNames(list('A', 'x', c('A', 'B'),
'x', 'B', 'z'), paste0(c('type', 'place'), rep(1:3, each=2)))
and then use Reduce
df1[Reduce(`&`,Map(`%in%`, df1[names(v1)], v1)),]
you can make a function extract :
extract<-function(df,type,place){
df[df$type %in% type & df$place %in% place,]
}
that will work for the different subsets you want to do :
df.extract<-extract(df=df,type="A",place=c("x","y")) # or just extract(df,"A",c("x","y"))
> df.extract
type place value
1 A x 1
2 A y 2
4 A x 4
5 A y 5
df.extract<-extract(df=df,type=c("A","B"),place="y") # or just extract(df,c("A","B"),"y")
> df.extract
type place value
2 A y 2
5 A y 5
8 B y 8
11 B y 11

Select random row for each unique value in one specific column od fata frame

I have quite simple request that I cannot, however, deal with by use of one code line.
All I want is to subset an input data frame in the way that in the output data frame there is only one randomly selected row for each unique value (factor's level) of one particular data frame's column.
E.x. I have (v2 is a particular data frame's column)
v1 v2
1 A 1
2 B 1
3 C 2
4 A 1
5 B 2
6 B 1
7 B 1
8 C 2
9 D 1
10 E 1
And want to have as an output data frame:
v1 v2
1 B 1
2 C 2
Thank you for any suggestions in advance!
This is way more than what you asked for, but I wrote a function called stratified that lets you take random samples from a data.frame by one or more group variables.
You can load it and use it like this:
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")
# [1] "https://raw.github.com/gist/6424112"
# SHA-1 hash of file is 0006d8548785ec8a5651c3dd599648cc88d153a4
## One row
stratified(mydf, "v2", 1)
# v1 v2
# 10 E 1
# 8 C 2
## Two rows
stratified(mydf, "v2", 2)
# v1 v2
# 2 B 1
# 6 B 1
# 3 C 2
# 5 B 2
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.
You can iterate over the unique values in your column and find the row indices for each vlaue and select one row index at random using sample. Like this:
# Set seed for reproducible results
set.seed(1)
# Generate indices
ind <- sapply( unique( df$v2 ) , function(x) sample( which(df$v2==x) , 1 ) )
# Subset data.frame
df[ ind , ]
# v1 v2
#2 B 1
#5 B 2

Resources