Replicate rows of dataframe - r

df =data.frame ("x"=c(5,4,10,7) , "y"=c (rep (1,2),rep (2,2))
I'm trying to replicate each x y times and then save it to a variable so the result will be like this:
a=c (5,4)
then
a=c (10,10,7,7)
Probably it is an easy one, but I'm new to programming..thanks in advance

You can use the split function that creates a list of length distincts values of y :
split(rep(df$x, df$y), rep(df$y, df$y))
$`1`
[1] 5 4
$`2`
[1] 10 10 7 7

Related

Random division of an array into equal sub (more than two) arrays in R

I need to divide my data (single variable) into multiple sub groups of equal size, but the division of the elements must be random.
let x <- c(1:12)
and I want to divide it into three sub groups randomly
G1<- (1,3,5,10)
G2<- (2,6,11,7)
G3<-(12,4,9,8)
You can do:
x <- sample(x)
n_grps = 3
grps <- split(x, rep_len(1:n_grps, length(x)))
print(grps)
$`1`
[1] 1 12 8 9
$`2`
[1] 3 10 5 4
$`3`
[1] 6 11 7 2
Looks like there's two parts to this - randomly shuffle your data, then break the vector apart into a list of (sub) vectors.
You can try something like the following:
x = rnorm(12)
nsplit = 3
split(x[sample(length(x))],rep(1:nsplit,each = length(x)%/%nsplit))
if your split doesn't divide evenly into the length of your array, there may be implementation details to take care of...but this is the gist.
p.s. not to be too pedantic but x=1:12 doesn't need the c()

R - Fixed length of vector while using lag function

I have a csv file(just call its name as 'csv') and want to use a lag function. Below is my code. (ColA and ColB are the name of columns of csv)
X <- subset(csv, ColA == 1)
Y <- c(NA, lag(X$ColB, 1))
Let's say there are 10 rows which satisfy ColA == 1. The problem is that I just want to have a vector of which length is 10 but after the lag function, its output shows a vector of which length is 11. How to fix it?
You can use the lagpad function in the ecm package. This will leave off the last element of the vector to retain the same length.
library(ecm)
X <- 1:10
Y <- lagpad(X)
Y
[1] NA 1 2 3 4 5 6 7 8 9

Named arrays, dataframes and matrices

If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20

using sample() or an equivalent on 2 variables of a dataframe

I have a large dataset, a subset of which looks like this:
Var1 Var2
9 29_13x
14 41y
9 51_13x
4 101_13x
14 105y
14 109y
9 113_13x
9 114_13x
14 116y
14 123y
4 124_13x
14 124y
14 126y
4 134_13x
4 135_13x
4 137_13x
9 138_13x
4 139_13x
14 140y
9 142_13x
4 143_13x
My code sits inside a loop and I would like to be able to sample without replacement, a certain number of Var2 (defined by the loop iteration) from each of the different Var1 categories. So for i=4 I'd like to get something like this:
29_13x
51_13x
113_13x
138_13x
which are all from Var1=9
41y
109y
126y
140y
from Var1=14, and
101_13x
134_13x
137_13x
139_13x
all from Var1=4.
I can't get sample() to work across more than one variable and can't find any other way to do this. Any suggestions would be greatly appreciated.
Here are two options.
Using sample with by or tapply:
by(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
Here's some example output with tapply:
out <- tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
out
# $`4`
# [1] "101_13x" "143_13x" "124_13x" "134_13x"
#
# $`9`
# [1] "114_13x" "113_13x" "142_13x" "29_13x"
#
# $`14`
# [1] "116y" "109y" "140y" "105y"
You can also extract individual vectors by index position or by name:
out[[3]]
# [1] "116y" "126y" "124y" "105y"
out[["14"]]
# [1] "116y" "126y" "124y" "105y"
Subsetting based on a random variable ranked by a grouping variable:
x <- rnorm(nrow(mydf))
mydf[ave(x, mydf$Var1, FUN = rank) %in% 1:4, ]

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

Resources