I have a dataframe in R that I want to randomize, keeping the first column like it is but randomizing the last two columns together, so that values that appear in the same rows in these columns will appear in the same row both after randomizing. So if I started with this:
1 a b c
2 d e f
3 g h i
when randomized it might look like:
1 a e f
2 d h i
3 g b c
I know that sample works fine but does it conserve the columns equivalence?
> t <- data.frame(matrix(nrow=4,ncol=10,data=1:40))
> t
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 5 9 13 17 21 25 29 33 37
2 2 6 10 14 18 22 26 30 34 38
3 3 7 11 15 19 23 27 31 35 39
4 4 8 12 16 20 24 28 32 36 40
> columns_to_random <- c(8,9,10)
> t[,columns_to_random] <- t[sample(1:nrow(t),size=nrow(t)), columns_to_random]
> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 5 9 13 17 21 25 32 36 40
2 2 6 10 14 18 22 26 29 33 37
3 3 7 11 15 19 23 27 30 34 38
4 4 8 12 16 20 24 28 31 35 39
Just sample one column at a time and you'll be fine. For example:
data[,2] = sample(data[,2])
data[,3] = sample(data[,3])
...
If you have many columns, you can extend this like:
data[,-1] = apply(data[,-1], 2, sample)
EDIT: With your clarification about row equivalence, this is just:
data[,-1] = data[sample(nrow(data)),-1]
What do you mean by "values equivalence"?
Honestly I do not get the message, but here's my guess. As you said, you could use sample, but use it separately on the on your columns, e.g. by apply:
# create a reproducible example
test <- data.frame(indx=c(1,2,3),col1=c("a","d","g"),
col2=c("b","e","h"),col3=c("c","f","i"))
xyz <- apply(test[,-1],MARGIN=2,sample)
as.data.frame(xyz)
Approach using colwise in plyr for elegant column wise permutation:
test <- data.frame(matrix(nrow=4,ncol=10,data=1:40))
Load plyr
require(plyr)
Creat a column wise "sample" function
colwise.sample <- colwise(sample)
Apply to the desired rows
permutation.test <- test
permutation.test[,c(1,3,4)] <- colwise.sample(test[,c(1,3,4)])
Related
I have a dataframe of incident cases of a disease. by year and age, which looks like this (it is much larger than this example)
88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20
What I would like to do is iteratively sum the diagonals, to get this result
88 89 90 91
22 1 2 5 14
23 1 7 11 20
24 2 6 19 22
25 3 5 13 39
Or, put another way; original dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 B2 C2 D2
24 A3 B3 C3 D3
25 A4 B4 C4 D4
Final dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 A1+B2 B1+C2 C1+D2
24 A3 A2+B3 A1+B2+C3 B1+C2+D3
25 A4 A3+B4 A2+B3+C4 A1+B2+C3+D4
Is there any way to do this in R?
I have seen this question How to sum over diagonals of data frame, but he only wants the total sum, I want the iterative sum.
Thanks.
Use ave noting that row(m) - col(m) is constant on diagonals:
ave(m, row(m) - col(m), FUN = cumsum)
## 88 89 90 91
## 22 1 2 5 14
## 23 1 7 11 20
## 24 2 6 19 22
## 25 3 5 13 39
It is assumed that m is a matrix as in the Note below. If you have a data frame then convert it to a matrix first.
Note
The input matrix m in reproducible form is:
Lines <- " 88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20"
m <- as.matrix(read.table(text = Lines, check.names = FALSE))
I have a dataframe df and a row vector named mult of the same size as a row in df.
I need to multiply each row of df by mult as element-wise multiplication. But I do not want to write a loop, as there are probably faster ways to do this.
Here are my failed attempts:
df = data.frame(matrix(nrow = 5, ncol = 5))
df[,] = 5
mult = as.data.frame(c(1,2,3,4,5))
df * t(mult[1:5,])
Whether I transpose mult[1:5,], I get the same result.
The correct answer should be a dataframe of five rows of 5 10 15 20 25.
However, I am getting the result as if I am doing element-wise multiplying by mult as a column vector.
5 5 5 5 5
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
Multiplying a row at a time works, but that will involve a loop.
I have searched the SO and found sweep(), but it does not seem to work in my case.
df * matrix(rep(mult[,1], NROW(df)), nrow = NROW(df), byrow = TRUE)
# X1 X2 X3 X4 X5
#1 5 10 15 20 25
#2 5 10 15 20 25
#3 5 10 15 20 25
#4 5 10 15 20 25
#5 5 10 15 20 25
We could just replicate the 'mult'
df * mult[,1][col(df)]
# X1 X2 X3 X4 X5
#1 5 10 15 20 25
#2 5 10 15 20 25
#3 5 10 15 20 25
#4 5 10 15 20 25
#5 5 10 15 20 25
I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...
I have a dataframe of records of varying lengths, with NAs at the end. If there are more than three x-values in a record, I want to make the value of the third x-value equal to the value of the last x-value. Each record already tells me how many x-values it has.
I can make x3 be equal to the name of the last x-value (x4 or x5 etc) but what I really need is to make x3 take the value of that last x-value.
I'm sure there is some simple answer. Any help would be greatly appreciated! Thank you.
Here is a simple case:
ii <- "n x1 x2 x3 x4 x5 x6
1 3 30 40 20 NA NA NA
2 4 10 50 16 25 NA NA
3 6 20 15 26 16 18 28
4 5 10 10 18 17 19 NA
5 2 65 41 NA NA NA NA
6 5 10 11 23 16 23 NA
7 1 99 NA NA NA NA NA"
df <- read.table(text=ii, header = TRUE, na.strings="NA", colClasses="character")
oo <- "n x1 x2 x3
1 3 30 40 20
2 4 10 50 25
3 6 20 15 28
4 5 10 10 19
5 2 65 41 NA
6 5 10 11 23
7 1 99 NA NA"
desireddf <- read.table(text=oo, header = TRUE, na.strings="NA", colClasses="character")
df$lastx <- as.character(paste("x", df$n, sep=""))
#df$lastx <- df[[get(df$lastx)]] #How can I make lastx equal to the _value_ of lastx???
df[df$n>3, c('x3')] <- df[df$n>3, 'lastx']
df <- df[,1:4]
print(df)
yields the following, not the desireddf above.
n x1 x2 x3
1 3 30 40 20
2 4 10 50 x4
3 6 20 15 x6
4 5 10 10 x5
5 2 65 41 <NA>
6 5 10 11 x5
7 1 99 <NA> <NA>
This seems like a pretty aribtrary task, but here goes:
desireddf <- data.frame(n=df$n, x1=df$x1, x2=df$x2, x3=df[cbind(1:nrow(df), paste("x", pmax(3,as.numeric(df$n)), sep=""))])
I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...