the original vector x:
x = 1:20
and what i look for is a vector y that repeats the n-th element in x every other n, for instance, when n=4:
n = 4
y = c(1,2,3,4,4,5,6,7,8,8,9,10,11,12,12,13,14,15,16,16,17,18,19,20,20)
i'm actually doing it for matrices and i think it relates to the use of apply here when margin=2 but couldn't figure it out right off the bat,
could anyone kindly show me a quick solution?
We can also use
v1 <- rep(1, length(x))
v1[c(FALSE, FALSE, FALSE, TRUE)] <- 2
rep(x, v1)
#[1] 1 2 3 4 4 5 6 7 8 8 9 10 11 12 12 13 14 15 16 16 17 18 19 20 20
Or as #MichaelChirico commented, the 2nd line of code can be made more general with
v1[seq_along(v1) %% n == 0L] = 2
Or in a one-liner with ifelse (from #JonathanCarroll's comments)
rep(x, ifelse(seq_along(x) %% n, 1, 2))
Indeed matrices are the way to go
duplast = function(M) rbind(M, M[nrow(M), ])
c(duplast(matrix(x, nrow = 4L)))
# [1] 1 2 3 4 4 5 6 7 8 8 9 10 11 12 12 13 14 15 16 16 17 18 19 20
# [25] 20
If you wanted to use apply:
c(apply(matrix(x, nrow = 4L), 2L, function(C) c(C, C[length(C)])))
Related
I am trying to make a piece-wise function. This is a really basic one. I want y to be a list of values (preferably not just a list of integers but a list of real numbers like (1.34, 20.92) in the future).
How might I make a piece-wise function?
y <- 1:10
if (y < 2){
print("CAN'T COMPUTE")
} else if (y >= 2 & y < 6){
print(y^2)
} else {
print(y * 2)
}
Let me give it a try:
library("dplyr")
y <- 1:10
y %>%
as_tibble() %>%
mutate(res = case_when(y < 2 ~ "CAN'T COMPUTE",
y >= 2 & y < 6 ~ as.character(y^2),
TRUE ~ as.character(y*2)))
Here's the results:
# A tibble: 10 x 2
value res
<int> <chr>
1 1 CAN'T COMPUTE
2 2 4
3 3 9
4 4 16
5 5 25
6 6 12
7 7 14
8 8 16
9 9 18
10 10 20
Here are a some base R approaches. We have used NA instead of a character string in order to produce a numeric vector result. The first uses a nested ifelse. The second uses a single ifelse to select between NA and the other values and computes the other values using a formula. The third computes which leg of the result is wanted (1, 2 or 3) and then uses switch to select that leg. The fourth is a variation of three that uses findInterval to compute the leg number.
ifelse(y < 2, NA, ifelse(y < 6, y^2, 2*y))
## [1] NA 4 9 16 25 12 14 16 18 20
ifelse(y < 2, NA, (y < 6) * y^2 + (y >= 6) * 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, 1 + (y >= 2) + (y >= 6), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, findInterval(y, c(-Inf, 2, 6, Inf), left.open = FALSE), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
df <- data.frame(x = seq(1:10))
I want this:
df$y <- c(1, 2, 3, 4, 5, 15, 20 , 25, 30, 35)
i.e. each y is the sum of previous five x values. This implies the first
five y will be same as x
What I get is this:
df$y1 <- c(df$x[1:4], RcppRoll::roll_sum(df$x, 5))
x y y1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 15
6 15 20
7 20 25
8 25 30
9 30 35
10 35 40
In summary, I need y but I am only able to achieve y1
1) enhanced sum function Define a function Sum which sums its first 5 values if it receives 6 values and returns the last value otherwise. Then use it with partial=TRUE in rollapplyr:
Sum <- function(x) if (length(x) < 6) tail(x, 1) else sum(head(x, -1))
rollapplyr(x, 6, Sum, partial = TRUE)
## [1] 1 2 3 4 5 15 20 25 30 35
2) sum 6 and subtract off original Another possibility is to take the running sum of 6 elements filling in the first 5 elements with NA and subtracting off the original vector. Finally fill in the first 5.
replace(rollsumr(x, 6, fill = NA) - x, 1:5, head(x, 5))
## [1] 1 2 3 4 5 15 20 25 30 35
3) specify offsets A third possibility is to use the offset form of width to specify the prior 5 elements:
c(head(x, 5), rollapplyr(x, list(-(1:5)), sum))
## [1] 1 2 3 4 5 15 20 25 30 35
4) alternative specification of offsets In this alternative we specify an offset of 0 for each of the first 5 elements and offsets of -(1:5) for the rest.
width <- replace(rep(list(-(1:5)), length(x)), 1:5, list(0))
rollapply(x, width, sum)
## [1] 1 2 3 4 5 15 20 25 30 35
Note
The scheme for filling in the first 5 elements seems quite unusual and you might consider using partial sums for the first 5 with NA or 0 for the first one since there are no prior elements fir that one:
rollapplyr(x, list(-(1:5)), sum, partial = TRUE, fill = NA)
## [1] NA 1 3 6 10 15 20 25 30 35
rollapplyr(x, list(-(1:5)), sum, partial = TRUE, fill = 0)
## [1] 0 1 3 6 10 15 20 25 30 35
rollapplyr(x, 6, sum, partial = TRUE) - x
## [1] 0 1 3 6 10 15 20 25 30 35
A simple approach would be:
df <- data.frame(x = seq(1:10))
mysum <- function(x, k = 5) {
res <- rep(NA, length(x))
for (i in seq_along(x)) {
if (i <= k) { # edited ;-)
res[i] <- x[i]
} else {
res[i] <- sum(x[(i-k):(i-1)])
}
}
res
}
mysum(df$x)
# [1] 1 2 3 4 5 15 20 25 30 35
mysum <- function(x, k = 5) {
res <- x[1:k]
append<-sapply(2:(len(x)+1-k),function(i) sum(x[i:(i+k-1)]))
return(c(res,append))
}
mysum(df$x)
now I have a lot of matrices with the different number of rows. And I want to sum the odd-number rows and even number rows element respectivelylike below:
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
#I want to sum the odd-number rows and even number rows element respectively
i=1
kg <- NULL
while(i <= 2){
op<-unlist(Map(sum,o[i,],o[i+2,],o[i+4,]))
kg <- c(kg,op)
i=i+1
}
i=1
kg2 <- NULL
while(i <= 2){
op2<-unlist(Map(sum,o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10]))
kg2 <- c(kg2,op2)
i=i+1
}
kg
kg2 #the result should be a vector sequence like kg and kg2
> kg2
[1] 18 18 18 18 18 18 24 24 24 24 24 24
It is what I can do know. But my data have a lot of different length of columns. Is that any method I can do it quickly?
And how can I generate a sring like "o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10])" automatically according to the input number? Thank you for your help :)
Perhaps something like this?
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
even <- function(x) 2 * seq(1, nrow(x) / 2);
odd <- function(x) 2 * seq(1, nrow(x) / 2) - 1;
colSums(o[even(o), ]);
#[1] 12 12 12 12 12 12
colSums(o[odd(o), ]);
#[1] 9 9 9 9 9 9
colSums(o2[even(o2), ]);
#[1] 24 24 24 24 24 24
colSums(o2[odd(o2), ]);
#[1] 18 18 18 18 18 18
Explanation: even/odd return even/odd row indices of a matrix/data.frame; we can then use colSums to sum entries by column.
Update
To sum entries from rows 3, 6, 9, 12 (or any other sequence) you just need to define a corresponding function, e.g.
another_seq <- function(x) 3 * seq(1, nrow(x) / 3)
colSums(o2[another_seq(o2), ]);
#[1] 18 18 18 18 18 18
In the OP's loop, if we want to change the Map to make it more automatic
unlist(do.call(Map, c(f = sum, as.data.frame(t(o2[seq(i, i+10, by = 2),])))))
Using the full code
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
#I want to sum the odd-number rows and even number rows
i=1
kg <- NULL
while(i <= 2){
#op<-unlist(Map(sum,o[i,],o[i+2,],o[i+4,]))
op <- unlist(do.call(Map, c(f = sum,
as.data.frame(t(o[seq(i, i+4, by = 2),]))))) # change here
kg <- c(kg,op)
i=i+1
}
i=1
kg2 <- NULL
while(i <= 2){
#op2<-unlist(Map(sum,o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10]))
op2 <- unlist(do.call(Map, c(f = sum,
s.data.frame(t(o2[seq(i, i+10, by = 2),]))))) # change here
kg2 <- c(kg2,op2)
i=i+1
}
kg
#[1] 9 9 9 9 9 9 12 12 12 12 12 12
kg2
#[1] 18 18 18 18 18 18 24 24 24 24 24 24
In the OP's code, if we analyze the individual arguments of Map with just two arguments i.e. the first and 3rd row of 'o'
i <- 1
Map(function(x, y) c(x, y), o[i,], o[i+2,])
#[[1]]
#[1] 1 3
#[[2]]
#[1] 1 3
#[[3]]
#[1] 1 3
#[[4]]
#[1] 1 3
#[[5]]
#[1] 1 3
#[[6]]
#[1] 1 3
Here, each element of the list is the column values concatenated (c). If we need to get a similar structure, by subsetting the odd rows, we transpose the subset of rows, convert it to data.frame, so that each individual block is a column (that corresponds to the original rows subsetted)
do.call(Map, c(f=c, as.data.frame(t(o[c(i, i+2),]))))
#[[1]]
#V1 V2
# 1 3
#[[2]]
#V1 V2
# 1 3
#[[3]]
#V1 V2
# 1 3
#[[4]]
#V1 V2
# 1 3
#[[5]]
#V1 V2
# 1 3
#[[6]]
#V1 V2
# 1 3
Keeping it as a matrix will not solve it as it take the whole matrix as a single cell (a matrix is a vector with dimension attribute)
do.call(Map, c(f=c, o[c(i, i+2),]))
#[[1]]
#[1] 1 3 1 3 1 3 1 3 1 3 1 3
while using Map directly will loop through each element of the matrix (vector) instead of each column
Map(c, o[c(i, i+2),]) # check the output
Another option would be to split the object by col and then do the sum
onew <- o[seq(i, i+4, by = 2),]
Map(sum, split(onew, col(onew)))
The above approach is loopy, but we can also use vectorized approach (just like in the #Maurits Evers post). Instead of seq, here we are using the recycling of logical vector to subset the rows and then do the colSums
i1 <- c(TRUE, FALSE)
colSums(cbind(o[i1,], o[!i1,]))
#[1] 9 9 9 9 9 9 12 12 12 12 12 12
colSums(cbind(o2[i1,], o2[!i1,]))
#[1] 18 18 18 18 18 18 24 24 24 24 24 24
I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))
I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))