Apply NA to the rows that meet a condition in R - r

I have a data.frame like such:
set.seed(126)
df <- data.frame(a=sample(c(1:100, NA), 10), b=sample(1:100, 10), c=sample(1:100, 10))
a b c
1 65 48 19
2 46 15 80
3 NA 47 84
4 68 34 46
5 23 75 42
6 92 87 68
7 79 28 48
8 84 55 9
9 28 43 38
10 94 99 77
>
I'd like to write a function that transforms all values in all columns to NA if df$a is NA However, I don't want to just assign b and c the value of NA, rather I would like a function that turns all columns in the data.frame to NA if the condition is.na(a) is met, no matter the number of columns.

I think you are just looking for
df[is.na(df$a), ] <- NA
# a b c
# 1 65 48 19
# 2 46 15 80
# 3 NA NA NA
# 4 68 34 46
# 5 23 75 42
# 6 92 87 68
# 7 79 28 48
# 8 84 55 9
# 9 28 43 38
# 10 94 99 77

Related

Matrix.utils merge.Matrix and merge different results

I am very confused by the R package Matrix.utils and its implementation of merge.Matrix(). I want to merge two matrices with 0 common values, but merge common column names and fill the rest with zeros.
The results are inconsistent and sensitive to whether merge() or merge.Matrix() is specified. I expected this to be similar to the dplyr::join() function but this is not true.
Simulating the data I plan to use:
mtx.x <- sample(1:100, 100) ; mtx.x <- matrix(mtx.x, nrow = 10)
mtx.y <- sample(1:100, 100) ; mtx.y <- matrix(mtx.y, nrow = 10)
colnames(mtx.x) <- letters[1:10] ; colnames(mtx.y) <- letters[6:15]
mtx.x ; mtx.y
a b c d e f g h i j
[1,] 82 61 76 36 27 67 85 38 29 87
[2,] 83 89 43 70 81 30 35 17 39 95
[3,] 1 75 69 54 66 3 10 47 93 73
[4,] 52 98 26 88 51 64 31 72 13 92
[5,] 44 74 86 9 63 58 50 56 6 49
[6,] 24 16 77 12 55 97 18 45 14 40
[7,] 11 5 79 94 2 80 37 15 41 42
[8,] 100 84 65 59 34 62 53 60 99 28
[9,] 19 78 8 25 96 21 90 46 68 71
[10,] 32 20 7 4 57 91 22 48 33 23
f g h i j k l m n o
[1,] 24 22 8 94 89 7 50 93 40 4
[2,] 63 80 32 44 64 83 16 96 46 47
[3,] 85 30 81 95 23 91 19 92 99 52
[4,] 21 55 61 58 27 76 67 65 37 14
[5,] 9 66 12 2 41 11 56 84 87 39
[6,] 18 57 88 3 68 100 74 62 82 25
[7,] 70 90 43 54 72 86 69 20 29 51
[8,] 1 59 60 45 79 75 15 5 73 10
[9,] 38 28 26 17 53 36 97 13 77 49
[10,] 6 71 98 35 42 31 78 33 48 34
Case 1: merge() with all.x/all.y set to TRUE does what I want
merge(x = mtx.x, y = mtx.y,
all.x = T, all.y = T)
f g h i j a b c d e k l m n o
1 1 59 60 45 79 NA NA NA NA NA 75 15 5 73 10
2 3 10 47 93 73 1 75 69 54 66 NA NA NA NA NA
3 6 71 98 35 42 NA NA NA NA NA 31 78 33 48 34
4 9 66 12 2 41 NA NA NA NA NA 11 56 84 87 39
5 18 57 88 3 68 NA NA NA NA NA 100 74 62 82 25
6 21 55 61 58 27 NA NA NA NA NA 76 67 65 37 14
7 21 90 46 68 71 19 78 8 25 96 NA NA NA NA NA
8 24 22 8 94 89 NA NA NA NA NA 7 50 93 40 4
9 30 35 17 39 95 83 89 43 70 81 NA NA NA NA NA
10 38 28 26 17 53 NA NA NA NA NA 36 97 13 77 49
11 58 50 56 6 49 44 74 86 9 63 NA NA NA NA NA
12 62 53 60 99 28 100 84 65 59 34 NA NA NA NA NA
13 63 80 32 44 64 NA NA NA NA NA 83 16 96 46 47
14 64 31 72 13 92 52 98 26 88 51 NA NA NA NA NA
15 67 85 38 29 87 82 61 76 36 27 NA NA NA NA NA
16 70 90 43 54 72 NA NA NA NA NA 86 69 20 29 51
17 80 37 15 41 42 11 5 79 94 2 NA NA NA NA NA
18 85 30 81 95 23 NA NA NA NA NA 91 19 92 99 52
19 91 22 48 33 23 32 20 7 4 57 NA NA NA NA NA
20 97 18 45 14 40 24 16 77 12 55 NA NA NA NA NA
Case 2: merge.Matrix() with same arguments wants me to specify by.x/by.y
merge.Matrix(x = mtx.x, y = mtx.y,
all.x = T, all.y = T)
Error in grr::matches(by.x, by.y, all.x, all.y, nomatch = NULL) :
argument "by.x" is missing, with no default
Case 3: specifying by.x/by.y as respective column names does not merge common columns. also, no idea why its offsetting the matrices by 5 and not 10, the matrices have no common values.
merge.Matrix(x = mtx.x, y = mtx.y,
all.x = T, all.y = T,
by.x = colnames(mtx.x), by.y = colnames(mtx.y))
a b c d e f g h i j y.f y.g y.h y.i y.j k l m n o
82 61 76 36 27 67 85 38 29 87 NA NA NA NA NA NA NA NA NA NA
83 89 43 70 81 30 35 17 39 95 NA NA NA NA NA NA NA NA NA NA
1 75 69 54 66 3 10 47 93 73 NA NA NA NA NA NA NA NA NA NA
52 98 26 88 51 64 31 72 13 92 NA NA NA NA NA NA NA NA NA NA
44 74 86 9 63 58 50 56 6 49 NA NA NA NA NA NA NA NA NA NA
24 16 77 12 55 97 18 45 14 40 24 22 8 94 89 7 50 93 40 4
11 5 79 94 2 80 37 15 41 42 63 80 32 44 64 83 16 96 46 47
100 84 65 59 34 62 53 60 99 28 85 30 81 95 23 91 19 92 99 52
19 78 8 25 96 21 90 46 68 71 21 55 61 58 27 76 67 65 37 14
32 20 7 4 57 91 22 48 33 23 9 66 12 2 41 11 56 84 87 39
fill.x NA NA NA NA NA NA NA NA NA NA 18 57 88 3 68 100 74 62 82 25
fill.x NA NA NA NA NA NA NA NA NA NA 70 90 43 54 72 86 69 20 29 51
fill.x NA NA NA NA NA NA NA NA NA NA 1 59 60 45 79 75 15 5 73 10
fill.x NA NA NA NA NA NA NA NA NA NA 38 28 26 17 53 36 97 13 77 49
fill.x NA NA NA NA NA NA NA NA NA NA 6 71 98 35 42 31 78 33 48 34
Case 4: by.x/by.y specified as common column names, all.x/all.y set to TRUE and fill.x/fill.y set to 0 does not do a full join as the documentation claims
common <- intersect(colnames(mtx.x), colnames(mtx.y))
merge.Matrix(x = mtx.x, y = mtx.y,
all.x = T, all.y = T,
by.x = common, by.y = common)
a b c d e f g h i j y.f y.g y.h y.i y.j k l m n o
82 61 76 36 27 67 85 38 29 87 24 22 8 94 89 7 50 93 40 4
83 89 43 70 81 30 35 17 39 95 63 80 32 44 64 83 16 96 46 47
1 75 69 54 66 3 10 47 93 73 85 30 81 95 23 91 19 92 99 52
52 98 26 88 51 64 31 72 13 92 21 55 61 58 27 76 67 65 37 14
44 74 86 9 63 58 50 56 6 49 9 66 12 2 41 11 56 84 87 39

Extract the columns in a data frame when colnames match characters in a vector

I have a vector with characters matching some colnames in a data frame. I want to extract/subset the columns that match the vector v. I could I do that? Imagine a big data frame!
a <- sample(100,10)
b <- sample(100,10)
c <- sample(100,10)
d <- sample(100,10)
df <- data.frame(a,b,c,d)
df
a b c d
1 91 17 93 53
2 9 94 65 55
3 11 58 38 13
4 100 77 98 45
5 69 9 61 2
6 15 50 44 14
7 58 55 88 85
8 78 45 33 51
9 94 3 89 62
10 7 12 90 44
v <- c("a","c")
wanted output:
a c
1 91 93
2 9 65
3 11 38
4 100 98
5 69 61
6 15 44
7 58 88
8 78 33
9 94 89
10 7 90
>
We can use select
library(dplyr)
df %>%
select(all_of(v))
-output
# a c
#1 26 92
#2 34 15
#3 15 80
#4 4 88
#5 55 69
#6 96 78
#7 63 2
#8 69 62
#9 12 100
#10 16 22
> df
a b c d
1 38 28 68 88
2 18 21 99 40
3 30 33 20 91
4 85 64 88 33
5 9 46 82 51
6 59 86 40 80
7 80 81 46 49
8 57 61 83 37
9 64 6 15 27
10 72 13 37 68
> v <- c("a","c")
> df[v]
a c
1 38 68
2 18 99
3 30 20
4 85 88
5 9 82
6 59 40
7 80 46
8 57 83
9 64 15
10 72 37
>

Split a data frame by column using a list of vectors

I want to split this data frame df by column using a list of vectors ind as the column indices.
> df
1 2 3 4 5 6 7 8 9 10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
The combined length of the vectors are equal to the number of columns in the data frame.
> ind
[[1]]
[1] 1 4 9
[[2]]
[1] 2 5 10 7 3
[[3]]
[1] 8 6
The desired result should look like this:
$`1`
1 4 9
1 1 31 81
2 2 32 82
3 3 33 83
4 4 34 84
5 5 35 85
6 6 36 86
7 7 37 87
8 8 38 88
9 9 39 89
10 10 40 90
$`2`
2 5 10 7 3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
4 14 44 94 64 24
5 15 45 95 65 25
6 16 46 96 66 26
7 17 47 97 67 27
8 18 48 98 68 28
9 19 49 99 69 29
10 20 50 100 70 30
$`3`
8 6
1 71 51
2 72 52
3 73 53
4 74 54
5 75 55
6 76 56
7 77 57
8 78 58
9 79 59
10 80 60
Effectively the code generates sub matrices as data frames from the data frame df based on the vectors in the list ind
I have tried using split.defult without achieving the desired result.
split.default(V, rep(seq_along(ind), lengths(ind)))
One purrr option could be:
map(.x = ind, ~ df[, .x])
[[1]]
X1 X4 X9
1 1 31 81
2 2 32 82
3 3 33 83
[[2]]
X2 X5 X10 X7 X3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
[[3]]
X8 X6
1 71 51
2 72 52
3 73 53
With ind defined as:
ind <- list(c(1, 4, 9),
c(2, 5, 10, 7, 3),
c(8, 6))
An option for a list of dfs:
map(ind, ~ map(df_list, `[`, .))
You can just do,
lapply(your_list, function(i) your_df[i])
You can try the following base R solution using subset + Map
r <- Map(function(k) subset(df,select = k),ind)
such that
> r
[[1]]
X1 X4 X9
1 1 31 81
2 2 32 82
3 3 33 83
4 4 34 84
5 5 35 85
6 6 36 86
7 7 37 87
8 8 38 88
9 9 39 89
10 10 40 90
[[2]]
X2 X5 X10 X7 X3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
4 14 44 94 64 24
5 15 45 95 65 25
6 16 46 96 66 26
7 17 47 97 67 27
8 18 48 98 68 28
9 19 49 99 69 29
10 20 50 100 70 30
[[3]]
X8 X6
1 71 51
2 72 52
3 73 53
4 74 54
5 75 55
6 76 56
7 77 57
8 78 58
9 79 59
10 80 60

Imputation of missing Values from a different Data Frame and generating many data frames

I have the sample input file DF1 out of many rows and columns with missing data and wanted to impute the missing data from a different data frame DF2and generate many data frames as shown in the output ans save as a data frame. Can anyone help in solving this issue.
Input:
DF1:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 NA
3 45 76 77 NA NA
4 56 88 NA NA NA
5 36 NA NA NA NA
DF2
V1 V2 V3
1 11 21
2 12 22
3 13 23
4 14 24
5 15 25
6 16 26
7 17 27
8 18 28
9 19 29
10 20 30
Output:
OutputV1:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 1
3 45 76 77 2 3
4 56 88 4 5 6
5 36 7 8 9 10
OutputV2
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 11
3 45 76 77 12 13
4 56 88 14 14 16
5 36 17 18 19 20
Output3:
GM A B C D E
1 22 34 56 345 76
2 34 44 777 67 21
3 45 76 77 22 23
4 56 88 24 25 26
5 36 27 28 29 30
I did put the picture to make it clear for adding the values of DF2 to the output dataframe
OuputV1:
OutputV2:
It would be great if someone help me in solving this as there area many variables in the DF2 and many data frames needs to be generated depending on the number of variables.
You can transpose DF1, fill the missing values, and then transpose it back:
t_df <- t(df1)
t_df[is.na(t_df)] <- df2$V1
as.data.frame(t(t_df))
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 1
#3 3 45 76 77 2 3
#4 4 56 88 4 5 6
#5 5 36 7 8 9 10
This works best if all columns have the same data type, otherwise the data types may get mixed up due to the transpose.
impute_by_row <- function(df, values) {
t_df <- t(df)
t_df[is.na(t_df)] <- values
as.data.frame(t(t_df))
}
impute_by_row(df1, df2$V1)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 1
#3 3 45 76 77 2 3
#4 4 56 88 4 5 6
#5 5 36 7 8 9 10
impute_by_row(df1, df2$V2)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 11
#3 3 45 76 77 12 13
#4 4 56 88 14 15 16
#5 5 36 17 18 19 20
impute_by_row(df1, df2$V3)
# GM A B C D E
#1 1 22 34 56 345 76
#2 2 34 44 777 67 21
#3 3 45 76 77 22 23
#4 4 56 88 24 25 26
#5 5 36 27 28 29 30
Apply the function to all columns of df2:
lapply(df2, function(v) impute_by_row(df1, v))
$V1
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 1
3 3 45 76 77 2 3
4 4 56 88 4 5 6
5 5 36 7 8 9 10
$V2
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 11
3 3 45 76 77 12 13
4 4 56 88 14 15 16
5 5 36 17 18 19 20
$V3
GM A B C D E
1 1 22 34 56 345 76
2 2 34 44 777 67 21
3 3 45 76 77 22 23
4 4 56 88 24 25 26
5 5 36 27 28 29 30

In R, how do I locally shuffle a vector's elements

I have the following vector in R. Think of them as a vector of numbers.
x = c(1,2,3,4,...100)
I want to randomize this vector "locally" based on some input number the "locality factor". For example if the locality factor is 3, then the first 3 elements are taken and randomized followed by the next 3 elements and so on. Is there an efficient way to do this? I know if I use sample, it would jumble up the whole array.
Thanks in advance
Arun didn't like how inefficient my other answer was, so here's something very fast just for him ;)
It requires just one call each to runif() and order(), and doesn't use sample() at all.
x <- 1:100
k <- 3
n <- length(x)
x[order(rep(seq_len(ceiling(n/k)), each=k, length.out=n) + runif(n))]
# [1] 3 1 2 6 5 4 8 9 7 11 12 10 13 14 15 18 16 17
# [19] 20 19 21 23 22 24 27 25 26 29 28 30 33 31 32 36 34 35
# [37] 37 38 39 40 41 42 43 44 45 47 48 46 51 49 50 52 54 53
# [55] 55 57 56 58 60 59 62 63 61 66 64 65 68 67 69 71 70 72
# [73] 75 74 73 76 77 78 81 80 79 84 82 83 86 85 87 89 88 90
# [91] 93 92 91 94 96 95 97 98 99 100
General solution:
Edit: As #MatthewLundberg comments, the issue I pointed out with "repeating numbers in x" can be easily overcome by working on seq_along(x), which would mean the resulting values will be indices. So, it'd be like so:
k <- 3
x <- c(2,2,1, 1,3,4, 4,6,5, 3)
x.s <- seq_along(x)
y <- sample(x.s)
x[unlist(split(y, (match(y, x.s)-1) %/% k), use.names = FALSE)]
# [1] 2 2 1 3 4 1 4 5 6 3
Old answer:
The bottleneck here is the amount of calls to function sample. And as long as your numbers don't repeat, I think you can do this with just one call to sample in this manner:
k <- 3
x <- 1:20
y <- sample(x)
unlist(split(y, (match(y,x)-1) %/% k), use.names = FALSE)
# [1] 1 3 2 5 6 4 8 9 7 12 10 11 13 14 15 17 16 18 19 20
To put everything together in a function (I like the name scramble from #Roland's):
scramble <- function(x, k=3) {
x.s <- seq_along(x)
y.s <- sample(x.s)
idx <- unlist(split(y.s, (match(y.s, x.s)-1) %/% k), use.names = FALSE)
x[idx]
}
scramble(x, 3)
# [1] 2 1 2 3 4 1 5 4 6 3
scramble(x, 3)
# [1] 1 2 2 1 4 3 6 5 4 3
To reduce the answer (and get it faster) even more, following #flodel's comment:
scramble <- function(x, k=3L) {
x.s <- seq_along(x)
y.s <- sample(x.s)
x[unlist(split(x.s[y.s], (y.s-1) %/% k), use.names = FALSE)]
}
For the record, the boot package (shipped with base R) includes a function permutation.array() that is used for just this purpose:
x <- 1:100
k <- 3
ii <- boot:::permutation.array(n = length(x),
R = 2,
strata = (seq_along(x) - 1) %/% k)[1,]
x[ii]
# [1] 2 1 3 6 5 4 9 7 8 12 11 10 15 13 14 16 18 17
# [19] 21 19 20 23 22 24 26 27 25 28 29 30 33 31 32 36 35 34
# [37] 38 39 37 41 40 42 43 44 45 46 47 48 51 50 49 53 52 54
# [55] 57 55 56 59 60 58 63 61 62 65 66 64 67 69 68 72 71 70
# [73] 75 73 74 76 77 78 79 80 81 82 83 84 86 87 85 89 88 90
# [91] 93 91 92 94 95 96 97 98 99 100
This will drop elements at the end (with a warning):
locality <- 3
x <- 1:100
c(apply(matrix(x, nrow=locality, ncol=length(x) %/% locality), 2, sample))
## [1] 1 2 3 4 6 5 8 9 7 12 10 11 13 15 14 16 18 17 19 20 21 22 24 23 26 25 27 28 30 29 32 33 31 35 34 36 38 39 37
## [40] 42 40 41 43 44 45 47 48 46 51 49 50 54 52 53 55 57 56 58 59 60 62 61 63 64 65 66 67 69 68 71 72 70 74 75 73 78 77 76
## [79] 80 81 79 83 82 84 87 85 86 88 89 90 92 93 91 96 94 95 99 98 97
v <- 1:16
scramble <- function(vec,n) {
res <- tapply(vec,(seq_along(vec)+n-1)%/%n,
FUN=function(x) x[sample.int(length(x), size=length(x))])
unname(unlist(res))
}
set.seed(42)
scramble(v,3)
#[1] 3 2 1 6 5 4 9 7 8 12 10 11 15 13 14 16
scramble(v,4)
#[1] 2 3 1 4 5 8 6 7 10 12 9 11 14 15 16 13
I like Matthew's approach way better but here was the way I did the problem:
x <- 1:100
fact <- 3
y <- ceiling(length(x)/fact)
unlist(lapply(split(x, rep(1:y, each =fact)[1:length(x)]), function(x){
if (length(x)==1) return(x)
sample(x)
}), use.names = FALSE)
## [1] 3 1 2 6 4 5 8 9 7 11 10 12 13 15 14 17 16 18
## [19] 20 21 19 24 23 22 26 27 25 29 30 28 31 32 33 35 34 36
## [37] 39 37 38 41 42 40 45 43 44 47 46 48 51 49 50 52 53 54
## [55] 57 56 55 59 60 58 63 62 61 64 66 65 67 68 69 70 71 72
## [73] 75 73 74 77 76 78 80 79 81 82 84 83 85 86 87 90 89 88
## [91] 92 91 93 96 94 95 98 99 97 100

Resources