Subsetting Data frame or matrix based on criteria of values - r

Suppose I have a matrix or a data frame and I want only those values that are greater than 15 and no values between 85 and 90 both inclusive
a<-matrix(1:100,nrow = 10, ncol = 10)
rownames(a) <- LETTERS[1:10]
colnames(a) <- LETTERS[1:10]
A B C D E F G H I J
A 1 11 21 31 41 51 61 71 81 91
B 2 12 22 32 42 52 62 72 82 92
C 3 13 23 33 43 53 63 73 83 93
D 4 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 86 96
G 7 17 27 37 47 57 67 77 87 97
H 8 18 28 38 48 58 68 78 88 98
I 9 19 29 39 49 59 69 79 89 99
J 10 20 30 40 50 60 70 80 90 100
Note: You can convert it into dataframe if you know this kind of operation is possible in dataframe
Now I want My result in such a format that only those values that are greater than 5 and less than 85 retain and all else got deleted and replaced with blank space.
My desired out is like below
A B C D E F G H I J
A 11 21 31 41 51 61 71 81 91
B 12 22 32 42 52 62 72 82 92
C 13 23 33 43 53 63 73 83 93
D 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 96
G 7 17 27 37 47 57 67 77 97
H 8 18 28 38 48 58 68 78 98
I 9 19 29 39 49 59 69 79 99
J 10 20 30 40 50 60 70 80 100
Is there any kind of function in R which can take my condition and produce the desired result. I want to change code according to problem . I searched it over stack flow but didn't find something like this. I don't want to format based on rows or column.
I tried
a[a> 5 & a!=c(85:90)]
but this give me values and looses the structure.

Assuming that the 'a' is matrix, we can assign the values of 'a' %in% 86:90 or | less than 5 (a < 5) to NA. Here, I am not assigning it to '' as it will change the class from numeric to character. Also, assigning to NA would be useful for later processing.
a[a %in% 86:90 | a<5] <- NA
However, if we need it to be ''
a[a %in% 86:90 | a<5] <- ""
If we are using a data.frame
a1 <- as.data.frame(a)
a1[] <- lapply(a1, function(x) replace(x, x %in% 86:90| x <5, ""))
a1
# A B C D E F G H I J
#A 11 21 31 41 51 61 71 81 91
#B 12 22 32 42 52 62 72 82 92
#C 13 23 33 43 53 63 73 83 93
#D 14 24 34 44 54 64 74 84 94
#E 5 15 25 35 45 55 65 75 85 95
#F 6 16 26 36 46 56 66 76 96
#G 7 17 27 37 47 57 67 77 97
#H 8 18 28 38 48 58 68 78 98
#I 9 19 29 39 49 59 69 79 99
#J 10 20 30 40 50 60 70 80 100
NOTE: This changes the class of each column to character
In the OP's code, a!=c(85:90) will not work as intended as the 85:90 will recycle to the length of the 'a' and the comparison will be between the corresponding values in the recycled value and 'a'. Instead, we need to use %in% for a vector with length > 1.

Related

Create new variables by dividing all pre-exisiting variables by all other variables

I would like to create new variables by dividing all pre-existing variables by each other
e.g.
X1/X1, X1/X2, X1/X3, X1/X4, X1/X5, X1/X6, X1/X7, X1/X8, X1/X9, X1/X10,
X2/X1, X2/X2, X2/X3, X2/X4, X2/X5, X2/X6, X2/X7, X2/X8, X2/X9, X2/X10,
X3/X1, X3/X2 ...
I started by trying to do each individually, as below, but I need to replicate this with multiple variable names so an automation (I assume a function/lapply) would be ideal.
ds$rom_3_5m <- (ds$roll_open_mean_3m/ds$roll_open_mean_5m)
ds$rom_3_10m <- (ds$roll_open_mean_3m/ds$roll_open_mean_10m)
ds$rom_3_15m <- (ds$roll_open_mean_3m/ds$roll_open_mean_15m)
ds$rom_3_30m <- (ds$roll_open_mean_3m/ds$roll_open_mean_30m)
ds$rom_3_60m <- (ds$roll_open_mean_3m/ds$roll_open_mean_60m)
ds$rom_3_120m <- (ds$roll_open_mean_3m/ds$roll_open_mean_120m)
ds$rom_3_240m <- (ds$roll_open_mean_3m/ds$roll_open_mean_240m)
ds$rom_3_480m <- (ds$roll_open_mean_3m/ds$roll_open_mean_480m)
ds$rom_3_960m <- (ds$roll_open_mean_3m/ds$roll_open_mean_960m)
ds$rom_3_1920m <- (ds$roll_open_mean_3m/ds$roll_open_mean_1920m)
ds$rom_3_3840m <- (ds$roll_open_mean_3m/ds$roll_open_mean_3840m)
ds$rom_3_7680m <- (ds$roll_open_mean_3m/ds$roll_open_mean_7680m)
ds$rom_3_15360m <- (ds$roll_open_mean_3m/ds$roll_open_mean_15360m)
ds$rom_3_30720m <- (ds$roll_open_mean_3m/ds$roll_open_mean_30720m)
ds$rom_3_61440m <- (ds$roll_open_mean_3m/ds$roll_open_mean_61440m)
ds$rom_3_122880m <- (ds$roll_open_mean_3m/ds$roll_open_mean_122880m)
ds$rom_3_245760m <- (ds$roll_open_mean_3m/ds$roll_open_mean_245760m)
ds$rom_3_491520m <- (ds$roll_open_mean_3m/ds$roll_open_mean_491520m)
#5m
ds$rom_5_3m <- (ds$roll_open_mean_5m/ds$roll_open_mean_3m)
ds$rom_5_10m <- (ds$roll_open_mean_5m/ds$roll_open_mean_10m)
ds$rom_5_15m <- (ds$roll_open_mean_5m/ds$roll_open_mean_15m)
ds$rom_5_30m <- (ds$roll_open_mean_5m/ds$roll_open_mean_30m)
ds$rom_5_60m <- (ds$roll_open_mean_5m/ds$roll_open_mean_60m)
ds$rom_5_120m <- (ds$roll_open_mean_5m/ds$roll_open_mean_120m)
ds$rom_5_240m <- (ds$roll_open_mean_5m/ds$roll_open_mean_240m)
ds$rom_5_480m <- (ds$roll_open_mean_5m/ds$roll_open_mean_480m)
ds$rom_5_960m <- (ds$roll_open_mean_5m/ds$roll_open_mean_960m)
ds$rom_5_1920m <- (ds$roll_open_mean_5m/ds$roll_open_mean_1920m)
ds$rom_5_3840m <- (ds$roll_open_mean_5m/ds$roll_open_mean_3840m)
ds$rom_5_7680m <- (ds$roll_open_mean_5m/ds$roll_open_mean_7680m)
ds$rom_5_15360m <- (ds$roll_open_mean_5m/ds$roll_open_mean_15360m)
ds$rom_5_30720m <- (ds$roll_open_mean_5m/ds$roll_open_mean_30720m)
ds$rom_5_61440m <- (ds$roll_open_mean_5m/ds$roll_open_mean_61440m)
ds$rom_5_122880m <- (ds$roll_open_mean_5m/ds$roll_open_mean_122880m)
ds$rom_5_245760m <- (ds$roll_open_mean_5m/ds$roll_open_mean_245760m)
ds$rom_5_491520m <- (ds$roll_open_mean_5m/ds$roll_open_mean_491520m)
#10m
ds$rom_10_3m <- (ds$roll_open_mean_10m/ds$roll_open_mean_3m)
ds$rom_10_5m <- (ds$roll_open_mean_10m/ds$roll_open_mean_5m)
ds$rom_10_15m <- (ds$roll_open_mean_10m/ds$roll_open_mean_15m)
I have a data frame with 40+ variables with 6 million rows, I have attached a smaller example data frame below.
Thanks in advance!
Charlie
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 57 77 48 8 31 43 47 13 26 88
2 25 75 86 77 4 65 5 49 31 57
3 91 90 42 69 82 33 56 99 47 39
4 35 96 86 77 67 77 20 17 77 92
5 6 100 50 62 16 31 0 39 72 4
6 90 34 74 89 71 37 73 45 24 28
7 24 22 92 13 57 97 32 2 12 80
8 74 59 49 2 97 100 15 37 15 67
9 43 38 66 97 8 20 85 25 97 67
10 82 4 56 40 42 46 44 98 98 76
11 60 68 92 99 81 92 78 59 23 81
12 22 57 37 100 7 1 89 41 40 56
13 69 13 1 82 89 45 83 24 71 29
14 8 14 66 48 94 8 20 3 28 63
15 26 70 56 62 9 34 11 86 71 64
16 7 55 15 100 91 89 46 74 98 14
17 29 68 19 66 83 29 84 76 90 45
18 27 76 6 48 17 28 8 7 52 37
19 68 58 51 75 60 57 74 46 98 93
20 15 15 89 55 23 3 3 8 32 37
21 78 49 57 48 96 89 4 95 67 58
22 12 36 42 59 27 92 48 0 92 28
23 51 17 77 61 84 53 46 22 27 36
24 40 84 83 35 19 13 80 78 96 87
25 44 80 25 72 43 17 74 70 52 36
26 14 61 63 82 16 47 32 93 19 84
27 93 19 28 62 74 1 85 65 50 9
28 80 62 6 58 48 97 97 18 65 43
29 12 58 95 79 37 89 89 83 22 85
30 57 73 22 88 99 63 58 87 90 66
As #27 ϕ 9 suggested in the comments you should use that lapply solution.
With this, you also create a unique dataframe with correct names
l <- lapply(df, `/`, df)
l <- unlist(l, recursive = FALSE)
data.frame(l)

How to cut the values in a regular interval and define them into the separate group? [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 3 years ago.
How to cut the values (1 to 100) in a regular interval (25) and place them into 4 groups as below:
sdr <- c(1:100)
Group1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Group2: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Group3: 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
Group4: 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Any suggestion, please.
You could use split
sdr <- 1:100
split(sdr, rep(1:4, each = 25))
#$`1`
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#
#$`2`
# [1] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#
#$`3`
# [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#
#$`4`
# [1] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
#[20] 95 96 97 98 99 100
This returns a list with 4 vector elements.
Also note that the c() around 1:100 is not necessary.
Or we can define the number of groups
ngroup <- 4
split(sdr, rep(1:ngroup, each = length(sdr) %/% ngroup))
giving the same result.
You can make a dataframe for your groups and then transpose using t:
df <- t(data.frame(Group1 = c(1:25), Group2 = c(26:50), Group3 = c(51:75), Group4 = c(76:100)))

loop over a sequence and rounding problem in R

I want to assign some value to a vecter like:
a = rep(0, 101)
for(i in seq(0, 1, 0.01)){
u <- 100 * i + 1
a[u] <- u
}
a
plot(a)
The output is
> a
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 0 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59 0 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
There are problems on the 29th and the 59th elements. They should be 29 and 59, but it turns out to be 0, the default value. And the previous values, the 28th and 58th, are also incorrect. Why is this happening? Thank you!
There is a problem with your indexing. I don't know how to explain why it doesn't work as written, but here is a modification to your code that works:
a = rep(0, 101)
s<-seq(0, 1, 0.01)
for(i in 1:101){
a[i] <- 100 * s[i] + 1
}
a
plot(a)
In general it is best to avoid multiple indexes in the same loop as it can be confusing and difficult to diagnose problems.

Generate sequence with alternating increments in R? [duplicate]

This question already has answers here:
Get a seq() in R with alternating steps
(6 answers)
Closed 6 years ago.
I want to use R to create the sequence of numbers 1:8, 11:18, 21:28, etc. through 1000 (or the closest it can get, i.e. 998). Obviously typing that all out would be tedious, but since the sequence increases by one 7 times and then jumps by 3 I'm not sure what function I could use to achieve this.
I tried seq(1, 998, c(1,1,1,1,1,1,1,3)) but it does not give me the results I am looking for so I must be doing something wrong.
This is a perfect case of vectorisation( recycling too) in R. read about them
(1:100)[rep(c(TRUE,FALSE), c(8,2))]
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 31 32
#[27] 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57 58 61 62 63 64
#[53] 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85 86 87 88 91 92 93 94 95 96
#[79] 97 98
rep(seq(0,990,by=10), each=8) + seq(1,8)
You want to exclude numbers that are 0 or 9 (mod 10). So you can try this too:
n <- 1000 # upper bound
x <- 1:n
x <- x[! (x %% 10) %in% c(0,9)] # filter out (0, 9) mod (10)
head(x,80)
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27
# 28 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57
# 58 61 62 63 64 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85
# 86 87 88 91 92 93 94 95 96 97 98
Or in a single line using Filter:
Filter(function(x) !((x %% 10) %in% c(0,9)), 1:100)
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57
# [48] 58 61 62 63 64 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85 86 87 88 91 92 93 94 95 96 97 98
With a cycle: for(value in c(seq(1,991,10))){vector <- c(vector,seq(value,value+7))}

subsetting between two data frames

I want to subset everything from df1 except df2.
df1<-
A B C D E F G H I J
80 16 55 74 89 39 4 67 36 87
69 49 91 83 50 1 77 19 73 43
85 45 97 9 47 65 79 81 86 66
37 58 17 38 76 14 54 78 62 98
12 25 56 20 31 82 34 23 33 11
df2<-
C D E F
55 74 89 39
91 83 50 1
97 9 47 65
17 38 76 14
56 20 31 82
I would like to utilise this kind of approach if possible:
mydata<-df1[,!colnames(df2)]
If you want the columns that are in df1, but not in df2, this can be done as such:
not_in_df2 <- setdiff(colnames(df1), colnames(df2))
subSet_df1 <- df1[,not_in_df2]
Or you could define not_in_df2 via
not_in_df2 <- !(colnames(df1) %in% colnames(df2))

Resources