Ok, first of all let me generate some sample data:
A_X01 <- c(34, 65, 23, 43, 22)
A_X02 <- c(2, 4, 7, 8, 3)
B_X01 <- c(24, 45, 94, 23, 54)
B_X02 <- c(4, 2, 4, 9, 1)
C_X01 <- c(34, 65, 876, 45, 87)
C_X02 <- c(123, 543, 86, 87, 34)
Var <- c(3, 5, 7, 2, 3)
DF <- data.frame(A_X01, A_X02, B_X01, B_X02, C_X01, C_X02, Var)
What I want to do is apply an equation to the concurrent columns of A and B for both X01 and X02, with a third column "Var" used in the equation.
So far I have been doing this the following way:
DF$D_X01 <- (DF$A_X01 + DF$B_X01) * DF$Var
DF$D_X02 <- (DF$A_X02 + DF$B_X02) * DF$Var
My desired output is as follows:
A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
1 34 2 24 4 34 123 3 174 18
2 65 4 45 2 65 543 5 550 30
3 23 7 94 4 876 86 7 819 77
4 43 8 23 9 45 87 2 132 34
5 22 3 54 1 87 34 3 228 12
As you'll appreciate this is a lot of lines of code to do something fairly simple. Meaning at present my scripts are rather long (as I have multiple columns in the actual dataset)!
One of the apply functions must be the way to go but I can't seem to get my head around it for concurrent columns. I did think about using lapply but how would I get this to work for the two lists of columns and for the right columns to be added together?
I've looked around and can't seem to find a way to do this which must be a fairly common problem?
Thanks.
EDIT:
Original question was a bit confusing so have updated with a desired output and some extra conditions.
Try this
indx <- gsub("\\D", "", grep("A_X|B_X", names(DF), value = TRUE)) # Retrieving indexes
indx2 <- DF[grep("A_X|B_X", names(DF))] # Considering only the columns of interest
DF[paste0("D_X", unique(indx))] <-
sapply(unique(indx), function(x) rowSums(indx2[which(indx == x)])*DF$Var)
DF
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
# 1 34 2 24 4 34 123 3 174 18
# 2 65 4 45 2 65 543 5 550 30
# 3 23 7 94 4 876 86 7 819 77
# 4 43 8 23 9 45 87 2 132 34
# 5 22 3 54 1 87 34 3 228 12
You may also try
indxA <- grep("^A", colnames(DF))
indxB <- grep("^B", colnames(DF))
f1 <- function(x,y,z) (x+y)*z
DF[sprintf('D_X%02d', indxA)] <- Map(f1 , DF[indxA], DF[indxB], list(DF$Var))
DF
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
#1 34 2 24 4 34 123 3 174 18
#2 65 4 45 2 65 543 5 550 30
#3 23 7 94 4 876 86 7 819 77
#4 43 8 23 9 45 87 2 132 34
#5 22 3 54 1 87 34 3 228 12
Or you could use mapply
DF[sprintf('D_X%02d', indxA)] <- mapply(`+`, DF[indxA],DF[indxB])*DF$Var
Related
seq can only use a single value in the by parameter. Is there a way to vectorize by, i.e. to use multiple intervals?
Something like this:
seq(1, 10, by = c(1, 2))
would return c(1, 2, 4, 5, 7, 8, 10). Now, this is possible to do this with e.g. seq(1, 10, by = 1)[c(T, T, F)] because it's a simple case, but is there a way to make it generalizable to more complex sequences?
Some examples
seq(1, 100, by = 1:5)
#[1] 1 2 4 7 11 16 17 19 22 26 31...
seq(8, -5, by = c(3, 8))
#[1] 8 5 -3
this looks like a close base R solution?
ans <- Reduce(`+`, rep(1:5, 100), init = 1, accumulate = TRUE)
ans[1:(which.max(ans >= 100) - 1)]
[1] 1 2 4 7 11 16 17 19 22 26 31 32 34 37 41 46 47 49 52 56 61 62 64
[24] 67 71 76 77 79 82 86 91 92 94 97
you would have to inverse a part of it, if you want to calculate 'down'
ans <- Reduce(`+`, rep(c(-3, -8), 20), init = 8, accumulate = TRUE)
ans[1:(which.max(ans <= -5) - 1)]
[1] 8 5 -3
still, you would have tu 'guess' the number of repetitions needed (20 of 100 in the examples above) to create ans.
This doesn't seem possible Maƫl. I suppose it's easy enough to write one?
seq2 <- function(from, to, by) {
vals <- c(0, cumsum(rep(by, abs(ceiling((to - from) / sum(by))))))
if(from > to) return((from - vals)[(from - vals) >= to])
else (from + vals)[(from + vals) <= to]
}
Testing:
seq2(1, 10, by = c(1, 2))
#> [1] 1 2 4 5 7 8 10
seq2(1, 100, by = 1:5)
#> [1] 1 2 4 7 11 16 17 19 22 26 31 32 34 37 41 46 47 49 52 56 61 62 64 67 71
#> [26] 76 77 79 82 86 91 92 94 97
seq2(8, -5, by = c(3, 8))
#> [1] 8 5 -3
Created on 2022-12-23 with reprex v2.0.2
We are looking to create a vector with the following sequence:
1,4,5,8,9,12,13,16,17,20,21,...
Start with 1, then skip 2 numbers, then add 2 numbers, then skip 2 numbers, etc., not going above 2000. We also need the inverse sequence 2,3,6,7,10,11,...
We may use recyling vector to filter the sequence
(1:21)[c(TRUE, FALSE, FALSE, TRUE)]
[1] 1 4 5 8 9 12 13 16 17 20 21
Here's an approach using rep and cumsum. Effectively, "add up alternating increments of 1 (successive #s) and 3 (skip two)."
cumsum(rep(c(1,3), 500))
and
cumsum(rep(c(3,1), 500)) - 1
Got this one myself - head(sort(c(seq(1, 2000, 4), seq(4, 2000, 4))), 20)
We can try like below
> (v <- seq(21))[v %% 4 %in% c(0, 1)]
[1] 1 4 5 8 9 12 13 16 17 20 21
You may arrange the data in a matrix and extract 1st and 4th column.
val <- 1:100
sort(c(matrix(val, ncol = 4, byrow = TRUE)[, c(1, 4)]))
# [1] 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33
#[18] 36 37 40 41 44 45 48 49 52 53 56 57 60 61 64 65 68
#[35] 69 72 73 76 77 80 81 84 85 88 89 92 93 96 97 100
A tidyverse option.
library(purrr)
library(dplyr)
map_int(1:11, ~ case_when(. == 1 ~ as.integer(1),
. %% 2 == 0 ~ as.integer(.*2),
T ~ as.integer((.*2)-1)))
# [1] 1 4 5 8 9 12 13 16 17 20 21
I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.
I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,
df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234,
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")
Which looks like this
df
id var1 var2
1 110 26 234
2 110 21 12
3 110 54 43
4 90 10 32
5 90 18 21
6 90 9 19
7 90 16 16
8 252 54 34
9 252 39 44
Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,
> apply(df, 2, sort)
id var1 var2
[1,] 90 9 12
[2,] 90 10 16
[3,] 90 16 19
[4,] 90 18 21
[5,] 110 21 32
[6,] 110 26 34
[7,] 110 39 43
[8,] 252 54 44
[9,] 252 54 234
However, this is not the output I am seeking. The correct output should be,
id var1 var2
1 110 21 12
2 110 26 234
3 110 54 43
4 90 9 19
5 90 10 32
6 90 16 16
7 90 18 21
8 252 39 44
9 252 54 34
Group by id and sort by var1 column and keep original id column order.
Any idea how to sort like this?
Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:
df[order(ordered(df$id, unique(df$id)), df$var1), ]
A one-liner tidyverse solution w/o any temp vars:
library(tidyverse)
df %>% arrange(ordered(id, unique(id)), var1)
# id var1 var2
# 1 110 26 234
# 2 110 21 12
# 3 110 54 43
# 4 90 10 32
# 5 90 18 21
# 6 90 9 19
# 7 90 16 16
# 8 252 54 34
# 9 252 39 44
Explanation of why apply(df, 2, sort) does not work
What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).
apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:
# [1,] 90 9 12
does not even exist in the original data.frame.
Another base R option using order and match
df[with(df, order(match(id, unique(id)), var1, var2)), ]
# id var1 var2
#2 110 21 12
#1 110 26 234
#3 110 54 43
#6 90 9 19
#4 90 10 32
#7 90 16 16
#5 90 18 21
#9 252 39 44
#8 252 54 34
We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.
df$id <- factor(df$id, levels = unique(df$id))
do.call(rbind, lapply(split(df, df$id), function(i)i[order(i$var1),]))
# id var1 var2
#110.2 110 21 12
#110.1 110 26 234
#110.3 110 54 43
#90.6 90 9 19
#90.4 90 10 32
#90.7 90 16 16
#90.5 90 18 21
#252.9 252 39 44
#252.8 252 54 34
NOTE: You can reset the rownames by rownames(new_df) <- NULL
In base R we could use split<- :
split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )
or as #Markus suggests :
split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])
output in either case :
df
# id var1 var2
# 1 110 21 12
# 2 110 26 234
# 3 110 54 43
# 4 90 9 19
# 5 90 10 32
# 6 90 16 16
# 7 90 18 21
# 8 252 39 44
# 9 252 54 34
With the following tidyverse pipe, the question's output is reproduced.
library(tidyverse)
df %>%
mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
group_by(id) %>%
arrange(tmp, var1) %>%
select(-tmp)
## A tibble: 9 x 3
## Groups: id [3]
# id var1 var2
# <dbl> <dbl> <dbl>
#1 110 21 12
#2 110 26 234
#3 110 54 43
#4 90 9 19
#5 90 10 32
#6 90 16 16
#7 90 18 21
#8 252 39 44
#9 252 54 34
Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE