I have tried to apply a function to a data.frame including only specific rows.
My aim is to have a fifth column which includes a function which varies according to the group and func. Say I would like to perform a t-test in the case that func=a and to calculate a mean difference in the case func=b. In other words, the first three rows in the fifth column should include the result of a t.test (t.test(n1[1:3],n2[1:3])$p.value) comparing the n1 and n2 in the group 1. How is this possible?
n1<-c(58,94,58,94,65,87,65,91,20,16)
n2<-c(37,34,88,23,86,37,80,34,24,67)
group<-c(1,1,1,2,2,2,2,3,3,3)
func<-c('a','a','a','b','b','b','b','a','a','a')
data<-data.frame(n1,n2,group,func)
data
n1 n2 group func
1 58 37 1 a
2 94 34 1 a
3 58 88 1 a
4 94 23 2 b
5 65 86 2 b
6 87 37 2 b
7 65 80 2 b
8 91 34 3 a
9 20 24 3 a
10 16 67 3 a
EDIT:
Manually I can do it like this. But is could I do it if I had +1000 rows with 100+ groups?
pvalue1<-t.test(c(58,94,58),c(37,34,88))$p.value
pvalue2<-chisq.test(c(94,65,87,65),c(23,86,37,80))$p.value
pvalue3<-t.test(c(91,20,16),c(34,24,67))$p.value
pvalue<-c(rep(pvalue1,3),rep(pvalue2,4),rep(pvalue3,3))
cbind(data,pvalue)
n1 n2 group func pvalue
1 58 37 1 a 0.4737073
2 94 34 1 a 0.4737073
3 58 88 1 a 0.4737073
4 94 23 2 b 0.2381033
5 65 86 2 b 0.2381033
6 87 37 2 b 0.2381033
7 65 80 2 b 0.2381033
8 91 34 3 a 0.9822272
9 20 24 3 a 0.9822272
10 16 67 3 a 0.9822272
You can do the calculations with dplyr like this:
library(dplyr)
my_df %>%
group_by(group) %>%
mutate(p_value = ifelse(func == 'a', t.test(n1, n2)$p.value, chisq.test(n1, n2)$p.value))
# A tibble: 10 x 5
# Groups: group [3]
# n1 n2 group func p_value
# <dbl> <dbl> <dbl> <fct> <dbl>
# 1 58. 37. 1. a 0.474
# 2 94. 34. 1. a 0.474
# 3 58. 88. 1. a 0.474
# 4 94. 23. 2. b 0.238
# 5 65. 86. 2. b 0.238
# 6 87. 37. 2. b 0.238
# 7 65. 80. 2. b 0.238
# 8 91. 34. 3. a 0.982
# 9 20. 24. 3. a 0.982
# 10 16. 67. 3. a 0.982
I've seen some cool stuff along these lines in Hadley's R4DS book. Check this out for an example and some discussion around my approach below.
The following goes some way to achieving what you'd like:
library(dplyr)
library(purrr)
library(tidyr)
test_function <- function(func, data) {
if (func == "a") {t.test(data$n1, data$n2)$p.value}
else if (func == "b") {chisq.test(data$n1, data$n2)$p.value}
}
df %>%
group_by(group, func) %>%
nest() %>%
mutate(p_value = map2_dbl(func, data, function(x, y) test_function(x, y)))
%>% unnest()
Consider base R's underused by() which can split dataframes by one or more factors and then pass subsets into a defined or anonymous function, returning a list of function's output.
Data (assuming functions are strings)
n1 <- c(58,94,58,94,65,87,65,91,20,16)
n2 <- c(37,34,88,23,86,37,80,34,24,67)
group <- c(1,1,1,2,2,2,2,3,3,3)
func < -c('t.test','t.test','t.test','chisq.test','chisq.test',
'chisq.test','chisq.test','t.test','t.test','t.test')
data <- data.frame(n1,n2,group,func)
By processing (using get() to retrieve actual function):
data_list <- by(data, data$group, function(sub){
func <- print(as.character(sub$func[[1]]))
f <- get(func)
sub$pvalue <- f(sub$n1, sub$n2)$p.value
return(sub)
})
final_df <- do.call(rbind, data_list)
final_df
# n1 n2 group func pvalue
# 1.1 58 37 1 t.test 0.4737073
# 1.2 94 34 1 t.test 0.4737073
# 1.3 58 88 1 t.test 0.4737073
# 2.4 94 23 2 chisq.test 0.2381033
# 2.5 65 86 2 chisq.test 0.2381033
# 2.6 87 37 2 chisq.test 0.2381033
# 2.7 65 80 2 chisq.test 0.2381033
# 3.8 91 34 3 t.test 0.9822272
# 3.9 20 24 3 t.test 0.9822272
# 3.10 16 67 3 t.test 0.9822272
Related
I am trying to simulate the following "game:
There is a population of 100 units
You randomly sample 10 of these units, record the id's of the units you saw, and then put them back into the population
You then take a second sample, record the id's of the units you saw in this second sample along with the first sample, and then put the second sample back into the population
Repeat this many times
I wrote the following code in R that performs the above procedure:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
The result looks something like this:
> summary_file
iteration_i Count Cumulative unobserved
1 1 10 10 90
2 2 8 18 82
3 3 9 27 73
4 4 8 35 65
5 5 6 41 59
6 6 5 46 54
7 7 7 53 47
8 8 7 60 40
9 9 4 64 36
10 10 3 67 33
11 11 4 71 29
12 12 4 75 25
13 13 1 76 24
14 14 4 80 20
15 15 1 81 19
16 16 2 83 17
17 17 2 85 15
18 18 1 86 14
19 20 1 87 13
20 22 1 88 12
21 23 2 90 10
22 24 1 91 9
23 25 1 92 8
24 27 2 94 6
25 28 1 95 5
26 30 1 96 4
27 35 1 97 3
28 37 1 98 2
29 44 1 99 1
30 46 1 100 0
I would now like to repeat this "game" many times.
I would like to keep the "summary_file" for each "game" (e.g. summary_file_1, summary_file_2, summary_file_3, etc.)
I would then like to create a "total" summary file that shows the number of iterations that were required in each game to observe all units.
This total_summary_file would look something like this:
game_id iterations_required
1 game_1 47
2 game_2 45
3 game_3 44
4 game_4 42
5 game_5 42
Currently, I am just copy/pasting my earlier code several times and storing the results, then I append everything at the end and calculate the summary statistics - but I am trying to find a way to "loop the loop" and do everything at once. I do not know if it is possible to introduce a command like "results_df_i <- do.call(rbind.data.frame, results_i)" into the loop and efficiently create everything at the same time instead of manually copy/pasting the earlier loop.
You're making this a lot less efficient than it could be. To get, say, 100 repeated samples of 10 from the set 1:100 (with replacement), we can do replicate(100, sample(100, 10, TRUE)).
We can then coerce this into a vector and count the number of unique values every 10 entries along the vector until we get to 100. This gives us the number of iterations required to exhaust the samples.
If we put this inside an sapply, we don't even need an explicit loop, which means we can create the results data frame in a single call:
set.seed(1)
n_games <- 10
results <- data.frame(game_id = paste("game", seq(n_games), sep = "_"),
iterations_required = sapply(seq(n_games), function(x) {
samp <- c(replicate(100, sample(100, 10, TRUE)))
sum(sapply(1:100 * 10, function(n) length(unique(samp[1:n]))) < 100)
}))
results
#> game_id iterations_required
#> 1 game_1 59
#> 2 game_2 44
#> 3 game_3 54
#> 4 game_4 59
#> 5 game_5 57
#> 6 game_6 58
#> 7 game_7 96
#> 8 game_8 60
#> 9 game_9 71
#> 10 game_10 33
Created on 2022-06-11 by the reprex package (v2.0.1)
There are lots of ways to get your desired outcome; wrapping your loop in a function and running the function multiple times is another potential solution:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
sample_func <- function(output_file_name) {
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
output_file_name <- summary_file
return(output_file_name)
}
list_of_names <- paste0("game_", 1:10)
output <- lapply(list_of_names, sample_func)
names(output) <- list_of_names
head(output, n = 2)
#> $game_1
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 9 19 81
#> 3 3 8 27 73
#> 4 4 8 35 65
#> 5 5 6 41 59
#> 6 6 6 47 53
#> 7 7 6 53 47
#> 8 8 4 57 43
#> 9 9 4 61 39
#> 10 10 3 64 36
#> 11 11 2 66 34
#> 12 12 1 67 33
#> 13 13 4 71 29
#> 14 14 1 72 28
#> 15 15 2 74 26
#> 16 16 2 76 24
#> 17 17 3 79 21
#> 18 18 4 83 17
#> 19 19 2 85 15
#> 20 20 2 87 13
#> 21 21 1 88 12
#> 22 24 1 89 11
#> 23 25 2 91 9
#> 24 26 1 92 8
#> 25 27 1 93 7
#> 26 30 1 94 6
#> 27 31 1 95 5
#> 28 33 1 96 4
#> 29 34 1 97 3
#> 30 36 1 98 2
#> 31 41 1 99 1
#> 32 66 1 100 0
#>
#> $game_2
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 10 20 80
#> 3 3 7 27 73
#> 4 4 7 34 66
#> 5 5 8 42 58
#> 6 6 5 47 53
#> 7 7 7 54 46
#> 8 8 5 59 41
#> 9 9 1 60 40
#> 10 10 7 67 33
#> 11 11 3 70 30
#> 12 12 3 73 27
#> 13 13 1 74 26
#> 14 14 3 77 23
#> 15 15 4 81 19
#> 16 16 3 84 16
#> 17 17 2 86 14
#> 18 18 1 87 13
#> 19 19 2 89 11
#> 20 20 1 90 10
#> 21 21 2 92 8
#> 22 22 1 93 7
#> 23 25 2 95 5
#> 24 27 1 96 4
#> 25 29 2 98 2
#> 26 30 1 99 1
#> 27 41 1 100 0
The output is a list, so you can use purrr to apply functions to each element (e.g. https://purrr.tidyverse.org/reference/lmap.html) or use the bind_rows() function to create a single dataframe for further use, e.g.
df2 <- bind_rows(output, .id = "game") %>%
group_by("Game" = factor(game, levels = list_of_names)) %>%
summarise(rows_in_output = n(),
number_of_iterations = max(iteration_i))
df2
#> # A tibble: 10 × 3
#> Game rows_in_output number_of_iterations
#> <fct> <int> <int>
#> 1 game_1 32 66
#> 2 game_2 27 41
#> 3 game_3 27 48
#> 4 game_4 32 50
#> 5 game_5 27 35
#> 6 game_6 27 71
#> 7 game_7 28 68
#> 8 game_8 27 48
#> 9 game_9 29 43
#> 10 game_10 29 66
Created on 2022-06-17 by the reprex package (v2.0.1)
Or you can use list2env() to get have each individual dataframe in your environment, e.g.
list2env(output, envir = .GlobalEnv)
ls()
#> [1] "df2" "game_1" "game_10" "game_2" "game_3" "game_4"
#> [7] "game_5" "game_6" "game_7" "game_8" "game_9" "list_of_names"
#> [13] "my_data" "output" "response" "results" "results2" "sample_func"
#> [19] "var_1" "var_2" "var_3"
This seems easily solvable using recursion:
fun <- function(x, i=1, size = 10){
a <- setdiff(x,sample(100, 10, TRUE)) # Remove the seen from x
if(length(a)) Recall(a, i+1) else i # if we have unobserved, call fun again
}
Now we can have as many games as we want:
data.frame(game = paste0('game',seq(10)), results = replicate(10, fun(1:100)))
game results
1 game1 62
2 game2 40
3 game3 51
4 game4 50
5 game5 34
6 game6 83
7 game7 38
8 game8 40
9 game9 53
10 game10 41
You could also do
hist(replicate(1000, fun(1:100)), breaks = 30)
Edit:
Note that this can be editted to take in any size and vector. eg:
fun <- function(x, size = 10, y=x, i=1){
a <- setdiff(x,sample(y, size, TRUE))
cat('i', i, '\t a: ',a, '\n')
if(length(a)>0) Recall(a, size, y, i+1) else i
}
set.seed(117);fun(1:10, 1)
i 1 a: 1 2 4 5 6 7 8 9 10 # 3 removed
i 2 a: 1 2 4 5 7 8 9 10 # 6 removed
i 3 a: 1 2 4 5 7 8 9 # 10 removed
i 4 a: 1 2 4 5 8 9 # 7 removed
i 5 a: 1 2 5 8 9 # 4 removed
i 6 a: 1 2 5 8 9 # Nothing removed
i 7 a: 1 5 8 9 # 2 removed
i 8 a: 1 5 8 # 9 removed
i 9 a: 1 5 # Nothing removed
i 10 a: 1 5 # Nothing removed
i 11 a: 5 # 1 removed
i 12 a: # 5 removed
Using Markov chains, we can produce the cumulative distribution function for the number of iterations required for a game (up to machine precision). The resulting CDF can be sampled directly using findInterval.
We can simplify things slightly by starting with the second iteration, since the first iteration will always result in 90 unseen units.
First, set up a matrix for all possible transitions:
m <- matrix(c(rep(90:1, each = 11), sequence(rep(11,90), 90:1, -1)), ncol = 2, dimnames = list(NULL, c("from", "to")))
m <- m[m[,2] >= 0L,]
Then create a transition matrix with row 1 representing the state where all units have been seen and row 91 representing the state where 10 units have been seen:
mTrans <- matrix(0, 91, 91)
The number of previously unseen units selected follows the hypergeometric distribution.
mTrans[m + 1L] <- dhyper(m[,1] - m[,2], m[,1], 100L - m[,1], 10L)
Row 1 represents an absorbing state since all units have been seen.
mTrans[1, 1] <- 1
mTrans contains the probabilities of each state after the second iteration.
Initialize a while loop and calculate the CDF.
mm <- mTrans %*% mTrans
maxIter <- 1000L
p <- numeric(maxIter)
iter <- 3L
while (p[iter] < 1) {
if ((iter <- iter + 1L) > maxIter) {
p <- c(p, numeric(maxIter))
maxIter <- maxIter*2L
}
mm <- mm %*% mTrans
p[iter] <- mm[91, 1]
}
p <- p[1:iter]
iter
#> [1] 345
Machine precision limits the CDF to less than 345 iterations. Plot the CDF:
plot(p, xlab = "iterations", ylab = "cumulative probability")
Using findInterval we can quickly generate a large number of random samples of the iterations required.
ngames <- 1e6L # one million games
results <- data.frame(game_id = 1:ngames, iterations_required = findInterval(runif(ngames), p))
head(results)
#> game_id iterations_required
#> 1 1 73
#> 2 2 69
#> 3 3 40
#> 4 4 41
#> 5 5 44
#> 6 6 43
Get a histogram of the sample number of iterations required.
hist(results$iterations_required)
OP here! I think I was able to find an answer to my own question:
library(dplyr)
var_1 <- rnorm(100, 10, 10)
var_2 <- rnorm(100, 1, 10)
var_3 <- rnorm(100, 5, 10)
response <- rnorm(100, 1, 1)
my_data <- data.frame(var_1, var_2, var_3, response)
my_data$id <- 1:100
simulate <- function() {
results <- list()
results2 <- list()
for (i in 1:100) {
iteration_i <- i
sample_i <- my_data[sample(nrow(my_data), 10), ]
results_tmp <- data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file <- data.frame(test_1 %>%
group_by(iteration_i) %>%
summarise(Count=n()))
cumulative <- cumsum(summary_file$Count)
summary_file$Cumulative <- cumulative
summary_file$unobserved <- 100 - cumulative
return(summary_file)
}
# now, loop 10 times!
results <- list()
for (i in 1:10) {
game_i <- i
s_i <- simulate()
results_tmp <- data.frame(game_i, s_i)
results[[i]] <- results_tmp
}
final_file <- do.call(rbind.data.frame, results)
Thanks for your help everyone!
I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8
I wrote a fxn (fxn-b) that works as expected when I manually apply it to each row of df1. I would like to apply the fxn to df1 so it is automatically applied to each row of df1.
Each iteration should produce a new df, and I would like the newly created dfs to be bound via rbind. When I try to apply fxn-b to df1, it appears it only makes it to the first row. The error message also suggests its not progressing past row 1 in df1.
BTW, Fxn-b also contains another fxn (fxn-a), although I dont believe this is effecting the outcome. Nonetheless I will provide both.
fxn-a:
pythag.opp.leg<-function(Radius){
Diam<-Radius*2
opposite<-sqrt((Diam^2)/2)
opposite.rounded<-round(opposite)
box<-opposite.rounded/2
return(box)
}
fxn-b:
swc.fxn<-function(df1){
box<-pythag.opp.leg(df1$Radius)
box<-round(box)<
xHigh<-df1$X+box
Xlow<-df1$X-box
Yhigh<-df1$Y+box
Ylow<-df1$Y-box
swc.box<- data.frame(X=Xlow:Xhigh, Y=Ylow:Yhigh, Z=df1[1,3])
swc.box2<-expand(swc.box, X, Y, Z)
return(swc.box2)
}
here is df1:
df1<-data.frame(X=c(100,110,120,130), Y=c(90,90,90,90),
Z=c(10,10,15,15),Radius=c(2,2,2,2))
here is output:
#A tibble: 25 x 3
X Y Z
<int> <int> <dbl>
1 98 88 10
2 98 89 10
3 98 90 10
4 98 91 10
5 98 92 10
6 99 88 10
7 99 89 10
8 99 90 10
9 99 91 10
10 99 92 10
... with 15 more rows
Warning messages:
1: In Xlow:Xhigh :
numerical expression has 4 elements: only the first used
2: In Xlow:Xhigh :
numerical expression has 4 elements: only the first used
3: In Ylow:Yhigh :
numerical expression has 4 elements: only the first used
4: In Ylow:Yhigh :
numerical expression has 4 elements: only the first used
As mentioned in the warning message swc.fxn can handle for only one input.
swc.fxn(df1[1, ])
This works but if you pass all the rows it doesn't work. One way is to use Map to make it work for multiple rows simultaneously and use lapply to expand each dataframe.
swc.fxn <-function(df1){
box<-pythag.opp.leg(df1$Radius)
box<-round(box)
Xhigh<-df1$X+box
Xlow<-df1$X-box
Yhigh<-df1$Y+box
Ylow<-df1$Y-box
swc.box<- Map(function(a, b, c, d, e) data.frame(X = a:b, Y = c:d, Z = e),
Xlow, Xhigh, Ylow, Yhigh, df1$Z)
swc.box2<- lapply(swc.box, function(x) tidyr::expand(x, X, Y, Z))
return(swc.box2)
}
which will then return you a dataframe for each row
swc.fxn(df1)
#[[1]]
# A tibble: 25 x 3
# X Y Z
# <int> <int> <dbl>
# 1 98 88 10
# 2 98 89 10
# 3 98 90 10
# 4 98 91 10
# 5 98 92 10
# 6 99 88 10
# 7 99 89 10
# 8 99 90 10
# 9 99 91 10
#10 99 92 10
# … with 15 more rows
#[[2]]
# A tibble: 25 x 3
# X Y Z
# <int> <int> <dbl>
# 1 108 88 10
# 2 108 89 10
# 3 108 90 10
# 4 108 91 10
# 5 108 92 10
# 6 109 88 10
# 7 109 89 10
# 8 109 90 10
# 9 109 91 10
#10 109 92 10
# … with 15 more rows
#.....
#.....
If the final goal is to make this into one dataframe, we can use do.call(rbind... with lapply or use purrr::map_df. A concise version of the function could be
swc.fxn <-function(df1){
box<- round(pythag.opp.leg(df1$Radius))
swc.box<- Map(function(a, b, c, d, e) data.frame(X = a:b, Y = c:d, Z = e),
df1$X-box, df1$X+box, df1$Y-box, df1$Y+box, df1$Z)
swc.box2<- purrr::map_df(swc.box, function(x) tidyr::expand(x, X, Y, Z))
return(swc.box2)
}
I have a data.frame of 14 columns made up of test scores at 13 time periods, all numeric. The last column, say X, denotes the specific time point that each student (rows) received a failing grade. I would like to create a separate column that has each student's failing test score from their specific failing time point.
dataframe<-data.frame(TestA=c(58,92,65,44,88),
TestB=c(17,22,58,46,98),
TestC=c(88,98,2,45,80), TestD=c(33,25,65,66,5),
TestE=c(98,100,100,100,100), X=c(2,2,3,NA,4))
Above is a condensed version with mock data. The first student failed at time point two, etc., but the fourth student never failed. The resulting column should be 17,2 2, 2, NA, 5. How can I accomplish this?
You can try
dataframe[cbind(1:nrow(dataframe), dataframe$X)]
#[1] 17 22 2 NA 5
From ?`[`
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector. Negative indices are not allowed in the index matrix. NA and zero values are allowed: rows of an index matrix containing a zero are ignored, whereas rows containing an NA produce an NA in the result.
Two alternative solutions.
One using map function from purrr package
library(tidyverse)
dataframe %>%
group_by(student_id = row_number()) %>%
nest() %>%
mutate(fail_score = map(data, ~c(.$TestA, .$TestB, .$TestC, .$TestD, .$TestE)[.$X])) %>%
unnest()
# # A tibble: 5 x 8
# student_id fail_score TestA TestB TestC TestD TestE X
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 17 58 17 88 33 98 2
# 2 2 22 92 22 98 25 100 2
# 3 3 2 65 58 2 65 100 3
# 4 4 NA 44 46 45 66 100 NA
# 5 5 5 88 98 80 5 100 4
And the other one uses rowwise
dataframe %>%
rowwise() %>%
mutate(fail_score = c(TestA, TestB, TestC, TestD, TestE)[X]) %>%
ungroup()
# # A tibble: 5 x 7
# TestA TestB TestC TestD TestE X fail_score
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 58 17 88 33 98 2 17
# 2 92 22 98 25 100 2 22
# 3 65 58 2 65 100 3 2
# 4 44 46 45 66 100 NA NA
# 5 88 98 80 5 100 4 5
I'm posting both because I have a feeling that the map approach would be faster if you have many students (i.e. rows) and tests (i.e. columns).
I have the data frame df and I want to subset df based on a number sequence within a categorical.
x <- c(1,2,3,4,5,7,9,11,13)
x2 <- x+77
df <- data.frame(x=c(x,x2),y= c(rep("A",9),rep("B",9)))
df
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 7 A
7 9 A
8 11 A
9 13 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
15 84 B
16 86 B
17 88 B
18 90 B
I want only the rows where x increments by 1 and not the rows where x increases by two: e.g.
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
I figured I have to do some dort of subtraction between elements and check if the difference is >1 and combine this with a ddply but this seems cumbersome. Is there a sort of sequence function I am missing?
using diff
df[which(c(1,diff(df$x))==1),]
Your example seems to behave well and can be nicely handled by #agstudy's answer. Should your data act up one day, though...
myfun <- function(d, whichDiff = 1) {
# d is the data.frame you'd like to subset, containing the variable 'x'
# whichDiff is the difference between values of x you're looking for
theWh <- which(!as.logical(diff(d$x) - whichDiff))
# Take the diff of x, subtract whichDiff to get the desired values equal to 0
# Coerce this to a logical vector and take the inverse (!)
# which() gets the indexes that are TRUE.
# allWh <- sapply(theWh, "+", 1)
# Since the desired rows may be disjoint, use sapply to get each index + 1
# Seriously? sapply to add 1 to a numeric vector? Not even on a Friday.
allWh <- theWh + 1
return(d[sort(unique(c(theWh, allWh))), ])
}
> library(plyr)
>
> ddply(df, .(y), myfun)
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 78 B
7 79 B
8 80 B
9 81 B
10 82 B