> df = data.frame(id = 1:5, ch_1 = 11:15,ch_2= 10:14,selection = c(11,13,12,14,12))
> df
id ch_1 ch_2 selection
1 1 11 10 11
2 2 12 11 13
3 3 13 12 12
4 4 14 13 14
5 5 15 14 12
Given this data set I need an additional column that follow the rules:
if selection is one of the two choices (ch_1 and ch_2), return the number of the choice (1 or 2)
if the selection is not of the two choices, return 3
I need a way to do this for every row. For a single row, doing the following code works just fine, but I can't seem to find a way to use it with apply to run it to each single row of a dataframe.Looking for a solution that can be applied to more than just two columns and that runs faster than doing a traditional loop
df=df[1,]
if (df$selection %in% df[,paste("ch_",1:2,sep="")]) {
a = which(df[,paste("ch_",1:2,sep="")]==df$selection)
} else {
a = 3
}
# OR
ifelse(df$selection %in% df[,paste("ch_",1:2,sep="")],1,3)
# OR
match(df$selection,df[,paste("ch_",1:2,sep="")])
Compare the vector to the other columns with ==, add a final column which is always TRUE, and then take the index of the first TRUE in each row using max.col
max.col(cbind(df$selection == df[c("ch_1","ch_2")], TRUE), "first")
#[1] 1 3 2 1 3
This should easily extend to n columns then.
You could do this with nested ifelse,
with(df, ifelse(selection == ch_1, 1L, ifelse(selection == ch_2, 2L, 3L)))
# [1] 1 3 2 1 3
but I'm rarely fond of nesting them. If this is all you need (and you never need more than two), then this might suffice.
One alternative is using dplyr::case_when,
with(df, dplyr::case_when(selection == ch_1 ~ 1, selection == ch_2 ~ 2, TRUE ~ 3))
and it can be easily used within a dplyr::mutate if you are already using the package.
Related
Suppose I have the following dataset test
> test = data.frame(location = c("here", "there", "here", "there", "where"), x = 1:5, y = c(6,7,8,6,10))
> test
location x y
1 here 1 6
2 there 2 7
3 here 3 8
4 there 4 6
5 where 5 10
Then, I want to make a condition where if y satisfy a condition, every location matched once are maintained in the dataset, something like
test %>% filter_something(y == 6)
location x y
1 here 1 6
2 there 2 7
3 here 3 8
4 there 4 6
Note that, even in line 4 there is no y = 6, they keep on the dataset, since there is at least one case where location match the 'right' y.
I can solve this problem creating another dataset using y == 6, and then doing an inner join with test, but any hint if there is another option more elegant?, because I'm not filtering just this variable, but I'm using another columns too.
We can group_by location, then use any(condition)
library(dplyr)
test %>% group_by(location) %>%
filter(any(y==6))
If we want to use data.table, we could first get the locations associated with y ==6, and filter on those, all in one line:
library(data.table)
test <- setDT(test)
# keep only the locations associated with y == 6
test <- test[location %in% test[y==6]$location]
I'd like to make a data frame using only the last computed values from a Repeat loop.
For the repeat and sample functions, I'm using this data. The numbers in Prob column are the probabilities of each number to occur.
enter image description here
b <- 1
repeat {
c <- sample(a$Plus, size=1, prob=(a$Prob))
cat(b, '\t', c, '\n')
b <- b + 1
if (c >= 10) {
{
break
}
}
}
#I'm interested in the result greater than 10 only
If I run the code above, then it will compute something like
1 4
2 8
3 13
If I run this again, it will compute different results like..
1 9
2 3
3 7
4 3
5 11
What I'd like to do is to make a data frame using only the last outputs of each loop.
For example, using the computed data above, I'd like to make a frame that looks like
Trial Result
3 13
5 11
Is there any way to repeat this loop the number of times I want to and make a data frame using only the last outputs of each repeated function?
You can use a user defined function to do this. Since you haven't given your dataframe a, I've defined it as follows:
library(tidyverse)
a <- tibble(
Plus = 1:15,
Prob = seq(from = 15, to = 1, by = -1)
)
The following function does the same thing as your repeat loop, but stores the relevant results in a tibble. I've left your variable b out of this because as far as I can see, it doesn't contribute to your desired output.
samplefun <- function(a) {
c <- sample(a$Plus, size=length(a$Plus), prob=a$Prob)
res <- tibble(
Trial = which(c >= 10)[1],
Result = c[which(c >= 10)[1]]
)
return(res)
}
Then use map_dfr to return as many samples as you like:
nsamples <- 5
map_dfr(1:nsamples, ~ samplefun(a))
Output:
# A tibble: 5 x 2
Trial Result
<int> <int>
1 4 11
2 6 14
3 5 11
4 2 10
5 4 15
I have a function that I want to iterate over only certain rows of my dataset, and then save the results in a variable in the dataset.
So for example say I have this set up:
library(tidyverse)
add_one <- function(vector, x_id){
return(vector[x_id] + 1)
}
test <- data.frame(x = c(1,2,3,4), y = c(1,2,3,4), run_on = c(TRUE,FALSE,TRUE,FALSE))
test
So the test data frame looks like:
> x y run_on
>1 1 1 TRUE
>2 2 2 FALSE
>3 3 3 TRUE
>4 4 4 FALSE
So what I want to do is iterate over the dataframe and set the y column to be the result of applying the function add_one() to the x column for just the rows where run_on is TRUE. I want the end result to look like this:
> x y run_on
>1 1 2 TRUE
>2 2 2 FALSE
>3 3 4 TRUE
>4 4 4 FALSE
I have been able to iterate the function over all of the rows using apply(). So for example:
test$y <- apply(test,1,add_one,x_id = 1)
test
> x y run_on
>1 1 2 TRUE
>2 2 3 FALSE
>3 3 4 TRUE
>4 4 5 FALSE
But this also applies the function to rows 2 and 4, which I do not want. I suspect there may be some way to do this using versions of the map() functions from ::purrr, which is why I tagged this post as such.
In reality, I am using this kind of procedure to repeatedly iterate over a large dataset multiple times, so I need it to be done automatically and cleanly. Any help or suggestions would be very much appreciated.
UPDATE
I managed to find a solution. Some of the solutions offered here did work in my toy example but did not extend to the more complex function I was actually using. Ultimately what worked was something similar to what tmfmnk suggested. I just wrapped the original function inside another function that included an if statement to determine whether or not to apply the original function. So to extend my toy example, my solution looks like this:
add_one_if <- function(vector, x_id, y_id, run_on_id){
if(vector[run_on_id]){
return(add_one(vector,x_id))}
else{
return(vector[x_id])
}
}
test$y <- apply(test, 1, add_one_if, x_id = 1, y_id = 2, run_on_id = 3)
It seems a little convoluted, but it worked for me and is reproducible and reliable in the way I need it to be.
You can also do:
add_one <- function(data, vector, x_id, n, is.true = c(TRUE, FALSE)) {
if (is.true) {
return(data[[vector]] + (data[[x_id]]) * n)
} else {
return(data[[vector]] + (!data[[x_id]]) * n)
}
}
add_one(test, vector = "y", x_id = "run_on", 1, is.true = TRUE)
[1] 2 2 4 4
add_one(test, vector = "y", x_id = "run_on", 5, is.true = FALSE)
[1] 1 7 3 9
It may be that your real case is more complicated than allowed by this, but why not just use ifelse?
test$y <- ifelse(test$run_on,add_one(test,x),y)
Or even:
test$y[test$run_on]<-add_one(test[run_on,],x)
You won't need to use purrr until you are applying the same function to multiple columns. Since you want to modify only one column, but based on a condition you can use mutate() + case_when().
mutate(test, y = case_when(run_on ~ add_one(y),
!run_on ~ y))
#> x y run_on
#> 1 1 2 TRUE
#> 2 2 2 FALSE
#> 3 3 4 TRUE
#> 4 4 4 FALSE
I am working on subsetting multiple variables in a dataset to remove data points that are not useful. When I enter the subset command for the first variable and check the dataset, the variable has been properly subset. However, after doing the same with the second variable, the first is no longer subset in the dataset. It seems as though the second subset command is overriding the first. In the example I came up with below the first variable (Height) is no longer subset once I subset the second variable (Weight). Any thoughts on how to resolve this?
rTestDataSet = TestDataSet
rTestDataSet = subset(TestDataSet, TestDataSet$Height < 4)
rTestDataSet = subset(TestDataSet, TestDataSet$Weight < 3)
You are applying both subsets to the original data. What you need to do is apply one subset, save it to a variable and then apply the second subset to this new variable. Also as already pointed out you don't need the $ when using subset.
try this:
Make some reproducible data:
set.seed(50)
TestDataSet <- data.frame("Height" = c(sample(1:10,30, replace = T)), Weight = sample(1:10,30, replace = T) )
rTestDataSet = TestDataSet
rTestDataSet = subset(rTestDataSet, Height < 4)
rTestDataSet
Height Weight
3 3 5
6 1 7
9 1 4
10 2 5
12 3 9
14 1 1
15 3 1
19 1 8
20 2 9
22 2 8
28 3 6
rTestDataSet = subset(rTestDataSet, Weight < 3)
rTestDataSet
Height Weight
14 1 1
15 3 1
Why not use tidyverse? Chain the operations together to create your own logic. Instead of subset you can use filter to get the rows you want conditionally:
library(tidyverse)
TestDataSet %>%
filter(Height < 4) %>%
filter(Weight < 3)
or
TestDataSet %>%
filter(Height < 4 & Weight < 3)
I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops.
for example
ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data.frame")
remove all rows that have cells that contain the values 8,9,10
I guess i could use ff[ !ff[,1] %in% c(8, 9, 10), ] or subset(ff, !ff[,1] %in% c(8,9,10) )
but in order to remove all the values from the dataset i have to parse each column (probably with a for loop, something i wish to avoid).
Is there any other (cleaner) way?
Thanks a lot
apply your test to each row:
keeps <- apply(ff, 1, function(x) !any(x %in% 8:10))
which gives a boolean vector. Then subset with it:
ff[keeps,]
j.1 j.2 j.3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
11 11 12 13
12 12 13 14
13 13 14 15
>
I suppose the apply strategy may turn out to be the most economical but one could also do either of these:
ff[ !rowSums( sapply( ff, function(x) x %in% 8:10) ) , ]
ff[ !Reduce("+", lapply( ff, function(x) x %in% 8:10) ) , ]
Vector addition of logical vectors, (equivalent to any) followed by negation. I suspect the first one would be faster.