Subset based on a variable value interpreted as a column - r

I'm trying to subset a data frame based on a variable I'm passing into it. My goal is to form a column name inside a function using some values I am passing into it and filter on that newly form column name.
Here's a reproducible example:
var_as_col_name <- function(df, col_var, filter_var) {
subset(df, col_var == filter_var)
}
# this should return what subset(df, cty == 18) would return
var_as_col_name(mpg,"cty", 18)
# this should return what subset(df, cyl == 4) would return
var_as_col_name(mpg,"cyl", 4)
Also, apart from the filters on mpg$cty and mpg$cyl above, I might have another filter that is hardcoded, which I don't want to change, i.e. my requirement should hold for more than one filter. Is there a better approach without using subset (since it is meant for interactive use)?
I am doing this because I have some columns in my dataset like t_1, t_2, t_3...t_24 and I need to filter on either of them and another flag column, so I'm doing:
df_1 <- subset(my_df,flag == 0 & t_1 > 0 & t_1 < 1) when I want data after filtering on t_1
df_2 <- subset(my_df,flag == 1 & t_2 > 0 & t_2 < 1) when I want data after filtering on t_2
...
Instead of this I was thinking of writing a function that takes:
n from 1 to 24, filters on that t_n
takes 1 or 0 for the flag.
and then returns the subsetted dataframe that I want.
Let me know if you need clarification on the question and thanks for your help...

Related

How to exclude values from random selection

This is independent but related to this question Randomly take wrong answers of a quiz question from dataframe column, instead of doing by hand
Using the mtcars dataset I now have managed to randomly select one value from a certain column: In this example from cyl column.
mtcars[sample(1:nrow(mtcars), 1),2]
This code will give randomly
[1] 6 or [1] 8 or ...
Now I want to exclude one certain value to be chosen, in this example say cyl==8.
I would store the value in a vector like:
not_select <- 8
mtcars[sample(1:nrow(mtcars), 1),2]
My question: How can I integrate not_select in mtcars[sample(1:nrow(mtcars), 1),2]
Expected Output: The random sample should not include 8
UPDATE:
e.g. the output should be:
6 or 4
UPDATE II due to unclear situation:
I want to select from column cyl one value randomly. This value should not be for example 8. So the value will be 4 or 6.
Explanation: 8 is the correct answer. And I am constructing randomly false answers with the other values (e.g. 4 and 6) from cyl column.
Perhaps, another way -
tmp <- mtcars[, 2]
sample(tmp[tmp != not_select], 1)
The above gives the probability of selecting each value based on their occurrence in the dataset. If you want the probability to be equal irrespective of how many times they occur you may only consider unique values.
tmp <- unique(mtcars[, 2])
sample(tmp[tmp != not_select], 1)
Couldn‘t you just add a filtering condition based on not_select?
mtcars[sample(1:nrow(mtcars), 1) & mtcars$cyl != not_select, 2]
Update: how about:
not_select <- 8
draw_cyl <- sample(unique(mtcars$cyl[mtcars$cyl != not_select]), 1)
mtcars %>%
filter(cyl == draw_cyl) %>%
slice_sample(n = 1) %>%
pull(cyl)
Or as suggested by TarJae themselve (so I don‘t own any credit for it!):
mtcars[sample(which (mtcars[,2] != not_select), 1), 2]
This recursive function will run on itself again if the output matches not_selected.
exclude_not_selected <- function(not_selected) {
value <- mtcars[sample(1:nrow(mtcars), 1),2]
if (value == not_selected) {
exclude_not_selected(not_selected)
} else {
return(value)
}
}
exclude_not_selected(8)
[1] 4

How do I filter data in data frame and change column's cell values based on it using a loop?

Currently working with a larger data frame with various participant IDs that looks like this:
#ASC_new Data Frame
Pcp Choice Target ASC Product choice_consis
2393 zwyn27soc B A 1 USB drive 0
2394 zwyn27soc B A 1 job 0
2395 zwyn27soc B B 1 USB drive 0
2397 zwyn27soc B A 1 printer 0
2399 zwyn27soc B B 1 walking shoes 0
2400 zwyn27soc B A 1 printer 0
I would like to try to loop through each participant (Pcp), and look at their choices in column "Choice." For example, under both of the products "USB drive," the participant chose "B" (Choice). Therefore, under "choice_consis," I want there to be a 1 to replace the 0 because the choices are consistent or equal. Although, my for loop for going through the participants and product names isn't working:
#Examples/snippets of my values
pcp_list <- list("ybg606k3l", "yk83d2asc", "yl55v0zhm", "zwyn27soc")
product_list <- list("USB drive", "printer", "walking shoes", "job")
#for loop that isn't working
for (i in pcp_list){ #iterating through participant codes
for (j in product_list){ #iterating through product names
comparison <- filter(ASC_new, Pcp == i & Product == j) #filtering participant data and products into new dataframe
choice_1 <- ASC_new$Choice[1] #creating labels for choice 1 and 2
choice_2 <- ASC_new$Choice[2]
if (isTRUE(choice_1 == choice_2)){ #comparing choice 1 and choice 2 and adding value of 1 to Choice_consis column if they are equal
ASC_new$choice_consis[1] <- 1
ASC_new$choice_consis[2] <- 1
}
}
}
In the end I would like a data frame where each participant's choice_consis is labeled with a 1 or 0 expressing if they chose the same item (A,B,D) both times that each product appeared.
This is something that's pretty natural to do using dplyr, if you don't care about collapsing across different choices. I'll illustrate on a toy data frame:
IDs <- 1:2
choices <- c('A', 'B')
products <- c('USB', 'Printer')
df <- data.frame(Pcp = rep(IDs, each = 4),
Choice = c(rep(choices, each = 2),
rep(choices, each = 2)),
Product = c(rep(products, times = 2),
rep(products, each = 2)))
df %>%
dplyr::group_by(Pcp, Product) %>%
dplyr::summarize(choice_consis = as.numeric(length(unique(Choice)) == 1))
This does (in essence) the same thing you're trying to do with your for loop: look at each combination of participants and products (that's what the group_by does) and then analyze that combination (that's what the summarize does). It's a little more succinct and readable than a double for loop. I'd check out Chapter 5 of Hadley's book on R for Data Science to learn more about these sorts of things.
As far as what's wrong with your for loop, the issue is that even though you create your comparison data frame, all the subsequent operations are on ASC_new. So if you wanted to use a for loop and maintain the structure of your original data, you could do something like:
for (i in pcp_list) {
for (j in product_list) {
compare <- (ASC_new$Pcp == i) & (ASC_new$Product == j)
choices <- ASC_new$Choice[compare]
if (length(unique(choices)) == 1) {
ASC_new$choice_consis[compare] <- 1
}
}
}
Creating a new data frame as you did makes it a little harder to substitute values in the original (because we don't know "where" the filtered data frame came from), so I just get the indices of the original data frame corresponding to the participant-product combination. Note also that I eliminated the hard-coding of the fact that there are only two choices, as well as the isTRUE within the if statement (== will evaluate to TRUE or FALSE, as desired).
Hope this helps!
You can count the unique value of Choice for each Pcp and Product and assign 1 if it is 1 or 0 otherwise.
This can be done in base R :
df$choice_consis <- +(with(df, ave(Choice, Pcp, Product, FUN = function(x)
length(unique(x)))) == 1)
dplyr :
library(dplyr)
df %>%
group_by(Pcp, Product) %>%
mutate(choice_consis = +(n_distinct(Choice) == 1))
and data.table
library(data.table)
setDT(df)[, choice_consis := as.integer(uniqueN(Choice) == 1), .(Pcp, Product)]

How to drop a buffer of rows in a data frame around rows of a certain condition

I am trying to remove rows in a data frame that are within x rows after rows meeting a certain condition.
I have a data frame with a response variable, a measurement type that represents the condition, and time. Here's a mock data set:
data <- data.frame(rlnorm(45,0,1),
c(rep(1,15),rep(2,15),rep(1,15)),
seq(
from=as.POSIXct("2012-1-1 0:00", tz="EST"),
to=as.POSIXct("2012-1-1 0:44", tz="EST"),
by="min"))
names(data) <- c('Variable','Type','Time')
In this mock case, I want to delete the first 5 rows in condition 1 after condition 2 occurs.
The way I thought about solving this problem was to generate a separate vector that determines the distance that each observation that is a 1 is from the last 2. Here's the code I wrote:
dist = vector()
for(i in 1:nrow(data)) {
if(data$Type[i] != 1) dist[i] <- 0
else {
position = i
tempcount = 0
while(position > 0 && data$Type[position] == 1){
position = position - 1
tempcount = tempcount + 1
}
dist[i] = tempcount
}
}
This code will do the trick, but it's extremely inefficient. I was wondering if anyone had some cleverer, faster solutions.
If I understand you correctly, this should do the trick:
criteria1 = which(data$Type[2:nrow(data)] == 2 & data$Type[2:nrow(data)] != data$Type[1:nrow(data)-1]) +1
criteria2 = as.vector(sapply(criteria1,function(x) seq(x,x+5)))
data[-criteria2,]
How it works:
criteria1 contains indices where Type==2, but the previous row is not the same type. The strange lookign subsets like 2:nrow(data) are because we want to compare to the previous row, but for the first row there is no previous row. herefore we add +1 at then end.
criteria2 contains sequences starting with the number in criteria1, to those numbers+5
the third row performs the subset
This might need small modification, I wasn't exactly clear what criteria 1 and criteria 2 were from your code. Let me know if this works or you need any more advice!

function to subtract each column from one specific column in r

I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:
CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
}
apply(df[3:50], 2, CopyNumberLoop)
but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?
I have been asked to produce a reproducible example so all the code I want is here:
df1 <- read.delim(file.choose(),header=TRUE)
#Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
df$Means <- rowMeans(df[,30:32])
RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
df$sd=sqrt(RowVar(df[,c(30:32)]))
#Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.
{
df$ZScore <- (df[,35]-df$Means)/(df$sd)
######################################### QUARTILE FILTER ###########################################################
alpha=1.5
numberofControls = 3
UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)
#Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)
######################################### Three neighbour Isolation filter ##############################################################################
finalSeb=c()
for(i in 2:(length(Zoutliers)-1)){
j=Zoutliers[i]
if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i+1]){
finalSeb = c(finalSeb,i)
}
}
finalset_row_number = Zoutliers[finalSeb]
#View(finalset_row_number)
p_seq = rep(0,nrow(df))
for(i in 1:length(finalset_row_number)){
p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
}
nrow(as.data.frame(finalset_row_number))
}
For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column
Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:
result = apply(df[, 3 : 50], 2, function (col) col - df$Means)
Or you can subtract the means directly:
result = df[, 3 : 50] - df$Means
This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:
result = (df[, 3 : 50] - df$Means) / df$sd
It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:
CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
}
df[, paste0('ZScore' , 3:50)] <- # assignment done outside the loop
lapply(df[3:50], CopyNumberLoop) # result is a list
# but the `[.data.frame<-` method will accept a list.
Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.
If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).

adding a column based on other values

I have a dataframe with millions of rows and three columns labeled Keywords, Impressions, Clicks. I'd like to add a column with values depending on the evaluation of this function:
isType <- function(Impressions, Clicks)
{
if (Impressions >= 1 & Clicks >= 1){return("HasClicks")} else if (Impressions >=1 & Clicks == 0){return("NoClicks")} else {return("ZeroImp")}
}
so far so good. I then try this to create the column but 1) it takes for ever and 2) it marks all the rows has "HasClicks" even the ones where it shouldn't.
# Creates a dataframe
Type <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Type <- rbind(Type,isType(Mydf$Impressions[i], Mydf$Clicks[i]))}
# Add the column to Mydf
Mydf <- transform(Mydf, Type = Type)
input data:
Keywords,Impressions,Clicks
"Hello",0,0
"World",1,0
"R",34,23
Wanted output:
Keywords,Impressions,Clicks,Type
"Hello",0,0,"ZeroImp"
"World",1,0,"NoClicks"
"R",34,23,"HasClicks"
Building on Joshua's solution, I find it cleaner to generate Type in a single shot (note however that this presumes Clicks >= 0...)
Mydf$Type = ifelse(Mydf$Impressions >= 1,
ifelse(Mydf$Clicks >= 1, 'HasClicks', 'NoClicks'), 'ZeroImp')
First, the if/else block in your function will return the warning:
Warning message:
In if (1:2 > 2:3) TRUE else FALSE :
the condition has length > 1 and only the first element will be used
which explains why it all the rows are the same.
Second, you should allocate your data.frame and fill in the elements rather than repeatedly combining objects together. I imagine this is causing your long run-times.
EDIT: My shared code. I'd love for someone to provide a more elegant solution.
Mydf <- data.frame(
Keywords = sample(c("Hello","World","R"),20,TRUE),
Impressions = sample(0:3,20,TRUE),
Clicks = sample(0:3,20,TRUE) )
Mydf$Type <- "ZeroImp"
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks >= 1,
"HasClicks", Mydf$Type)
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks == 0,
"NoClicks", Mydf$Type)
This is a case where arithmetic can be cleaner and most likely faster than nested ifelse statements.
Again building on Joshua's solution:
Mydf$Type <- factor(with(Mydf, (Impressions>=1)*2 + (Clicks>=1)*1),
levels=1:3, labels=c("ZeroImp","NoClicks","HasClicks"))

Resources