Unfiltering in R to mutate a variable in a data frame - r

I have a data set of open-answer responses in a survey. I have a set of topics that I want to assign to these open answer responses according to the content of the people's responses to a given question.
I want to specify keywords and search for a match to these keywords (/strings) for a specific variable (in this case, the variable is a question asked). Importantly, these strings can only be a partial match to the responses, I'm just specifying keywords that will appear at some point in the string. For example, filtering the responses by those that use the word 'money' at some point in their response to a given question. Given the matched rows, I want to assign a "1" to those rows in a new variable, let's call it "Topic1".
I have this code:
df %>%
filter(str_detect(var, "money|cash|revenue|profit")) %>%
mutate(Topic1 = "1")
This code works fine to filter and create the new variable according to the conditions I specify. However, if I assign back to the dataframe, I have now subsetted the entire dataframe according to my filter. I would like to filter the rows by my provided keywords, create the new variable, then 'unfilter' the dataframe, and assign the result back to the df.
I'm wondering if there is an 'unfilter' that I can use to reselect all my data. OR perhaps there is a better solution with ifelse?

Related

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

R Studio "filter" function question. Filtering out items that do not contain a certain value

I have a dataset with two columns of interest, one a "Response" column (where participants in a task could respond through typing what they believed a presented image was - so the class being a "character" for their responses). The second column is an "Image" column (containing the name of the actual image presented).
What I would like to do is see how many of the Responses do not match what the image actually was. As there are multiple words participants can characterise and name an object, I would also like to have several options for what is acceptable for the response to be. What I have done so far is to try and use the filter function for each of the 300 images that have been presented, including all responses to the presentation of one individual image and all responses to that image that contain the word that is correct. See below:
Image1CorrectAnswers <- data %>% filter(data$Image == "Image1.jpg", data$Response == "bike")
What I was wondering however, is 1) whether it is possible to use the filter function for responses that do not contain the correct word for that specific image? 2) As well as whether I can have multiple different "acceptable" words to "filter" out correct responses from the incorrect ones (as different participants can answer differently to the same image, and yet both be correct). The goal is to have a final variable for each of the 300 images containing only the incorrect responses.
Thank you in advance.
If I understood correctly, you need to determine which rows in your dataset have the "Response" column match the "Image" column, right?
Well, considering your "Image" column is a filename (say, ending with .jpg), maybe you could split/remove the file extension. Thus, you could directly find out which rows are correctly labeled (response = Image)
Say,
correct_responses <- data$Response %in% strsplit(data$Image, ".jpg")
This way you can acess the correct and incorrect rows by doing:
correct_data <- data[correct_responses, ]
incorrect_data <- data[!correct_responses, ]

Filter tbl_sqlite column by string comparing with another column - dplyr

This maybe a novice question.How do I filter with matching string values with values from another column using dplyr when working with a database?
e.g. I want to do something like this.
install.packages("nycflights13")
library(nycflights13)
head(nycflights13)
Lets say I want to filter for rows with origin values contained in destination values, I tried
filter(flights_sqlite, origin %in% (unique(select(flights_sqlite,dest))))
However that operation is not allowed. I do not want to convert this to dataframe as the database that I am working with is large and will eat up any available ram.

List rows with specific columns with a specific value in R from dataset

I have a dataset called "flights" and I am attempting to list all the rows that have the value of "Escanaba, Michigan" in the column Destination. I would like to show 5 columns and then all the rows that apply to Escanaba.
Currently I have...
flights[,c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
That works perfectly for what I want, except it shows all rows.
How do I call out a specific value from a column in a dataset?
This is a pretty basic indexing question (see e.g here, which was the first hit when I googled "R indexing"); you need to construct a logical vector that is TRUE for the relevant rows.
flights[flights$Destination=="Escanaba, Michigan",
c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
A prettier alternative for interactive work (not entirely safe for programmatic use):
subset(flights,Destination=="Escanaba, Michigan",
select=c(FlightDate,Carrier,
Destination,DestCityName,AirTime))
If you want to allow for more than one possible value of Destination, try %in%

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

Resources