How to check for the presence of multiple strings for each value of a particular column in R Dataframe? - r

How do we identify all those row entries in a particular column that contain a specific set of keywords?
For example, I have the following dataframe:
test <- data.frame(nom = 1:5, name = c("ser bla", "onlybla", "inspectiongfa serdafds", "inspection", "serbla blainspection"))
My keywords of interest are "ser" & "inspection"
What I'm looking for is to enlist all the values of the second column (i.e. name) in which both the keywords are present together.
So basically, my output should enlist the name values of rows 3 and 4 viz. "inspectiongfa serdafds" & "serbla blainspection"
What I have tried is the following:
I first generate a truth table to enlist the presence of each of the keywords for each row in the dataframe as follows:
as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
Once I get this, all I have to do is identify all those row entries where the values are a pair of TRUE TRUE. Hence, they'll correspond to the cases where the keywords of interest are present. Here it's the same rows 3 & 4.
But, I'm not able to figure out how to identify such row entries with the TRUE TRUE pair and whether this whole process is a bit of an overkill and it can be done in a much efficient manner.
Any help would be appreciated. Thanks!

You're almost there :)
Here's a solution extending what you have done:
# store your logic test outcomes
conditions_df <- as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
# False=0 & True=1. Can use rowSums to get the total and find ones that =2 ie True+True
# which gives you the indices of the TRUE outcomes ie the rows we need to filter test
locate_rows <- which(rowSums(conditions_df) == 2)
test$name[locate_rows]
[1] "inspectiongfa serdafds"
[2] "serbla blainspection"

Related

How do I pull the values from multiple columns, conditionally, into a new column?

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!
I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If
This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

R set column value to be other column value based on string search

I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

subset indexing in r

I have a dataframe ma
it has a factor called type
type is comprised of the following factors: I210, I210plus, I210plusc, KV2c, KV2cplus
I'd like to put some of these factors in a vector, say, selected_types
so, selected_types<-c("I210plusc","KV2c")
then, have this command subset the dataframe ma
ma1<-subset(ma, type==selected_types)
such that ma1 would be a subset of ma consisting of only the observations that had
type I210plusc and KV2c
however, when I do this, the number of observations in the resulting dataframe ma1 is less than the sum of the occurrences of the two types in selected_types from the original ma
Any ideas on what I'm doing incorrectly?
Thank you
I originally had this in a comment, but it's a bit lengthy, plus I wanted to add to it. Here some details on what's happening:
what you're doing with == is recycling your two length vector, so that every even row is compared to "KV2c", and every odd one to "I210plusc", so your final result will be the data frame of odd rows that are "KV2c" and even rows that are "I210plusc".
An alternate solution that might make the issue clear is as follows:
subset(ma, type == selected_types[[1]] | type == selected_types[[2]])
Or, more gracefully:
subset(ma, type %in% selected_types)
The %in% operator returns a logical vector of same length as type with TRUE for every position in type that "is in" selected_types (hence the name of the operator).

How to delete rows in multiple columns by unique number?

Given data like this
C1<-c(3,-999.000,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,4,3,6,-999.000)
DF<-data.frame(ID=c("A","B","C","D","E"),C1=C1,C2=C2,C3=C3)
How do I go about removing the -999.000 data in all of the columns
I know this works per column
DF2<-DF[!(DF$C1==-999.000 | DF$C2==-999.000 | DF$C3==-999.000),]
But I'd like to avoid referencing each column. I am thinking there is an easy way to reference all of the columns in a particular data frame aka:
DF3<-DF[!(DF[,]==-999.000),]
or
DF3<-DF[!(DF[,(2:4)]==-999.000),]
but obviously these do not work
And out of curiosity, bonus points if you can me why I need that last comma before the ending square bracket as in:
==-999.000),]
The following may work
DF[!apply(DF==-999,1,sum),]
or if you can have multiple -999 on a row
DF[!(apply(DF==-999,1,sum)>0),]
or
DF[!apply(DF==-999,1,any),]
Based on your code, I'll assume that you want to remove all rows that contain -999.
DF2 <- DF[rowSums(DF == -999) == 0, ]
As for your bonus question: A data frame is a list of vectors, all of which have the same length. If we think of the vectors as columns, then a data frame can be thought of as a matrix where the columns might have different types (numeric, character, etc). R allows you to refer to elements of a data frame much the same way you refer to elements of a matrix; by using row and column indices. So DF[i, j] refers to the ith element in the jth vector of DF, which you can think of as the ith row and jth column. So if you want to retain only some of the rows of the data frame and all columns, you can use a matrix-like notation: DF[row.indices, ].
To address your "bonus" question, if we go to the documentation for ?Extract.data.frame we will find:
Data frames can be indexed in several modes. When [ and [[ are used
with a single index (x[i] or x[[i]]), they index the data frame as if
it were a list. In this usage a drop argument is ignored, with a
warning.
and also:
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they
act like indexing a matrix: [[ can only be used to select one element.
Note that for each selected column, xj say, typically (if it is not
matrix-like), the resulting column will be xj[i], and hence rely on
the corresponding [ method, see the examples section.
So you need the comma to ensure that R knows you are referring to a row, not a column.
I don't understand if your target is to remove all the rows that contain at least one NA, if this is what you are looking for, then this could be a possible answer:
DF[DF==-999] <- NA
na.omit(DF)
ID C1 C2 C3
1 A 3 3 5
3 C 4 3 3
4 D 4 4 6

Resources