How do I pull the values from multiple columns, conditionally, into a new column? - r

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!

I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If

This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

Related

Conditional sub-setting or nulling

I have to include participants into a dataframe(or existing data frame) if they have higher score in invalid conditions relative to valid conditions. But I have two times of (T1-T3) data.
I have tried this one: data_new <- subset(data_raw, T1_invalid > T1_valid & T3_invalid > T3_valid)
However, it did not work because, for instance, some participants may have higher invalid score in just one time (T1), not in the second time (T3), or vice versa.
For example, a person can have higher invalid in one of the times, let's say T1_invalid > T1_valid. This should be included to the new data frame, it is okay. But, T3_invalid - T3_valid should be excluded because the invalid score is not higher than the valid score. But when you use AND operator, it excludes the person because, they have to have higher invalid scores in both T1 and T3. So, we over exclude in that case.
When you use OR operator it is the same. For example, a person has a higher score in T1_invalid > T1_valid, but not in the T3_invalid - T3_valid. Then, since one of the conditions is okay, it includes the person, but this person failed at T3. So, we should exclude T3_invalid - valid scores.
So basically, I was looking for something can check them separately. Then, I decided to make it null one by one like this:
data_raw[data_raw$T1_invalid < data_raw$T1_valid, c("T1_invalid", "T1_valid")] <- NA
data_raw[data_raw$T3_invalid < data_raw$T3_valid, c("T3_invalid", "T3_valid")] <- NA
However, it did not let me do this because I use the variables two times, for the condition part (>) and for make it null.
Does anyone have any idea? By the way they have to be in the same data frame for using in the model.
Here I provide a normal data.table solution. You can have a try.
library(data.table)
setDT(data_raw)
data_raw[, T1_invalid := ifelse(T1_invalid < T1_valid,NA,T1_invalid)]
data_raw[, T1_valid := ifelse(T1_invalid < T1_valid,NA,T1_valid)]
data_raw[, T3_invalid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]
data_raw[, T3_valid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]

Get next level from a given level factor

I am currently making my first steps using R with RStudio and right now I am struggling with the following problem:
I got some test data about a marathon with four columns, where the third column is a factor with 15 levels representing different age classes.
One age class randomAgeClass will be randomly selected at the beginning, and an object is created holding the data that matches this age class.
set.seed(12345678)
attach(marathon)
randomAgeClass <- sample(levels(marathon[,3]), 1)
filteredMara <- subset(marathon, AgeClass == randomAgeClass)
My goal is to store a second object that holds the data matching the next higher level, meaning that if age class 'kids' was randomly selected, I now want to access the data relating to 'teenagers', which is the next higher level. Looking something like this:
nextAgeClass <- .... randomAgeClass+1 .... ?
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
Note that I already found this StackOverflow question, which seems to partially match my situation, but the accepted answer is not understandable to me, thus I wasn't able to apply it to my needs.
Thanks a lot for any patient help!
First you have to make sure thar the levels of your factor are ordered by age:
factor(marathon$AgeClass,levels=c("kids","teenagers",etc.))
Then you almost got there in your example:
next_pos<-which(levels(marathon$AgeClass)==randomAgeClass)+1 #here you get the desired position in the level vector
nextAgeClass <- levels(marathon$AgeClass) [next_pos]
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
You might have a problem if the randomAgeClass is the last one, so make sure to avoid that problem

Finding the percentage of a specific value in the column of a data set

I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)

How to check for the presence of multiple strings for each value of a particular column in R Dataframe?

How do we identify all those row entries in a particular column that contain a specific set of keywords?
For example, I have the following dataframe:
test <- data.frame(nom = 1:5, name = c("ser bla", "onlybla", "inspectiongfa serdafds", "inspection", "serbla blainspection"))
My keywords of interest are "ser" & "inspection"
What I'm looking for is to enlist all the values of the second column (i.e. name) in which both the keywords are present together.
So basically, my output should enlist the name values of rows 3 and 4 viz. "inspectiongfa serdafds" & "serbla blainspection"
What I have tried is the following:
I first generate a truth table to enlist the presence of each of the keywords for each row in the dataframe as follows:
as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
Once I get this, all I have to do is identify all those row entries where the values are a pair of TRUE TRUE. Hence, they'll correspond to the cases where the keywords of interest are present. Here it's the same rows 3 & 4.
But, I'm not able to figure out how to identify such row entries with the TRUE TRUE pair and whether this whole process is a bit of an overkill and it can be done in a much efficient manner.
Any help would be appreciated. Thanks!
You're almost there :)
Here's a solution extending what you have done:
# store your logic test outcomes
conditions_df <- as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
# False=0 & True=1. Can use rowSums to get the total and find ones that =2 ie True+True
# which gives you the indices of the TRUE outcomes ie the rows we need to filter test
locate_rows <- which(rowSums(conditions_df) == 2)
test$name[locate_rows]
[1] "inspectiongfa serdafds"
[2] "serbla blainspection"

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

Resources