Conditional sub-setting or nulling - r

I have to include participants into a dataframe(or existing data frame) if they have higher score in invalid conditions relative to valid conditions. But I have two times of (T1-T3) data.
I have tried this one: data_new <- subset(data_raw, T1_invalid > T1_valid & T3_invalid > T3_valid)
However, it did not work because, for instance, some participants may have higher invalid score in just one time (T1), not in the second time (T3), or vice versa.
For example, a person can have higher invalid in one of the times, let's say T1_invalid > T1_valid. This should be included to the new data frame, it is okay. But, T3_invalid - T3_valid should be excluded because the invalid score is not higher than the valid score. But when you use AND operator, it excludes the person because, they have to have higher invalid scores in both T1 and T3. So, we over exclude in that case.
When you use OR operator it is the same. For example, a person has a higher score in T1_invalid > T1_valid, but not in the T3_invalid - T3_valid. Then, since one of the conditions is okay, it includes the person, but this person failed at T3. So, we should exclude T3_invalid - valid scores.
So basically, I was looking for something can check them separately. Then, I decided to make it null one by one like this:
data_raw[data_raw$T1_invalid < data_raw$T1_valid, c("T1_invalid", "T1_valid")] <- NA
data_raw[data_raw$T3_invalid < data_raw$T3_valid, c("T3_invalid", "T3_valid")] <- NA
However, it did not let me do this because I use the variables two times, for the condition part (>) and for make it null.
Does anyone have any idea? By the way they have to be in the same data frame for using in the model.

Here I provide a normal data.table solution. You can have a try.
library(data.table)
setDT(data_raw)
data_raw[, T1_invalid := ifelse(T1_invalid < T1_valid,NA,T1_invalid)]
data_raw[, T1_valid := ifelse(T1_invalid < T1_valid,NA,T1_valid)]
data_raw[, T3_invalid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]
data_raw[, T3_valid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]

Related

Trying to create a loop for a population range and seem to be missing something

This is the prompt I am working from:
In the state population data, can you write a loop that pulls only states with populations between 200,000 and 250,000 in 1800? You could accomplish this a few different ways. Try to use next and break statements.
The dataset in question tracks how each state's population has shifted from year to year, decade to decade.
There are various columns representing each year of observations, including one for 1800 ("X1800"). The states are all also listed under one column ("state," of course).
I've been working on this one for quite a while, being new to coding. Right now, the code I have is as follows:
i <- 1
for (i in 1:length(statepopulations$X1800)) {
if ((statepopulations$X1800[i] < 200000) == TRUE)
next
if ((statepopulations$X1800[i] > 250000) == TRUE)
break
print (statepopulations$state[i])
}
I want to print the names of all states that fall within that population range for the year 1800.
I'm not sure where I'm going wrong. Any help is appreciated.
I keep getting a message that I'm "missing value where TRUE/FALSE needed."
in the condition, you don't need the == TRUE part.
(statepopulations$X1800[i] < 200000) should work by itself -- it will return a TRUE or FALSE, which will dictate what happens next

Get next level from a given level factor

I am currently making my first steps using R with RStudio and right now I am struggling with the following problem:
I got some test data about a marathon with four columns, where the third column is a factor with 15 levels representing different age classes.
One age class randomAgeClass will be randomly selected at the beginning, and an object is created holding the data that matches this age class.
set.seed(12345678)
attach(marathon)
randomAgeClass <- sample(levels(marathon[,3]), 1)
filteredMara <- subset(marathon, AgeClass == randomAgeClass)
My goal is to store a second object that holds the data matching the next higher level, meaning that if age class 'kids' was randomly selected, I now want to access the data relating to 'teenagers', which is the next higher level. Looking something like this:
nextAgeClass <- .... randomAgeClass+1 .... ?
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
Note that I already found this StackOverflow question, which seems to partially match my situation, but the accepted answer is not understandable to me, thus I wasn't able to apply it to my needs.
Thanks a lot for any patient help!
First you have to make sure thar the levels of your factor are ordered by age:
factor(marathon$AgeClass,levels=c("kids","teenagers",etc.))
Then you almost got there in your example:
next_pos<-which(levels(marathon$AgeClass)==randomAgeClass)+1 #here you get the desired position in the level vector
nextAgeClass <- levels(marathon$AgeClass) [next_pos]
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
You might have a problem if the randomAgeClass is the last one, so make sure to avoid that problem

Find row where multiple criteria are met, return a different column, for a full dataset

I have two datasets; 'data' and 'noiseaware'. noiseaware contains a RoomCode and a time stamp.
RoomCode last_trigger
GTX-513 2020-05-09 00:30:28
data contains a ton of things, including a reservation code, a check-in time stamp, a check-out time stamp, and a RoomCode. Ie
ReservationID RoomCode checkin_time checkOutDate
25307070gawgw GTX-513 2020-04-09 00:30:28 2020-05-09 00:30:28
My objective is that for each line in noiseaware, I want to find the corresponding reservation ID that matches the following combination:
Is after the checkInDate
Is before the checkOutDate
Has the same RoomCode
That in logic is as follows:
noiseaware$last_trigger <= data$checkOutDate & noiseaware$last_trigger >= data$checkInDate & data$RoomCode == noiseaware$RoomCode
However, I can't work out how to turn that logic - which returns a vector of true and false values - into something that returns the ReservationId. If it makes any difference, there should only be one matching ID for the above criteria.
Once I can do that, I'd then want to loop through and do the same for each line in noiseaware. I suppose I could do that with lapply?
Sounds like something dplyr can handle easily.
You will need to left_join table noiseaware to data by RoomCode.
And then filter out the samples you don't need.
Here's an example. Without a sample data, I have no way to test this. You may need to tweak these codes to accommodate the actual data. But the basic idea is there.
library("dplyr")
noiseaware %>%
left_join(data, by = "RoomCode") %>%
filter(last_trigger > checkin_time & last_trigger < checkOutDate)
An option using data.table:
library(data.table)
setDT(noiseaware)[, last_trigger :=
setDT(data)[.SD, on=.(RoomCode, checkInDate<=last_trigger, checkOutDate>=last_trigger),
mult="last", x.ReservationID]
]
mult="last" uses the last observation if there are multiple results for a row in noiseaware.

How do I pull the values from multiple columns, conditionally, into a new column?

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!
I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If
This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

Use <= to search a range

I want to perform another calculation after checking the range $N$16:$N$9000 for all dates that are >=$C$6 as shown below.
=IF($N$16:$N$9000>=$C$6; "Y"; "N")
I really need a calculation that will test for >=$C$6 and <=$C$8.
The additional equation has been tested and works fine. It will replace the "Y" once I fix this portion of the logic.
Assuming you want to count the values from another column according to date values >=$C$6 and <=$C$8, you could adapt an example from the Conditional Counting and Summation HowTo. It's based on the SUMPRODUCT function.
=SUMPRODUCT($N$16:$N$9000 >= $C$6; $N$16:$N$9000 <= $C$8; $P$16:$P$900)
(assuming that P16:P9000 holds the values to sum up)

Resources