Creating new column with values based on old columns including NA

Creating new column with values based on old columns including NA - r

I have a dataset where I'm focusing on 2 specific columns. I want to create a new column with the following:
How can I do this in R?
Thanks
This is the code that I tried, which didn't work. Gave me an error. Also, I wasn't sure how to include all the NA values in this code.
data_2=data_1%>%mutate(majoramp_indX=case_when(c("amplevel_r","amplevel_l")>=2~1,c("amplevel_r","amplevel_l")<2~0))
Then I also tried this, which gave me all 1s in the new column
data_1$majoramp_indX=case_when(c("amplevel_r","amplevel_l")>=2~1,c("amplevel_r","amplevel_l")<2~0)

Welcome to SO Mufti!
There is a point that needs clarifying in your two conditions: what to do if, for example A = 1 and B = 2? Is the result 0 or 1? To get you going, I've done code for the two conditions.
To code your first condition, you can then do
library(dplyr)
myData <- myData |> mutate(Con1 = ifelse((A %in% c(2:5) | (B %in% c(2:5))), 1, NA))
For your second condition, you can do
myData <- myData |> mutate(Con2 = ifelse((A %in% c(1, NA) | (B %in% c(1,NA))), 0, NA))
Hope this helps! :-)

Related

In R, sample n rows from a df in which a certain column has non-NA values (sample conditionally)

Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?

We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]

We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220

Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663

For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)

The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]

Loop only running for the last iteration in R - Looping over participants

I am very new to R and I am trying to run a loop, so any help is greatly appreciated.
I have longitudinal data with multiple timepoints for each participant, which looks like the picture attached1
I need to replace the NA values with the values from when the Years variable is equal to 0, and I want to write a loop to do this for each participant. I have written some code which seems to work, however it only gives output for the last iteration of the loop (the last participant). This is the code I am using:
x <- c(1:4)
n = length(x)
for(i in 1:n)
{
data <- subset(df, ID %in% c(x[i]))
data$outcome <- ifelse(is.na(data$outcome),
data[1,3],
data$outcome)
}
Using this code, the output gives only the last iteration (i.e. in this case, ID 4). I need to complete this for all IDs.
Any help is much appreciated! Thankyou.

I'm not 100% clear on your intent, but this will, within an ID, fill all outcomes missing values by the (first) outcomes value from a row where Years == 0.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(outcome = coalesce(outcome, first(outcomes[Years == 0])))
Obvioustly untested, but if you provide some sample data I'll happily help debug.

Your loop replaces data$outcome each iteration. That is why you only get the last result.
Here's my inelegant solution:
Making sample data to match yours (not including unused column)
my_dat <- data.frame("years" = sample(c(0, 1.5, 3), 30, replace = T),
"outcome" = as.numeric(sample(c("", 1, 2), 30, replace = T)))
Find which rows are both 0 for years and missing outcome
my_index <- my_dat$years == 0 * is.na(my_dat$outcome)
Assign 0 to replace NA:
my_dat$outcome[my_index] <- 0

A simpler tidyverse method:
library(tidyverse)
df %>%
filter(ID %in% x) %>%
mutate(outcome = ifelse(is.na(Outcome), Years, Outcome))

your question could do with some clarification and a repreducible example. As I understand it from: "I need to replace the NA values with the values from when the Years variable is equal to 0". So if outcome equals NA and Years equals 0 you want outcom to equal 0?
set.seed(1984) # ser the seed so that my_dat is the same each time
# using a modified df from markhogue answer...
my_dat <- data.frame(
ID = 1:30,
years = sample(c(0, 1.5, 3), 30, replace = T),
outcome = as.numeric(sample(c("", 1, 2), 30, replace = T))
)
my_dat # have a look at rows 9 and 22
# ifelse given two conditions does year == 0 and is.na(outcome)
my_dat$outcome <- ifelse(my_dat$year == 0 & is.na(my_dat$outcome), my_dat$years, my_dat$outcome)
my_dat # have a look at rows 9 and 22
Let me know if this is what you need :)

replacing cell values in dataframe for specific variables

I have thousands of rows in each column. I need to find specific values in column A based on the value of column B, and replace column A with a new value if it is greater than a specific value.
For example, if column B = 1 and the values in column A > 2, then I want to replace all the values in column A > 2 equal to 2 when column B = 1.
I've tried this code:
if(dt$B=='1'){
dt <- dt %>% mutate(A = ifelse(A > 2, 2, A))
}
But this does not work. I've tried some other methods as well, but nothing I do works. Please let me know if you can help with this! Thank you.

We can have a & option within ifelse for the test condition
library(dplyr)
dt <- dt %>%
mutate(A = ifelse(A > 2 & B == 1, 2, A))

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!

Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]

Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit

R Drop rows according to various criteria

I need some help regarding how to start the implementation a problem in R. I have a data frame with rows which are grouped by the variable 'id'. For each 'id' I want to keep only one row. However, I have a number of criteria which specify which rows to drop.
These are some of my criteria:
I want to keep one random row within each group 'id' which has 'text' != NA (there might be several such rows); and I also want to keep all columns of this row, this is also the case for all following criteria.
If all rows in a group have 'text' == NA, then I want to keep one random row which has the variable 'check' == T (there might be several such rows)
If all rows in a group have 'text' == NA and 'check' == F, then I want to keep the row which has the variable 'newtext' which meets the condition !(grepl("None",df$newtext))
I can also provide a dataset if this makes it more clear. However, my most important issue is that I do not know how to implement this logic of dropping rows according to an ordered number of criteria.
It would be nice, if anyone can tell me how to implement such a code.
Thank you!
This would be an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
As an output, I want to keep the following rows:
row 1 or 3
row 4 or 5
row 7 or 8
The column othervars should be kept as well as I need this information later on.
Hope this makes it a bit clearer.

Alright, I've got something. I'm using filter() from dplyr to subset with unknown NA, because I ran into problems using either subset() or common df[,] subsetting from base R.
Data:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
text=c("asd",NA,"asd",NA,NA,NA,NA,NA,NA),
check = c(T,F,T,T,T,F,F,F,F),
newtext =
c("as","as","as","das","das","None","qwe","qwe2","None"),
othervars = c(1,2,3,45,5,6,6,7,1))
Initiating new empty dataframe:
df2 <- df[0,]
Loop to sample one row per id:
library(dplyr)
for(i in unique(df$id)){
temp <- filter(df, id == i)
if(nrow(filter(temp, !is.na(text))) > 0){
temp <- filter(temp, !is.na(text))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else if(nrow(filter(temp, check)) > 0){
temp <- filter(temp, check)
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}else{
temp <- filter(temp, !(grepl("None",temp$newtext)))
df2[i, ] <- temp[sample(nrow(temp), size = 1), ]
}
}
Output example:
> df2
id text check newtext othervars
2 1 asd TRUE as 1
1 2 <NA> TRUE das 45
3 3 <NA> FALSE qwe 6
Greetings.
Edit: Ignore the row numbers on the left, they are residuals from the different subsets within the loop.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating new column with values based on old columns including NA - r

Related

In R, sample n rows from a df in which a certain column has non-NA values (sample conditionally)

Loop only running for the last iteration in R - Looping over participants

replacing cell values in dataframe for specific variables

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

R Drop rows according to various criteria

Categories

Resources