I have 2 scenarios:
One where I would like to define a new variable (called df$x1) depending on whether there are 16 NAs in 16 other different columns. My proposed code would be:
cols <- 1:16
df %>% mutate(x1=ifelse(rowSums(df[cols] ==NA, na.rm = TRUE) ==16) ,'Yes', 'No')))
On the second scenario, I would like to check whether there is at least 1 NA in a list of 12 variables
How would you do that?
Thank you!
Continuing with your 1st approach except NA's are checked with is.na -
cols <- 1:12
df$x1 <- ifelse(rowSums(is.na(df[cols])) > 0, 'Yes', 'No')
1s't Scenario: df$x1 <- ifelse(rowSums(is.na(df[,cols])) == 16, "Yes", "No")
Related
Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?
We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]
We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220
Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663
For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)
The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]
I am very new to R and I am trying to run a loop, so any help is greatly appreciated.
I have longitudinal data with multiple timepoints for each participant, which looks like the picture attached1
I need to replace the NA values with the values from when the Years variable is equal to 0, and I want to write a loop to do this for each participant. I have written some code which seems to work, however it only gives output for the last iteration of the loop (the last participant). This is the code I am using:
x <- c(1:4)
n = length(x)
for(i in 1:n)
{
data <- subset(df, ID %in% c(x[i]))
data$outcome <- ifelse(is.na(data$outcome),
data[1,3],
data$outcome)
}
Using this code, the output gives only the last iteration (i.e. in this case, ID 4). I need to complete this for all IDs.
Any help is much appreciated! Thankyou.
I'm not 100% clear on your intent, but this will, within an ID, fill all outcomes missing values by the (first) outcomes value from a row where Years == 0.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(outcome = coalesce(outcome, first(outcomes[Years == 0])))
Obvioustly untested, but if you provide some sample data I'll happily help debug.
Your loop replaces data$outcome each iteration. That is why you only get the last result.
Here's my inelegant solution:
Making sample data to match yours (not including unused column)
my_dat <- data.frame("years" = sample(c(0, 1.5, 3), 30, replace = T),
"outcome" = as.numeric(sample(c("", 1, 2), 30, replace = T)))
Find which rows are both 0 for years and missing outcome
my_index <- my_dat$years == 0 * is.na(my_dat$outcome)
Assign 0 to replace NA:
my_dat$outcome[my_index] <- 0
A simpler tidyverse method:
library(tidyverse)
df %>%
filter(ID %in% x) %>%
mutate(outcome = ifelse(is.na(Outcome), Years, Outcome))
your question could do with some clarification and a repreducible example. As I understand it from: "I need to replace the NA values with the values from when the Years variable is equal to 0". So if outcome equals NA and Years equals 0 you want outcom to equal 0?
set.seed(1984) # ser the seed so that my_dat is the same each time
# using a modified df from markhogue answer...
my_dat <- data.frame(
ID = 1:30,
years = sample(c(0, 1.5, 3), 30, replace = T),
outcome = as.numeric(sample(c("", 1, 2), 30, replace = T))
)
my_dat # have a look at rows 9 and 22
# ifelse given two conditions does year == 0 and is.na(outcome)
my_dat$outcome <- ifelse(my_dat$year == 0 & is.na(my_dat$outcome), my_dat$years, my_dat$outcome)
my_dat # have a look at rows 9 and 22
Let me know if this is what you need :)
I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!
Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]
Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit
I am currently trying to clean a dataframe for further machine learning analysis. I want to replace all the instances of -1 as null. I know how to do this by column, but how do I do this over a lot of columns?
Lets assume a dataframe containing 10 columns with 1 and -1:
DF <- data.frame(matrix(sample(c(1,-1), 1000, replace = TRUE), ncol = 10))
Then you simple replace the -1 values by NA:
DF[DF==-1] <- NA
This should work if your data is in a data frame.
df[df == -1] <- NA
The answer is similar to the ones posted above, I thought of a small tweak.
I think you mean replace -1 with NAs, since missing values are stored as NAs in R.
Depending upon whether -1 is stored as a factor/character or a numeric variable, you could try -
dfx = data.frame(x = c(0,1,2,-1), y = c("a", "b", "c","-1") )
dfx[dfx == -1 | dfx == "-1"] <- NA
I have the following simulated data.frame:
(please note that I have re-written large portions of the question, reflecting akrun's answer to my initial question)
set.seed(22)
df <- data.frame(f1 = rep("a", 20), f2 = factor(sample(c("yes", "no", "maybe", "maybenot"), 20, replace = T)), f3 = factor(sample(c("yes", "no"), 20, replace = T)), f4 = factor(sample(c("yes", "no"), 20, replace = T)))
f1 f2 f3 f4
1 a maybe yes yes
2 a no yes yes
3 a yes no no
4 a maybe yes no
5 a maybe no yes
6 a maybenot no yes
...
I would like to exclude all rows that do not show a yes in df$f2, and show a noin either df$f3, or df$f4. If I would manually transform the values into 0s and 1 (0 for everything except yesin df$f2), I could use rowSums as suggested by akrun. My current solution is to introduce a dummy column called df$exclude as follows and then to subset on df$exclude:
df$exclude <- "no"
df[df$f2 != "yes" | df$f3 == "no" | df$f4 == "no",]$exclude <- "yes"
df <- subset(df, exclude == "no")
Can't this be accomplished more concisely, e.g. without a prior transformation of the columns f2, f3, and f3, or by using lapply (somehow combined with subset, and possibly an anonymous function)?
Thanks in advance for your answers.
If we need to exclude rows that have 0 values for 'f2', 'f3' and 'f4', just do a rowSums to create a logical vector and subset the dataset
subset(df, rowSums(df[2:4]!=0) != 0)
Update
Based on the update in the OP's post
df[!rowSums(df[2:4] != "yes"),]