Replacing NA in for/if loop in R - r

I'm running into an unexpected challenge in R. In my dataset, there are NA in certain columns. Some of these NAs SHOULD be present (the values are truly missing), while others should be replaced with 0s. I used code like the following:
df1 <- data.frame(x = c(1, 2, 3, 4, 5), y = c(10, 10, NA, NA, 12), z = c(9, 9, 9, 9, 9))
for (i in nrow(df1)){
if(df1$x[i] > 3){
df1$y[i] = 0
df1$z[i] = 0
}
}
And obtained this output
x y z
1 1 10 9
2 2 10 9
3 3 NA 9
4 4 NA 9
5 5 0 0
The NA SHOULD be preserved in row 3, but the NA in row 4 should have been replaced with 0. Further, the z value in row 4 did not update. Any ideas as to what is happening?

You've used for i in nrow(df1) which evaluates to for i in 5. I'm guessing you meant to use for i in 1:nrow(df1), which would evaluate to for i in 1:5 and include all rows.

Don't do it this way, R isn't Python, you get your vectorized functions out of the box:
df1[df1$x > 3, c('y', 'z')] <- 0
df1
# x y z
# 1 1 10 9
# 2 2 10 9
# 3 3 NA 9
# 4 4 0 0
# 5 5 0 0

Related

How to replace NA with cero in a columns, if the columns beside have a values? using R

I want to know a way to replace the NA of a column if the columns beside have a value, this because, using a example if the worker have values in the other columns mean he went to work that day so if he have an NA it means that should be replaced with cero, and if there are no values in the columns surrounding means he didnt go to work that day and the NA is correct
I have been doing this by sorting the other columns but its so time consuming
A sample of my data called df, the real one have 30 columns and like 30,000 rows
df <- data.frame(
hours = c(NA, 3, NA, 8),
interactions = c(NA, 3, 9, 9),
sales = c(1, 1, 1, NA)
)
df$hours2 <- ifelse(
test = is.na(df$hours) & any(!is.na(df[,c("interactions", "sales")])),
yes = 0,
no = df$hours)
df
hours interactions sales hours2
1 NA NA 1 0
2 3 3 1 3
3 NA 9 1 0
4 8 9 NA 8
You could also do as follows:
library(dplyr)
mutate(df, X = if_else(is.na(hours) | is.na(interactions), 0, hours))
# hours interactions sales X
# 1 NA NA 1 0
# 2 3 3 1 3
# 3 NA 9 1 0
# 4 8 9 NA 8

R - Find the sum for the lagging record and add another column to current value iteratively

So my dataframe is structured like so:
x
s
NA
0
13
0
-3
0
2
0
-4
0
for each row in s, I would like to take the lag(s), add it to column x, then set it to the value of s.
my output data would therefore look like:
x
s
NA
0
13
13
-3
10
2
12
-4
8
I tried the following function, but after fiddling I was only able to get all NA's or all 0's:
mydata$s = lag(mydata$s)+mydata$x
Note - if it helps, I can remove the first row.
You can use cumsum() to perform the job, and also replace NA with 0 during the calculation (without changing your original dataset).
library(tidyverse)
df %>% mutate(s = cumsum(ifelse(is.na(x), 0, x)))
x s
1 NA 0
2 13 13
3 -3 10
4 2 12
5 -4 8
It works for me.
Set up:
mydata <- data.frame(x = c(NA, 13, -3, 2, -4), s = c(0, 13, 10, 12, 8) )
mydata$s <- lag(mydata$s)+mydata$x
Gives:
mydata
x s
1 NA NA
2 13 13
3 -3 10
4 2 12
5 -4 8
The difference is my first s is NA. That should be expected as the first x is NA.
Base R solution:
mydata$s <- c(mydata$x[1], cumsum(mydata$x[-1]))
Data:
mydata <- data.frame(x = c(NA, 13, -3, 2, -4))

How to replace missing points in a data set?

I want to write a function in R that receives any data set as input, such that the data set has some missing points (NA). Now I want to use mean function to replace some numbers/values for missing points (NA) in the data set. What I am thinking is a function like this:
x<function(data,type=c("mean", lag=2))
Indeed, it should compute the mean of the two numbers later and two numbers before of the missing point (because I considered lag as 2 in the function). For example, if the missing point is in place 12th then the function should compute the mean of the numbers in places 10th, 11th, 13th, and 14th and substitute the result for the missing point at place 12th. In particular cases, for example, if the missing point is in the last place, and we do not have two numbers later, the function should compute the mean of all the data of the corresponding column and substitute for the missing point. Here I give an example to make it clear. Consider the following data set:
3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA
In the above data set, the first NA should be replaced with the mean of numbers 2, 5 (two data before), and 6 and 4 (two data after) which is (2+5+6+4)/4 is equal to 17/4. And the last NA should be replaced with the mean of the last column which is (2+1+7+10+2)/5 is equal to 22/5.
My question is how can I add some codes (if, if-else, or other loops) to the above function to make a complete function to satisfy the above explanations. I should highlight that I want to use the family of apply functions.
First we can define a function that smooths a single vector:
library(dplyr)
smooth = function(vec, n=2){
# Lead and lag the vector twice in both directions
purrr::map(1:n, function(i){
cbind(
lead(vec, i),
lag(vec, i)
)
}) %>%
# Bind the matrix together
do.call(cbind, .) %>%
# Take the mean of each row, ie the smoothed version at each position
# If there are NAs in the mean, it will itself be NA
rowMeans() %>%
# In order, take a) original values b) locally smoothed values
# c) globally smoothed values (ie the entire mean ignoring NAs)
coalesce(vec, ., mean(vec, na.rm=TRUE))
}
> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1] 2.0 1.0 7.0 10.0 2.0 4.4
Then we can apply it to each column:
> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>%
matrix(byrow=TRUE, ncol=7) %>%
as_tibble(.name_repair="universal") %>%
mutate(across(everything(), smooth))
# A tibble: 6 × 7
...1 ...2 ...3 ...4 ...5 ...6 ...7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 7 8 0 8 12 2
2 5 8 9 2 8 9 1
3 1 2 4 5 0 6 7
4 5 6 0 4.25 3 9 10
5 7 2 3 6 11 14 2
6 4 8 7 4 5 3 4.4
Please find below one solution using the data.table library.
Reprex
Your data:
m1 <- "3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA"
myData<- read.table(text=m1,h=F)
Code for the function replaceNA
library(data.table)
replaceNA <- function(data){
setDT(data)
# Create a data.table identifying rows and cols indexes of NA values in the data.table
NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
# Select row and column indexes of NAs that are not at the last row in the data.table
NA_not_Last <- NA_DT[row < nrow(data)]
# Select row and column indexes of NA that is at the last row in the data.table
NA_Last <- NA_DT[row == nrow(data)]
# Create a vector of column names where NA values are not at the last row in the data.table
Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
# Create a vector of column names where NA values are at the last row in the data.table
Cols_NA_Last <- colnames(data)[NA_Last[,col]]
# Replace NA values that are not at the last row in the data.table by the mean of the values located
# in the two previous lines and the two following lines of the line containing the NA value
data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x))+1], x[which(is.na(x))+2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
# Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found
data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]
return(data)
}
Test of the function with your data
replaceNA(myData)
#> V1 V2 V3 V4 V5 V6 V7
#> 1: 3 7 8 0.00 8 12 2.0
#> 2: 5 8 9 2.00 8 9 1.0
#> 3: 1 2 4 5.00 0 6 7.0
#> 4: 5 6 0 4.25 3 9 10.0
#> 5: 7 2 3 6.00 11 14 2.0
#> 6: 4 8 7 4.00 5 3 4.4
Created on 2021-11-08 by the reprex package (v2.0.1)

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Resources