I have a dataframe that has x/y values every 5 seconds, with a depth value every second (time column). There is no depth where there is an x/y value.
x <- c("1430934", NA, NA, NA, NA, "1430939")
y <- c("4943206", NA, NA, NA, NA, "4943210")
time <- c(1:6)
depth <- c(NA, 10, 19, 84, 65, NA)
data <- data.frame(x, y, time, depth)
data
x y time depth
1 1430934 4943206 1 NA
2 NA NA 2 10
3 NA NA 3 19
4 NA NA 4 84
5 NA NA 5 65
6 1430939 4943210 6 NA
I would like to calculate the maximum depth between the x/y values that are not NA and add this to a new column in the row of the starting x/y values. So max depth of rows 2-5. An example of the output desired.
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
This is to repeat whenever a new x/y value is present.
You can use ave and cumsum with !is.na to get the groups for ave like:
data$newvar <- ave(data$depth, cumsum(!is.na(data$x)), FUN=
function(x) if(all(is.na(x))) NA else {
c(max(x, na.rm=TRUE), rep(NA, length(x)-1))})
data
# x y time depth newvar
#1 1430934 4943206 1 NA 84
#2 <NA> <NA> 2 10 NA
#3 <NA> <NA> 3 19 NA
#4 <NA> <NA> 4 84 NA
#5 <NA> <NA> 5 65 NA
#6 1430939 4943210 6 NA NA
Using dplyr, we can create groups of every 5 rows and update the first row in group as max value in the group ignoring NA values.
library(dplyr)
df %>%
group_by(grp = ceiling(time/5)) %>%
mutate(depth = ifelse(row_number() == 1, max(depth, na.rm = TRUE), NA))
In base R, we can use tapply :
inds <- seq(1, nrow(df), 5)
df$depth[inds] <- tapply(df$depth, ceiling(df$time/5), max, na.rm = TRUE)
df$depth[-inds] <- NA
Maybe you can try ave like below
df <- within(df,
newvar <- ave(depth,
ceiling(time/5),
FUN = function(x) ifelse(length(x)>1&is.na(x),max(na.omit(x)),NA)))
such that
> df
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
DATA
df <- structure(list(x = c(1430934L, NA, NA, NA, NA, 1430939L), y = c(4943206L,
NA, NA, NA, NA, 4943210L), time = 1:6, depth = c(NA, 10L, 19L,
84L, 65L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
Here is another option using data.table:
library(data.table)
setDT(data)[, newvar := replace(frollapply(depth, 5L, max, na.rm=TRUE, align="left"),
seq(.N) %% 5L != 1L, NA_integer_)]
Related
This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)
I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")
I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)
My data is bit typical and I need find out field/Column order that follow some pattern.
For Instance, One field(say sub3) has values till some rows and followed by NULL values, then another field will continue with some values(like Sub1) and then follows null values.
And in some cases I may have multiple fields having values at two rows(like Sub2 and Sub4).
In below case the solution is vector of field names which follow the pattern c(Sub3,Sub1,c(Sub2,Sub4),Sub5)
Here is the reproducible format of data and Snapshot of data.
structure(list(RollNo = 1:10, Sub1 = c(NA, NA, NA, NA, 3L, 2L,
NA, NA, NA, NA), Sub2 = c(NA, NA, NA, NA, NA, NA, "A", "B", NA,
NA), Sub3 = c(4L, 3L, 5L, 6L, NA, NA, NA, NA, NA, NA), Sub4 = c(NA,
NA, NA, NA, NA, NA, 2L, 5L, NA, NA), Sub5 = c(NA, NA, NA, NA,
NA, NA, NA, NA, 7L, NA)), .Names = c("RollNo", "Sub1", "Sub2",
"Sub3", "Sub4", "Sub5"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000200788>)
Sounds like you are sorting on the order of first non-NA data. If df is your data:
sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1)))
# RollNo Sub1 Sub2 Sub3 Sub4 Sub5
# 1 5 7 1 7 9
Gives the first non-NA row for each column. This should be a natural sort, meaning ties retain the original order between the ties.
There are a couple of ways to use this, one such:
order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))
# [1] 1 4 2 3 5 6
df[,order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))]
# RollNo Sub3 Sub1 Sub2 Sub4 Sub5
# 1 1 4 NA <NA> NA NA
# 2 2 3 NA <NA> NA NA
# 3 3 5 NA <NA> NA NA
# 4 4 6 NA <NA> NA NA
# 5 5 NA 3 <NA> NA NA
# 6 6 NA 2 <NA> NA NA
# 7 7 NA NA A 2 NA
# 8 8 NA NA B 5 NA
# 9 9 NA NA <NA> NA 7
# 10 10 NA NA <NA> NA NA
I'm inferring from the column names that RollNo should always be first, so:
df[,c(1, 1 + order(sapply(df[-1], function(x) min(Inf, head(which(!is.na(x)),n=1)))))]
Using:
DT[, nms := paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6]
will get you:
> DT
RollNo Sub1 Sub2 Sub3 Sub4 Sub5 nms
1: 1 NA NA 4 NA NA Sub3
2: 2 NA NA 3 NA NA Sub3
3: 3 NA NA 5 NA NA Sub3
4: 4 NA NA 6 NA NA Sub3
5: 5 3 NA NA NA NA Sub1
6: 6 2 NA NA NA NA Sub1
7: 7 NA A NA 2 NA Sub2,Sub4
8: 8 NA B NA 5 NA Sub2,Sub4
9: 9 NA NA NA NA 7 Sub5
10: 10 NA NA NA NA NA
If you just want the specified vector:
unique(DT[, paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6][V1!='']$V1)
which returns:
[1] "Sub3" "Sub1" "Sub2,Sub4" "Sub5"
As #Frank pinted out in the comments, you can also use:
melt(DT, id=1, na.rm = TRUE)[, toString(unique(variable)), by = RollNo][order(RollNo)]
I have a dataset with this structure:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1 1 1 NA
# 2 1 NA NA
# 3 1 NA NA
# 4 1 NA NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA NA
# 8 3 NA NA
# 9 3 1 NA
# 10 3 NA NA
# 11 3 NA 1
When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:
library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)
This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)
df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1 1 1 NA
# 2 1 1 NA
# 3 1 1 NA
# 4 1 1 NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA 1
# 8 3 NA NA
# 9 3 1 NA
# 10 3 1 NA
# 11 3 NA 1
We can use na.locf
library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
# ID L40 K50
#1: 1 1 NA
#2: 1 1 NA
#3: 1 1 NA
#4: 1 1 NA
#5: 2 1 NA
#6: 2 NA 1
#7: 3 NA 1
#8: 3 NA 1
#9: 3 NA 1
An option using dplyr would be
library(dplyr)
df %>%
mutate(ind = rowSums(is.na(.))) %>%
group_by(ID) %>%
mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
select(-ind)
# ID L40 K50
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 1 1 NA
#3 1 1 NA
#4 1 1 NA
#5 2 1 NA
#6 2 NA 1
#7 3 NA 1
#8 3 NA 1
#9 3 NA 1
I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, date, L40, K50)
Here is what I did:
#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)
#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)
#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)
setkey(Dt1, ID, date)
setkey(Dt2, ID, date)
#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present,
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()
This carries forward the name(s) of the diagnosis until the next observed diagnosis.