Finding if a value is within the range of other columns - r

I have a dataframe df which looks like this:
Input:
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
Desired Output:
To explain a little further, my desired output is as below:
ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall Q1_check Q2_check
1 1 2 3 1 2 2 "above" "within"
2 0 NA NA 2 1 1 NA "within"
3 2 1 1 3 4 0 "within" "below"
4 1 0 2 4 0 2 "above" "within"
5 NA 1 NA 0 NA 0 NA "within"
6 2 0 1 1 NA NA "within" NA
Explanation:
Example 1:
Based on the value in columns Q1_PM and Q1_TP, I want to see whether the value in column Q1_overall is within their range or not? If, not in range, is the value above or below the range? To track this, I want to add an additional column Q1_check.
Example 2:
Similarly, based on the values of Q2_PM and Q2_LS, I want to check if the value of Q2_overall is within their range or not? If not in range, is it above or below the range? Again, to track this, I want to add an additional column Q2_check
Requirements:
1- For this, I want to add additional columns Q1_check and Q2_check where the first column is for the comparisons that involve Q1 items and the second column is for the comparisons that involve Q2 items.
2- The columns could contain the following values: above, below and within.
3- The case when the columns named overall have NAs, then the extra columns could also have NAs.
Related posts:
I have looked for related posts such as:
Add column with values depending on another column to a dataframe
and Create categories by comparing a numeric column with a fixed value
but I am running into errors as discussed below.
Partial Solution:
The only solution, I can think of is, along these lines:
df$Q1_check <- ifelse(data$Q1_overall < data$Q1_PM, 'below',
ifelse(data$Q1_overall > data$Q1_TP, 'above',
ifelse(is.na(data$Q1_overall), NA, 'within')))
But it results in following error: Error in data$Q1_overall : object of type 'closure' is not subsettable. I do not understand what the possible issue could be.
OR
df %>%
mutate(Regulation = case_when(Q1_overall < Q1_PM ~ 'below',
Q1_overall > Q1_TP ~ 'above',
Q1_PM < Q1_overall < Q1_TP, 'within'))
This also results in error Error: unexpected '<' in: "Q1_overall > Q1_TP ~ 'above', Q1_PM < Q1_overall <"
Edit 1:
How can the solution be extended if (let's say) the columns are these:
"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Study Plan"
How can we identify when the column Q1 Comm - Overall Study Plan is:
1 - Below the min() of all the other columns, or
2 - Above the max() of all the other columns, or
3 - Within the range of all the other columns
Edit 2:
For the updated fields, I am also including the dput(df)
dput(df)
structure(list(ï..ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA,
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L),
Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA,
NA, NA, 1L), Q1.Comm...05.Exhibit = c(4L, 2L, NA, NA), Q1.Comm...06.Social.Act = c(NA,
NA, NA, 3L), Q1.Comm...07.Post.Project = c(NA, NA, 3L, NA
), Q1.Comm...08.Learn.Plant = c(NA, NA, NA, 4L), Q1.Comm...09.Study.Narrate = c(NA,
NA, 0L, NA), Q1.Comm...10.Learn.Participate = c(4L, NA, NA,
NA), Q1.Comm...11.Write.1 = c(NA, 2L, NA, NA), Q1.Comm...12.Read.2 = c(NA,
NA, 1L, NA), Q1.Comm...Overall.Study.Plan = c(4L, 1L, 2L,
NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))
Any advice on how to achieve this would be greatly appreciated. Thank you!

Seems a very long winded approach -
library(dplyr)
comparison <- function(x, y, z) {
case_when(is.na(z) ~ NA_character_,
z >= x & z <= y |
z >= y & z <= x |
is.na(x) & y == z |
is.na(y) & x == z ~ 'within',
z > x & z > y ~ 'above',
TRUE ~ 'below')
}
df %>%
mutate(Q1_check = comparison(Q1.PM, Q1.TP, Q1.overall),
Q2_check = comparison(Q2.PM, Q2.LS, Q2.overall))
# ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#1 1 1 2 3 1 2 2 above within
#2 2 0 NA NA 2 1 1 <NA> within
#3 3 2 1 1 3 4 0 within below
#4 4 1 0 2 4 0 2 above within
#5 5 NA 1 NA 0 NA 0 <NA> within
#6 6 2 0 1 1 NA NA within <NA>

df <- read.table(text =
"ID Q1-PM Q1-TP Q1-overall Q2-PM Q2-LS Q2-overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
library(tidyverse)
f <- function(x, y, z){
case_when(
z < pmin(x, y, na.rm = TRUE) ~ "below",
z > pmax(x, y, na.rm = TRUE) ~ "abowe",
between(z, pmin(x, y, na.rm = TRUE), pmax(x, y, na.rm = TRUE)) ~ "within"
)
}
df %>%
rowwise() %>%
mutate(Q1_check = f(Q1.PM, Q1.TP, Q1.overall),
Q2_check = f(Q2.PM, Q2.LS, Q2.overall))
#> # A tibble: 6 x 9
#> # Rowwise:
#> ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#> <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
#> 1 1 1 2 3 1 2 2 abowe within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 abowe within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)

If your columns are named similarly, you may do this for any number of Qs simultaneously.
changed - in column names to acceptable _
changed Q2_LS to Q2_TP for sake of similarity
What is does -
It picks up every column that ends with _overall (2 here but can be any number)
check this columns values as -
If less than column having name _PM / _TP in lieu of _overall allocates value below
If greater than column having name _PM/_TP in lieu of _overall allocates value above
To access these column values I used get alongwith cur_column and stringr string replacement function
if current value is NA allocated a NA_character
otherwise allocates value within
Now, for final mutated columns (all at once) it renames these by removing _overall from these columns and pasting _check instead (I used .names argument of across here)
For this I used stringr::str_remove inside glue argument (.names follow glue style of formula)
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
df
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
#> 1 1 1 2 3 1 2 2
#> 2 2 0 NA NA 2 1 1
#> 3 3 2 1 1 3 4 0
#> 4 4 1 0 2 4 0 2
#> 5 5 NA 1 NA 0 NA 0
#> 6 6 2 0 1 1 NA NA
library(tidyverse)
df %>% mutate(across(ends_with('overall'), ~ case_when(. < pmin(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'below',
. > pmax(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'above',
is.na(.) ~ NA_character_,
TRUE ~ 'within'),
.names = '{str_remove(.col,"_overall")}_check'))
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall Q1_check Q2_check
#> 1 1 1 2 3 1 2 2 above within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 above within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)

Largely based on Ronak's great solution:
df <- structure(list(ID = c(10L, 31L, 225L, 243L),
`Q1 Comm - 01 Scope Thesis` = c(NA, 2L, 0L, NA),
`Q1 Comm - 02 Scope Project` = c(NA, NA, NA, 2L),
`Q1 Comm - 03 Learn Intern` = c(4L, NA, NA, NA),
`Q1 Comm - 04 Biography` = c(NA, NA, NA, 1L),
`Q1 Comm - 05 Exhibit` = c(4L, 2L, NA, NA),
`Q1 Comm - 06 Social Act` = c(NA, NA, NA, 3L),
`Q1 Comm - 07 Post Project` = c(NA, NA, 3L, NA),
`Q1 Comm - 08 Learn Plant` = c(NA, NA, NA, 4L),
`Q1 Comm - 09 Study Narrate` = c(NA, NA, 0L, NA),
`Q1 Comm - 10 Learn Participate` = c(4L, NA, NA,NA),
`Q1 Comm - 11 Write 1` = c(NA, 2L, NA, NA),
`Q1 Comm - 12 Read 2` = c(NA, NA, 1L, NA),
`Q1 Comm - Overall Study Plan` = c(4L, 1L, 2L, NA),
X = c(NA, NA, NA, NA),
`X 1` = c(NA, NA, NA, NA),
`X 2` = c(NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
comparison <- function(df, prefix) {
df <- df[grep(prefix, colnames(df))]
min <- apply(df[-grep("Overall", colnames(df))], 1, min, na.rm = T)
max <- apply(df[-grep("Overall", colnames(df))], 1, max, na.rm = T)
z <- df[grep("Overall", colnames(df))]
case_when(is.na(z) ~ NA_character_,
z >= min & z <= max ~ 'within',
z > max ~ 'above',
TRUE ~ 'below')
}
prefixes <- sub(" \\- Overall.*", "", colnames(df[grep("Overall", colnames(df))]))
for (i in prefixes) {
df <- df %>%
mutate("{i} - Check" := comparison(df, i))
}
> print(df)
ID Q1 Comm - 01 Scope Thesis Q1 Comm - 02 Scope Project Q1 Comm - 03 Learn Intern Q1 Comm - 04 Biography
1 10 NA NA 4 NA
2 31 2 NA NA NA
3 225 0 NA NA NA
4 243 NA 2 NA 1
Q1 Comm - 05 Exhibit Q1 Comm - 06 Social Act Q1 Comm - 07 Post Project Q1 Comm - 08 Learn Plant
1 4 NA NA NA
2 2 NA NA NA
3 NA NA 3 NA
4 NA 3 NA 4
Q1 Comm - 09 Study Narrate Q1 Comm - 10 Learn Participate Q1 Comm - 11 Write 1 Q1 Comm - 12 Read 2
1 NA 4 NA NA
2 NA NA 2 NA
3 0 NA NA 1
4 NA NA NA NA
Q1 Comm - Overall Study Plan X X 1 X 2 Q1 Comm - Check
1 4 NA NA NA within
2 1 NA NA NA below
3 2 NA NA NA within
4 NA NA NA NA <NA>

comparison <- function(dt, group_cols, new_col, compare_col){
dt[,
c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = group_cols
][,(new_col) := fcase(
is.na(get(compare_col)), NA_character_,
get(compare_col) < min, "below",
get(compare_col) > max, "above",
default = "within"
)
][]
}
group_cols <- names(df) %>%
str_subset("^Q[0-9]+") %>%
str_subset("overall", negate = TRUE) %>%
split(str_extract(., "^Q[0-9]+"))
new_cols <- names(group_cols) %>% str_c("_check")
compare_cols <- names(group_cols) %>% str_c("_overall")
setDT(df)
pwalk(list(group_cols, new_cols, compare_cols), ~comparison(df, ...))
df[, c("min", "max") := NULL]

Related

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Conditional Column Formatting

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1
Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA
You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Rearrange data by matching columns

I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

Last observation carried forward conditional on multiple columns

I have a dataset with this structure:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1 1 1 NA
# 2 1 NA NA
# 3 1 NA NA
# 4 1 NA NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA NA
# 8 3 NA NA
# 9 3 1 NA
# 10 3 NA NA
# 11 3 NA 1
When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:
library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)
This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)
df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1 1 1 NA
# 2 1 1 NA
# 3 1 1 NA
# 4 1 1 NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA 1
# 8 3 NA NA
# 9 3 1 NA
# 10 3 1 NA
# 11 3 NA 1
We can use na.locf
library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
# ID L40 K50
#1: 1 1 NA
#2: 1 1 NA
#3: 1 1 NA
#4: 1 1 NA
#5: 2 1 NA
#6: 2 NA 1
#7: 3 NA 1
#8: 3 NA 1
#9: 3 NA 1
An option using dplyr would be
library(dplyr)
df %>%
mutate(ind = rowSums(is.na(.))) %>%
group_by(ID) %>%
mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
select(-ind)
# ID L40 K50
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 1 1 NA
#3 1 1 NA
#4 1 1 NA
#5 2 1 NA
#6 2 NA 1
#7 3 NA 1
#8 3 NA 1
#9 3 NA 1
I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, date, L40, K50)
Here is what I did:
#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)
#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)
#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)
setkey(Dt1, ID, date)
setkey(Dt2, ID, date)
#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present,
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()
This carries forward the name(s) of the diagnosis until the next observed diagnosis.

Restructuring data using apply family of functions

I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame

Resources