Related
I have a dataframe df which looks like this:
Input:
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
Desired Output:
To explain a little further, my desired output is as below:
ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall Q1_check Q2_check
1 1 2 3 1 2 2 "above" "within"
2 0 NA NA 2 1 1 NA "within"
3 2 1 1 3 4 0 "within" "below"
4 1 0 2 4 0 2 "above" "within"
5 NA 1 NA 0 NA 0 NA "within"
6 2 0 1 1 NA NA "within" NA
Explanation:
Example 1:
Based on the value in columns Q1_PM and Q1_TP, I want to see whether the value in column Q1_overall is within their range or not? If, not in range, is the value above or below the range? To track this, I want to add an additional column Q1_check.
Example 2:
Similarly, based on the values of Q2_PM and Q2_LS, I want to check if the value of Q2_overall is within their range or not? If not in range, is it above or below the range? Again, to track this, I want to add an additional column Q2_check
Requirements:
1- For this, I want to add additional columns Q1_check and Q2_check where the first column is for the comparisons that involve Q1 items and the second column is for the comparisons that involve Q2 items.
2- The columns could contain the following values: above, below and within.
3- The case when the columns named overall have NAs, then the extra columns could also have NAs.
Related posts:
I have looked for related posts such as:
Add column with values depending on another column to a dataframe
and Create categories by comparing a numeric column with a fixed value
but I am running into errors as discussed below.
Partial Solution:
The only solution, I can think of is, along these lines:
df$Q1_check <- ifelse(data$Q1_overall < data$Q1_PM, 'below',
ifelse(data$Q1_overall > data$Q1_TP, 'above',
ifelse(is.na(data$Q1_overall), NA, 'within')))
But it results in following error: Error in data$Q1_overall : object of type 'closure' is not subsettable. I do not understand what the possible issue could be.
OR
df %>%
mutate(Regulation = case_when(Q1_overall < Q1_PM ~ 'below',
Q1_overall > Q1_TP ~ 'above',
Q1_PM < Q1_overall < Q1_TP, 'within'))
This also results in error Error: unexpected '<' in: "Q1_overall > Q1_TP ~ 'above', Q1_PM < Q1_overall <"
Edit 1:
How can the solution be extended if (let's say) the columns are these:
"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Study Plan"
How can we identify when the column Q1 Comm - Overall Study Plan is:
1 - Below the min() of all the other columns, or
2 - Above the max() of all the other columns, or
3 - Within the range of all the other columns
Edit 2:
For the updated fields, I am also including the dput(df)
dput(df)
structure(list(ï..ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA,
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L),
Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA,
NA, NA, 1L), Q1.Comm...05.Exhibit = c(4L, 2L, NA, NA), Q1.Comm...06.Social.Act = c(NA,
NA, NA, 3L), Q1.Comm...07.Post.Project = c(NA, NA, 3L, NA
), Q1.Comm...08.Learn.Plant = c(NA, NA, NA, 4L), Q1.Comm...09.Study.Narrate = c(NA,
NA, 0L, NA), Q1.Comm...10.Learn.Participate = c(4L, NA, NA,
NA), Q1.Comm...11.Write.1 = c(NA, 2L, NA, NA), Q1.Comm...12.Read.2 = c(NA,
NA, 1L, NA), Q1.Comm...Overall.Study.Plan = c(4L, 1L, 2L,
NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))
Any advice on how to achieve this would be greatly appreciated. Thank you!
Seems a very long winded approach -
library(dplyr)
comparison <- function(x, y, z) {
case_when(is.na(z) ~ NA_character_,
z >= x & z <= y |
z >= y & z <= x |
is.na(x) & y == z |
is.na(y) & x == z ~ 'within',
z > x & z > y ~ 'above',
TRUE ~ 'below')
}
df %>%
mutate(Q1_check = comparison(Q1.PM, Q1.TP, Q1.overall),
Q2_check = comparison(Q2.PM, Q2.LS, Q2.overall))
# ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#1 1 1 2 3 1 2 2 above within
#2 2 0 NA NA 2 1 1 <NA> within
#3 3 2 1 1 3 4 0 within below
#4 4 1 0 2 4 0 2 above within
#5 5 NA 1 NA 0 NA 0 <NA> within
#6 6 2 0 1 1 NA NA within <NA>
df <- read.table(text =
"ID Q1-PM Q1-TP Q1-overall Q2-PM Q2-LS Q2-overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
library(tidyverse)
f <- function(x, y, z){
case_when(
z < pmin(x, y, na.rm = TRUE) ~ "below",
z > pmax(x, y, na.rm = TRUE) ~ "abowe",
between(z, pmin(x, y, na.rm = TRUE), pmax(x, y, na.rm = TRUE)) ~ "within"
)
}
df %>%
rowwise() %>%
mutate(Q1_check = f(Q1.PM, Q1.TP, Q1.overall),
Q2_check = f(Q2.PM, Q2.LS, Q2.overall))
#> # A tibble: 6 x 9
#> # Rowwise:
#> ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#> <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
#> 1 1 1 2 3 1 2 2 abowe within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 abowe within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)
If your columns are named similarly, you may do this for any number of Qs simultaneously.
changed - in column names to acceptable _
changed Q2_LS to Q2_TP for sake of similarity
What is does -
It picks up every column that ends with _overall (2 here but can be any number)
check this columns values as -
If less than column having name _PM / _TP in lieu of _overall allocates value below
If greater than column having name _PM/_TP in lieu of _overall allocates value above
To access these column values I used get alongwith cur_column and stringr string replacement function
if current value is NA allocated a NA_character
otherwise allocates value within
Now, for final mutated columns (all at once) it renames these by removing _overall from these columns and pasting _check instead (I used .names argument of across here)
For this I used stringr::str_remove inside glue argument (.names follow glue style of formula)
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
df
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
#> 1 1 1 2 3 1 2 2
#> 2 2 0 NA NA 2 1 1
#> 3 3 2 1 1 3 4 0
#> 4 4 1 0 2 4 0 2
#> 5 5 NA 1 NA 0 NA 0
#> 6 6 2 0 1 1 NA NA
library(tidyverse)
df %>% mutate(across(ends_with('overall'), ~ case_when(. < pmin(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'below',
. > pmax(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'above',
is.na(.) ~ NA_character_,
TRUE ~ 'within'),
.names = '{str_remove(.col,"_overall")}_check'))
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall Q1_check Q2_check
#> 1 1 1 2 3 1 2 2 above within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 above within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)
Largely based on Ronak's great solution:
df <- structure(list(ID = c(10L, 31L, 225L, 243L),
`Q1 Comm - 01 Scope Thesis` = c(NA, 2L, 0L, NA),
`Q1 Comm - 02 Scope Project` = c(NA, NA, NA, 2L),
`Q1 Comm - 03 Learn Intern` = c(4L, NA, NA, NA),
`Q1 Comm - 04 Biography` = c(NA, NA, NA, 1L),
`Q1 Comm - 05 Exhibit` = c(4L, 2L, NA, NA),
`Q1 Comm - 06 Social Act` = c(NA, NA, NA, 3L),
`Q1 Comm - 07 Post Project` = c(NA, NA, 3L, NA),
`Q1 Comm - 08 Learn Plant` = c(NA, NA, NA, 4L),
`Q1 Comm - 09 Study Narrate` = c(NA, NA, 0L, NA),
`Q1 Comm - 10 Learn Participate` = c(4L, NA, NA,NA),
`Q1 Comm - 11 Write 1` = c(NA, 2L, NA, NA),
`Q1 Comm - 12 Read 2` = c(NA, NA, 1L, NA),
`Q1 Comm - Overall Study Plan` = c(4L, 1L, 2L, NA),
X = c(NA, NA, NA, NA),
`X 1` = c(NA, NA, NA, NA),
`X 2` = c(NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
comparison <- function(df, prefix) {
df <- df[grep(prefix, colnames(df))]
min <- apply(df[-grep("Overall", colnames(df))], 1, min, na.rm = T)
max <- apply(df[-grep("Overall", colnames(df))], 1, max, na.rm = T)
z <- df[grep("Overall", colnames(df))]
case_when(is.na(z) ~ NA_character_,
z >= min & z <= max ~ 'within',
z > max ~ 'above',
TRUE ~ 'below')
}
prefixes <- sub(" \\- Overall.*", "", colnames(df[grep("Overall", colnames(df))]))
for (i in prefixes) {
df <- df %>%
mutate("{i} - Check" := comparison(df, i))
}
> print(df)
ID Q1 Comm - 01 Scope Thesis Q1 Comm - 02 Scope Project Q1 Comm - 03 Learn Intern Q1 Comm - 04 Biography
1 10 NA NA 4 NA
2 31 2 NA NA NA
3 225 0 NA NA NA
4 243 NA 2 NA 1
Q1 Comm - 05 Exhibit Q1 Comm - 06 Social Act Q1 Comm - 07 Post Project Q1 Comm - 08 Learn Plant
1 4 NA NA NA
2 2 NA NA NA
3 NA NA 3 NA
4 NA 3 NA 4
Q1 Comm - 09 Study Narrate Q1 Comm - 10 Learn Participate Q1 Comm - 11 Write 1 Q1 Comm - 12 Read 2
1 NA 4 NA NA
2 NA NA 2 NA
3 0 NA NA 1
4 NA NA NA NA
Q1 Comm - Overall Study Plan X X 1 X 2 Q1 Comm - Check
1 4 NA NA NA within
2 1 NA NA NA below
3 2 NA NA NA within
4 NA NA NA NA <NA>
comparison <- function(dt, group_cols, new_col, compare_col){
dt[,
c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = group_cols
][,(new_col) := fcase(
is.na(get(compare_col)), NA_character_,
get(compare_col) < min, "below",
get(compare_col) > max, "above",
default = "within"
)
][]
}
group_cols <- names(df) %>%
str_subset("^Q[0-9]+") %>%
str_subset("overall", negate = TRUE) %>%
split(str_extract(., "^Q[0-9]+"))
new_cols <- names(group_cols) %>% str_c("_check")
compare_cols <- names(group_cols) %>% str_c("_overall")
setDT(df)
pwalk(list(group_cols, new_cols, compare_cols), ~comparison(df, ...))
df[, c("min", "max") := NULL]
Starting with data with the start of the desired sequences filled in with 1, I need to fill in the NA rows with sequences. Below is the starting data (first two columns) and the desired third column:
I can make this happen with a loop, below, but what is the better R programming way to do it?
for(i in 1:length(df2$col2)) {
df2$col3[i] <- ifelse(df2$col2[i] == 1, 1, df2$col3[i - 1] + 1)
if(is.na(df2$col2[i])) df2$col3[i] <- df2$col3[i - 1] + 1
}
Here is a 20-row data set of the first two columns:
structure(list(col1 = c(478.69, 320.45, 503.7, 609.3, 478.19,
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
Try:
library(data.table)
df2 <- data.table(df2)
df2[, col3 := col2[1] + 1 * (1:.N - 1), by = .(cumsum(!is.na(col2)))]
You can use ave with seq_along with grouping using cumsum.
df2$col3 <- ave(integer(nrow(df2)), cumsum(!is.na(df2$col2)), FUN=seq_along)
df2
# col1 col2 col3
#1 478.69000 1 1
#2 320.45000 NA 2
#3 503.70000 1 1
#4 609.30000 NA 2
#5 478.19000 NA 3
#6 478.69000 1 1
#7 320.45000 NA 2
#8 503.70000 1 1
#9 609.30000 NA 2
#10 478.19000 NA 3
#11 419.63368 NA 4
#12 552.93998 NA 5
#13 785.11939 1 1
#14 18.25427 NA 2
#15 98.64697 NA 3
#16 132.58726 NA 4
#17 697.11955 NA 5
#18 512.56037 NA 6
#19 916.42520 NA 7
#20 14.33851 NA 8
I have this kind of data :
daynight
[1] NA NA NA NA 2 1 NA NA
I want R to detect if there is a series of at least x NA and replace these by another value.
For example if x=3 and the replacement value is 3 I want R to give me in output :
daynight
[1] 3 3 3 3 2 1 NA NA
Would you have any ideas?
We can use rle
daynight <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
x <- 3
r <- 3
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
daynight
#[1] 3 3 3 3 2 1 NA NA
Taking another example :
daynight <- c(NA, NA, NA, 3,2,1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
#[1] 3 3 3 3 2 1 NA NA 1 3 3 3 1 NA NA
And here is another solution using the zoo package
library(zoo)
replace_consecutive_NAs <- function(x, nrNAs = 3, replaceBy = nrNAs){
x <- as.numeric(is.na(x))
indexes <- (rollapply(x, 3, prod, fill = 0, align = "left") +
rollapply(x, 3, prod, fill = 0, align = "right")) != 0
x[indexes] <- replaceBy
x
}
x <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
replace_consecutive_NAs(x, 3, 999)
[1] 999 999 999 999 2 1 NA NA
I have a data frame such as this (but of size 16 Billion):
structure(list(id1 = c(1, 2, 3, 4, 4, 4, 4, 4, 4, 4), id2 = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), b1 = c(NA, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L), b2 = c(1, NA, NA, NA, NA, NA,
1, 1, 1, 1), b3 = c(NA, 1, NA, NA, NA, NA, NA, NA, 1, 1), b4 = c(NA,
NA, 1, NA, NA, NA, NA, NA, 1, 1)), .Names = c("id1", "id2", "b1",
"b2", "b3", "b4"), row.names = c(NA, 10L), class = "data.frame")
df
id1 id2 b1 b2 b3 b4
1 1 a NA 1 NA NA
2 2 b NA NA 1 NA
3 3 c NA NA NA 1
4 4 d 1 NA NA NA
5 4 e 1 NA NA NA
6 4 f 1 NA NA NA
7 4 g 1 1 NA NA
8 4 h 1 1 NA NA
9 4 i 1 1 1 1
10 4 j 1 1 1 1
I need to get it into long format, while ONLY keeping values of 1. Of course, I tried using gather from tidyr and also melt from data.table to no avail as the memory requirements of them are explosive. My original data had zeros and ones, but I filled zeroes with NA and hoped na.rm = TRUE option will help with memory issue. But, it does not.
With just ones retained and lengthened, my data frame will fit easily in memory I have.
Is there a better way to get at this vs. using the standard methods - reasonable compute as a tradeoff for better memory fit is acceptable.
My desired output is the equivalent of:
library(dplyr)
library(tidyr)
df %>% gather(b, value, -id1, -id2, na.rm = TRUE)
id1 id2 b value
1 4 d b1 1
2 4 e b1 1
3 4 f b1 1
4 4 g b1 1
5 4 h b1 1
6 4 i b1 1
7 4 j b1 1
8 1 a b2 1
9 4 g b2 1
10 4 h b2 1
11 4 i b2 1
12 4 j b2 1
13 2 b b3 1
14 4 i b3 1
15 4 j b3 1
16 3 c b4 1
17 4 i b4 1
18 4 j b4 1
# or
reshape2::melt(df, id=c("id1","id2"), na.rm=TRUE)
# or
library(data.table)
melt(setDT(df), id=c("id1","id2"), na.rm=TRUE)
Currently, the call to gather on my full data set gives me this error, which I believe is due to memory issue:
Error in .Call("tidyr_melt_dataframe", PACKAGE = "tidyr", data, id_ind, :
negative length vectors are not allowed
I have a dataset with this structure:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1 1 1 NA
# 2 1 NA NA
# 3 1 NA NA
# 4 1 NA NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA NA
# 8 3 NA NA
# 9 3 1 NA
# 10 3 NA NA
# 11 3 NA 1
When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:
library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)
This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)
df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1 1 1 NA
# 2 1 1 NA
# 3 1 1 NA
# 4 1 1 NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA 1
# 8 3 NA NA
# 9 3 1 NA
# 10 3 1 NA
# 11 3 NA 1
We can use na.locf
library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
# ID L40 K50
#1: 1 1 NA
#2: 1 1 NA
#3: 1 1 NA
#4: 1 1 NA
#5: 2 1 NA
#6: 2 NA 1
#7: 3 NA 1
#8: 3 NA 1
#9: 3 NA 1
An option using dplyr would be
library(dplyr)
df %>%
mutate(ind = rowSums(is.na(.))) %>%
group_by(ID) %>%
mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
select(-ind)
# ID L40 K50
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 1 1 NA
#3 1 1 NA
#4 1 1 NA
#5 2 1 NA
#6 2 NA 1
#7 3 NA 1
#8 3 NA 1
#9 3 NA 1
I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, date, L40, K50)
Here is what I did:
#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)
#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)
#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)
setkey(Dt1, ID, date)
setkey(Dt2, ID, date)
#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present,
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()
This carries forward the name(s) of the diagnosis until the next observed diagnosis.