Create multiple sequences dependent on data frame column - r

Starting with data with the start of the desired sequences filled in with 1, I need to fill in the NA rows with sequences. Below is the starting data (first two columns) and the desired third column:
I can make this happen with a loop, below, but what is the better R programming way to do it?
for(i in 1:length(df2$col2)) {
df2$col3[i] <- ifelse(df2$col2[i] == 1, 1, df2$col3[i - 1] + 1)
if(is.na(df2$col2[i])) df2$col3[i] <- df2$col3[i - 1] + 1
}
Here is a 20-row data set of the first two columns:
structure(list(col1 = c(478.69, 320.45, 503.7, 609.3, 478.19,
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))

Try:
library(data.table)
df2 <- data.table(df2)
df2[, col3 := col2[1] + 1 * (1:.N - 1), by = .(cumsum(!is.na(col2)))]

You can use ave with seq_along with grouping using cumsum.
df2$col3 <- ave(integer(nrow(df2)), cumsum(!is.na(df2$col2)), FUN=seq_along)
df2
# col1 col2 col3
#1 478.69000 1 1
#2 320.45000 NA 2
#3 503.70000 1 1
#4 609.30000 NA 2
#5 478.19000 NA 3
#6 478.69000 1 1
#7 320.45000 NA 2
#8 503.70000 1 1
#9 609.30000 NA 2
#10 478.19000 NA 3
#11 419.63368 NA 4
#12 552.93998 NA 5
#13 785.11939 1 1
#14 18.25427 NA 2
#15 98.64697 NA 3
#16 132.58726 NA 4
#17 697.11955 NA 5
#18 512.56037 NA 6
#19 916.42520 NA 7
#20 14.33851 NA 8

Related

collapsing columns values in a specific order and leaving the missing values as NA in R

I am using R.
I have 4 different databases. Each one have values for my variables. Some of the bases have more values than others. So I want to use first the one that has the most values and lastly the one that have the least values. The data looks like this...
Variables A B C D
John 2 4
Mike 6
Walter 7
Jennifer 9 8
Amanda 3
Carlos 9
Michael 3
James 5
Kevin 4
Dennis 7
Frank
Steven
Joseph
Elvis 2
Maria 1
So, in roder to fill the data a need to create a new column that first uses the data of column B because is the one that contains the most values, then A, then C and then D and the ones that are missing need to be NA's. Also I need to add another column that gives me the reference of the data. In other words if I am using the column B to the that of John I need a column that tells me that the data pertains to column B.
The column should look like this...
Variables E D
John 4 B
Mike 6 B
Walter 7 B
Jennifer 9 B
Amanda 3 A
Carlos 9 A
Michael 3 B
James 5 D
Kevin 4 A
Dennis 7 C
Frank NA NA
Steven NA NA
Joseph NA NA
Elvis 2 B
Maria 1 B
With tidyverse you can do the following...
Use pivot_longer to put into long form. Make name an ordered factor by "B", "A", "C", and "D". Then when you arrange, you can get the first value by this order within each person's name.
This assumes your missing data are NA. If they are instead blank character values, you can filter those out with filter(value != "") instead of drop_na(value).
library(tidyverse)
df %>%
pivot_longer(cols = -Variables) %>%
mutate(name = ordered(name, levels = c('B', 'A', 'C', 'D'))) %>%
group_by(Variables) %>%
drop_na(value) %>%
arrange(name) %>%
summarise(E = first(value),
New_D = first(name)) %>%
right_join(df)
Output
Variables E New_D A B C D
<chr> <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
1 Amanda 3 A 3 NA NA NA
2 Carlos 9 A 9 NA NA NA
3 Dennis 7 C NA NA 7 NA
4 Elvis 2 B NA 2 NA NA
5 James 5 D NA NA NA 5
6 Jennifer 9 B NA 9 8 NA
7 John 4 B 2 4 NA NA
8 Kevin 4 A 4 NA NA NA
9 Maria 1 B NA 1 NA NA
10 Michael 3 B NA 3 NA NA
11 Mike 6 B NA 6 NA NA
12 Walter 7 B NA 7 NA NA
13 Frank NA NA NA NA NA NA
14 Steven NA NA NA NA NA NA
15 Joseph NA NA NA NA NA NA
Data
df <- structure(list(Variables = c("John", "Mike", "Walter", "Jennifer",
"Amanda", "Carlos", "Michael", "James", "Kevin", "Dennis", "Frank",
"Steven", "Joseph", "Elvis", "Maria"), A = c(2, NA, NA, NA, 3,
9, NA, NA, 4, NA, NA, NA, NA, NA, NA), B = c(4, 6, 7, 9, NA,
NA, 3, NA, NA, NA, NA, NA, NA, 2, 1), C = c(NA, NA, NA, 8, NA,
NA, NA, NA, NA, 7, NA, NA, NA, NA, NA), D = c(NA, NA, NA, NA,
NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-15L))

Find max value within a data frame interval

I have a dataframe that has x/y values every 5 seconds, with a depth value every second (time column). There is no depth where there is an x/y value.
x <- c("1430934", NA, NA, NA, NA, "1430939")
y <- c("4943206", NA, NA, NA, NA, "4943210")
time <- c(1:6)
depth <- c(NA, 10, 19, 84, 65, NA)
data <- data.frame(x, y, time, depth)
data
x y time depth
1 1430934 4943206 1 NA
2 NA NA 2 10
3 NA NA 3 19
4 NA NA 4 84
5 NA NA 5 65
6 1430939 4943210 6 NA
I would like to calculate the maximum depth between the x/y values that are not NA and add this to a new column in the row of the starting x/y values. So max depth of rows 2-5. An example of the output desired.
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
This is to repeat whenever a new x/y value is present.
You can use ave and cumsum with !is.na to get the groups for ave like:
data$newvar <- ave(data$depth, cumsum(!is.na(data$x)), FUN=
function(x) if(all(is.na(x))) NA else {
c(max(x, na.rm=TRUE), rep(NA, length(x)-1))})
data
# x y time depth newvar
#1 1430934 4943206 1 NA 84
#2 <NA> <NA> 2 10 NA
#3 <NA> <NA> 3 19 NA
#4 <NA> <NA> 4 84 NA
#5 <NA> <NA> 5 65 NA
#6 1430939 4943210 6 NA NA
Using dplyr, we can create groups of every 5 rows and update the first row in group as max value in the group ignoring NA values.
library(dplyr)
df %>%
group_by(grp = ceiling(time/5)) %>%
mutate(depth = ifelse(row_number() == 1, max(depth, na.rm = TRUE), NA))
In base R, we can use tapply :
inds <- seq(1, nrow(df), 5)
df$depth[inds] <- tapply(df$depth, ceiling(df$time/5), max, na.rm = TRUE)
df$depth[-inds] <- NA
Maybe you can try ave like below
df <- within(df,
newvar <- ave(depth,
ceiling(time/5),
FUN = function(x) ifelse(length(x)>1&is.na(x),max(na.omit(x)),NA)))
such that
> df
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
DATA
df <- structure(list(x = c(1430934L, NA, NA, NA, NA, 1430939L), y = c(4943206L,
NA, NA, NA, NA, 4943210L), time = 1:6, depth = c(NA, 10L, 19L,
84L, 65L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
Here is another option using data.table:
library(data.table)
setDT(data)[, newvar := replace(frollapply(depth, 5L, max, na.rm=TRUE, align="left"),
seq(.N) %% 5L != 1L, NA_integer_)]

with R, how to scan the interval value and print out subdata of other dataset?

Dataset 1:
dput(kk)
structure(list(V1 = c(1.05, NA, NA, NA, NA, NA, NA, NA, NA,
NA,
1.06, NA, NA, NA, NA, NA, NA, NA), V2 = c(NA, NA, 105.11, 105.12,
105.13, 105.14, 105.15, NA, 105.94, 105.99, NA, NA, 106.11, 106.12,
106.13, 106.14, 106.19, 106.2)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-18L))
show(kk)
V1 V2
1 1.05 NA
2 NA NA
3 NA 105.11
4 NA 105.12
5 NA 105.13
6 NA 105.14
7 NA 105.15
8 NA NA
9 NA 105.94
10 NA 105.99
11 1.06 NA
12 NA NA
13 NA 106.11
14 NA 106.12
15 NA 106.13
16 NA 106.14
17 NA 106.19
18 NA 106.20
Dataset 2:
structure(list(V1 = structure(1:4, .Label = c("1.05 ~ 1.06", "1.07",
"1.08", "1.09 ~ 1.10"), class = "factor")), .Names = "V1", class =
"data.frame", row.names = c(NA,
-4L))
V1
1 1.05 ~ 1.06
2 1.07
3 1.08
4 1.09 ~ 1.10
How can I scan the interval value of V1 in dataset 2 and print out the sub category data of dataset 1 which covers the interval on new dataset like above?
If I understand correctly you are after something like this:
lapply(df2$V1, function(x) {
z <- as.numeric(unlist(strsplit(as.character(x), split = " ~ ")))
b <- which(df1$V1 %in% z)
if(length(b)==0) return(NULL)
if(length(b)==1) return(df1[b,])
if(length(b)==2) return(df1[b[1]:b[2],])
})
#result
[[1]]
V1 V2
1 1.05 NA
2 NA NA
3 NA 105.11
4 NA 105.12
5 NA 105.13
6 NA 105.14
7 NA 105.15
8 NA NA
9 NA 105.94
10 NA 105.99
11 1.06 NA
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Explanation
with
lapply(df2$V1, function(x)...
you go through elements of df2$V1 one by one and apply a function on each.
the function first splits the string at " ~ " then converts to numeric after un listing since strsplit returns a list and not a vector
z <- as.numeric(unlist(strsplit(as.character(x), split = " ~ ")))
Then it determines which elements of df1$V1 are in z
b <- which(df1$V1 %in% z)
And if b has 0 elements it returns NULL
if b has 1 element it returns just one row for df1$V1
if b has 2 elements it returns a range or rows from b[1] to b[2]

Lengthen an extremely large data frame with massive sparsity

I have a data frame such as this (but of size 16 Billion):
structure(list(id1 = c(1, 2, 3, 4, 4, 4, 4, 4, 4, 4), id2 = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), b1 = c(NA, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L), b2 = c(1, NA, NA, NA, NA, NA,
1, 1, 1, 1), b3 = c(NA, 1, NA, NA, NA, NA, NA, NA, 1, 1), b4 = c(NA,
NA, 1, NA, NA, NA, NA, NA, 1, 1)), .Names = c("id1", "id2", "b1",
"b2", "b3", "b4"), row.names = c(NA, 10L), class = "data.frame")
df
id1 id2 b1 b2 b3 b4
1 1 a NA 1 NA NA
2 2 b NA NA 1 NA
3 3 c NA NA NA 1
4 4 d 1 NA NA NA
5 4 e 1 NA NA NA
6 4 f 1 NA NA NA
7 4 g 1 1 NA NA
8 4 h 1 1 NA NA
9 4 i 1 1 1 1
10 4 j 1 1 1 1
I need to get it into long format, while ONLY keeping values of 1. Of course, I tried using gather from tidyr and also melt from data.table to no avail as the memory requirements of them are explosive. My original data had zeros and ones, but I filled zeroes with NA and hoped na.rm = TRUE option will help with memory issue. But, it does not.
With just ones retained and lengthened, my data frame will fit easily in memory I have.
Is there a better way to get at this vs. using the standard methods - reasonable compute as a tradeoff for better memory fit is acceptable.
My desired output is the equivalent of:
library(dplyr)
library(tidyr)
df %>% gather(b, value, -id1, -id2, na.rm = TRUE)
id1 id2 b value
1 4 d b1 1
2 4 e b1 1
3 4 f b1 1
4 4 g b1 1
5 4 h b1 1
6 4 i b1 1
7 4 j b1 1
8 1 a b2 1
9 4 g b2 1
10 4 h b2 1
11 4 i b2 1
12 4 j b2 1
13 2 b b3 1
14 4 i b3 1
15 4 j b3 1
16 3 c b4 1
17 4 i b4 1
18 4 j b4 1
# or
reshape2::melt(df, id=c("id1","id2"), na.rm=TRUE)
# or
library(data.table)
melt(setDT(df), id=c("id1","id2"), na.rm=TRUE)
Currently, the call to gather on my full data set gives me this error, which I believe is due to memory issue:
Error in .Call("tidyr_melt_dataframe", PACKAGE = "tidyr", data, id_ind, :
negative length vectors are not allowed

Last observation carried forward conditional on multiple columns

I have a dataset with this structure:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1 1 1 NA
# 2 1 NA NA
# 3 1 NA NA
# 4 1 NA NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA NA
# 8 3 NA NA
# 9 3 1 NA
# 10 3 NA NA
# 11 3 NA 1
When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:
library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)
This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)
df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1 1 1 NA
# 2 1 1 NA
# 3 1 1 NA
# 4 1 1 NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA 1
# 8 3 NA NA
# 9 3 1 NA
# 10 3 1 NA
# 11 3 NA 1
We can use na.locf
library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
# ID L40 K50
#1: 1 1 NA
#2: 1 1 NA
#3: 1 1 NA
#4: 1 1 NA
#5: 2 1 NA
#6: 2 NA 1
#7: 3 NA 1
#8: 3 NA 1
#9: 3 NA 1
An option using dplyr would be
library(dplyr)
df %>%
mutate(ind = rowSums(is.na(.))) %>%
group_by(ID) %>%
mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
select(-ind)
# ID L40 K50
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 1 1 NA
#3 1 1 NA
#4 1 1 NA
#5 2 1 NA
#6 2 NA 1
#7 3 NA 1
#8 3 NA 1
#9 3 NA 1
I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, date, L40, K50)
Here is what I did:
#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)
#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)
#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)
setkey(Dt1, ID, date)
setkey(Dt2, ID, date)
#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present,
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()
This carries forward the name(s) of the diagnosis until the next observed diagnosis.

Resources