combine two similar columns in r - r

I'm trying to combine two columns of data that essentially contain the same information but some values are missing from each column that the other doesn't have. Column "wasiIQw1" holds the data for half of the group while column w1iq holds the data or the other half of the group.
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
nidaid wasiIQw1 w1iq
1 45-D11150341 104 NA
2 45-D11180321 82 NA
3 45-D11220022 93 93
4 45-D11240432 118 NA
5 45-D11270422 99 NA
6 45-D11290422 82 82
7 45-D11320321 99 99
8 45-D11500021 99 99
9 45-D11500311 95 95
10 45-D11520011 111 111
select(gadd.us,nidaid,wasiIQw1,w1iq)[384:394,]
nidaid wasiIQw1 w1iq
384 H1900442S NA 62
385 H1930422S NA 83
386 H1960012S NA 89
387 H1960321S NA 90
388 H2020011S NA 96
389 H2020422S NA 102
390 H2040011S NA 102
391 H2040331S NA 94
392 H2040422S NA 103
393 H2050051S NA 86
394 H2050341S NA 98
With the following code I joined df.a (a df with the id and wasiIQw1) with df.b (a df with the id and w1iq) and get the following results.
df.join <- semi_join(df.a,
df.b,
by = "nidaid")
nidaid w1iq
1 45-D11150341 NA
2 45-D11180321 NA
3 45-D11220022 93
4 45-D11240432 NA
5 45-D11270422 NA
6 45-D11290422 82
7 45-D11320321 99
8 45-D11500021 99
9 45-D11500311 95
10 45-D11520011 111
nidaid w1iq
384 H1900442S 62
385 H1930422S 83
386 H1960012S 89
387 H1960321S 90
388 H2020011S 96
389 H2020422S 102
390 H2040011S 102
391 H2040331S 94
392 H2040422S 103
393 H2050051S 86
394 H2050341S 98
All of this works except for the first four "NA"s that won't merge. Other "_join" functions from dplyr have not worked either. Do you have any tips for combining theses two columns so that no data is lost but all "NA"s are filled in if the other column has a present value?

I guess you can use coalesce here which finds the first non-missing value at each position.
library(dplyr)
gadd.us %>% mutate(w1iq = coalesce(w1iq, wasiIQw1))
This will select values from w1iq if present or if w1iq is NA then it would select value from wasiIQw1. You can switch the position of w1iq and wasiIQw1 if you want to give priority to wasiIQw1.

Here would be a way to do it with base R (no packages)
Create reproducible data:
> dat<-data.frame(nidaid=paste0("H",c(1:5)), wasiIQw1=c(NA,NA,NA,75,9), w1iq=c(44,21,46,75,NA))
>
> dat
nidaid wasiIQw1 w1iq
1 H1 NA 44
2 H2 NA 21
3 H3 NA 46
4 H4 75 75
5 H5 9 NA
Create a new column named new to combine the two. With this ifelse statement, we say if the first column wasiIQw1 is not (!) an 'NA' (is.na()), then grab it, otherwise grab the second column. Similar to Ronak's answer, you can switch the column names here to give one preference over the other.
> dat$new<-ifelse(!is.na(dat$wasiIQw1), dat$wasiIQw1, dat$w1iq)
>
> dat
nidaid wasiIQw1 w1iq new
1 H1 NA 44 44
2 H2 NA 21 21
3 H3 NA 46 46
4 H4 75 75 75
5 H5 9 NA 9

Using base R, we can do
gadd.us$w1iq <- with(gadd.us, pmax(w1iq, wasiIQw1, na.rm = TRUE))

Related

R impute with Kalman on large data

I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.
> str(dt)
Classes ‘data.table’ and 'data.frame': 4666972 obs. of 5 variables:
$ Year : int 1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
$ State: int 1 1 1 1 1 1 1 1 1 1 ...
$ CC : int 1 1 1 1 1 1 1 1 1 1 ...
$ ID : chr "1" "1" "1" "2" ...
$ MPR : num 54 54 55 52 52 53 60 60 65 70 ...
I tried the code below but it crashed after a while.
> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]
I don't know how to improve the time efficiency and impute successfully without crashed.
Is it better to split the dataset with ID to list and impute each of them with for loop?
> length(unique(hpms_S3$Section_ID))
[1] 668184
> split(dt, dt$ID)
However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.
Is there any great way to do or how can I optimize code I did?
I provide the simple sample here:
# dt
Year State CC ID MPR
2002 15 3 3 NA
2003 15 3 3 NA
2004 15 3 3 193
2005 15 3 3 193
2006 15 3 3 348
2007 15 3 3 388
2008 15 3 3 388
1999 53 33 1 NA
2000 53 33 1 NA
2002 53 33 1 NA
2003 53 33 1 NA
2004 53 33 1 NA
2005 53 33 1 170
2006 53 33 1 170
2007 53 33 1 330
2008 53 33 1 330
EDIT:
As #r2evans mentioned in comment, I modified the code:
> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.
Perhaps splitting should be done using data.table's by= operator, perhaps more efficient.
Since I don't have imputeTS installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table.
myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
Now some equivalent calls, one with your with(dt, ...) and my alternatives (which are really walk-throughs until my ultimate suggestion of 5):
dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
# Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
# 1: 2002 15 3 3 NA 193 193 193 193 193
# 2: 2003 15 3 3 NA 193 193 193 193 193
# 3: 2004 15 3 3 193 193 193 193 193 193
# 4: 2005 15 3 3 193 193 193 193 193 193
# 5: 2006 15 3 3 348 348 348 348 348 348
# 6: 2007 15 3 3 388 388 388 388 388 388
# 7: 2008 15 3 3 388 388 388 388 388 388
# 8: 1999 53 33 1 NA 170 170 170 170 170
# 9: 2000 53 33 1 NA 170 170 170 170 170
# 10: 2002 53 33 1 NA 170 170 170 170 170
# 11: 2003 53 33 1 NA 170 170 170 170 170
# 12: 2004 53 33 1 NA 170 170 170 170 170
# 13: 2005 53 33 1 170 170 170 170 170 170
# 14: 2006 53 33 1 170 170 170 170 170 170
# 15: 2007 53 33 1 330 330 330 330 330 330
# 16: 2008 53 33 1 330 330 330 330 330 330
The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table preferred.
The use of with(dt, ...) is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table can do a lot of things behind-the-scenes so that the calculations/function-calls within the j= component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table example is conditional replacement, as in dt[is.na(x), x := -1]. With the reference to the enter table dt inside of this, if there is ever something in the first argument (conditional replacement) or a by= argument, then it fails.
MPR_kalman2 mitigates this by using .SD, which is data.table's way of replacing the data-to-be-used with the "Subset of the Data" (ref). But it's still not taking advantage of data.table's significant efficiencies in dealing in-memory with groups.
MPR_kalman3 works on this by grouping outside, still using with but not (as in 2) in a more friendly way.
MPR_kalman4 removes the use of with, since really the MPR visible to ave is only within each group anyway. And then when you think about it, since ave is given no grouping variables, it really just passes all of the MPR data straight-through to myimpute. From this, we have MPR_kalman5, a direct method that is along the normal patterns of data.table.
While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table's ways).

What is a memory-efficient method to spread then gather columns? (see example)

I'm trying to rearrange my data for downstream processing. I found a way to accomplish what I want, but it is memory-intensive and I'm sure there is a more-efficient way.
Here is an example from the data:
X.1 Label X
81 81 21 367.138
82 82 21 384.295
83 83 21 159.496
84 84 21 269.927
85 85 22 364.118
86 86 22 154.475
87 87 22 265.861
I want to rearrange the data to create a table of X values for each separate object, as shown below:
1 2 3 4
1 367.138 384.295 159.496 269.927
2 364.118 154.475 265.861 NA
I can do this just fine using spread, apply, and ldply functions shown below:
X <- apply(tidyr::spread(X, Label,X), 2, function(x) na.omit(x))
X<-X[-1]
X<-plyr::ldply(X, rbind)
X<-as.data.frame(X[-1])
Here's the problem, the spread function generates the following table as an intermediate step:
X.1 1 2
1 81 367.138 NA
2 82 384.295 NA
3 83 159.496 NA
4 84 269.927 NA
5 85 NA 364.118
6 86 NA 154.475
7 87 NA 265.861
This is fine for small data sets, but for large data sets the table generated is huge and I'm running out of memory which produces the following error:
Error: cannot allocate vector of size 8.4 Gb
I'm sure there must be a more efficient way of doing this without generating that massive intermediate table. Any ideas?
An option using data.table
dcast(DT, rleid(Label) ~ rowid(Label), value.var = "X")
# Label 1 2 3 4
#1: 1 367.138 384.295 159.496 269.927
#2: 2 364.118 154.475 265.861 NA
data
library(data.table)
DT <- fread(text = " X.1 Label X
81 21 367.138
82 21 384.295
83 21 159.496
84 21 269.927
85 22 364.118
86 22 154.475
87 22 265.861")

Efficient way to add multiple columns to weekly data in data.table, based on other values of columns

I have data with this structure:
a <- data.table(week = 1:52, price = 101:152)
a <- a[rep(1:nrow(a), each = 12),]
a$index_in_week <- 1:12
How do I efficiently create 12 new columns that will hold values of prices for next 12 weeks? So, for each week we have 12 rows of data, with index column by week, so it's always in range(1,12). The new columns should contain prices of following 12 weeks starting from current, with a step of 1 week. For example, for week 1 the first new column will have prices of week 1 to 12, column 2 will have values of week 2 to 13, and so on.
I.e., here is how one can create the first two columns:
a$price_for_week_1 <- apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]]-1), price], 1))
})
a$price_for_week_2 <- apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]]+0), price], 1))
})
Here is an example of a for loop:
for (i in 1:12) {
inside_i <- -2+i
a[, paste0('PRICE_WEEK_', i) := apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]] + inside_i), price], 1))
})]
}
The ways to do it as I see it (e.g. for loop or apply family) consumes too much time, and I need efficiency.
What would be the way with data.table or maybe, as all columns are integer, some funky matrix operations?
P.s. I couldn't come up with better title, my apologies.
If I understand correctly, the OP wants to create a table for 52 weeks (rows) where the prices for the subsequent 12 weeks are printed horizontally.
For this, it is not necessary to create a data.table of 12 x 52 = 624 rows and an index_in_week helper column. docendo discimus has suggested to apply the shift() function on the enlarged (624 rows) data.table.
Instead, the shift() function can be applied directly to the data.table which contains weeks and prices (52 rows).
library(data.table)
a <- data.table(week = 1:52, price = 101:152)
print(a, nrows = 20L)
week price
1: 1 101
2: 2 102
3: 3 103
4: 4 104
5: 5 105
---
48: 48 148
49: 49 149
50: 50 150
51: 51 151
52: 52 152
a[, sprintf("wk%02i", 1:12) := shift(price, n = 0:11, type = "lead")]
print(a, nrows = 20L)
week price wk01 wk02 wk03 wk04 wk05 wk06 wk07 wk08 wk09 wk10 wk11 wk12
1: 1 101 101 102 103 104 105 106 107 108 109 110 111 112
2: 2 102 102 103 104 105 106 107 108 109 110 111 112 113
3: 3 103 103 104 105 106 107 108 109 110 111 112 113 114
4: 4 104 104 105 106 107 108 109 110 111 112 113 114 115
5: 5 105 105 106 107 108 109 110 111 112 113 114 115 116
---
48: 48 148 148 149 150 151 152 NA NA NA NA NA NA NA
49: 49 149 149 150 151 152 NA NA NA NA NA NA NA NA
50: 50 150 150 151 152 NA NA NA NA NA NA NA NA NA
51: 51 151 151 152 NA NA NA NA NA NA NA NA NA NA
52: 52 152 152 NA NA NA NA NA NA NA NA NA NA NA

Adding data frame below another data frame [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 5 years ago.
I want to do the following:
I have a Actual Sales Dataframe
Dates Actual
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
Another data frame of Predicted values
Dates Predicted
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
Add the predicted Sales data frame below the Actual data Frame in following manner:
Dates Actual Predicted
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
With:
library(dplyr)
bind_rows(d1, d2)
you get:
Dates Actual Predicted
1 24/04/2017 58 NA
2 25/04/2017 59 NA
3 26/04/2017 58 NA
4 27/04/2017 154 NA
5 28/04/2017 117 NA
6 29/04/2017 127 NA
7 30/04/2017 178 NA
8 01/05/2017 NA 68.54159
9 02/05/2017 NA 90.73130
10 03/05/2017 NA 82.76875
11 04/05/2017 NA 117.48913
12 05/05/2017 NA 110.38090
13 06/05/2017 NA 156.53363
14 07/05/2017 NA 198.14819
Or with:
library(data.table)
rbindlist(list(d1,d2), fill = TRUE)
Or with:
library(plyr)
rbind.fill(d1,d2)

How add rownames with no dimensions in R

> Cases <- c(4,46,98,115,88,34)
> Cases
[1] 4 46 98 115 88 34
> str(Cases)
num [1:6] 4 46 98 115 88 34
I want to name row as "total.cases" and I got error attempt to set rownames with no dimensions.please see expected the output to be as follow
total.cases 4 46 98 115 88 34
Your problem is that Cases as you define it is an atomic vector. There is no concept of rows or columns.
I think you probably want a list
Cases <- list(total.cases = c(4,46,98,115,88,34))
Cases
## $total.cases
## [1] 4 46 98 115 88 34
str(Cases)
## List of 1
## $ total.cases: num [1:6] 4 46 98 115 88 34
Do you want to print the output in a particular way or do you actually want rownames?
To print Cases how you want, you could just use:
> cat("total.cases ",Cases,"\n")
total.cases 4 46 98 115 88 34
To assign a rowname, you need to actually have rows first. A vector (like Cases) doesn't have any rows or columns as dimensions. You could however convert to a matrix though:
> matrix(Cases,nrow=1,dimnames=list("total.cases",1:length(Cases)))
1 2 3 4 5 6
total.cases 4 46 98 115 88 34

Resources