Subsetting to drop rows where df$var=0 produces NA rows where var is NA - r

I have a data.frame that I'm attempting to eliminate some observations on. I want to drop any row in which out$SUB_AGE is equal to 0. However, when I try to subset my df based on that condition, it transforms any row that has NA for out$SUB_AGE into a row of NAs. I've provided a dput below which doesn't actually contain any rows where out$SUB_AGE=0 but it does behave exactly the same as the full dataset which does contain zeroes does.
# dput the data
> temp <- dput(droplevels(out[1:12, 1:4]))
structure(list(SUB_ID = c(5998784L, 6805295L, 318926L, 1270965L,
1635543L, 4296301L, 1001498L, 2388387L, 2190957L, 4168048L, 318926L,
4073180L), ORG_ID = c(10861L, 17361L, 10608L, 11099L, 13135L,
14803L, 12359L, 13151L, 13135L, 17252L, 10608L, 17317L), SUB_AGE = c(36,
NA, NA, 40, 60, 50, 52, 61, 56, 62, NA, NA), SUB_SEX = c(NA,
1, 2, 1, 2, 2, 1, 2, 2, NA, 2, 2)), .Names = c("SUB_ID", "ORG_ID",
"SUB_AGE", "SUB_SEX"), row.names = c(107L, 190L, 242L, 331L,
361L, 447L, 455L, 591L, 663L, 664L, 731L, 732L), class = "data.frame")
# table before subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
190 6805295 17361 NA 1
242 318926 10608 NA 2
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
731 318926 10608 NA 2
732 4073180 17317 NA 2
# code to subset
temp <- temp[temp$SUB_AGE != 0,]
# table after subsetting
SUB_ID ORG_ID SUB_AGE SUB_SEX
107 5998784 10861 36 NA
NA NA NA NA NA
NA.1 NA NA NA NA
331 1270965 11099 40 1
361 1635543 13135 60 2
447 4296301 14803 50 2
455 1001498 12359 52 1
591 2388387 13151 61 2
663 2190957 13135 56 2
664 4168048 17252 62 NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
I'm sure there's something simple I'm missing here but I racked my brain and apparently couldn't come up with the right combination of keywords to figure it out myself.

To understand the problem, try printing temp$SUB_AGE != 0:
[1] TRUE NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA NA
You're using this vector to subset temp, but that functionality only works for TRUE/FALSE values. If you want to keep all the rows with NA values, you can add an extra condition:
temp[temp$SUB_AGE != 0 | is.na(temp$SUB_AGE),]

Related

Conditional filling NA rows with comparing non-NA labeled rows

I want to fill NA rows based on checking the differences between the closest non-NA labeled rows.
For instance
data <- data.frame(sd_value=c(34,33,34,37,36,45),
value=c(383,428,437,455,508,509),
label=c(c("bad",rep(NA,4),"unable")))
> data
sd_value value label
1 34 383 bad
2 33 428 <NA>
3 34 437 <NA>
4 37 455 <NA>
5 36 508 <NA>
6 45 509 unable
I want to evaluate how to change NA rows with checking the difference between sd_value and value those close to bad and unablerows.
if we want to get differences between the rows we can do;
library(dplyr)
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 <NA> 45 -1
3 34 437 <NA> 9 1
4 37 455 <NA> 18 3
5 36 508 <NA> 53 -1
6 45 509 unable 1 9
The condition how I want to label the NA rows is
if the diff_val<50 and diff_sd_val<9 label them with the last non-NA label else use the first non-NA label after the last NA row.
So that the expected output would be
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 bad 45 -1
3 34 437 bad 9 1
4 37 455 bad 18 3
5 36 508 unable 53 -1
6 45 509 unable 1 9
The possible solution I cooked up so far:
custom_labelling <- function(x,y,label){
diff_sd_val<-c(NA,diff(x))
diff_val<-c(NA,diff(y))
label <- NA
for (i in 1:length(label)){
if(is.na(label[i])&diff_sd_val<9&diff_val<50){
label[i] <- label
}
else {
label <- label[i]
}
}
return(label)
}
which gives
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))%>%
mutate(custom_label=custom_labelling(sd_value,value,label))
Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.
In addition: Warning message:
In if (is.na(label[i]) & diff_sd_val < 9 & diff_val < 50) { :
the condition has length > 1 and only the first element will be used
One option is to find NA and non-NA index and based on the condition select the closest label to it.
library(dplyr)
#Create a new dataframe with diff_val and diff_sd_val
data1 <- data%>% mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
#Get the NA indices
NA_inds <- which(is.na(data1$label))
#Get the non-NA indices
non_NA_inds <- setdiff(1:nrow(data1), NA_inds)
#For every NA index
for (i in NA_inds) {
#Check the condition
if(data1$diff_sd_val[i] < 9 & data1$diff_val[i] < 50)
#Get the last non-NA label
data1$label[i] <- data1$label[non_NA_inds[which.max(i > non_NA_inds)]]
else
#Get the first non-NA label after last NA value
data1$label[i] <- data1$label[non_NA_inds[i < non_NA_inds]]
}
data1
# sd_value value label diff_val diff_sd_val
#1 34 383 bad 0 0
#2 33 428 bad 45 -1
#3 34 437 bad 9 1
#4 37 455 bad 18 3
#5 36 508 unable 53 -1
#6 45 509 unable 1 9
You can remove diff_val and diff_sd_val columns later if not needed.
We can also create a function
custom_label <- function(label, diff_val, diff_sd_val) {
NA_inds <- which(is.na(label))
non_NA_inds <- setdiff(1:length(label), NA_inds)
new_label = label
for (i in NA_inds) {
if(diff_sd_val[i] < 9 & diff_val[i] < 50)
new_label[i] <- label[non_NA_inds[which.max(i > non_NA_inds)]]
else
new_label[i] <- label[non_NA_inds[i < non_NA_inds]]
}
return(new_label)
}
and then apply it
data%>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))
# sd_value value label diff_val diff_sd_val new_label
#1 34 383 bad 0 0 bad
#2 33 428 <NA> 45 -1 bad
#3 34 437 <NA> 9 1 bad
#4 37 455 <NA> 18 3 bad
#5 36 508 <NA> 53 -1 unable
#6 45 509 unable 1 9 unable
If we want to apply it by group we can add a group_by statement and it should work.
data%>%
group_by(group) %>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))

Adding columns by splitting number, and removing duplicates

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257
You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

Apply a rule to calculate sum of specific

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))
Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.
You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

How to find NAs in groups and create new column for data frame

I have a data frame consisting of an "ID" column and a "Diff" column. The ID column is responsible for marking groups of corresponding Diff values.
An example looks like this:
structure(list(ID = c(566, 566, 789, 789, 789, 487, 487, 11,
11, 189, 189), Diff = c(100, 277, 529, 43, NA, 860, 780, 445,
NA, 578, 810)), .Names = c("ID", "Diff"), row.names = c(9L, 10L,
20L, 21L, 22L, 25L, 26L, 51L, 52L, 62L, 63L), class = "data.frame")
My goal is to search each group for NAs in the Diff column and create a new column, that has either a "True" or "False" value for each row, depending if the corresponding group has an NA in Diff.
I tried
x <- aggregate(Diff ~ ID, data, is.na)
and
y <- aggregate(Diff ~ ID, data, function(x) any(is.na(x)))
The idea was to merge the result depending on ID. However, none of the above created a useful result. I know R can do it … and after searching for quite a while I ask you how :)
You can use the plyr and ddply
require(plyr)
ddply(data, .(ID), transform, na_diff = any(is.na(Diff)))
## ID Diff na_diff
## 1 11 445 TRUE
## 2 11 NA TRUE
## 3 189 578 FALSE
## 4 189 810 FALSE
## 5 487 860 FALSE
## 6 487 780 FALSE
## 7 566 100 FALSE
## 8 566 277 FALSE
## 9 789 529 TRUE
## 10 789 43 TRUE
## 11 789 NA TRUE
A very similar solution to #dickoa except in base:
do.call(rbind,by(data,data$ID,function(x)transform(x,na_diff=any(is.na(Diff)))))
# ID Diff na_diff
# 11.51 11 445 TRUE
# 11.52 11 NA TRUE
# 189.62 189 578 FALSE
# 189.63 189 810 FALSE
# 487.25 487 860 FALSE
# 487.26 487 780 FALSE
# 566.9 566 100 FALSE
# 566.10 566 277 FALSE
# 789.20 789 529 TRUE
# 789.21 789 43 TRUE
# 789.22 789 NA TRUE
Similarly, you could avoid transform with:
data$na_diff<-with(data,by(Diff,ID,function(x) any(is.na(x)))[as.character(ID)])
(You have two viable strategies already, but here is another which may be conceptually easier to follow if you are relatively new to R and aren't familiar with the way plyr works.)
I often need to know how many NAs I have in different variables, so here is a convenience function I use standard:
sna <- function(x){
sum(is.na(x))
}
From there, I sometimes use aggregate(), but sometimes I find ?summaryBy in the doBy package to be more convenient. Here's an example:
library(doBy)
z <- summaryBy(Diff~ID, data=my.data, FUN=sna)
z
ID Diff.sna
1 11 1
2 189 0
3 487 0
4 566 0
5 789 1
After this, you just need to use ?merge and convert the count of NAs to a logical to get your final data frame:
my.data <- merge(my.data, z, by="ID")
my.data$Diff.sna <- my.data$Diff.sna>0
my.data
ID Diff Diff.sna
1 11 445 TRUE
2 11 NA TRUE
3 189 578 FALSE
4 189 810 FALSE
5 487 860 FALSE
6 487 780 FALSE
7 566 100 FALSE
8 566 277 FALSE
9 789 529 TRUE
10 789 43 TRUE
11 789 NA TRUE

R lag over missing data

Is there a variant of lag somewhere that keeps NAs in position? I want to compute returns of price data where data could be missing.
Col 1 is the price data
Col 2 is the lag of price
Col 3 shows p - lag(p) - the return from 99 to 104 is effectively missed, so the path length of the computed returns will differ from the true.
Col 4 shows the lag with NA position preserved
Col 5 shows the new difference - now the return of 5 for 2009-11-07 is available
Cheers, Dave
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
# fake the lag I want, with NA kept in position
x.pos.lag <- lag.xts(x.pos)
x.pos.lag <- lag.xts(x.pos)
x.pos.lag['2009-11-07']=99
x.pos.lag['2009-11-06']=NA
cbind(x, lag.xts(x), x - lag.xts(x), x.pos.lag, x-x.pos.lag)
..1 ..2 ..3 ..4 ..5
2009-11-01 100 NA NA NA NA
2009-11-02 101 100 1 100 1
2009-11-03 97 101 -4 101 -4
2009-11-04 95 97 -2 97 -2
2009-11-05 99 95 4 95 4
2009-11-06 NA 99 NA NA NA
2009-11-07 104 NA NA 99 5
2009-11-08 103 104 -1 104 -1
2009-11-09 103 103 0 103 0
2009-11-10 100 103 -3 103 -3
There are no functions to do that natively in R, but you can create an index of the original NA positions and then swap the values there after the lag.
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
lag.xts.na <- function(x, ...) {
na.idx <- which(is.na(x))
x2 <- lag.xts(x, ...)
x2[na.idx+1,] <- x2[na.idx,]
x2[na.idx,] <- NA
return(x2)
}
lag.xts.na(x)
[,1]
2009-11-01 NA
2009-11-02 100
2009-11-03 101
2009-11-04 97
2009-11-05 95
2009-11-06 NA
2009-11-07 99
2009-11-08 104
2009-11-09 103
2009-11-10 103
Incidentally, are you just trying to deal with weekends/holidays or something along that line? If so, you might consider dropping those positions from your series; that will dramatically simplify things for you. Alternatively, the timeSeries package in Rmetrics has a number of functions to deal with business days.

Resources