How does the expression dt[is.na(dt)]=0 work? - r

I'm trying to replace NA cells by some value but only in one column. I found another thread explaining how to proceed but I don't understand how it works.
is.na(dt) returns a data table tracing the original dt but replacing all the values by either TRUE or FALSE depending on whether the original cell is NA. Now a datatable first parameters is supposed to accept a logical vector to select lines, not a whole datatable. And indeed dt[is.na(dt)] returns an error but dt[is.na(dt)]=0 will replace all the NA values with 0. Why does adding an =0 suddenly makes this call work ? Is it a special feature or part of datatable design.

The expression would work if it is a data.frame
dt[is.na(dt)]
#[1] NA NA NA NA NA
But, in a data.table, the syntax is different and converting to logical matrix is inefficient and not recommended in v1.10.0
setDT(dt)[is.na(dt)]
Error in [.data.table(setDT(dt), is.na(dt)) : i is invalid type
(matrix). Perhaps in future a 2 column matrix could return a list of
elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
datatable-help know if you'd like this, or add your
A better option is set which replaces in place without copying
for(j in seq_along(dt)) {
set(dt, i = which(is.na(dt[[j]])), j = j, value = 0)
}
dt
# a b c
# 1: 1 0 2
# 2: 2 2 2
# 3: 2 1 1
# 4: 2 0 1
# 5: 0 1 2
# 6: 2 0 5
# 7: 1 1 4
# 8: 1 1 0
# 9: 2 1 5
#10: 2 1 1
Or another version is
setDT(dt)[, lapply(.SD, function(x) replace(x, is.na(x), 0))]
data
dt <- structure(list(a = c(1L, 2L, 2L, 2L, NA, 2L, 1L, 1L, 2L, 2L),
b = c(NA, 2L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L), c = c(2L,
2L, 1L, 1L, 2L, 5L, 4L, NA, 5L, 1L)), .Names = c("a", "b",
"c"), class = "data.frame", row.names = c(NA, -10L))

Related

subset error: longer object length is not a multiple of shorter object length [duplicate]

I have data similar to this:
dt <- structure(list(fct = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), X = c(2L, 4L, 3L, 2L, 5L, 4L, 7L, 2L, 9L, 1L, 4L, 2L, 5L, 4L, 2L)), .Names = c("fct", "X"), class = "data.frame", row.names = c(NA, -15L))
I want to select rows from this data frame based on the values in the fct variable. For example, if I wish to select rows containing either "a" or "c" I can do this:
dt[dt$fct == 'a' | dt$fct == 'c', ]
which yields
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
as expected. But my actual data is more complex and I actually want to select rows based on the values in a vector such as
vc <- c('a', 'c')
So I tried
dt[dt$fct == vc, ]
but of course that doesn't work. I know I could code something to loop through the vector and pull out the rows needed and append them to a new dataframe, but I was hoping there was a more elegant way.
So how can I filter/subset my data based on the contents of the vector vc?
Have a look at ?"%in%".
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element:
dt[is.element(dt$fct, vc),]
Similar to above, using filter from dplyr:
filter(df, fct %in% vc)
Another option would be to use a keyed data.table:
library(data.table)
setDT(dt, key = 'fct')[J(vc)] # or: setDT(dt, key = 'fct')[.(vc)]
which results in:
fct X
1: a 2
2: a 7
3: a 1
4: c 3
5: c 5
6: c 9
7: c 2
8: c 4
What this does:
setDT(dt, key = 'fct') transforms the data.frame to a data.table (which is an enhanced form of a data.frame) with the fct column set as key.
Next you can just subset with the vc vector with [J(vc)].
NOTE: when the key is a factor/character variable, you can also use setDT(dt, key = 'fct')[vc] but that won't work when vc is a numeric vector. When vc is a numeric vector and is not wrapped in J() or .(), vc will work as a rowindex.
A more detailed explanation of the concept of keys and subsetting can be found in the vignette Keys and fast binary search based subset.
An alternative as suggested by #Frank in the comments:
setDT(dt)[J(vc), on=.(fct)]
When vc contains values that are not present in dt, you'll need to add nomatch = 0:
setDT(dt, key = 'fct')[J(vc), nomatch = 0]
or:
setDT(dt)[J(vc), on=.(fct), nomatch = 0]

Filtering and Piping [duplicate]

I have data similar to this:
dt <- structure(list(fct = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), X = c(2L, 4L, 3L, 2L, 5L, 4L, 7L, 2L, 9L, 1L, 4L, 2L, 5L, 4L, 2L)), .Names = c("fct", "X"), class = "data.frame", row.names = c(NA, -15L))
I want to select rows from this data frame based on the values in the fct variable. For example, if I wish to select rows containing either "a" or "c" I can do this:
dt[dt$fct == 'a' | dt$fct == 'c', ]
which yields
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
as expected. But my actual data is more complex and I actually want to select rows based on the values in a vector such as
vc <- c('a', 'c')
So I tried
dt[dt$fct == vc, ]
but of course that doesn't work. I know I could code something to loop through the vector and pull out the rows needed and append them to a new dataframe, but I was hoping there was a more elegant way.
So how can I filter/subset my data based on the contents of the vector vc?
Have a look at ?"%in%".
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element:
dt[is.element(dt$fct, vc),]
Similar to above, using filter from dplyr:
filter(df, fct %in% vc)
Another option would be to use a keyed data.table:
library(data.table)
setDT(dt, key = 'fct')[J(vc)] # or: setDT(dt, key = 'fct')[.(vc)]
which results in:
fct X
1: a 2
2: a 7
3: a 1
4: c 3
5: c 5
6: c 9
7: c 2
8: c 4
What this does:
setDT(dt, key = 'fct') transforms the data.frame to a data.table (which is an enhanced form of a data.frame) with the fct column set as key.
Next you can just subset with the vc vector with [J(vc)].
NOTE: when the key is a factor/character variable, you can also use setDT(dt, key = 'fct')[vc] but that won't work when vc is a numeric vector. When vc is a numeric vector and is not wrapped in J() or .(), vc will work as a rowindex.
A more detailed explanation of the concept of keys and subsetting can be found in the vignette Keys and fast binary search based subset.
An alternative as suggested by #Frank in the comments:
setDT(dt)[J(vc), on=.(fct)]
When vc contains values that are not present in dt, you'll need to add nomatch = 0:
setDT(dt, key = 'fct')[J(vc), nomatch = 0]
or:
setDT(dt)[J(vc), on=.(fct), nomatch = 0]

Extracting data from dataframe by logic in R [duplicate]

I have data similar to this:
dt <- structure(list(fct = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), X = c(2L, 4L, 3L, 2L, 5L, 4L, 7L, 2L, 9L, 1L, 4L, 2L, 5L, 4L, 2L)), .Names = c("fct", "X"), class = "data.frame", row.names = c(NA, -15L))
I want to select rows from this data frame based on the values in the fct variable. For example, if I wish to select rows containing either "a" or "c" I can do this:
dt[dt$fct == 'a' | dt$fct == 'c', ]
which yields
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
as expected. But my actual data is more complex and I actually want to select rows based on the values in a vector such as
vc <- c('a', 'c')
So I tried
dt[dt$fct == vc, ]
but of course that doesn't work. I know I could code something to loop through the vector and pull out the rows needed and append them to a new dataframe, but I was hoping there was a more elegant way.
So how can I filter/subset my data based on the contents of the vector vc?
Have a look at ?"%in%".
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element:
dt[is.element(dt$fct, vc),]
Similar to above, using filter from dplyr:
filter(df, fct %in% vc)
Another option would be to use a keyed data.table:
library(data.table)
setDT(dt, key = 'fct')[J(vc)] # or: setDT(dt, key = 'fct')[.(vc)]
which results in:
fct X
1: a 2
2: a 7
3: a 1
4: c 3
5: c 5
6: c 9
7: c 2
8: c 4
What this does:
setDT(dt, key = 'fct') transforms the data.frame to a data.table (which is an enhanced form of a data.frame) with the fct column set as key.
Next you can just subset with the vc vector with [J(vc)].
NOTE: when the key is a factor/character variable, you can also use setDT(dt, key = 'fct')[vc] but that won't work when vc is a numeric vector. When vc is a numeric vector and is not wrapped in J() or .(), vc will work as a rowindex.
A more detailed explanation of the concept of keys and subsetting can be found in the vignette Keys and fast binary search based subset.
An alternative as suggested by #Frank in the comments:
setDT(dt)[J(vc), on=.(fct)]
When vc contains values that are not present in dt, you'll need to add nomatch = 0:
setDT(dt, key = 'fct')[J(vc), nomatch = 0]
or:
setDT(dt)[J(vc), on=.(fct), nomatch = 0]

Need help replacing values with NA when another condition is met in R (i.e. when another variable is a specific value)

I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37

Resources