Complete.obs of cor() function - r

I am establishing a correlation matrix for my data, which looks like this
df <- structure(list(V1 = c(56, 123, 546, 26, 62, 6, NA, NA, NA, 15
), V2 = c(21, 231, 5, 5, 32, NA, 1, 231, 5, 200), V3 = c(NA,
NA, 24, 51, 53, 231, NA, 153, 6, 700), V4 = c(2, 10, NA, 20,
56, 1, 1, 53, 40, 5000)), .Names = c("V1", "V2", "V3", "V4"), row.names = c(NA,
10L), class = "data.frame")
This gives the following data frame:
V1 V2 V3 V4
1 56 21 NA 2
2 123 231 NA 10
3 546 5 24 NA
4 26 5 51 20
5 62 32 53 56
6 6 NA 231 1
7 NA 1 NA 1
8 NA 231 153 53
9 NA 5 6 40
10 15 200 700 5000
I normally use a complete.obs command to establish my correlation matrix using this command
crm <- cor(df, use="complete.obs", method="pearson")
My question here is, how does the complete.obs treat the data? does it omit any row having a "NA" value, make a "NA" free table and make a correlation matrix at once like this?
df2 <- structure(list(V1 = c(26, 62, 15), V2 = c(5, 32, 200), V3 = c(51,
53, 700), V4 = c(20, 56, 5000)), .Names = c("V1", "V2", "V3",
"V4"), row.names = c(NA, 3L), class = "data.frame")
or does it omit "NA" values in a pairwise fashion, for example when calculating correlation between V1 and V2, the row that contains an NA value in V3, (such as rows 1 and 2 in my example) do they get omitted too?
If this is the case, I am looking forward to establish a command that reserves as much as possible of the data, by omitting NA values in a pairwise fashion.
Many thanks,

Look at the help file for cor, i.e. ?cor. In particular,
If ‘use’ is ‘"everything"’, ‘NA’s will propagate conceptually, i.e., a
resulting value will be ‘NA’ whenever one of its contributing
observations is ‘NA’.
If ‘use’ is ‘"all.obs"’, then the presence of missing observations
will produce an error. If ‘use’ is ‘"complete.obs"’ then missing
values are handled by casewise deletion (and if there are no complete
cases, that gives an error).
To get a better feel about what is going on, is to create an (even) simpler example:
df1 = df[1:5,1:3]
cor(df1, use="pairwise.complete.obs", method="pearson")
cor(df1, use="complete.obs", method="pearson")
cor(df1[3:5,], method="pearson")
So, when we use complete.obs, we discard the entire row if an NA is present. In my example, this means we discard rows 1 and 2. However, pairwise.complete.obs uses the non-NA values when calculating the correlation between V1 and V2.

Related

R code - How can I manipulate the df below to look like this this:

Currently I wrote this code:
lc <- round(tabyl(x$Likelihood.to.Click, show_na = FALSE),2)
lc$percent <- lc$percent * 100
and produced this chart:
But In need help manipulating it to create the below df (basically its summing the percentages
of row 1 and 2, leaving 3 as is, and then summing 4 and 5:
First make your example reproducible using dput(lc) to pass the data:
lc <- structure(list(L2E = 1:5, n = c(7, 23, 84, 73, 33), pct = c(3,
10, 38, 33, 15)), class = "data.frame", row.names = c(NA, -5L))
Now define the groups
groups <- list(1:2, 3, 4:5)
sumpct <- sapply(groups, function(x) sum(lc[x, 3]))
lc2 <- data.frame(group=seq(length(groups)), sumpct)
lc2
# group sumpct
# 1 1 13
# 2 2 38
# 3 3 48
Note that row 3 is 38, not 32 and in rows 4 and 5, 33 + 15 = 48, not 54.

Get mean value from other rows into current row

I have a soil properties data.table with values for different locations and depths. Some values are NA so I'd like to get the mean values considering the upper and lower layers. In the case of the top layer, I'd take the value from the next one down.
I was able to create a column indicating which are the upper and lower layer for each row and I though about doing a self merge. But I'm completely lost at how to proceed.
Any clues as how to do this? Bellow is an example data.table and what I'd like to achieve. The example considers two locations with 3 layers. But I have multiple locations and some have more layers than others.
library(data.table)
# I was able to identify which are the botton and top layers
# using a function to identify the neighboors
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20))
# How can I calculate the average between top and lowe layers?
dt_desired <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20)
mean_top_bot = c(18, 11.5, 18, 27, 27, 27))
To explain a bit more:
mean_top_bot[1] = val[depth = 0] + val[depth = 20]. Since I don't have value at depth 0, that would become (NA + 18)/2 = 18 (rm.na = TRUE)
mean_top_bot[2] = val[depth=10] + val[depth=30] = (12+11)/2
I calculated the mean_top_bot values by hand. That's why I had some errors there :facepal:
Solution using self merge
I was able to merge the table with itself, by changin the by.x and by.y`parameters. But I have a feeling that I'm doing this in the worst way possible.
dt1 <- merge(dt, dt[, .SD, .SDcols = !c('bot_l', 'top_l')],
by.x = c('id', 'bot_l'),
by.y = c('id', 'depth'),
all = TRUE)[order(id, depth)]
id bot_l depth val.x top_l val.y
1: 1 20 10 12 NA 18
2: 1 30 20 18 10 11
3: 1 NA 30 11 20 NA
4: 1 10 NA NA NA 12
5: 2 20 10 25 NA 27
6: 2 30 20 27 10 29
7: 2 NA 30 29 20 NA
8: 2 10 NA NA NA 25
Are there any easier ways to do this?
It should be easier to use directly data.table::shift without computing "top" and "bot" layers.
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29))
dt[, v := rowMeans(data.table::setDT(data.table::shift(val,
c(1, - 1))),
na.rm = TRUE),
by = id]
Same but with maggrittr :
library(magrittr)
dt[, v := data.table::shift(val, c(1, -1)) %>% data.table::setDT() %>% rowMeans(na.rm = TRUE),
by = id]
The code above do the mean between previous and next val for a given depth. I suppose the gap between the value and top/bot layers is one and the data are already ordered by id and depth, as in your example.
It took me a while to figure out, but this can be solved as well by a rolling mean:
dt[, mean_top_bot :=
zoo::rollapply(val, width = list(c(-1L, 1L)), FUN = mean, partial = TRUE), id][]
id depth val bot_l top_l mean_top_bot
1: 1 10 12 20 NA 18
2: 1 20 18 30 10 11.5
3: 1 30 11 NA 20 18
4: 2 10 25 20 NA 27
5: 2 20 27 30 10 27
6: 2 30 29 NA 20 27
Two characteristics of zoo::rollapply() come in handy:
The width argument alternatively takes a list of integer offsets. So, list(c(-1L, 1L)) refers to the values of the preceeding and subsequent rows while omitting the current row.
With partial = TRUE, only the subset of indexes that are in range are passed to FUN. E.g., for the first row, offset -1 refers to index 0 which is out of range. Therefore, only the value of index 2 (offset 1) is passed to mean(). Likewise for the last row, where only the second to last value is passed to mean().

Remove NAs from nested list data frame

The following really seems to be a tough nut to crack:
I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
I want to print the structure without any NAs (dots instead of the blanks are ok too):
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90
I have tried all kinds of things, the best approach so far seems to be something like rapply(df, na.omit, how = "replace") where I can't even suppress the Warnings (suppressWarnings doesn't seem to work here!). print(df, na.print = "") doesn't help either.
I am really exhausted now, nothing seems to work... data frames in the form of nested lists doesn't seem to be a good idea after all... could anybody help?
You can try the code below
df[]<-rapply(Map(as.list,df), na.omit, how = "replace")
which gives
> df
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90
You can create your own recursive function and apply it to each column :
rm_nested_na <- function(x) {
if (is.atomic(x)) {
na.omit(x)
} else {
lapply(x, rm_nested_na)
}
}
res <- df
listcol <- sapply(res, is.list)
res[listcol] <- lapply(res[listcol], rm_nested_na)
res
This is clearly inefficient if the nesting is deep.

Remove duplicate rows in nested list data frame

I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
My question: You see that the first and second row are duplicated. I want to remove all duplicate rows in such data frames and I am looking for some general method.
What I tried: duplicated doesn't seem to work, I guess because of this special structure of a data frame with nested lists inside.
You may need to flatten the nested lists of each column and then apply unique, e.g.,
> unique({df[]<-Map(function(x) Map(unlist,x),df);df})
zerobonds nominal calls call_strike puts put_strike
1 1 20 NA, -1, 1 NA, 90, 110 NA NA
3 NA NA NA NA NA, 1, -1 NA, 110, 90

Replace NA-s with mean values

I have a data frame with values missing. How can I replace NA-s with the mean of value from the previous and next row? In the example with (30+10)/2=20.
id value
1 30
2 NA
3 10
4 20
Try
library(zoo)
na.approx(df$value)
#[1] 30 20 10 20
Suppose if the data has first or last rows as NA values or consecutive NAs (it is not clear in the post), the function would return
na.approx(df1$value, na.rm=FALSE)
#[1] NA 20 25 24 23 22 27 28 29 NA
data
df <- structure(list(id = 1:4, value = c(30L, NA, 10L, 20L)), .Names = c("id",
"value"), class = "data.frame", row.names = c(NA, -4L))
df1 <- data.frame(id=1:10, value=c(NA, 20,25, NA, NA, 22, 27, 28, 29, NA))

Resources