Replace NA-s with mean values - r

I have a data frame with values missing. How can I replace NA-s with the mean of value from the previous and next row? In the example with (30+10)/2=20.
id value
1 30
2 NA
3 10
4 20

Try
library(zoo)
na.approx(df$value)
#[1] 30 20 10 20
Suppose if the data has first or last rows as NA values or consecutive NAs (it is not clear in the post), the function would return
na.approx(df1$value, na.rm=FALSE)
#[1] NA 20 25 24 23 22 27 28 29 NA
data
df <- structure(list(id = 1:4, value = c(30L, NA, 10L, 20L)), .Names = c("id",
"value"), class = "data.frame", row.names = c(NA, -4L))
df1 <- data.frame(id=1:10, value=c(NA, 20,25, NA, NA, 22, 27, 28, 29, NA))

Related

remove rows containing NA based on condition

df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
From df I want to remove rows containing NA in y based on the condition that the x value in that row is smaller than the x value in the row with the minimum y value to obtain this data frame.
data.frame(x = 3:7, y = c(5, 10, NA, 20, 30))
dlypr() solutions preferable!
We could use which.min to get the index of minimum 'y' value, subset the 'x' create the comparison with the 'x' values along with the expression for NA elements in 'y' and negate (!)
subset(df, !(x< x[which.min(y)] & is.na(y)))
-output
x y
3 3 5
4 4 10
5 5 NA
6 6 20
7 7 30
Or the same logic can be applied with dplyr::filter
library(dplyr)
df %>%
filter(!(x< x[which.min(y)] & is.na(y)))
-ouptut
x y
1 3 5
2 4 10
3 5 NA
4 6 20
5 7 30
data
df <- structure(list(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30)),
class = "data.frame", row.names = c(NA,
-7L))
Use logical indices for each of the conditions and combine them with logical AND, &:
df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
i <- is.na(df$y)
j <- df$x < df$y
df[!i & j, ]
# x y
#3 3 5
#4 4 10
#6 6 20
#7 7 30

How to remove rows if a list of columns is completely full of NAs (not removing the rows with at least one value other than NA)

I've got a dataset similar to the one below (but with millions of rows), and I wanted to remove the rows where the columns REVENUE were ALL NAs (in the dataset below, the lines c and e).
I saw a similar post in the link (R - Remove rows which have all NAs in certain columns) but the answer was using the position of the columns (I would rather use their names) and I didn't understand what they meant by "!=5".
You can get a replicable dataset with the code:
dat <- data.frame(Company = c("a","b","c","d","e","f"), survey_year = c(2014, 2010, 2006, 2014, 2006, 2010), rev_01 = c(NA, 20, NA, NA, NA, 10),
rev_02 = c(10, 50, NA, 30, NA, 20), rev_03 = c(20, NA, NA, NA, NA, 30), rev_04 = c(NA, NA, NA, 50, NA, 50),
rev_05 = c(NA, 30, NA, NA, NA, 60), variable = c("U", "P", "X", "E", "T","T"))
Thank you!
You can use grep to find columns with rev and use all with apply to find rows where all are NA.
dat[!apply(is.na(dat[,grep("^rev", colnames(dat))]), 1, all),]
# Company survey_year rev_01 rev_02 rev_03 rev_04 rev_05 variable
#1 a 2014 NA 10 20 NA NA U
#2 b 2010 20 50 NA NA 30 P
#4 d 2014 NA 30 NA 50 NA E
#6 f 2010 10 20 30 50 60 T
Or you can use rowSums like:
dat[rowSums(!is.na(dat[,grep("^rev", colnames(dat))])) > 0,]
You can use is.na() + rowSums() + subset() to get the desired output
subset(dat,rowSums(is.na(dat[grep("rev",names(dat))]))!=5)
such that
> subset(dat,rowSums(is.na(dat[grep("rev",names(dat))]))!=5)
Company survey_year rev_01 rev_02 rev_03 rev_04 rev_05 variable
1 a 2014 NA 10 20 NA NA U
2 b 2010 20 50 NA NA 30 P
4 d 2014 NA 30 NA 50 NA E
6 f 2010 10 20 30 50 60 T
Another approach would be selecting the columns with dplyr::vars and dplyr::starts_with and remove the rows with dplyr::filter_at and dplyr::any_vars. This is just a small adjustment of https://stackoverflow.com/a/51600309/10754831
library(tidyverse)
dat %>%
filter_at(vars(starts_with("rev")), any_vars(!is.na(.)))

Transform a dataframe to use first column values as column names

I have a dataframe with 2 columns:
.id vals
1 A 10
2 B 20
3 C 30
4 A 100
5 B 200
6 C 300
dput(tst_df)
structure(list(.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), vals = c(10, 20, 30, 100, 200,
300)), .Names = c(".id", "vals"), row.names = c(NA, -6L), class = "data.frame")
Now i want to have the .id column to become my column names and the vals will become 2 rows.
Like this:
A B C
10 20 30
100 200 300
Basically .id is my grouping variable and i want to have all values belonging to 1 group as a row. I expected something simple like melt and transform. But after many tries i still not succeeded. Is anyone familiar with a function that will accomplish this?
You can do this in base R with unstack:
unstack(df, form=vals~.id)
A B C
1 10 20 30
2 100 200 300
The first argument is the name of the data.frame and the second is a formula which determines the unstacked structure.
You can also use tapply,
do.call(cbind, tapply(df$vals, df$.id, I))
# A B C
#[1,] 10 20 30
#[2,] 100 200 300
or wrap it in data frame, i.e.
as.data.frame(do.call(cbind, tapply(df$vals, df$.id, I)))

Select last value in a row & place it in another column

I have data table like this
Col1 | Col2 | Colx
12 | 13 | 19
34 | NA | NA
13 | 33 | NA
to determine the last value in each row I used Andrie's suggestion here for a previous question on the same subject
But I'd like the output to be in a separated column, the expected output for the above example.
>
Column
19
34
33
The OG question in the link above didn't solve my problem, as the output is not coming in a new column.
We can do
apply(df, 1, function(x) tail(x[!is.na(x)], 1))
If you want the result in a new column, you can do:
df$newColumn <- apply(df, 1, function(x) tail(x[!is.na(x)], 1))
Another option is
i1 <- which(!is.na(df1), arr.ind=TRUE)
unname(tapply(df1[i1], i1[,1], FUN=tail,1))
#[1] 19 34 33
data
df1 <- structure(list(Col1 = c(12, 34, 13), Col2 = c(13,
NA, 33), Colx = c(19,
NA, NA)), .Names = c("Col1", "Col2", "Colx"),
row.names = c(NA, -3L), class = "data.frame")

Complete.obs of cor() function

I am establishing a correlation matrix for my data, which looks like this
df <- structure(list(V1 = c(56, 123, 546, 26, 62, 6, NA, NA, NA, 15
), V2 = c(21, 231, 5, 5, 32, NA, 1, 231, 5, 200), V3 = c(NA,
NA, 24, 51, 53, 231, NA, 153, 6, 700), V4 = c(2, 10, NA, 20,
56, 1, 1, 53, 40, 5000)), .Names = c("V1", "V2", "V3", "V4"), row.names = c(NA,
10L), class = "data.frame")
This gives the following data frame:
V1 V2 V3 V4
1 56 21 NA 2
2 123 231 NA 10
3 546 5 24 NA
4 26 5 51 20
5 62 32 53 56
6 6 NA 231 1
7 NA 1 NA 1
8 NA 231 153 53
9 NA 5 6 40
10 15 200 700 5000
I normally use a complete.obs command to establish my correlation matrix using this command
crm <- cor(df, use="complete.obs", method="pearson")
My question here is, how does the complete.obs treat the data? does it omit any row having a "NA" value, make a "NA" free table and make a correlation matrix at once like this?
df2 <- structure(list(V1 = c(26, 62, 15), V2 = c(5, 32, 200), V3 = c(51,
53, 700), V4 = c(20, 56, 5000)), .Names = c("V1", "V2", "V3",
"V4"), row.names = c(NA, 3L), class = "data.frame")
or does it omit "NA" values in a pairwise fashion, for example when calculating correlation between V1 and V2, the row that contains an NA value in V3, (such as rows 1 and 2 in my example) do they get omitted too?
If this is the case, I am looking forward to establish a command that reserves as much as possible of the data, by omitting NA values in a pairwise fashion.
Many thanks,
Look at the help file for cor, i.e. ?cor. In particular,
If ‘use’ is ‘"everything"’, ‘NA’s will propagate conceptually, i.e., a
resulting value will be ‘NA’ whenever one of its contributing
observations is ‘NA’.
If ‘use’ is ‘"all.obs"’, then the presence of missing observations
will produce an error. If ‘use’ is ‘"complete.obs"’ then missing
values are handled by casewise deletion (and if there are no complete
cases, that gives an error).
To get a better feel about what is going on, is to create an (even) simpler example:
df1 = df[1:5,1:3]
cor(df1, use="pairwise.complete.obs", method="pearson")
cor(df1, use="complete.obs", method="pearson")
cor(df1[3:5,], method="pearson")
So, when we use complete.obs, we discard the entire row if an NA is present. In my example, this means we discard rows 1 and 2. However, pairwise.complete.obs uses the non-NA values when calculating the correlation between V1 and V2.

Resources