Unable to apply functions to column names in dplyr - r

I have a dataframe df and I wish to create a new column b that is the smaller value of column a and 10 - a. When there is NA, I wish column b also returnsNA in the corresponding rows. So column b should be c(1, 3, 1, NA). I tried the following code but all rows of b are 1. I wish to find a solution in tidyverse.
library(tidyverse)
df <- data.frame(a = c(1, 3, 9, NA))
df2 <- df %>% mutate(b = min(a, 10 - a, na.rm = T))
I guess the issue arises becuase of applying the min function, which is complicated by the presence of NA. But I cannot figure out how to solve the issue.

Related

R: data.table condition and delete columns "NA"

I'm trying to identify lines with a missing date between two dates.
data.table initial
i want
I want to delete the columns with only "NA" in them (dt_7 and dt_8).
Perhaps you are looking for something like
df <- data.frame(dt_1 = 1:10, dt_2 = c(1, NA, 2, 3, NA, 6:10), dt_3 = rep(NA, 10))
df[,-(which(colSums(is.na(df))==dim(df)[1]))]
or
df %>% select_if(colSums(is.na(.))!=dim(df)[1])
The first option doesn't work for data.tables. Sorry, but the second one should solve your problem.

Fast way to and rows to dataframe based on number of column

I have the following data frame:
df1 = structure(c(3, 5, 8, 6), .Dim = c(2L, 2L))
I would like to add 3 rows:
(0,0)
(5,0)
(0,8)
i.e. for each column, I add a row that is the max of this column, the rest are zeros, and an all zero line. Any fast way to do this?
One option is colMaxs from matrixStats, get the diag and rbind with the original matrix along with a 0 padded row
library(matrixStats)
rbind(df1, 0, diag(colMaxs(df1)))
If it is a data.frame (based on the title)
library(dplyr)
df2 %>%
summarise_all(max) %>%
diag %>%
rbind(df2, 0, .)
data
df2 <- as.data.frame(df1)

How to get ONLY columns with NA values and the amount of NAs

I have a dataset and some of the columns have NA values. I need to display only the column names that have NA values as well as the total number of NA values in each of those columns.
I've been able to get different pieces of the problem working but not both things at once.
This gives me only the column names of the columns containing NA values. But I want the NA totals to show under each column name.
nacol<- colnames(df)[colSums(is.na(df)) > 0]
This gives me exactly what I want but it also displays the zero totals of the other columns in the dataframe and I don't want those to be displayed.
df %>% summarise_all(funs(sum(is.na(.))))
I'm obviously a complete beginner. I realize this is an extremely easy problem to fix but I've been trying for hours and I'm just getting frustrated. Please help. Thank you!
We can use Filter with colSums to remove 0 values
Filter(function(x) x > 0, colSums(is.na(df)))
#a c
#2 1
Or select_if in dplyr
library(dplyr)
df %>%
summarise_all(~(sum(is.na(.)))) %>%
select_if(. > 0)
We can also first select column with any NA values and then count them.
df %>%
select_if(~any(is.na(.))) %>%
summarise_all(~(sum(is.na(.))))
data
df <- data.frame(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4, NA, 1))
A possible alternative using purrr and dplyr for the pipe(using airquality for reproducibilty):
library(dplyr)
library(purrr)
airquality %>%
keep(~anyNA(.x)) %>%
map_dbl(~sum(is.na(.x)))
Ozone Solar.R
37 7
Using data from #Ronak Shah 's answer:
df %>%
keep(~anyNA(.x)) %>%
map_dbl(~sum(is.na(.x)))
a c
2 1
Using data.table(there might be a way to make it more compact):
setDT(df)
df[,Filter(anyNA,.SD)][,lapply(.SD, function(x) sum(is.na(x)))]
a c
1: 2 1
Data:
df <- structure(list(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4,
NA, 1)), class = "data.frame", row.names = c(NA, -5L))
airquality is builtin
We can do
na.omit(na_if(colSums(is.na(df)), 0))
# a c
# 2 1
Or using summarise_if
library(dplyr)
df %>%
summarise_if(~ any(is.na(.)), ~sum(is.na(.)))
# a c
#1 2 1
data
df <- data.frame(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4, NA, 1))

Lagged difference between rows in R_ a different take

My question is similar to a few that have been asked before, but I hope different enough to warrant a separate question.
See here, and here. I'll pull some of the same example data as these questions. For context to my question- I am looking to see how my observed catch-rate (sea creatures) changed over multiple days of sampling the same area.
I want to calculate the difference between the first sample day at a given site (first Letter in data below), and the subsequent sample days (next rows of same letter).
#Example data
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B"),
num = c(1, 8, 6, 3, 7, 7 , 9),
What_I_Want = c(NA, 7, 5, 2, NA, 0, 2))
The first solution that I found calculates a lagged difference between each row. I also wanted this calculation- so it was helpful to find:
#Calculate lagged differences
df_new <- df %>%
# group by condition
group_by(id) %>%
# find difference
mutate(diff = num - lag(num))
Here the difference is between A.1 and A.2; then A.2 and A.3 etc...
What I would like to do now is calculate the difference with respect to the first value of each group. So for letter A, I would like to calculate 1 - 8, then 1 - 6, and finally 1 - 3. Any suggestions?
One clunky solution (linked above) is to create two (or more) columns for each distance lagged and some how merge the results that I want
df_clunky = df %>%
group_by(id) %>%
mutate(
deltaLag1 = num - lag(num, 1),
deltaLag2 = num - lag(num, 2))
Here is a base R method with replace and ave
ave(df$num , df$id, FUN=function(x) replace(x - x[1], 1, NA))
[1] NA 7 5 2 NA 0 2
ave applies the replace function to each id. replace takes the difference of the vector and the first element in the vector as its input and replaces NA into the first element.

Combining frequency tables in R

I have a vector containing the frequencies of molecules within their respective molecular class for all molecules measured. I also have a vector that contains the per class frequency of significant molecules identified by variable selection. How can I merge these 2 vectors into a data frame and fill in empty frequencies with 0's (in R)?
Here is a workable example:
full = rep(letters[1:4], 4:7)
fullTable = table(full)
sub = rep(letters[1:2], c(2, 4))
subTable = table(sub)
I would like the table to look like:
print(data.frame(Letter=letters[1:4], fullFreq=c(4, 5, 6, 7), subFreq=c(2, 4, 0, 0)))
Try this (I supposed you meant subTable=table(sub) in your last line):
res<-merge(as.data.frame(fullTable),as.data.frame(subTable),by.x=1,by.y=1,all=TRUE)
colnames(res)<-c("Letter","fullFreq","subFreq")
res[is.na(res)]<-0
With the library dplyr
library(dplyr)
full=rep(letters[1:4], 4:7)
sub=rep(letters[1:2], c(2,4))
df <- data.frame(Letter=unique(c(full, sub)))
df <- df %>%
left_join(as.data.frame(table(full)), by=c("Letter"="full")) %>%
left_join(as.data.frame(table(sub)), by=c("Letter"="sub"))
df[is.na(df)] <- 0
df

Resources