Add a column with count of NAs and Mean - r

I have a data frame and I need to add another column to it which shows the count of NAs in all the other columns for that row and also the mean of the non-NA values.
I think it can be done in dplyr.
> df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
> df1
a b c
1 1 1 NA
2 2 2 2
3 3 NA 3
4 4 4 NA
5 5 NA NA
I want to mutate another column which counts the number of NAs in that row and another column which shows the mean of all the NON-NA values in that row.

library(dplyr)
count_na <- function(x) sum(is.na(x))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = apply(., 1, count_na))
#### ANSWER FOR RADEK ####
elected_cols <- c('b', 'c')
df1 %>%
mutate(means = rowMeans(.[elected_cols], na.rm = T),
count_na = apply(.[elected_cols], 1, count_na))

As mentioned here https://stackoverflow.com/a/37732069/2292993
df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(.)))
to work on selected cols (the example here is for col a and col c):
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(select(.,one_of(c('a','c'))))))

You can try this:
#Find the row mean and add it to a new column in the dataframe
df1$Mean <- rowMeans(df1, na.rm = TRUE)
#Find the count of NA and add it to a new column in the dataframe
df1$CountNa <- rowSums(apply(is.na(df1), 2, as.numeric))

I recently faced a variation on this question where I needed to compute the percent of complete values, but for specific variables (not all variables). Here is an approach that worked for me.
df1 %>%
# create dummy variables representing if the observation is missing ----
# can modify here for specific variables ----
mutate_all(list(dummy = is.na)) %>%
# compute a row wise sum of missing ----
rowwise() %>%
mutate(
# number of missing observations ----
n_miss = sum(c_across(matches("_dummy"))),
# percent of observations that are complete (non-missing) ----
pct_complete = 1 - mean(c_across(matches("_dummy")))
) %>%
# remove grouping from rowwise ----
ungroup() %>%
# remove dummy variables ----
dplyr::select(-matches("dummy"))

Related

How to separate integers from string in a data frame cell that are separated by commas?

I currently have a file that has a variety of responses to some questions. Each cell will have anywhere from 1 to 4 numbers, followed by the word "finished" inside of one cell. For example, df[1,1] could equal "-5","2","1","Finished" . I need to be able to get rid of the word finished, and just have the integers so that I can add them together to get one number for that cell. How can i do this?
Another option using R base apply function:
df <- data.frame(X = c('-5,-2,1,Finished','1,2,7,Finished','-3,-2,4,Finished'))
new_df <- apply(df, c(1, 2), FUN = function(x){
values <- trimws(unlist(strsplit(x, split = ","))) # Convert cell values to a vector
values <- values[which(!tolower(values) == "finished")] # Remove Finished
return(sum(as.numeric(values), na.rm = T)) # Add remaining integer values
})
new_df
X
[1,] -6
[2,] 10
[3,] -1
The above will iterate through every cell in a dataframe. For each cell it convert the cell's values to a vector by splitting on commas. Then it will remove the 'finished' value from the vector and finally sum all remaining numeric values. new_df will be a matrix the same size as df.
Maybe you can try the code below
df <- within(df,
Y <- sapply(regmatches(X,gregexpr("[+-]?\\d+",X)),
function(v) sum(as.integer(v))))
such that
> df
X Y
1 -5,-2,1,Finished -6
2 1,2,7,Finished 10
3 -3,-2,4,Finished -1
Dummy Data
df <- data.frame(X = c('-5,-2,1,Finished','1,2,7,Finished','-3,-2,4,Finished'))
One option after reading the file with read.csv/read.table is to use separate_rows to expand the rows after removing the 'Finished', while using convert = TRUE and then get the sum
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(rn = row_number(), col2 = str_remove(col2, ",\\s*[Ff]inished")) %>%
separate_rows(col2, sep= ",", convert = TRUE) %>%
group_by(rn) %>%
summarise(col3 = sum(col2, na.rm = TRUE)) %>%
select(-rn) %>%
bind_cols(df1, .)
# A tibble: 3 x 3
# col1 col2 col3
# <int> <chr> <int>
#1 1 -5,-2,1,Finished -6
#2 2 -3,-2,5,Finished 0
#3 3 3,4,2,Finished 9
Or using base R
df1$col3 <- sapply(sub(",[Ff]inished", "", df1$col2), function(str1)
sum(scan(text = str1, what = numeric(), sep=",", quiet = TRUE)))
data
df1 <- read.csv('yourfile.csv', stringsAsFactors = FALSE)
df1 <- data.frame(col1 = 1:3, col2 = c('-5,-2,1,Finished',
'-3,-2,5,Finished', '3,4,2,Finished'), stringsAsFactors = FALSE)

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

R - maximum value of variables when compared between levels of variable1 grouped by variable2

Consider the following data
set.seed(123)
example.df <- data.frame(
gene = sample(c("A", "B", "C", "D"), 100, replace = TRUE),
treated = sample(c("Yes", "No"), 100, replace = TRUE),
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))
I am trying to get the maximum value for all variables when they are compared by the levels of gene and grouped by treated. I can create the gene combinations like so,
combn(sort(unique(example.df$gene)), 2, simplify = T)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] A A A B B c
#[2,] B c D c D D
#Levels: A B c D
Edit: The output I am looking for is a dataframe like this
comparison group max.resp max.effect
A-B no value1 value2
....
C-D no valueX valueY
A-B yes value3 value4
....
C-D yes valueXX valueYY
While I am able to get the max values for each individual gene level grouped by treated...
max.df <- example.df %>%
group_by(treated, gene) %>%
nest() %>%
mutate(mod = map(data, ~summarise_if(.x, is.numeric, max, na.rm = TRUE))) %>%
select(treated, gene, mod) %>%
unnest(mod) %>%
arrange(treated, gene)
Despite trying to tackle the issue for more than a day, I cannot figure out how to get the max for each numeric variable for each 2 level gene comparison (A vs B, A vs C, A vs D, B vs C, B vs D, and C vs D) grouped by treated.
Any help is appreciated. Thanks.
I found a solution, it might be a little messy, but I will update it in a better way, it takes no time whatsoever
library(tidyverse)
First I generate a dataframe with two columns, Gen1 and Gen2 for al possible comparisons, very similar to your use of combn but creating a data.frame
GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene)) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
Then I loop throught it grouping by
Comps <- list()
for(i in 1:nrow(GeneComp)){
Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
group_by(treated) %>% # Then gorup by treated
summarise_if(is.numeric, max) %>% # then summarise max if numeric
mutate(Comparison = paste(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2, sep = "-")) # and generate the comparisson variable
}
Comps <- bind_rows(Comps) # and finally join in a data frame
let me know if it does everything you want
Adding in order to get only the data one time
It is important here that your genes are strings and not factors so you might have to do this
options(stringsAsFactors = FALSE)
example.df <- data.frame(
gene = c(sample(c("A", "B", "C", "D"), 100, replace = TRUE)),
treated = sample(c("Yes", "No"), 100, replace = TRUE),
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))
Then again in expand.grid add the stringsAsFactors = F argument
GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene), stringsAsFactors = F) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
Now that allows you in the loop when pasting the Comparisson variable to sort both inputs, with that, the lines will be duplicated, but when you use the distinct function at the end, it will make your data the way you want it
Comps <- list()
for(i in 1:nrow(GeneComp)){
Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
group_by(treated) %>% # Then gorup by treated
summarise_if(is.numeric, max) %>% # then summarise max if numeric
mutate(Comparison = paste(sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[1], sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[2], sep = "-")) # and generate the comparisson variable
}
Comps <- bind_rows(Comps) %>% distinct() # and finally join in a data frame

Computing average over different columns/rows in a list of data.frames

I've a list of 140 elements of type data.frame ('my.list'). I would like to compute 350 averages of certain values ranges in a certain column for a certain set of rows in a certain data.frame (this is a bit cryptic); so, 350 different averages like:
Of data.frame #1, the average of column 'Measure1', row 1:5;
Of data.frame #2, the average of column 'Measure3', row 1:4, etc. etc.
I have another data.frame ('my.dfAverage') which indicates for which data.frame, column and rows it needs the average. I want to write the 350 different averages and standard deviations to this data.frame (so with the columns: 'average_id', 'dataframe_number', 'column_name', 'row_numbers', 'average' and 'st_dev'). Some value ranges have NA's, these values can be dropped for computing the average.
What is the best way to automatically compute the 350 averages and standard deviations from the list of data.frames based on the info in this data.frame? I thought of creating a for-loop (or maybe the lapply function?), but I'm quite new to these functions, so I'm not sure what the way to go is here.
Small reproducible example of my list of data.frames:
my.df1 <- data.frame(ID = c(1:5),
Measure1 = c(2247,2247,1970,1964,1971),
Measure2 = c(2247,2247,NA,1964,1971))
my.df2 <- data.frame(ID = c(1:4),
Measure3 = c(2247,NA,1970,1964),
Measure5 = c(2247,2247,NA,1964))
my.df3 <- data.frame(ID = c(1:4),
Measure6 = c(2247,600,1970,1964),
Measure8 = c(2247,2247,NA,1964))
my.list <- list(list1 = my.df1, list2 = my.df2, list3 = my.df3)
Desired output table for the averages and standard deviation:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA))
This is a different approach than the one given above: I will use only base r functions: Point to note, ensure the data has stringsAsFactors=FALSE
write a function but ensure you index mylist correctly. then compute the function on this i e f(...,na.rm=T). to write a function using apply:
fun1=function(f){with(my.dfAverage,
mapply(function(x,y,z)
f(x[eval(parse(text=y)),z],na.rm=T),my.list,row_numbers,column_name))}
transform(my.dfAverage,average=fun1(mean),st_dev=fun1(sd))
average_id dataframe_number column_name row_numbers average st_dev
1 1 1 Measure1 1:3 2154.667 159.9260
2 2 2 Measure3 1:4 2060.333 161.6859
3 3 3 Measure6 1:2 1423.500 1164.6049
Data Used:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA),stringsAsFactors = F)
A solution using tidyverse.
First, expand the my.dfAverage based on row_numbers.
library(tidyverse)
my.dfAverage2 <- my.dfAverage %>%
separate(row_numbers, into = c("start", "end")) %>%
mutate(row_numbers = map2(start, end, `:`)) %>%
unnest() %>%
select(-start, -end) %>%
mutate(row_numbers = as.integer(row_numbers),
dataframe_number = as.integer(dataframe_number))
Second, transform all data frames in my.list and combine them to a single data frame.
my.list.df <- my.list %>%
setNames(1:length(.)) %>%
map_dfr(function(x){
x2 <- x %>%
gather(column_name, value, -ID)
return(x2)
},.id = "dataframe_number") %>%
mutate(ID = as.integer(ID), dataframe_number = as.integer(dataframe_number)) %>%
rename(row_numbers = ID)
Third, merge my.dfAverage2 and my.list.df and calculate the mean and standard deviation. my.dfAverage3 is the final output.
my.dfAverage3 <- my.dfAverage2 %>%
left_join(my.list.df, by = c("dataframe_number", "column_name", "row_numbers")) %>%
group_by(average_id, dataframe_number, column_name) %>%
summarise(row_numbers = paste(min(row_numbers), max(row_numbers), sep = ":"),
average = mean(value, na.rm = TRUE),
st_dev = sd(value, na.rm = TRUE)) %>%
ungroup()
my.dfAverage3
# A tibble: 3 x 6
# average_id dataframe_number column_name row_numbers average st_dev
# <int> <int> <chr> <chr> <dbl> <dbl>
# 1 1 1 Measure1 1:3 2155 160
# 2 2 2 Measure3 1:4 2060 162
# 3 3 3 Measure6 1:2 1424 1165
DATA
my.list is the same as OP's my.list.
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'))

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

Resources