adding values using rowSums and tidyverse - r

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!

Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

Related

dplyr: User defined Function in summarise() involving two input vectors

I have a data frame of say 20 columns. Column 1 is group, column 2 is weights (not normalized to 1 or 100) and columns 3 to 20 contain data to be aggregated. There are some 250 rows but just 15 groups. So on an average for each group there are around 16-17 rows for each group.
For each of columns 3 to 20, I need to get group-wise weighted mean, weights being column 2.
As such this is easy by multiplying all columns by column 2 and then running
group_by(df, column1)%>%
summarise_all(sum_na)
Here sum_na is the usual function sum with na.rm=T
And then dividing column 3 to 20 by column 2.
The problem is that there are NAs scattered in the data frame. Say for example, 150th row (belonging to group 5, say) in column 12 has NA. While calculating weighted mean for Group 5 and column 12, the denominator should exclude the weight in row 150 of column 2.
How to do this? Sorry for the long post. Unable to provide sample data as unfortunately stack overflow is inaccessible in office (posting from mobile).
Would something like this work ?
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(col1:col18), ~weighted.mean(., wt, na.rm = TRUE))
You can select range of columns in vars. This removes NA values from the columns col1 to col18 with weight column as wt.
Tried this on this example :
df <- data.frame(group = rep(1:3, each = 3), wt = 1:9,
col1 = c(2:5, NA, 6:9), col2 = c(NA, 3:6, NA, 2:4))
df %>%
group_by(group) %>%
summarise_at(vars(col1:col2), ~weighted.mean(., wt, na.rm = TRUE))
# group col1 col2
# <int> <dbl> <dbl>
#1 1 3.33 3.6
#2 2 5.6 5.56
#3 3 8.08 3.08
We can use data.table methods
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, wt, na.rm = TRUE)),
by = group, .SDcols = col1:col18]

Merge 3 columns based on unique values?

I am trying to do a merge on 3 columns to a single one. The column values are separated by ";" and the new column need to unzip all the 3 column values and put the unique values. I know how to perform the merge column. But I am struggling to do unzipping the row value in 3 columns and finding unique value and putting in another column.
Here is the dummy data
n = c(2, 3, 5,10)
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn")
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA")
t = c("aa;bb;cc", "bb;dd", "kk","NA")
df = data.frame(n, s, b,t)
> df
n s b t
1 2 aa;bb;cc aa;bb;cc aa;bb;cc
2 3 bb;dd;aa bb;dd;cc bb;dd
3 5 NA zz;bb;yy kk
4 10 xx;nn NA NA
The expected output is
> df
n finalcol
1 2 aa;bb;cc
2 3 bb;dd;aa;cc
3 5 zz;bb;yy;kk
4 10 xx;nn
What I have to perform a simple merge
dff = df %>% unite(finalcol, c(s,b,t), sep = ";", remove = TRUE)
Since you mentioned unite, I want to show a solution using separate, the complement of unite.
This solution keeps it within the tidyverse, which makes it easy to understand what's going on step-by-step. #d.b's answer in the comment works perfectly, is compact, and probably runs faster, but has a steeper learning curve to understand what's going on. With a piped tidyverse solution, you can run each line and see what's going on.
This solution first separates the terms, then converts the data from wide to long data format with gather, so that we can do operations such as check for and handle NAs and "NA"s, drop_na, and then distinct, to get unique values only (per group with the same "id" i.e. items from the same original line). Then, it uses summarise and paste to go back to the original format, but could also use spread then unite. (Note that na.rm=TRUE is an upcoming feature of unite https://github.com/tidyverse/tidyr/issues/203)
Sources: I used these handy dplyr and tidyr reference sheets:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf and I also worked out the solution based on the comments, questions, and answers here: How do I remove NAs with the tidyr::unite function?
# Load packages and data
library(tidyverse)
df = data.frame(n = c(2, 3, 5,10),
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn"),
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA"),
t = c("aa;bb;cc", "bb;dd", "kk", NA))
# Solution
dff <- df %>%
separate(col = "s", into = c("s1", "s2", "s3")) %>%
separate(col = "b", into = c("b1", "b2", "b3")) %>%
separate(col = "t", into = c("t1", "t2", "t3")) %>% # Solution here could be enhanced to take in n columns and put them into however many columns as needed, using map or apply.
rowid_to_column('id') %>%
gather(key, value, -(id:n)) %>%
mutate_at(vars(value), na_if, "NA") %>%
drop_na(value) %>%
group_by(id) %>%
distinct(value, .keep_all = TRUE) %>%
summarise(n = first(n), finalcol = paste(value, collapse = ';')) %>%
ungroup() %>%
select(-id)
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3,
#> 4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [2,
#> 3].
dff
#> # A tibble: 4 x 2
#> n finalcol
#> <dbl> <chr>
#> 1 2 aa;bb;cc
#> 2 3 bb;dd;aa;cc
#> 3 5 zz;bb;yy;kk
#> 4 10 xx;nn
Created on 2019-03-26 by the reprex package (v0.2.1)

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

Computing average over different columns/rows in a list of data.frames

I've a list of 140 elements of type data.frame ('my.list'). I would like to compute 350 averages of certain values ranges in a certain column for a certain set of rows in a certain data.frame (this is a bit cryptic); so, 350 different averages like:
Of data.frame #1, the average of column 'Measure1', row 1:5;
Of data.frame #2, the average of column 'Measure3', row 1:4, etc. etc.
I have another data.frame ('my.dfAverage') which indicates for which data.frame, column and rows it needs the average. I want to write the 350 different averages and standard deviations to this data.frame (so with the columns: 'average_id', 'dataframe_number', 'column_name', 'row_numbers', 'average' and 'st_dev'). Some value ranges have NA's, these values can be dropped for computing the average.
What is the best way to automatically compute the 350 averages and standard deviations from the list of data.frames based on the info in this data.frame? I thought of creating a for-loop (or maybe the lapply function?), but I'm quite new to these functions, so I'm not sure what the way to go is here.
Small reproducible example of my list of data.frames:
my.df1 <- data.frame(ID = c(1:5),
Measure1 = c(2247,2247,1970,1964,1971),
Measure2 = c(2247,2247,NA,1964,1971))
my.df2 <- data.frame(ID = c(1:4),
Measure3 = c(2247,NA,1970,1964),
Measure5 = c(2247,2247,NA,1964))
my.df3 <- data.frame(ID = c(1:4),
Measure6 = c(2247,600,1970,1964),
Measure8 = c(2247,2247,NA,1964))
my.list <- list(list1 = my.df1, list2 = my.df2, list3 = my.df3)
Desired output table for the averages and standard deviation:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA))
This is a different approach than the one given above: I will use only base r functions: Point to note, ensure the data has stringsAsFactors=FALSE
write a function but ensure you index mylist correctly. then compute the function on this i e f(...,na.rm=T). to write a function using apply:
fun1=function(f){with(my.dfAverage,
mapply(function(x,y,z)
f(x[eval(parse(text=y)),z],na.rm=T),my.list,row_numbers,column_name))}
transform(my.dfAverage,average=fun1(mean),st_dev=fun1(sd))
average_id dataframe_number column_name row_numbers average st_dev
1 1 1 Measure1 1:3 2154.667 159.9260
2 2 2 Measure3 1:4 2060.333 161.6859
3 3 3 Measure6 1:2 1423.500 1164.6049
Data Used:
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'),
average = (NA),
st_dev = (NA),stringsAsFactors = F)
A solution using tidyverse.
First, expand the my.dfAverage based on row_numbers.
library(tidyverse)
my.dfAverage2 <- my.dfAverage %>%
separate(row_numbers, into = c("start", "end")) %>%
mutate(row_numbers = map2(start, end, `:`)) %>%
unnest() %>%
select(-start, -end) %>%
mutate(row_numbers = as.integer(row_numbers),
dataframe_number = as.integer(dataframe_number))
Second, transform all data frames in my.list and combine them to a single data frame.
my.list.df <- my.list %>%
setNames(1:length(.)) %>%
map_dfr(function(x){
x2 <- x %>%
gather(column_name, value, -ID)
return(x2)
},.id = "dataframe_number") %>%
mutate(ID = as.integer(ID), dataframe_number = as.integer(dataframe_number)) %>%
rename(row_numbers = ID)
Third, merge my.dfAverage2 and my.list.df and calculate the mean and standard deviation. my.dfAverage3 is the final output.
my.dfAverage3 <- my.dfAverage2 %>%
left_join(my.list.df, by = c("dataframe_number", "column_name", "row_numbers")) %>%
group_by(average_id, dataframe_number, column_name) %>%
summarise(row_numbers = paste(min(row_numbers), max(row_numbers), sep = ":"),
average = mean(value, na.rm = TRUE),
st_dev = sd(value, na.rm = TRUE)) %>%
ungroup()
my.dfAverage3
# A tibble: 3 x 6
# average_id dataframe_number column_name row_numbers average st_dev
# <int> <int> <chr> <chr> <dbl> <dbl>
# 1 1 1 Measure1 1:3 2155 160
# 2 2 2 Measure3 1:4 2060 162
# 3 3 3 Measure6 1:2 1424 1165
DATA
my.list is the same as OP's my.list.
my.dfAverage <- data.frame(average_id = c(1:3),
dataframe_number = c(1,2,3),
column_name = c('Measure1','Measure3','Measure6'),
row_numbers = c('1:3','1:4','1:2'))

Add a column with count of NAs and Mean

I have a data frame and I need to add another column to it which shows the count of NAs in all the other columns for that row and also the mean of the non-NA values.
I think it can be done in dplyr.
> df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
> df1
a b c
1 1 1 NA
2 2 2 2
3 3 NA 3
4 4 4 NA
5 5 NA NA
I want to mutate another column which counts the number of NAs in that row and another column which shows the mean of all the NON-NA values in that row.
library(dplyr)
count_na <- function(x) sum(is.na(x))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = apply(., 1, count_na))
#### ANSWER FOR RADEK ####
elected_cols <- c('b', 'c')
df1 %>%
mutate(means = rowMeans(.[elected_cols], na.rm = T),
count_na = apply(.[elected_cols], 1, count_na))
As mentioned here https://stackoverflow.com/a/37732069/2292993
df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(.)))
to work on selected cols (the example here is for col a and col c):
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(select(.,one_of(c('a','c'))))))
You can try this:
#Find the row mean and add it to a new column in the dataframe
df1$Mean <- rowMeans(df1, na.rm = TRUE)
#Find the count of NA and add it to a new column in the dataframe
df1$CountNa <- rowSums(apply(is.na(df1), 2, as.numeric))
I recently faced a variation on this question where I needed to compute the percent of complete values, but for specific variables (not all variables). Here is an approach that worked for me.
df1 %>%
# create dummy variables representing if the observation is missing ----
# can modify here for specific variables ----
mutate_all(list(dummy = is.na)) %>%
# compute a row wise sum of missing ----
rowwise() %>%
mutate(
# number of missing observations ----
n_miss = sum(c_across(matches("_dummy"))),
# percent of observations that are complete (non-missing) ----
pct_complete = 1 - mean(c_across(matches("_dummy")))
) %>%
# remove grouping from rowwise ----
ungroup() %>%
# remove dummy variables ----
dplyr::select(-matches("dummy"))

Resources