R Aggregate multiple rows - r

My question seems to be a very common question, but the solutions I found on internet don't work...
I would like to aggregate rows in a data frame in R.
Here is the structure of my data frame (df), it is a table of citations :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Kaufmann V NA 1 NA NA
Kaufmann V NA NA 1 NA
Kaufmann V NA NA NA 1
Orfeuil P 1 NA NA NA
Orfeuil P NA 1 NA NA
Sorokin P NA NA NA 1
That is I would like to have :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 NA NA NA
Kaufmann V NA 1 1 1
Orfeuil P 1 1 NA NA
Sorokin P NA NA NA 1
I have tried those solutions, but it doesn't work :
ddply(df,"Autors", numcolwise(sum))
and
df %>% group_by(Autors) %>% summarize_all(sum)
It aggregates well the rows, but the values (sum of the 1 values) are absolutely not correct ! And I don't understand why...
Do you have an idea ?
Thank you very much !
Joël

You can also do the summing using rowsum(), although it (perhaps misleadingly) gives sums of 0 rather than NA for cells in the output that had only NA's for input.
rowsum(df[,c(2:5)],df$Autors,na.rm=T)
Gives:
Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 0 0 0
Kaufmann V 0 1 1 1
Orfeuil P 1 1 0 0
Sorokin P 0 0 0 1

It could be because the na.rm is not used
library(dplyr)
df %>%
group_by(Autors) %>%
summarize_all(sum, na.rm = TRUE)
if both plyr and dplyr are loaded, summarise would get masked, but doubt about summarise_all as it is a dplyr function
Based on the expected output, with na.rm = TRUE, it removes all NAs and if there are cases having only NAs it returns 0. To avoid that, we can have a condition
df %>%
group_by(Autors) %>%
summarize_all(funs(if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
# A tibble: 4 x 5
# Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
# <chr> <int> <int> <int> <int>
#1 Burgess E 4 NA NA NA
#2 Kaufmann V NA 1 1 1
#3 Orfeuil P 1 1 NA NA
#4 Sorokin P NA NA NA 1
data
df <- structure(list(Autors = c("Burgess E", "Burgess E", "Burgess E",
"Burgess E", "Kaufmann V", "Kaufmann V", "Kaufmann V", "Orfeuil P",
"Orfeuil P", "Sorokin P"), Lannoy_2016 = c(1L, 1L, 1L, 1L, NA,
NA, NA, 1L, NA, NA), Ramadier_2014 = c(NA, NA, NA, NA, 1L, NA,
NA, NA, 1L, NA), Lord_2009 = c(NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA), Ortar_2008 = c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L
)), .Names = c("Autors", "Lannoy_2016", "Ramadier_2014", "Lord_2009",
"Ortar_2008"), class = "data.frame", row.names = c(NA, -10L))

Related

Merge survey columns across variables in R

I am analyzing a very large survey in which I want to combine four parts of the survey, through several combinations of 4 questions. Below I have created a small example. A little background: a respondent either answered q2, q5, q8 or q9, because they only filled in 1 of 4 parts of the survey based on their answer in q1 (not shown here).Therefore, only one of the four columns contains an answer (1 or 2), while the others contain NAs. q2, q5, q8, q9 are similar questions that have the same answer options, which is why I want to combine them to make my dataset less wide and make it easier to further analyze the data.
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
# running df shows:
q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
1 NA NA NA 1 NA NA NA 1
2 NA NA NA 2 NA NA NA 2
3 NA NA 1 NA NA NA 1 NA
4 NA NA 2 NA NA NA 2 NA
5 NA 1 NA NA NA 1 NA NA
6 NA 2 NA NA NA 2 NA NA
7 1 NA NA NA 1 NA NA NA
8 2 NA NA NA 2 NA NA NA
My desired end result would be a dataframe with only columns for questions starting with q2_ (so, in the example that would be q2_1 and q2_2; in reality there's about 20 for this question), but with the NAs replaced for the answer options from the corresponding q5_, q8_, and q_9.
# desired end result
q2_1 q2_2
1 1 1
2 1 2
3 1 1
4 2 2
5 1 1
6 2 2
7 1 1
8 2 2
For single questions, i've done this using the code below, but this is very manual and because q2, q5, q8, and q9 both go up to _20, I'm looking for a way to automate this more.
# example single question
library(tidyverse)
df <- df %>%
mutate(q2_1 = case_when(!is.na(q2_1) ~ q2_1,
!is.na(q5_1) ~ q5_1,
!is.na(q8_1) ~ q8_1,
!is.na(q9_1) ~ q9_1))
I hope I explained myself well enough and looking forward for some directions!
Here's one way, using coalesce:
df %>%
mutate(q2_1 = do.call(coalesce, across(ends_with('_1'))),
q2_2 = do.call(coalesce, across(ends_with('_2')))) %>%
select(q2_1, q2_2)
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
#> q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
#> 1 NA NA NA 1 NA NA NA 1
#> 2 NA NA NA 2 NA NA NA 2
#> 3 NA NA 1 NA NA NA 1 NA
#> 4 NA NA 2 NA NA NA 2 NA
#> 5 NA 1 NA NA NA 1 NA NA
#> 6 NA 2 NA NA NA 2 NA NA
#> 7 1 NA NA NA 1 NA NA NA
#> 8 2 NA NA NA 2 NA NA NA
library(tidyverse)
suffix <- str_c("_", 1:2)
map_dfc(.x = suffix,
.f = ~ transmute(df, !!str_c("q2", .x) := rowSums(across(ends_with(.x
)), na.rm = T)))
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
Created on 2022-04-04 by the reprex package (v2.0.1)

Find max value within a data frame interval

I have a dataframe that has x/y values every 5 seconds, with a depth value every second (time column). There is no depth where there is an x/y value.
x <- c("1430934", NA, NA, NA, NA, "1430939")
y <- c("4943206", NA, NA, NA, NA, "4943210")
time <- c(1:6)
depth <- c(NA, 10, 19, 84, 65, NA)
data <- data.frame(x, y, time, depth)
data
x y time depth
1 1430934 4943206 1 NA
2 NA NA 2 10
3 NA NA 3 19
4 NA NA 4 84
5 NA NA 5 65
6 1430939 4943210 6 NA
I would like to calculate the maximum depth between the x/y values that are not NA and add this to a new column in the row of the starting x/y values. So max depth of rows 2-5. An example of the output desired.
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
This is to repeat whenever a new x/y value is present.
You can use ave and cumsum with !is.na to get the groups for ave like:
data$newvar <- ave(data$depth, cumsum(!is.na(data$x)), FUN=
function(x) if(all(is.na(x))) NA else {
c(max(x, na.rm=TRUE), rep(NA, length(x)-1))})
data
# x y time depth newvar
#1 1430934 4943206 1 NA 84
#2 <NA> <NA> 2 10 NA
#3 <NA> <NA> 3 19 NA
#4 <NA> <NA> 4 84 NA
#5 <NA> <NA> 5 65 NA
#6 1430939 4943210 6 NA NA
Using dplyr, we can create groups of every 5 rows and update the first row in group as max value in the group ignoring NA values.
library(dplyr)
df %>%
group_by(grp = ceiling(time/5)) %>%
mutate(depth = ifelse(row_number() == 1, max(depth, na.rm = TRUE), NA))
In base R, we can use tapply :
inds <- seq(1, nrow(df), 5)
df$depth[inds] <- tapply(df$depth, ceiling(df$time/5), max, na.rm = TRUE)
df$depth[-inds] <- NA
Maybe you can try ave like below
df <- within(df,
newvar <- ave(depth,
ceiling(time/5),
FUN = function(x) ifelse(length(x)>1&is.na(x),max(na.omit(x)),NA)))
such that
> df
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
DATA
df <- structure(list(x = c(1430934L, NA, NA, NA, NA, 1430939L), y = c(4943206L,
NA, NA, NA, NA, 4943210L), time = 1:6, depth = c(NA, 10L, 19L,
84L, 65L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
Here is another option using data.table:
library(data.table)
setDT(data)[, newvar := replace(frollapply(depth, 5L, max, na.rm=TRUE, align="left"),
seq(.N) %% 5L != 1L, NA_integer_)]

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Conditional Column Formatting

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1
Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA
You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Required order of columns names which does not have NULL values

My data is bit typical and I need find out field/Column order that follow some pattern.
For Instance, One field(say sub3) has values till some rows and followed by NULL values, then another field will continue with some values(like Sub1) and then follows null values.
And in some cases I may have multiple fields having values at two rows(like Sub2 and Sub4).
In below case the solution is vector of field names which follow the pattern c(Sub3,Sub1,c(Sub2,Sub4),Sub5)
Here is the reproducible format of data and Snapshot of data.
structure(list(RollNo = 1:10, Sub1 = c(NA, NA, NA, NA, 3L, 2L,
NA, NA, NA, NA), Sub2 = c(NA, NA, NA, NA, NA, NA, "A", "B", NA,
NA), Sub3 = c(4L, 3L, 5L, 6L, NA, NA, NA, NA, NA, NA), Sub4 = c(NA,
NA, NA, NA, NA, NA, 2L, 5L, NA, NA), Sub5 = c(NA, NA, NA, NA,
NA, NA, NA, NA, 7L, NA)), .Names = c("RollNo", "Sub1", "Sub2",
"Sub3", "Sub4", "Sub5"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000200788>)
Sounds like you are sorting on the order of first non-NA data. If df is your data:
sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1)))
# RollNo Sub1 Sub2 Sub3 Sub4 Sub5
# 1 5 7 1 7 9
Gives the first non-NA row for each column. This should be a natural sort, meaning ties retain the original order between the ties.
There are a couple of ways to use this, one such:
order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))
# [1] 1 4 2 3 5 6
df[,order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))]
# RollNo Sub3 Sub1 Sub2 Sub4 Sub5
# 1 1 4 NA <NA> NA NA
# 2 2 3 NA <NA> NA NA
# 3 3 5 NA <NA> NA NA
# 4 4 6 NA <NA> NA NA
# 5 5 NA 3 <NA> NA NA
# 6 6 NA 2 <NA> NA NA
# 7 7 NA NA A 2 NA
# 8 8 NA NA B 5 NA
# 9 9 NA NA <NA> NA 7
# 10 10 NA NA <NA> NA NA
I'm inferring from the column names that RollNo should always be first, so:
df[,c(1, 1 + order(sapply(df[-1], function(x) min(Inf, head(which(!is.na(x)),n=1)))))]
Using:
DT[, nms := paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6]
will get you:
> DT
RollNo Sub1 Sub2 Sub3 Sub4 Sub5 nms
1: 1 NA NA 4 NA NA Sub3
2: 2 NA NA 3 NA NA Sub3
3: 3 NA NA 5 NA NA Sub3
4: 4 NA NA 6 NA NA Sub3
5: 5 3 NA NA NA NA Sub1
6: 6 2 NA NA NA NA Sub1
7: 7 NA A NA 2 NA Sub2,Sub4
8: 8 NA B NA 5 NA Sub2,Sub4
9: 9 NA NA NA NA 7 Sub5
10: 10 NA NA NA NA NA
If you just want the specified vector:
unique(DT[, paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6][V1!='']$V1)
which returns:
[1] "Sub3" "Sub1" "Sub2,Sub4" "Sub5"
As #Frank pinted out in the comments, you can also use:
melt(DT, id=1, na.rm = TRUE)[, toString(unique(variable)), by = RollNo][order(RollNo)]

Resources