Combine rows by group with differing NAs in each row - r

I can't find an exact answer to this problem, so I hope I'm not duplicating a question.
I have a dataframe as follows
groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2
What I'm trying to convey with this is that there are duplicate IDs where the total information is spread across both rows and I want to combine these rows to get all the information into one row. How do I go about this?
I've tried to play around with group_by and paste but that ends up making the data messier (getting 22 instead of 2 in col4 for example) and sum() does not work because some columns are strings and those that are not are categorical variables and summing them would change the information.
Is there something I can do to collapse the rows and leave consistent data unchanged while filling in NAs?
EDIT:
Sorry desired output is as follows:
groupid col1 col2 col3 col4
1 0 n 2 2

Is this what you want ? zoo+dplyr also check the link here
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))%>%filter(row_number()==n())
# A tibble: 1 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n 2 2
EDIT1
without the filter , will give back whole dataframe.
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))
# A tibble: 2 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n NA 2
2 1 0 n 2 2
filter here, just slice the last one, na.locf will carry on the previous not NA value, which mean the last row in your group is what you want.
Also base on # thelatemail recommended. you can do the following , give back the same answer.
df %>% group_by(groupid) %>% summarise_all(funs(.[!is.na(.)][1]))
EDIT2
Assuming you have conflict and you want to show them all.
df <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 1 NA 2 2",
header=TRUE,stringsAsFactors=FALSE)
df
groupid col1 col2 col3 col4
1 1 0 n NA 2
2 1 1(#)<NA> 2 2(#)
df %>%
group_by(groupid) %>%
summarise_all(funs(toString(unique(na.omit(.)))))#unique for duplicated like col4
groupid col1 col2 col3 col4
<int> <chr> <chr> <chr> <chr>
1 1 0, 1 n 2 2

Another option with just dplyr is just to take the first non-NA value when available. You can do
dd <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2", header=T)
dd %>%
group_by(groupid) %>%
summarise_all(~first(na.omit(.)))

Would you be able to draw the desired output in this case? Converting data.frame into anothre type as.vector(), as.matrix() and grouping/factoring might help.
UPDATE:
Finding a unique elements for each column and omitting NAs.
df<-data.frame(groupid=c(1,1), col1=c(0,NA), col2=c('n', NA), col3=c(NA,2), col4=c(2,2)) # your input
out<-data.frame(df[1,]) # where the output is stored, duplicate retaining 1 row
for(i in 1:ncol(df)) out[,i]<-na.omit(unique(df[,i]))
print(out)

Related

New variable conditional on whether a df1 column value equals any value included in a specific df2 column

I am trying to make a new variable using mutate() . In df1, I have ranges of values in col1, col2, col3, and col4. I would like to create a new binary variable in df1 that is "1" IF any of the col1-4 values are found in a specific df2 column (let's say col10).
Thanks!
This is what I have tried so far, but I don't think it is returning a value of "1" for all matching value, only some of them.
df1 %>%
mutate(newvar = case_when(
col1 == df2$col10 | col2 == df2$col10 | col3 == df2$col10 | col4 == df2$col10 ~ 1
))
We could use if_any here. If the number of rows are the same, use == for elementwise comparison instead of %in%
library(dplyr)
df1 %>%
mutate(newvar = +(if_any(col1:col4, ~.x %in% df2$col10)))
First, let's make some dummy data. df1 has 4 columns and df2 has one column named col10. In the dummy data, rows 1,2,3 and 5 have matches in df2$col10.
library(dplyr)
df1 <- data.frame(col1 = 1:5, col2=3:7, col3=5:9, col4=10:14)
df2 <- data.frame(col10 = c(1,2,3,14))
We can use rowwise() to do computations within each row and then c_across() to identify that variables of interest. The code identifies whether any of the values in the four columns are in df2$col10 and returns a logical value. The as.numeric() turns that logical value into 0 (FALSE) and 1 (TRUE).
df1 %>%
rowwise() %>%
mutate(newvar = as.numeric(any(c_across(col1:col4) %in% df2$col10)))
#> # A tibble: 5 × 5
#> # Rowwise:
#> col1 col2 col3 col4 newvar
#> <int> <int> <int> <int> <dbl>
#> 1 1 3 5 10 1
#> 2 2 4 6 11 1
#> 3 3 5 7 12 1
#> 4 4 6 8 13 0
#> 5 5 7 9 14 1
Created on 2023-02-09 by the reprex package (v2.0.1)

Select specific/all columns in rowwise

I have the following table:
col1
col2
col3
col4
1
2
1
4
5
6
6
3
My goal is to find the max value per each row, and then find how many times it was repeated in the same row.
The resulting table should look like this:
col1
col2
col3
col4
max_val
repetition
1
2
1
4
4
1
5
6
6
3
6
2
Now to achieve this, I am doing the following for Max:
df%>% rowwise%>%
mutate(max=max(col1:col4))
However, I am struggling to find the repetition. My idea is to use this pseudo code in mutate:
sum( "select current row entirely or only for some columns"==max). But I don't know how to select entire row or only some columns of it and use its content to do the check, i.e.: is it equal to the max. How can we do this in dplyr?
A dplyr approach:
library(dplyr)
df %>%
rowwise() %>%
mutate(max_val = max(across(everything())),
repetition = sum(across(col1:col4) == max_val))
# A tibble: 2 × 6
# Rowwise:
col1 col2 col3 col4 max_val repetition
<int> <int> <int> <int> <int> <int>
1 1 2 1 4 4 1
2 5 6 6 3 6 2
An R base approach:
df$max_val <- apply(df,1,max)
df$repetition <- rowSums(df[, 1:4] == df[, 5])
For other (non-tidyverse) readers, a base R approach could be:
df$max_val <- apply(df, 1, max)
df$repetition <- apply(df, 1, function(x) sum(x[1:4] == x[5]))
Output:
# col1 col2 col3 col4 max_val repetition
# 1 1 2 1 4 4 1
# 2 5 6 6 3 6 2
Although dplyr has added many tools for working across rows of data, it remains, in my mind at least, much easier to adhere to tidy principles and always convert the data to "long" format for these kinds of operations.
Thus, here is a tidy approach:
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(max_val = max(value), repetitions = sum(value == max(value))) %>%
pivot_wider(id_cols = c(row, max_val, repetitions)) %>%
select(col1:col4, max_val, repetitions)
The last select() is just to get the columns in the order you want.

Extract rows with levels of a dataframe

I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )
You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2

I get n/a in columns when using the pivot_wider function

when I execute the following code:
data_ikea_wider <- data_ikea_longer %>%
pivot_wider(id_cols = c(Record_no
, Geography
, City
, Country
, City.Country
, Year)
, names_from = Category, values_from = Value)
The columns just have n/a's as shown in the attached print screen.
What am I doing wrong? Thanks!
We could use dcast from data.table
library(data.table)
setDT(dat)[, col1 ~ col2, value.var = 'val')
Getting NAs from a pivot is not unexpected, it means that not all of your id columns have all "columns".
For example,
dat <- data.frame(col1 = c(1,1,2), col2 = c('a', 'b', 'a'), val = 1:3)
dat
# col1 col2 val
# 1 1 a 1
# 2 1 b 2
# 3 2 a 3
If we want to pivot keeping col1 as an id, and turning col2 values into new columns, then it should be apparent that we'll end up with two rows (ida 1 and 2), and two new columns (a and b) to replace col2 and val. Unfortunately, since we only have three rows, the 2 rows 2 columns = 4 cells will not be completely filled with 3 values, so one will be NA:
pivot_wider(dat, col1, names_from = col2, values_from = val)
# # A tibble: 2 x 3
# col1 a b
# <dbl> <int> <int>
# 1 1 1 2
# 2 2 3 NA
If you see this and are surprised, thinking that you actually have the data ... then you should check your data importing and filtering to make sure you did not inadvertently remove it (or it was not provided initially).

New Column from String in dplyr [duplicate]

This question already has answers here:
cbind a dynamic column name from a string in R
(4 answers)
Closed 3 years ago.
I have a dataframe:
library(tidyverse)
df <- tribble(~col1, ~col2, 1, 2)
Now I want to create a column. I have the name of the new column in a string. It does work like this:
df %>%
mutate("col3" = 3)
# A tibble: 1 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 2 3
But it does not work like this:
newColumnName <- "col3"
df %>%
mutate(newColumnName = 3)
# A tibble: 1 x 3
col1 col2 newColumnName
<dbl> <dbl> <dbl>
1 1 2 3
How do I create a new column that gets its name from a string in an object?
Use !! with the definition operator := as mentioned here, to set a variable name as the column name.
:= supports unquoting on both the LHS and the RHS
library(dplyr)
newColumnName <- "col3"
df %>% mutate(!!newColumnName := 3)
# A tibble: 1 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 2 3

Resources