R Merging non-unique columns to consolidate data frame - r

I'm having issues figuring out how to merge non-unique columns that look like this:
2_2
2_3
2_4
2_2
3_2
1
2
3
NA
NA
2
3
-1
NA
NA
NA
NA
NA
3
-2
NA
NA
NA
-2
4
To make them look like this:
2_2
2_3
2_4
3_2
1
2
3
NA
2
3
-1
NA
3
NA
NA
-2
-2
NA
NA
4
Essentially reshaping any non-unique columns. I have a large data set to work with so this is becoming an issue!

Note that data.frame doesn't allow for duplicate column names. Even if we create those, it may get modified when we apply functions as make.unique is automatically applied. Assuming we created the data.frame with duplicate names, an option is to use split.default to split the data into list of subset of data, then loop over the list with map and use coalesce
library(dplyr)
library(purrr)
map_dfc(split.default(df1, names(df1)),~ invoke(coalesce, .x))
-output
# A tibble: 4 × 4
`2_2` `2_3` `2_4` `3_2`
<int> <int> <int> <int>
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
data
df1 <- structure(list(`2_2` = c(1L, 2L, NA, NA), `2_3` = c(2L, 3L, NA,
NA), `2_4` = c(3L, -1L, NA, NA), `2_2` = c(NA, NA, 3L, -2L),
`3_2` = c(NA, NA, -2L, 4L)), class = "data.frame", row.names = c(NA,
-4L))

Also using coalesce:
You use non-syntactic names. R is strict in using names see here https://adv-r.hadley.nz/names-values.html and also notice the explanation by #akrun:
library(dplyr)
df %>%
mutate(X2_2 = coalesce(X2_2, X2_2.1), .keep="unused")
X2_2 X2_3 X2_4 X3_2
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4

Related

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Conditional Column Formatting

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1
Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA
You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Subset lagged values in R

For a given data table see sample below, I only want to keep Difference column for values greater than 2 by Unique_ID, Without deleting the NA rows .
My_data_table <- structure(list(Unique_ID = structure(c(1L, 1L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L), .Label = c("1AA", "3AA", "5AA", "6AA"),
class = "factor"), Distance.km. = c(1, 2.05, 2, 4, 2, 4, 7,
8, 9, 10), Difference = c(NA, 1.05, NA, 2, NA, 2, 3, NA, 1, 1)),
.Names = c("Unique_ID", "Distance.km.", "Difference"),
class = "data.frame", row.names = c(NA, -10L))
My_data_table
Unique_ID Distance(km) Difference
1AA 1 NA
1AA 2.05 1.05
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3
6AA 8 NA
6AA 9 1
6AA 10 1
Here is the result i'm looking for
My_data_table
Unique_ID Distance(km) Difference
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3
After converting to 'data.table' (setDT(df1)), grouped by 'Unique_ID', if the sum of logical vector (Difference >= 2) is greater than 0, then get the Subset of Data.table (.SD) where the 'Difference' is either an NA or | it is greater than or equal to 2
library(data.table)
setDT(df1)[, if(sum(Difference >=2, na.rm = TRUE)>0)
.SD[is.na(Difference)|Difference>=2], by = Unique_ID]
# Unique_ID Distance.km. Difference
#1: 3AA 2 NA
#2: 3AA 4 2
#3: 5AA 2 NA
#4: 5AA 4 2
#5: 5AA 7 3
A dplyr solution:
library(dplyr)
df %>%
group_by(Unique_ID) %>%
filter(any(Difference >= 2 & !is.na(Difference)))
# # A tibble: 5 x 3
# # Groups: Unique_ID [2]
# Unique_ID Distance.km. Difference
# <fctr> <dbl> <dbl>
# 1 3AA 2 NA
# 2 3AA 4 2
# 3 5AA 2 NA
# 4 5AA 4 2
# 5 5AA 7 3

Restructuring data using apply family of functions

I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame

Resources