R Dataframe - Extract Unique rows from columns - r

I have a dataframe:
source= c("A", "A", "B")
target = c("B", "C", "C")
source_A = c(5, 5, 6)
target_A = c(6, 7, 7)
source_B = c(10, 10, 11)
target_B = c(11, 12, 12)
c = c(0.5, 0.6, 0.7)
df = data.frame(source, target, source_A, target_A, source_B, target_B, c)
> df
source target source_A target_A source_B target_B c
1 A B 5 6 10 11 0.5
2 A C 5 7 10 12 0.6
3 B C 6 7 11 12 0.7
How can I reduce this dataframe to return only the values for the unique source and target values and return (ignoring column c).
For the Values [A B C]
id A B
1 A 5 10
2 B 6 11
3 C 7 12
At the moment I do something like this:
df1 <- df[,c("source","source_A", "source_B")]
df2 <- df[,c("target","target_A", "target_B")]
names(df1)[names(df1) == 'source'] <- 'id'
names(df1)[names(df1) == 'source_A'] <- 'A'
names(df1)[names(df1) == 'source_B'] <- 'B'
names(df2)[names(df2) == 'target'] <- 'id'
names(df2)[names(df2) == 'target_A'] <- 'A'
names(df2)[names(df2) == 'target_B'] <- 'B'
df3 <- rbind(df1,df2)
df3[!duplicated(df3$id),]
id A B
1 A 5 10
3 B 6 11
5 C 7 12
In reality, I have tens of columns so this is non-viable long term.
How can I do this more succinctly (and ideally, generaliseable to more columns)?

library(dplyr)
library(magrittr)
df1 <- subset(df, select = ls(pattern = "source"))
df2 <- subset(df, select = ls(pattern = "target"))
names(df1) <- names(df2)
df <- bind_rows(df1, df2)
df %<>% group_by(target, target_A, target_B) %>% slice(1)
This should do it, but I do not quite know how you want to generalize it.
I don't think this is the most elegant solution in the world, but it serves the purpose. Hopefully the columns that you intend to use can be targeted by the column name string pattern!

Here's a more general method with dplyr functions. You basically need to gather everything into a long format, where you can rename the variable accordingly, then spread them back into id, A, B:
library(dplyr)
library(tidyr)
df %>%
select(-c) %>%
mutate(index = row_number()) %>%
gather(key , value, -index) %>%
separate(key, c("type", "name"), fill = "right") %>%
mutate(name = ifelse(is.na(name), "id", name)) %>%
spread(key = name, value = value) %>%
select(id, matches("[A-Z]", ignore.case = FALSE)) %>%
distinct

Related

Add 1 to column names that ends in integer r

My problem is quite straightforward, I have a dataframe with many columns, some of them start with q03b_, like this:
ID ... q03b_0 q03b_1 q03b_2 ... q03b_14
1 ... a b c m
But I need to change the column names to q03b_other_1, q03b_other_2, q03b_other_3, etc (counting from 1 instead of 0). I managed to select the columns with rename_at and add the "other" to the column names, like this:
df %>%
rename_at(vars(matches('q03b_')), list(~ str_replace(., "b_(\\d+)", "_other_\\1")))
Which brings a dataframe like this:
ID ... q03_other_0 q03_other_1 q03_other_2 ... q03_other_14
1 ... a b c m
But I'm struggling to get to the final stage, which would be this:
ID ... q03_other_1 q03_other_2 q03_other_3 ... q03_other_15
1 ... a b c m
I guess I need to use a combination of as.numeric and as.character, but because of tidy evaluation I'm struggling to find a way to make this work. Any ideas?
Thanks !
With gsubfn:
library(dplyr)
library(readr)
library(gsubfn)
df %>%
rename_at(vars(matches('q03b_')),
list(~ gsubfn("b_\\d+$",
~ paste0("_other_",
parse_number(x) + 1),
.)))
Output
q03_other_1 q03_other_2 q03_other_3
1 a b c
I am not sure if you have to get the number from the original column names, add +1 to it to create new columns.
This works without doing that -
library(dplyr)
df %>%
rename_with(~paste0('q03_other_', seq_along(.)), starts_with('q03b_'))
# ID q03_other_1 q03_other_2 q03_other_3
#1 1 a b c
data
df <- data.frame(ID = 1, q03b_0 = 'a', q03b_1 = 'b', q03b_2 = 'c')
Here is an alternative way using sprintf:
library(dplyr)
library(stringr)
df %>%
select(-ID) %>%
rename_with(~str_replace(., "[0-9]+$", sprintf("%.0f", 1:length(colnames(df))))) %>%
rename_with(~str_replace(., "b", "")) %>%
bind_cols(ID=df$ID)
q03_other_1 q03_other_2 q03_other_3 ID
1 a b c 1
We can also use
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., "b_\\d+$", function(x)
str_c('_other_', readr::parse_number(x) + 1)) , starts_with('q03b_'))
ID q03_other_1 q03_other_2 q03_other_3
1 1 a b c
data
df <- structure(list(ID = 1L, q03b_0 = "a", q03b_1 = "b", q03b_2 = "c"), class = "data.frame", row.names = c(NA,
-1L))
Try the following:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L),
q03b_0 = c("a"),
q03b_1 = c("b"),
q03b_2 = c("c")
)
names(df)[-1] <- names(df)[-1] %>%
str_remove("_.*") %>%
paste0("_other_",1:length(.))
df
#> ID q03b_other_1 q03b_other_2 q03b_other_3
#> 1 1 a b c
EDIT: A more general solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L),
q03b_0 = c("a"),
q03b_1 = c("b"),
q03b_2 = c("c")
)
names(df)[str_detect(names(df), "^q03b_")] %<>%
str_split("_") %>%
map_chr(~ paste0(.x[1], "_other_", 1+as.numeric(.x[2])))
df
#> ID q03b_other_1 q03b_other_2 q03b_other_3
#> 1 1 a b c

R group by problem_the multiple combination ID

I'm using group by function in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
df <- data.frame(ID = c ("A","A","B","C","C","D"),
Var1 = c(1,3,2,3,1,2))
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2 
I've to group ID by A+B and B+C and D (PS. say that F=A+B ,G=B+C) and the target result dataset below:
ID Var1
F 6
G 6
D 2
I use the following code to solve it
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(Var1=sum(Var1))
BUT this way failed because of the memory limit (the dataset is large)
Is there another way to solve it?
Any suggestions would be greatly appreciated.

Generating multiple columns at once with dplyr

I often have to dynamically generate multiple columns based on values in existing columns. Is there a dplyr equivalent of the following?:
cols <- c("x", "y")
foo <- c("a", "b")
df <- data.frame(a = 1, b = 2)
df[cols] <- df[foo] * 5
> df
a b x y
1 1 2 5 10
Not the most elegant:
library(tidyverse)
df %>%
mutate_at(vars(foo),function(x) x*5) %>%
set_names(.,nm=cols) %>%
cbind(df,.)
a b x y
1 1 2 5 10
This can be made more elegant as suggested by #akrun :
df %>%
mutate_at(vars(foo), list(new = ~ . * 5)) %>%
rename_at(vars(matches('new')), ~ c('x', 'y'))

Apply function over data frame rows

I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))

Avoiding missing row after summarise

I'm using RStudio Version 0.98.1028 on windows. Summarising a multi level data frame, package dplyr, using the function sum(), I lost a row, which had sum = 0. In other words, if my original data frame was something like
group <- as.factor(rep(c('X', 'Y'), each = 1, times = 6))
type <- as.factor(rep(c('a', 'b'), each = 2, times = 3))
day <- as.factor(rep(1:3, each = 4))
df = data.frame(type = type, day = day, value = abs(rnorm(12)))
df = df[day != 1 | type != 'a',]
and I summarise it
df1 = df %>%
group_by(day, type) %>%
summarise(sum = sum(value))
then I get one missing row, which is the interaction between day = 1 and type = a, which I would like to have (even if it's 0...)
Thanks in advance!
EB
You could try left_join
library(dplyr)
left_join(expand.grid(type=unique(df$type), day=unique(df$day)), df1) %>%
group_by(day, type) %>%
summarise(sum=sum(value, na.rm=TRUE))
# day type sum
#1 1 a 0.0000000
#2 1 b 0.5132914
#3 2 a 1.2482210
#4 2 b 0.9232343
#5 3 a 2.0381779
#6 3 b 0.7558351
where df1 is
df1 <- df[day != 1 | type != 'a',]

Resources