Modify multiple columns at same time in R - r

I don't know how to say it clearly, that is maybe why i did not find the answer, but i want to edit the values of two different columns at the same time, while they are the identifying columns.
For example this is the data :
> data = data.frame(name1 = c("John","Jake","John","Paul"),
name2 = c("Paul", "Paul","John","John"),
value1 = c(0,0,1,0),
value2 = c(1,0,1,0))
> data
name1 name2 value1 value2
1 John Paul 0 1
2 Jake Paul 0 0
3 John John 1 1
4 Paul John 0 0
I would like to edit the values of the first row so the first row become Jake & John instead of John & Paul, and so i would like to combine these two lines of code for doing it at the same time :
data$name1[(data$name1 == "John" & data$name2 == "Paul")] <- "Jake"
data$name2[(data$name1 == "John" & data$name2 == "Paul")] <- "John"
Should be a simple trick but i dont have it !
Also, i should do that on larger datasets each modification can appear on multiple lines, and i cant know on which rows will be the modification

How about this ?
data[data$col1 == "A" & data$col2 == "B", ] <- list("B", "D")
data
# col1 col2
#1 B D
#2 A C
#3 B A
#4 B B

library(tidyverse)
data %>%
mutate(
name1=
case_when(
name1=="John" & name2=="Paul" ~ "Jake",
TRUE ~ name1
),
name2=
case_when(
name1=="John" & name2=="Paul" ~ "John",
TRUE ~ name2))

Related

Find Groups by multiple cascadingly related conditions

Problem
I have some data. I would like to flag the same instance (e.g. a person, company, machine, whatever) in my data by a unique ID. The data actually has some IDs but they are either not always present or one instance has different IDs.
What I try to acheive is to use these IDs along with individual information to find the same instance and assign a unique ID to them.
I found a solution, but this one is highly inefficient. I would appreciate both tipps to improve the performance of my code or - probably more promising - another approach.
Code
Example Data
dt1 <- data.table(id1 = c(1, 1, 2, 3, 4),
id2 = c("A", "B", "A", "C", "D"),
surname = "Smith",
firstname = c("John", "John", "Joe", "Joe", "Jack"))
dt1
#> id1 id2 surname firstname
#> 1: 1 A Smith John
#> 2: 1 B Smith John
#> 3: 2 A Smith Joe
#> 4: 3 C Smith Joe
#> 5: 4 D Smith Jack
Current Solution
find_grp <- function(dt,
by) {
# keep necessary variables only
dtx <- copy(dt)[, .SD, .SDcols = c(unique(unlist(by)))]
# unique data.table to improve performance
dtx <- unique(dtx)
# assign a row id column
dtx[, ID := .I]
# for every row and every by group, find all rows that match each row
# on at least one condition
res <- lapply(X = dtx$ID,
FUN = function(i){
unique(unlist(lapply(X = by,
FUN = function(by_sub) {
merge(dtx[ID == i, ..by_sub],
dtx,
by = by_sub,
all = FALSE)$ID
}
)))
})
res
print("merge done")
# keep all unique matching rows
l <- unique(res)
# combine matching rows together, if there is at least one overlap between
# two groups.
# repeat until all row-groups are completely disjoint form one another
repeat{
l1 <- l
iterator <- seq_len(length(l1))
for (i in iterator) {
for (ii in iterator[-i]) {
# is there any overlap between both row-groups
if (length(intersect(l1[[i]], l1[[ii]])) > 0) {
l1[[i]] <- sort(union(l1[[i]], l1[[ii]]))
}
}
}
if (isTRUE(all.equal(l1, l))) {
break
} else {
l <- unique(l1)
}
}
print("repeat done")
# use result to assign a groupId to the helper data.table
Map(l,
seq_along(l),
f = function(ll, grp) dtx[ID %in% ll, ID_GRP := grp])
# remove helper Id
dtx[, ID := NULL]
# assign the groupId to the original data.table
dt_out <- copy(dt)[dtx,
on = unique(unlist(by)),
ID_GRP := ID_GRP]
return(dt_out[])
}
Result
find_grp(dt1, by = list("id1",
"id2"
, c("surname", "firstname"))
)
#> [1] "merge done"
#> [1] "repeat done"
#> id1 id2 surname firstname ID_GRP
#> 1: 1 A Smith John 1
#> 2: 1 B Smith John 1
#> 3: 2 A Smith Joe 1
#> 4: 3 C Smith Joe 1
#> 5: 4 D Smith Jack 2
As you can see, ID_GRP is identified because
the first two rows share id1
since id2 for id1 contains A, row 3 with id2 = A belongs to the same group.
finally, all Joe Smith belong to the same group as well because its the name in row 3
so on and so forth
only row 5 is completely unrelated
{data.table} solutions are preferred
This might help you. I'm not sure if I've completely understood your question. I've written a function (gen_grp(), that takes a data table d, and a vector of variables v. It steps through each unique id1, and replaces id1 values if matches of certain types are found.
gen_grp <- function(d,v) {
for(id in unique(d$id1)) {
d[id2 %in% d[id1==id,id2], id1:=id]
k=unique(d[id1==id, ..v])[,t:=id]
d = k[d, on=v][!is.na(t), id1:=t][, t:=NULL]
}
d[, grp:=rleid(id1)]
return(d[])
}
Usage:
gen_grp(dt1,c("surname","firstname"))
Output:
surname firstname id1 id2 grp
<char> <char> <num> <char> <int>
1: Smith John 1 A 1
2: Smith John 1 B 1
3: Smith Joe 1 A 1
4: Smith Joe 1 C 1
5: Smith Jack 4 D 2

How to program conditional Values across two dataframes in R?

I have a couple of data frames that I am attempting to use the values from one data frame to populate the cells of a column in a separate data frame.
They are as follows:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"))
df1$B <- 0
df2 <- data.frame(A = c("Doug", "Steve", "John"), B = c(1,1,0))
And the result that I am looking for is:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"), B = c(1,0,1,0,0,0))
I tried the following approach, but only Doug has a 1 value while the others are 0.
df1$B[(df1$A == df2$A & df2$B == 1)] <- 1
When attempting an approach with %in%, Doug has a 1 value but John does as well when Steve should be the one to receive the 1.
df1$B[(df1$A %in% df2$A & df2$B == 1)] <- 1
Am I missing something here that would resolve this issue?
Thanks in advance
An option with data.table would be to join on the 'A' column and assign the 'B' from the second dataset (i.B) to 'B' in first data
library(data.table)
setDT(df1)[df2, B := i.B, on = .(A)]
-output
df1
# A B
#1: Doug 1
#2: Michele 0
#3: Steve 1
#4: John 0
#5: Pete 0
#6: David 0

Applying condition based on a list and create a new column based on the outcome r

I have a list as follows:
c1 <-("apple", "tree", "husband")
and this data:
df <-data.frame(
ID = c("b","b","b","a","a","c"),
col = c("husband", "apple", "juice", "happy", "husband", "white"),
)
and I want to have this output:
df <-data.frame(
ID = c("b","b","b","a","a","c"),
col = c("husband", "apple", "juice", "happy", "husband", "white"),
c1 = c("1","1","0","0","1","0")
)
by applying the list (c1) as a condition and not to use
mutate(c1= ifelse(col==happy | col==tree | col==husband,1,0))
Thank you
You can use %in% to check c1 values in col
transform(df, c1 = as.integer(col %in% c1))
#Even shorter
#transform(df, c1 = +(col %in% c1))
# ID col c1
#1 b husband 1
#2 b apple 1
#3 b juice 0
#4 a happy 0
#5 a husband 1
#6 c white 0
Using as.integer over logical values is faster way than using ifelse :
transform(df, c1 = ifelse(col %in% c1, 1, 0))
You can play a trick via factor, e.g.,
within(df, out <- +!is.na(factor(col,levels = c1)))
or via %in%
within(df, out <- +(col %in%c1))
or via match
within(df,out <- 1-is.na(match(col,c1)))
such that
ID col out
1 b husband 1
2 b apple 1
3 b juice 0
4 a happy 0
5 a husband 1
6 c white 0
You can also use grepl() to check any of the values in c1 and assign directly to a new variable:
#Data 1
c1 <- c("apple", "tree", "husband")
#Data 2
df <-data.frame(
ID = c("b","b","b","a","a","c"),
col = c("husband", "apple", "juice", "happy", "husband", "white"),stringsAsFactors = F)
#Match and create new variable
df$NewVar <- as.numeric(grepl(paste0(c1,collapse = '|'),df$col))
Output:
ID col NewVar
1 b husband 1
2 b apple 1
3 b juice 0
4 a happy 0
5 a husband 1
6 c white 0
An option with case_when
library(dplyr)
df %>%
mutate(c1 = case_when(col %in% c1, 1, 0))
Or another option is
df %>%
mutate(c1 = +(col %in% c1))

Filter rows of a dataframe with an equal column value

Let's suppose we have a dataframe like this:
df <- data.frame(v1=c("aa", "aa", "b", "cc", "cc"), V2=c("yes", "yes", "no", "yes", "no"))
> df
six seven
1 aa yes
2 aa yes
3 b no
4 cc yes
5 cc no
I want to filter and, then, store in a new dataframe rows that matches 2 cryteria: same "six" column value and a specific "seven" column value. For example, let's suppose we want rows with "yes" column:
> df
six seven
1 aa yes
2 aa yes
How can I do this? I've tried with:
df_new <- filter(df, ...)
But I'm sure sure how to impose both the conditions.
and:
require(plyr)
ans = ddply(df, .(seven == "yes"), mutate, count = length(unique(six)))
Who gives:
> ans
seven == "yes" six seven count
1 FALSE b no 2
2 FALSE cc no 2
3 FALSE cc no 2
4 TRUE aa yes 1
5 TRUE aa yes 1
But this doesn't filter the dataframe.
EDIT: To clarify, if I have more columns in the dataframe, like this:
df <- data.frame(v1=c("aa", "aa", "b", "cc", "cc","aa","aa"), v2=c("yes", "yes", "no", "yes", "no","no","yes"))
> df
v1 v2
1 aa yes
2 aa yes
3 b no
4 cc yes
5 cc no
6 aa no
7 aa yes
The code has to give this:
df
six seven
1 aa yes
2 aa yes
7 aa yes
Ok, finally I had it. I leave here the solution for those who want to know:
types <- unique(df$six)
tmp = list()
require(dplyr)
for (k in 1:length(types)) {
tmp[[k]] <- df %>% filter(six == types[k] & seven == "yes")
}
ls <- Filter(function(x) nrow(x) > 1, tmp)
A bit tricky, maybe, but works. Of course, you have to extract a dataframe from the list in the end. If someone have a better idea, post it. If you're wondering why I'm using list, working with only dataframes gave me some problems.
Here is an idea via ddplyr. First group by v1 and add the 2 criteria in filter. The group needs to be larger than 2 so to infer that v1 values are the same, and of course v2 == 'yes' is self explanatory,
library(tidyverse)
df %>%
group_by(v1) %>%
filter(n() >= 2 & all(v2 == 'yes'))
which gives,
# A tibble: 2 x 2
# Groups: v1 [1]
v1 v2
<fct> <fct>
1 aa yes
2 aa yes

Joining 2 data frames

I have 2 data frames and want to put a match column on one of them
library(plyr)
d1<-data.frame(date=c("2015-01-01","2015-02-05"),s= c("b","s"),name=c("bob","frank"),number=c(10,10.44), MatchorNoMatch= as.character(c("","")))
d1
d2<-data.frame(date2=c("2015-01-01","2015-02-06"),s2= c("b","b"),name2=c("bob","george"),number2=c(10,114))
d2
d1[d1$date %in% d2$date2 & d1$s %in% d2$s2 & d1$name %in% d2$name2 & d1$number %in% d2$number2,"MatchorNoMatch"] <- "match"
d1
here is what I get when I run that:
> library(plyr)
> d1<-data.frame(date=c("2015-01-01","2015-02-05"),s= c("b","s"),name=c("bob","frank"),number=c(10,10.44), MatchorNoMatch= as.character(c("","")))
> d1
date s name number MatchorNoMatch
1 2015-01-01 b bob 10.00
2 2015-02-05 s frank 10.44
> d2<-data.frame(date2=c("2015-01-01","2015-02-06"),s2= c("b","b"),name2=c("bob","george"),number2=c(10,114))
> d2
date2 s2 name2 number2
1 2015-01-01 b bob 10
2 2015-02-06 b george 114
>
> d1[d1$date %in% d2$date2 & d1$s %in% d2$s2 & d1$name %in% d2$name2 & d1$number %in% d2$number2,"MatchorNoMatch"] <- "match"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "match") :
invalid factor level, NA generated
> d1
date s name number MatchorNoMatch
1 2015-01-01 b bob 10.00 <NA>
2 2015-02-05 s frank 10.44
I am getting a NA in the MatchOrNoMatch column. Any idea?
===========ACTAUALLY I JUST NEEDE TO PUT stringASFactors = FALSE
here is why using %in% won't work. Bob shoudl not be a match
library(plyr)
d1<-data.frame(date=c("2015-01-01","2015-02-05","2015-01-01"),s= c("b","s","s"),name=c("bob","frank","g"),number=c(10,10.44,66), match= as.character(c("","","")),stringsAsFactors= FALSE)
d1
class(d1$match)
d2<-data.frame(date2=c("2015-01-15","2015-02-05","2015-01-01"),s2= c("b","s","s"),name2=c("bob","frank","g"),number2=c(10,10.44,55),stringsAsFactors= FALSE)
d2
d1[d1$date %in% d2$date2 & d1$s %in% d2$s2 & d1$name %in% d2$name2 & d1$number %in% d2$number2,"match"] <- d2[d1$date %in% d2$date2 & d1$s %in% d2$s2 & d1$name %in% d2$name2 & d1$number %in% d2$number2, "name2"]
d1
This is really easy to do with just the base merge command from R.
d2$name2<-d2$name
merge(d1,d2,all.x=TRUE)
date s name number name2
1 2015-01-01 b bob 10.00 bob
2 2015-02-05 s frank 10.44 <NA>
merge(d1,d2,by=c("date","s","name","number"),all.x=TRUE)
edited in your specific column names that you wanted to match by.

Resources