Replace values in one column by taking values from another column - r

After asking one question this morning, now I would like to ask another way to do the replacement, since I am waiting my teacher confirm about the species name.
I have a dataframe like this (The real df resulted by removing duplicated rows)
df <- data.frame(name1 = c("a" , "b", "c", "a"),
name2 = c("x", NA, NA, NA),
name3 = c(NA, "b1", "c1", NA),
name4 = c("x", "b1", "c1", "a"))
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> a
Can we replace a by x by calling if the value in name4 column match with name1 column?
I do not want to use and assign x directly here since my data is supposed to have many cases like this. Any suggestions for me, please? (using base-R also fine for me since I would love to learn more)
Desired output
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> x
My explanation for the table and my expectation:
I have 3 columns name1, name2, name3 (after removing duplicated rows). Name4 column is the final column that contains value that I want from 3 previous columns. The value in name2 column is the my first priority to use, then value in name3.
In my fourth row, since NA value appears in name2 column, then I took an "a" from name1 column. I am thinking that whether can I replace a by x without assigning x i.e. if value (i.e. a) in name4 == value (i.e. a) in name1, then the a in name4 replaced by x in name2 or 4.

Your criteria to define name4 as I understand it is:
Use name2 from the same row if available
Use name3 from the same row if available
Leave it missing (for now)
Fill missing name4 values with name4 values from previous rows that share the same name1 value.
If you want a tidyverse-based solution:
library(dplyr)
library(tidyr)
df <- data.frame(name1 = c("a" , "b", "c", "a"),
name2 = c("x", NA, NA, NA),
name3 = c(NA, "b1", "c1", NA))
result <- df %>%
mutate(name4 = case_when(
#!is.na(name4) ~ name4, # when name4 is not missing, use it? If you like...
!is.na(name2) ~ name2, # when name2 is not missing, use it
!is.na(name3) ~ name3, # when name3 is not missing, use it
TRUE ~ NA_character_ # leave a NA for now otherwise
)) %>%
group_by(name1) %>%
fill(name4, .direction = c("down")) %>% # Fill each group looking at the previous non-missing row.
ungroup()
Returns:
# A tibble: 4 × 4
name1 name2 name3 name4
<chr> <chr> <chr> <chr>
1 a x NA x
2 b NA b1 b1
3 c NA c1 c1
4 a NA NA x
Note that fill can fill in several directions, you could use "downup" if you want to first fill from top to bottom and then bottom to top.

You can group by name1 and if name1 and name4 are equal replace the name4 value with 1st non-NA value available.
library(dplyr)
df %>%
group_by(name1) %>%
mutate(name4 = ifelse(name1 == name4, na.omit(unlist(cur_data()))[1], name4)) %>%
ungroup
# name1 name2 name3 name4
# <chr> <chr> <chr> <chr>
#1 a x NA x
#2 b NA b1 b1
#3 c NA c1 c1
#4 a NA NA x

You can do it like this:
df[which(df$name1==df$name4), "name4"] <- "x"
Basically this means subsetting your dataframe selecting rows, in which name1 == name4, and name4 column, then changing these values to "x"

Base R ifelse solution:
df$name4 <- ifelse(df$name1 == df$name4, "x", df$name4)
Based on your update, using dplyr's first:
library(dplyr)
df$name4 <- ifelse(df$name1 == df$name4, first(df$name4), df$name4)
This does the following:
Checks to see if name1 is equal to name 4
If name1 is equal to name4, it replaces the value of name4 with the first value occurring for name4.
Result:
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> x

Related

How to filter values in a list within a dataframe in R?

I have a dataframe, df:
df <- structure(list(id = c("id1", "id2", "id3",
"id4"), type = c("blue", "blue", "brown", "blue"
), value = list(
value1 = "cat", value2 = character(0),
value3 = "dog", value4 = "fish")), row.names = 1:4, class = "data.frame")
> df
id type value
1 id1 blue cat
2 id2 blue
3 id3 brown dog
4 id4 blue fish
The third column, value, is a list. I want to be able to filter out any rows in the dataframe where entries in that column that don't have any characters (ie. the second row).
I've tried this:
df <- filter(df, value != "")
and this
df <- filter(df, nchar(value) != 0)
But it doesn't have any effect on the data frame. What is the correct way to do this so my data frame looks like this:
> df
id type value
1 id1 blue cat
3 id3 brown dog
4 id4 blue fish
The lengths() function is perfect here - it gives the length of each element of a list. You want all the rows where value has non-zero length:
df[lengths(df$value) > 0, ]
# id type value
# 1 id1 blue cat
# 3 id3 brown dog
# 4 id4 blue fish
here is my approach
idx <- lapply(df$value, length)
filter(df, idx > 0)
id type value
1 id1 blue cat
2 id3 brown dog
3 id4 blue fish
An option with tidyverse
library(dplyr)
library(purrr)
df %>%
filter(map_int(value, length) > 0)
# id type value
#1 id1 blue cat
#2 id3 brown dog
#3 id4 blue fish
Try this:
df <- filter(df, !sapply(df$value,function(x) identical(x,character(0))) )

Remove rows found in more than 3 groups

I have a dataframe, i am trying to remove the rows that are present in >= 3 groups. In my below example bike is the common value across 3 group and i need to remove that. Please help me to achieve this.
df <- data.frame(a = c("name1","name1","name1","name2","name2","name2","name3"), b=c("car","bike","bus","train","bike","tour","bike"))
df
a b
name1 car
name1 bike
name1 bus
name2 train
name2 bike
name2 tour
name3 bike
Expected Output:
a b
name1 car
name1 bus
name2 train
name2 tour
You can use dplyr::n_distinct:
n_gr <- 3
cn <- df %>% group_by(b) %>% summarise(na = n_distinct(a)) %>%
filter(na >= n_gr) %>% pull(b)
df <- df %>% filter(!(b %in% cn))
Output
a b
1 name1 car
2 name1 bus
3 name2 train
4 name2 tour
In base R you could do this...
df[ave(as.numeric(as.factor(df$a)), #convert a to numbers (factor levels) (required by ave)
df$b, #group by b
FUN=length) < 3, ] #return whether no of a's per b is less than 3
a b
1 name1 car
3 name1 bus
4 name2 train
6 name2 tour
Using data.table:
library(data.table)
setDT(df)[, count := .N, by = b] ## convert df to data.table & create a column to count groups
df <- df[!(count >= 3), ] ## delete rows that have count equal to 3 or more than 3
df[, count := NULL] ## delete the column created
df
a b
1: name1 car
2: name1 bus
3: name2 train
4: name2 tour
Using Base R:
df <- data.frame(a = c("name1","name1","name1","name2","name2","name2","name3"), b=c("car","bike","bus","train","bike","tour","bike"))
df
lst <- table(df$b)
df[df$b != names(lst)[lst >=3],]
# a b
# 1 name1 car
# 3 name1 bus
# 4 name2 train
# 6 name2 tour

Identifying missing observations in groups

I have some difficulties with my code, and I hope some of you could help.
The dataset looks something like this:
df <- data.frame("group" = c("A", "A", "A","A_1", "A_1", "B","B","B_1"),
"id" = c("id1", "id2", "id3", "id2", "id3", "id5","id1","id1"),
"time" = c(1,1,1,3,3,2,2,5),
"Val" = c(10,10,10,10,10,12,12,12))
"group" indicate the group the individual "id" is in. "A_1" indicate that a subject has left the group.
For instance, one subject "id1" leaves the "group A" that becomes group "A_1", where only "id2" and "id3" are members. Similarly "id5" leaves group B that becomes "B_1" with only id1 as a member.
What I would like to have in the final dataset is an opposite type of groups identification, that should look something like this:
final <- data.frame("group" = c("A", "A", "A","A_1", "B","B","B_1"),
"id" = c("id1", "id2", "id3", "id1", "id5","id1","id5"),
"time" = c(1,1,1,3,2,2,5),
"Val" = c(10,10,10,10,12,12,12),
"groupid" = c("A", "A", "A","A", "B","B","B"))
Whereby "A_1" and "B_1" only indicate the subjects, "id1" and "id5" respectively, that have left the original group, rather than identifying remaining subjects.
Does anyone have suggestions on how I could systematically do this?
I thank you in advance for your help.
Follow up:
My data is a little more complex that in the above example as there are multiple "exits" from treatements, moreover group identifier can be of different character leghts (here for instance AAA and B). The data looks more like the following:
df2 <- data.frame("group" = c("AAA", "AAA", "AAA","AAA","AAA_1","AAA_1", "AAA_1","AAA_2","AAA_2","B","B","B_1"),
"id" = c("id1", "id2", "id3","id4", "id2", "id3","id4", "id2","id3", "id5","id1","id1"),
"time" = c(1,1,1,1,3,3,3,6,6,2,2,5),
"Val" = c(10,10,10,10,10,10,10,10,10,12,12,12))
Where at time 3 id1 leaves groups AAA, that becomes groups AAA_1, while at time 6, also id4 leaves group AAA, that becomes group AAA_2. As discussed previously, i would like groups with "_" to identify those id that left the group rather than the one remaining. Hence the final dataset should look something like this:
final2 <- data.frame("group" = c("A", "A", "A","A","A_1","A_2",
"B","B","B_1"),
"id" = c("id1", "id2", "id3","id4", "id1", "id4", "id5","id1","id5"),
"time" = c(1,1,1,1,3,6,2,2,5),
"Val" = c(10,10,10,10,10,10,12,12,12))
thanks for helping me with this
Ok you can try with dplyr in this way: maybe it's not elegant, but you get the result. The idea behind is to first fetch the ones that are in group ... but not in the relative ..._1 and change their group, fetch the others, and rbind them together:
library(dplyr)
# first you could find the one that are missing in the ..._1 groups
# and change their group to ..._1
dups <-
df %>%
group_by(id, groupid = substr(group,1,1)) %>%
filter(n() == 1)%>%
mutate(group = paste0(group,'_1')) %>%
left_join(df %>%
select(group, time, Val) %>%
distinct(), by ='group') %>%
select(group, id, time = time.y, Val = Val.y) %>%
ungroup()
dups
# A tibble: 2 x 5
groupid group id time Val
<chr> <chr> <fct> <dbl> <dbl>
1 A A_1 id1 3 10
2 B B_1 id5 5 12
# now you can select the ones that are in both groups:
dups2 <-
df %>%
filter(nchar(as.character(group)) == 1) %>%
mutate(groupid = substr(group,1,1))
dups2
group id time Val groupid
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 B id5 2 12 B
5 B id1 2 12 B
Last, rbind() them, arrange() them and order() the columns:
rbind(dups, dups2) %>%
arrange(group) %>%
select(group, id, time, Val, groupid)
# A tibble: 7 x 5
group id time Val groupid
<chr> <fct> <dbl> <dbl> <chr>
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 A_1 id1 3 10 A
5 B id5 2 12 B
6 B id1 2 12 B
7 B_1 id5 5 12 B
Hope it helps!
EDIT:
You can generalize it with some work, here my attempt, hope it helps:
library(dplyr)
df3 <- df2
# you have to set a couple of fields you need:
df3$group <-ifelse(
substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group))) %in% c(0:9),
paste0(substr(df2$group,1,1),"_",substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group)))),
paste0(substr(df2$group,1,1),"_0")
)
df3$util <- as.numeric(substr(df3$group,3,3))+1
# two empty lists to populate with a nested loop:
changed <- list()
final_changed <- list()
Now first we find who changes, then the other: the idea is the same of the previous part:
for (j in c("A","B")) {
df3_ <- df3[substr(df3$group,1,1)==j,]
for (i in unique(df3_$util)[1:length(unique(df3_$util))-1]) {
temp1 <- df3_[df3_$util == i,]
temp2 <- df3_[df3_$util == i+1,]
changes <- temp1[!temp1$id %in% temp2$id,]
changes$group <- paste0(j,'_',i )
changes <- changes %>% left_join(temp2, by = 'group') %>%
select(group , id = id.x, time = time.y, Val = Val.y)
changed[[i]] <- changes
}
final_changed[[j]] <- changed
}
change <- do.call(rbind,(do.call(Map, c(f = rbind, final_changed)))) %>% distinct()
change
group id time Val
1 A_1 id1 3 10
2 B_1 id5 5 12
3 A_2 id4 6 10
Then the remains, and put together:
remain <-
df3 %>% mutate(group = gsub("_0", "", .$group)) %>%
filter(nchar(as.character(group)) == 1) %>% select(-util)
rbind(change, remain) %>%
mutate(groupid = substr(group,1,1)) %>% arrange(group) %>%
select(group, id, time, Val, groupid)
group id time Val groupid
1 A id1 1 10 A
2 A id2 1 10 A
3 A id3 1 10 A
4 A id4 1 10 A
5 A_1 id1 3 10 A
6 A_2 id4 6 10 A
7 B id5 2 12 B
8 B id1 2 12 B
9 B_1 id5 5 12 B

Find duplicates by id and restructure the dataset in R [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 5 years ago.
I have a dataset with columns: id, names. There can be one id but multiple names, so I am getting duplicate id-rows at times:
id names
id1 name1
id1 name2
id1 name3
id2 name4
id2 name5
I need to restructure such a data.frame in R, so that all rows would have unique ids, and if there are multiple names, they all should be written into the names column as comma separated values like that:
id names
id1 name1, name2, name3
id2 name4, name5
I tried grouped <- table %>% group_by(names) but it did not work.
How could I achieve that in R?
Using data.table:
df <- read.table(header=T, text="id names
id1 name1
id1 name2
id1 name3
id2 name4
id2 name5")
library(data.table)
setDT(df)
df[, names := as.character(names)]
df[, names := paste0(names, collapse = ", "), by = id]
df <- unique(df)
Output:
df
id names
1: id1 name1, name2, name3
2: id2 name4, name5

Create a new column in dplyr by appending values to a list from other columns?

I would like to make a new column by appending to a list conditional on the values of other columns. If possible, I would like to do so in dplyr. Sample input and desired output is below.
Suppose a dataframe newdata:
col1 col2 col3 col4
dog cat NA NA
NA cat foo bar
dog NA NA NA
NA cat NA NA
Here is my desired output, with the new column newCol:
col1 col2 col3 col4 newCol
dog cat NA NA (dog, cat)
NA cat foo bar (cat, foo, bar)
dog NA NA NA (dog)
NA cat NA bar (cat, bar)
I have tried using ifelse within mutate and case_when within mutate, but both will not allow concatenation to a list. Here is my (unsuccessful) attempt with case_when:
newdata = newdata %>% mutate(
newCol = case_when(
col1 == "dog" ~ c("dog"),
col2 == "cat" ~ c(newCol, "cat"),
col3 == "foo" ~ c(newCol, "foo"),
col4 == "bar" ~ c(newcol, "dog")
)
)
I tried a similar approach with an ifelse statement for each column but also could not append to the list.
In the Note at the end we show the input data used here. It is as in the question except we have added a row of NAs at the end to show that all solutions work in that case too.
We show both list and character column solutions. The question specifically refers to list so this is the assumed desired output but if it was intended that newCol be a character vector then we show that as well.
This is so easy to do using base functions that we show that first; however, we do redo it in tidyverse although it involves significantly more code.
1) base We can use apply like this:
reduce <- function(x) unname(x[!is.na(x)])
DF$newCol <- apply(DF, 1, reduce)
giving the following where newCol is a list whose first component is c("dog", "cat"), etc.
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> dog, cat
2 <NA> cat foo bar cat, foo, bar
3 dog <NA> <NA> <NA> dog
4 <NA> cat <NA> <NA> cat
5 <NA> <NA> <NA> <NA>
The last line of code could alternately be:
DF$newCol <- lapply(split(DF, 1:nrow(DF)), reduce)
The question refers to concatenating to a list so I assume that a list is wanted for newCol but if a string is wanted then use this for reduce instead:
reduce_ch <- function(x) sprintf("(%s)", toString(x[!is.na(x)]))
apply(DF, 1, reduce_ch)
2) tidyverse or using tpldyr/tidyr/tibble we gather it to long form, remove the NAs, nest it, sort it back to the original order and cbind it back with DF.
library(dplyr)
library(tibble)
library(tidyr)
DF %>%
rownames_to_column %>%
gather(colName, Value, -rowname) %>%
na.omit %>%
select(-colName) %>%
nest(Value, .key = newCol) %>%
arrange(rowname) %>%
left_join(cbind(DF %>% rownames_to_column), .) %>%
select(-rowname)
giving:
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> dog, cat
2 <NA> cat foo bar cat, foo, bar
3 dog <NA> <NA> <NA> dog
4 <NA> cat <NA> <NA> cat
5 <NA> <NA> <NA> <NA> NULL
If character output is wanted then use this instead:
DF %>%
rownames_to_column %>%
gather(colName, Value, -rowname) %>%
select(-colName) %>%
group_by(rowname) %>%
summarize(newCol = sprintf("(%s)", toString(na.omit(Value)))) %>%
ungroup %>%
{ cbind(DF, .) } %>%
select(-rowname)
giving:
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> (dog, cat)
2 <NA> cat foo bar (cat, foo, bar)
3 dog <NA> <NA> <NA> (dog)
4 <NA> cat <NA> <NA> (cat)
5 <NA> <NA> <NA> <NA> ()
Note
The input DF in reproducible form:
Lines <- "col1 col2 col3 col4
dog cat NA NA
NA cat foo bar
dog NA NA NA
NA cat NA NA
NA NA NA NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Solution using na.omit() and paste() with collapse argument:
apply(newdata, 1,
function(x) paste0("(", paste(na.omit(x), collapse = ", "), ")"))
[1] "(dog, cat)" "(cat, foo, bar)" "(dog)" "(cat)"
Demo
This looks like a use case for tidyr::unite. You'll still need to do some dplyr cleanup at the end, but this should work for now.
library(tibble)
library(dplyr)
library(tidyr)
df <- tribble(~col1, ~col2, ~col3, ~col4,
"dog", "cat", NA, NA,
NA, "cat", "foo", "bar",
"dog", NA, NA, NA,
NA, "cat", NA, NA)
df %>%
unite(newCol, col1, col2, col3, col4,
remove = FALSE,
sep = ', ') %>%
# Replace NAs and "NA, "s with ''
mutate(newCol = gsub('NA[, ]*', '', newCol)) %>%
# Replace ', ' with '' if it is at the end of the line
mutate(newCol = gsub(', $', '', newCol)) %>%
# Add the parentheses on either side
mutate(newCol = paste0('(', newCol, ')'))
#> # A tibble: 4 x 5
#> newCol col1 col2 col3 col4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 (dog, cat) dog cat <NA> <NA>
#> 2 (cat, foo, bar) <NA> cat foo bar
#> 3 (dog) dog <NA> <NA> <NA>
#> 4 (cat) <NA> cat <NA> <NA>
Also for what it's worth, other people are discussing this problem!

Resources