Replace NA values when they are in two adjacent columns - r

Hi this is an example of a similar dataframe I am working with. I have an experiment with 10 samples and two replicates
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
I have different Na values, however, I only want to replace them with zeros in the samples 4 and 8 due to they are the only ones which have NA in both replicates. Then, other samples would maintain the "NA".

You can also use the following solution. In the following solution we iterate over each row and detect corresponding index or indices that is (are) equal to Na then if there were more that one index we replace it with 0 otherwise the row will remain as it:
library(dplyr)
library(purrr)
df %>%
pmap_df(., ~ {ind <- which(c(...) == "Na");
if(length(ind) > 1) {
replace(c(...), ind, "0")
} else {
c(...)
}
}
) %>%
mutate(across(ID, as.integer))
# A tibble: 10 x 3
ID Rep1 Rep2
<int> <chr> <chr>
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
P.S = I almost went crazy as why I could not get it to work only to realize your NAs are in fact Na.

We create an index where the 'Rep' columns are both "Na" with rowSums on a logical matrix. Use the row, column index/names to subset the data and assign the values to 0
nm1 <- grep("Rep", names(df), value = TRUE)
i1 <- rowSums(df[nm1] == "Na") == length(nm1)
df[i1, nm1] <- 0
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
As the OP created string "Na", the column types are not numeric. We can convert this to numeric as
df[-1] <- lapply(df[-1], as.numeric)
forces the "Na" to be converted to NA
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 NA 3
6 6 9 NA
7 7 4 6
8 8 0 0
9 9 NA 2
10 10 2 1

With dplyr we could:
library(dplyr)
df %>%
mutate(across(starts_with("Rep"), ~case_when(.=="Na" & ID==4 | ID==8 ~ "0",
TRUE ~ .)))
Output:
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1

Though it has been marked as solved, yet I propose a simple answer
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
library(dplyr)
df %>% group_by(ID) %>%
mutate(replace(cur_data(), all(cur_data() == 'Na'), '0'))
#> # A tibble: 10 x 3
#> # Groups: ID [10]
#> ID Rep1 Rep2
#> <dbl> <chr> <chr>
#> 1 1 6 8
#> 2 2 5 4
#> 3 3 3 4
#> 4 4 0 0
#> 5 5 Na 3
#> 6 6 9 Na
#> 7 7 4 6
#> 8 8 0 0
#> 9 9 Na 2
#> 10 10 2 1
OR
df %>% rowwise() %>%
mutate(replace(cur_data()[-1], all(cur_data()[-1] == 'Na'), '0'))

Related

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Insert dots in column names in wide data using R

The following data set is in the wide format and has repeated measures of "ql", "st" and "xy" prefixed by "a", "b" and "c";
df<-data.frame(id=c(1,2,3,4),
ex=c(1,0,0,1),
aql=c(5,4,NA,6),
bql=c(5,7,NA,9),
cql=c(5,7,NA,9),
bst=c(3,7,8,9),
cst=c(8,7,5,3),
axy=c(1,9,4,4),
cxy=c(5,3,1,4))
I'm looking for a way to insert dots after the prefixed letters "a", "b" and "c", while keeping other columns (i.e. id, ex) unchanged. I've been working around this using gsub function, e.g.
names(df) <- gsub("", "\\.", names(df))
but got undesired results. The expected output would look like
id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4
Try
sub("(^[a-c])(.+)", "\\1.\\2", names(df))
# [1] "id" "ex" "a.ql" "b.ql" "c.ql" "b.st" "c.st" "a.xy" "c.xy"
or
sub("(?<=^[a-c])", ".", names(df), perl = TRUE)
# [1] "id" "ex" "a.ql" "b.ql" "c.ql" "b.st" "c.st" "a.xy" "c.xy"
You can do
setNames(df, sub("(ql$)|(st$)|(xy$)", "\\.\\1\\2\\3", names(df)))
#> id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
#> 1 1 1 5 5 5 3 8 1 5
#> 2 2 0 4 7 7 7 7 9 3
#> 3 3 0 NA NA NA 8 5 4 1
#> 4 4 1 6 9 9 9 3 4 4
Another way you can try
library(dplyr)
df %>%
rename_at(vars(aql:cxy), ~ str_replace(., "(?<=\\w{1})", "\\."))
# id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
# 1 1 1 5 5 5 3 8 1 5
# 2 2 0 4 7 7 7 7 9 3
# 3 3 0 NA NA NA 8 5 4 1
# 4 4 1 6 9 9 9 3 4 4
You can also try a tidyverse approach reshaping your data like this:
library(tidyverse)
#Data
df<-data.frame(id=c(1,2,3,4),
ex=c(1,0,0,1),
aql=c(5,4,NA,6),
bql=c(5,7,NA,9),
cql=c(5,7,NA,9),
bst=c(3,7,8,9),
cst=c(8,7,5,3),
axy=c(1,9,4,4),
cxy=c(5,3,1,4))
#Reshape
df %>% pivot_longer(-c(1,2)) %>%
mutate(name=paste0(substring(name,1,1),'.',substring(name,2,nchar(name)))) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 4 x 9
id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4

Add a new row for each id in dataframe for ALL variables

I want to add a new row after each id. I found a solution on a stackflow page(Inserting a new row to data frame for each group id)
but there is one thing I want to change and I dont know how. I want to make a new row for all variables, I don't want to write down all the variables ( the stackflow example). It doesnt matter the numbers in the row, I will change that later. If it is possible to add "base" in the new row for trt, that would be good. I want the code to work for many ids and varibles, having a lot of those in the data I'm working with. Many thanks if someone can help me with this!
The example code:
set.seed(1)
> id <- rep(1:3,each=4)
> trt <- rep(c("A","OA", "B", "OB"),3)
> pointA <- sample(1:10,12, replace=TRUE)
> pointB<- sample(1:10,12, replace=TRUE)
> pointC<- sample(1:10,12, replace=TRUE)
> df <- data.frame(id,trt,pointA, pointB,pointC)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
I want it to look like:
df <- rbind(df[1:4,], df1, df[5:8,], df2, df[9:12,],df3)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 1 base
51 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
13 2 base
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
14 3 base
>
I'm trying this code:
df %>%
+ group_by(id) %>%
+ summarise(week = "base") %>%
+ mutate_all() %>% #want tomutate allvariables
+ bind_rows(df, .) %>%
+ arrange(id)
You could bind_rows directly, it will add NAs to all other columns by default.
library(dplyr)
df %>% group_by(id) %>% summarise(trt = 'base') %>% bind_rows(df) %>% arrange(id)
# id trt pointA pointB pointC
# <int> <chr> <int> <int> <int>
# 1 1 base NA NA NA
# 2 1 A 3 7 3
# 3 1 OA 4 4 4
# 4 1 B 6 8 1
# 5 1 OB 10 5 4
# 6 2 base NA NA NA
# 7 2 A 3 8 9
# 8 2 OA 9 10 4
# 9 2 B 10 4 5
#10 2 OB 7 8 6
#11 3 base NA NA NA
#12 3 A 7 10 5
#13 3 OA 1 3 2
#14 3 B 3 7 9
#15 3 OB 2 2 7
If you want empty strings instead of NA, we can give a range of columns in mutate_at and replace NA values with empty string.
df %>%
group_by(id) %>%
summarise(trt = 'base') %>%
bind_rows(df) %>%
mutate_at(vars(pointA:pointC), ~replace(., is.na(.) , '')) %>%
arrange(id)
library(dplyr)
library(purrr)
df %>% mutate_if(is.factor, as.character) %>%
group_split(id) %>%
map_dfr(~bind_rows(.x, data.frame(id=.x$id[1], trt="base", stringsAsFactors = FALSE)))
#Note that group_modify is Experimental
df %>% mutate_if(is.factor, as.character) %>%
group_by(id) %>%
group_modify(~bind_rows(.x, data.frame(trt="base", stringsAsFactors = FALSE)))

Count number of new and lost friends between two data frames in R

I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)

How to replace 0 or missing value with NA in R [duplicate]

This question already has answers here:
Replace all 0 values to NA
(11 answers)
Closed 4 years ago.
this is what i have already done so far
data is numeric data type
if (is.na(data) || attribute==0){replace(data,NA)}
it gives me error message that
Error in replace(attribute, NA) : argument "values" is missing, with no default
With mutate_all:
library(dplyr)
df %>%
mutate_all(~replace(., . == 0, NA))
or with mutate_if to be safe:
df %>%
mutate_if(is.numeric, ~replace(., . == 0, NA))
Note that there is no need to check for NA's, because we are replacing with NA anyway.
Output:
> df %>%
+ mutate_all(~replace(., . == 0, NA))
X Y Z
1 1 5 <NA>
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 <NA>
6 NA 4 <NA>
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 <NA>
12 2 5 2
13 4 4 4
14 3 4 <NA>
15 NA NA 3
16 5 2 1
17 1 4 <NA>
18 NA 1 4
19 1 1 5
20 5 1 2
> df %>%
+ mutate_if(is.numeric, ~replace(., . == 0, NA))
X Y Z
1 1 5 0
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 0
6 NA 4 0
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 0
12 2 5 2
13 4 4 4
14 3 4 0
15 NA NA 3
16 5 2 1
17 1 4 0
18 NA 1 4
19 1 1 5
20 5 1 2
Data:
set.seed(123)
df <- data.frame(X = sample(0:5, 20, replace = TRUE),
Y = sample(0:5, 20, replace = TRUE),
Z = as.character(sample(0:5, 20, replace = TRUE)))
You could just use replace without any additional function / package:
data <- replace(data, data == 0, NA)
This is now assuming that data is your data frame.
Otherwise you can simply insert the column name, e.g. if your data frame is df and column name data:
df$data <- replace(df$data, df$data == 0, NA)
Assuming that data is a dataframe then you could use sapply to update your values based on a set of filters:
new.data = as.data.frame(sapply(data,FUN= function(x) replace(x,is.na(x) | x == 0)))

Resources