I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I saw a list of questions asked in stack overflow, regarding the following, but never got a satisfactory answer. I will follow up on the following question Blend of na.omit and na.pass using aggregate?
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
When I try out the given solution, instead of mean I try to find out the sum
aggregate(. ~ name, test, FUN = sum, na.action=na.pass, na.rm=TRUE)
the solution doesn't work as usual. Accordingly, it converts NA to 0, So the sum of NAs is 0. It displays it as 0 instead of NaN.
Why doesn't the following work for FUN=sum.And how to make it work?
Create a lambda function with a condition to return NaN when all elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum and na.rm = TRUE. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0
I have a data.frame with a single column, a vector of strings.
These strings have duplicate values.
I want to find the character strings that have duplicates in this vector and write their index of position in a new column.
So for example consider I have:
DT<- data.frame(string=A,B,C,D,E,F,A,C,F,Z,A)
I want to get:
string match2 match2 match3 matchx....
A 1 7 11
B 2 NA NA
C 3 8 NA
D 4 NA NA
E 5 NA NA
F 6 9 NA
A 1 7 11
C 3 8 NA
F 6 9 NA
Z 10 NA NA
A 1 7 11
The string is ways longer than in this example and I do not know the amount of maximum columns I need.
What will be the most effective way to do this?
I know that there is the duplicate function but I am not exactly sure how to combine it to the result I want to get here.
Many thanks!
Here's one way of doing this. I'm sure a data.table one liner follows.
DT<- data.frame(string=c("A","B","C","D","E","F","A","C","F","Z","A"))
# find matches
rbf <- sapply(DT$string, FUN = function(x, DT) which(DT %in% x), DT = DT$string)
# fill in NAs to have a pretty matrix
out <- sapply(rbf, FUN = function(x, mx) c(x, rep(NA, length.out = mx - length(x))), max(sapply(rbf, length)))
# bind it to the original data
cbind(DT, t(out))
string 1 2 3
1 A 1 7 11
2 B 2 NA NA
3 C 3 8 NA
4 D 4 NA NA
5 E 5 NA NA
6 F 6 9 NA
7 A 1 7 11
8 C 3 8 NA
9 F 6 9 NA
10 Z 10 NA NA
11 A 1 7 11
Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)
And here's one that uses tidyverse tools ( not quite a one-liner ;) ):
library( tidyverse )
DT %>% group_by( string ) %>%
do( idx = which(DT$string == unique(.$string)) ) %>%
ungroup %>% unnest %>% group_by( string ) %>%
mutate( m = stringr::str_c( "match", 1:n() ) ) %>%
spread( m, idx )
Reproducible example:
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
The output I am after is a new column "test" that gets the maximum of the column "Value" for each value of the column "Label" when there are 3 consecutives values that are the same and otherwise just report the values of the column "Value".
I do not mind about the missing values at the beggining and at the end, they can stay.
Expected result of the column test: NA, NA, 3,3,3,1,2,3,3,3,NC,3,3,3,NA,NA
in excel it was very easy and I coded successfully as follow:
=IF(AND(BN6=BN5,BN6=BN4),X4,Y6)
but in R I cannot.
I tried several methods, the closest to a result is the following:
test <-c(NA,NA)
test_tot <-NULL
for(i in 3:length(dat1$Label)){
test_tot<-c(test_tot, test)
if( dat1$Label[i]==dat1$Label[i+1]&& dat1$Label[i]==dat1$Label[i+2] ){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i+1],dat1$Value[i+2])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i+1]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i+1])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i-2]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i-2])))
}
else {test<-dat1$Value[i]}
}
test_tot<-c(test_tot,NA,NA)
dat1$test<-test_tot
EDIT:
The difficulty apparently is that the column "Value" has character based values. Any solution able to deal with it is greatly appreciated.
Edit: The OP has pointed out that column Value may contain character-based values which are important to identify a specific behaviour happened at a specific time.
Consequently, the whole vector or column is of type character in R (or factor). The code below has been amended to handle this by extracting numeric values to a separate column, computing the maximum values per group, coercing the result back to character and to copy the character-based values into the result.
The data.table solution below
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
Expected <- c(NA, NA, 3,3,3,1,2,3,3,3,"NC",3,3,3,NA,NA)
dat1<-data.frame(Label, Value, Expected)
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
setDT(dat1)[
# create temporary column with only numeric values
, Value_num := as.numeric(as.character(Value))][
# create temp cols for group id and group size
, `:=`(grp = .GRP, N = .N), by = rleid(Label)][
# for sufficiently large groups compute max values and coerce to char
N >= 3, new := as.character(max(Value_num)), by = grp][
# copy missing values
is.na(new), new := as.character(Value)][
# clean up
, c("grp", "N", "Value_num") := NULL][]
returns the expected result
Label Value Expected new
1: 0 NA NA NA
2: 0 NA NA NA
3: 1 1 3 3
4: 1 2 3 3
5: 1 3 3 3
6: 2 1 1 1
7: 2 2 2 2
8: 3 3 3 3
9: 3 2 3 3
10: 3 1 3 3
11: 4 NC NC NC
12: 5 1 3 3
13: 5 3 3 3
14: 5 2 3 3
15: 6 1 NA 1
16: 6 NA NA NA
except for row 15 where I believe the expected result should be 1 if we follow the words of the OP otherwise just report the values of the column "Value"
The warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
can be ignored as it's intended to convert non-numbers to NA, here.
Here is a dplyr solution. . NOTE: NC was changed to NA
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,NA,1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
library(dplyr)
dat1 %>%
filter(!is.na(Value)) %>%
group_by(Label) %>%
summarize(n = n(), max_Value = max(Value)) %>%
mutate(test = if_else(n>=3, max_Value, as.numeric(NA))) %>%
right_join(dat1, by = "Label") %>%
mutate(test = if_else(is.na(test), Value, test)) %>%
select(Label, Value, test)
# # A tibble: 16 × 3
# Label Value test
# <dbl> <dbl> <dbl>
# 1 0 NA NA
# 2 0 NA NA
# 3 1 1 3
# 4 1 2 3
# 5 1 3 3
# 6 2 1 1
# 7 2 2 2
# 8 3 3 3
# 9 3 2 3
# 10 3 1 3
# 11 4 NA NA
# 12 5 1 3
# 13 5 3 3
# 14 5 2 3
# 15 6 1 1
# 16 6 NA NA
I would like to update values of var3 in an R data.frame mydata according to a simple criterion.
var1 var2 var3
1 1 4 5
2 3 58 800
3 8 232 8
I would think that the following should do:
mydata$var3[mydata$var3 > 500,] <- NA
However, this replaces the entire row of every matching record with NA (all cells of the row), instead of just the var3 value (cell):
var1 var2 var3
1 1 4 5
2 NA NA NA
3 8 232 8
How can I ensure that just the value for the selected variable is replaced? mydata should then look like
var1 var2 var3
1 1 4 5
2 3 58 NA
3 8 232 8
Use which and arr.ind=TRUE
> mydata[which(mydata[,3]>500, arr.ind=TRUE), 3] <- NA
> mydata
var1 var2 var3
1 1 4 5
2 3 58 NA
3 8 232 8
Or just modify your previous attempt...
mydata[mydata$var3 > 500, 3] <- NA
This also works
mydata$var3[mydata$var3 > 500 ] <- NA # note no comma is inside [ ]
Your attempt didnt work because mydata$var3 gives a vector and you are indexing it as if it were a matrix by using [mydata$var3 > 500,] so a dimension error is thrown. You almost got it, all you have to do is remove the comma in your code (see my last alternative).