How to call column names from an object in dplyr? - r

I am trying to replace all zeros in multiple columns with NA using dplyr.
However, since I have many variables, I do not want to call them all by one, but rather store them in an object that I can call afterwards.
This is a minimal example of what I did:
library(dplyr)
Data <- data.frame(var1=c(1:10), var2=rep(c(0,4),5), var3 = rep(c(2,0,3,4,5),2), var4 = rep(c(7,0),5))
col <- Data[,c(2:4)]
Data <- Data %>%
mutate(across(col , na_if, 0))
However, if I do this, I get the following error message:
Error: Problem with 'mutate()' input '..1'.
x Must subset columns with a valid subscript vector.
x Subscript has the wrong type 'data.frame<
var2: double
var3: double
var4: double>'.
i It must be numeric or character.
i Input '..1' is '(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...'.
I have tried to change the format of col to a tibble, but that did not help.
Could anyone tell me how to make this work?

In case you wanted to target numeric columns only, then try helper functions like where(), which will select any variable where the function returns TRUE. I suppose the only benefit here is targeting a specific type of variable.
library(dplyr)
# The where() function will select var2, var3, and var4
# Note: var1 is an integer so the function returns FALSE
# Useful when you want to completely ignore a specific type of variable
Data <- data.frame(
var1 = c(1:10),
var2 = rep(c(0, 4),5),
var3 = rep(c(2, 0 ,3, 4, 5), 2),
var4 = rep(c(7, 0), 5)
)
Data %>%
mutate(across(where(is.numeric), ~na_if(., 0)))
Here is the output:
var1 var2 var3 var4
1 1 NA 2 7
2 2 4 NA NA
3 3 NA 3 7
4 4 4 4 NA
5 5 NA 5 7
6 6 4 2 NA
7 7 NA NA 7
8 8 4 3 NA
9 9 NA 4 7
10 10 4 5 NA
The other answer you'll find here is great and allows you to select any arbitrary number of columns.

Here, the col should be names of the Data. As there is a function name with col, we can name the object differently, wrap with all_of and replace the 0 to NA within across
library(dplyr)
col1 <- names(Data)[2:4]
Data <- Data %>%
mutate(across(all_of(col1) , na_if, 0))
-output
Data
# var1 var2 var3 var4
#1 1 NA 2 7
#2 2 4 NA NA
#3 3 NA 3 7
#4 4 4 4 NA
#5 5 NA 5 7
#6 6 4 2 NA
#7 7 NA NA 7
#8 8 4 3 NA
#9 9 NA 4 7
#10 10 4 5 NA
NOTE: Here the OP asked about looping based on either the index or the column names

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

aggregate function in R, sum of NAs are 0

I saw a list of questions asked in stack overflow, regarding the following, but never got a satisfactory answer. I will follow up on the following question Blend of na.omit and na.pass using aggregate?
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
When I try out the given solution, instead of mean I try to find out the sum
aggregate(. ~ name, test, FUN = sum, na.action=na.pass, na.rm=TRUE)
the solution doesn't work as usual. Accordingly, it converts NA to 0, So the sum of NAs is 0. It displays it as 0 instead of NaN.
Why doesn't the following work for FUN=sum.And how to make it work?
Create a lambda function with a condition to return NaN when all elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum and na.rm = TRUE. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0

Find all indices of duplicates and write them in new columns

I have a data.frame with a single column, a vector of strings.
These strings have duplicate values.
I want to find the character strings that have duplicates in this vector and write their index of position in a new column.
So for example consider I have:
DT<- data.frame(string=A,B,C,D,E,F,A,C,F,Z,A)
I want to get:
string match2 match2 match3 matchx....
A 1 7 11
B 2 NA NA
C 3 8 NA
D 4 NA NA
E 5 NA NA
F 6 9 NA
A 1 7 11
C 3 8 NA
F 6 9 NA
Z 10 NA NA
A 1 7 11
The string is ways longer than in this example and I do not know the amount of maximum columns I need.
What will be the most effective way to do this?
I know that there is the duplicate function but I am not exactly sure how to combine it to the result I want to get here.
Many thanks!
Here's one way of doing this. I'm sure a data.table one liner follows.
DT<- data.frame(string=c("A","B","C","D","E","F","A","C","F","Z","A"))
# find matches
rbf <- sapply(DT$string, FUN = function(x, DT) which(DT %in% x), DT = DT$string)
# fill in NAs to have a pretty matrix
out <- sapply(rbf, FUN = function(x, mx) c(x, rep(NA, length.out = mx - length(x))), max(sapply(rbf, length)))
# bind it to the original data
cbind(DT, t(out))
string 1 2 3
1 A 1 7 11
2 B 2 NA NA
3 C 3 8 NA
4 D 4 NA NA
5 E 5 NA NA
6 F 6 9 NA
7 A 1 7 11
8 C 3 8 NA
9 F 6 9 NA
10 Z 10 NA NA
11 A 1 7 11
Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)
And here's one that uses tidyverse tools ( not quite a one-liner ;) ):
library( tidyverse )
DT %>% group_by( string ) %>%
do( idx = which(DT$string == unique(.$string)) ) %>%
ungroup %>% unnest %>% group_by( string ) %>%
mutate( m = stringr::str_c( "match", 1:n() ) ) %>%
spread( m, idx )

if condition is true find max in 3 consecutive rows and report it in a new column - r

Reproducible example:
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
The output I am after is a new column "test" that gets the maximum of the column "Value" for each value of the column "Label" when there are 3 consecutives values that are the same and otherwise just report the values of the column "Value".
I do not mind about the missing values at the beggining and at the end, they can stay.
Expected result of the column test: NA, NA, 3,3,3,1,2,3,3,3,NC,3,3,3,NA,NA
in excel it was very easy and I coded successfully as follow:
=IF(AND(BN6=BN5,BN6=BN4),X4,Y6)
but in R I cannot.
I tried several methods, the closest to a result is the following:
test <-c(NA,NA)
test_tot <-NULL
for(i in 3:length(dat1$Label)){
test_tot<-c(test_tot, test)
if( dat1$Label[i]==dat1$Label[i+1]&& dat1$Label[i]==dat1$Label[i+2] ){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i+1],dat1$Value[i+2])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i+1]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i+1])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i-2]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i-2])))
}
else {test<-dat1$Value[i]}
}
test_tot<-c(test_tot,NA,NA)
dat1$test<-test_tot
EDIT:
The difficulty apparently is that the column "Value" has character based values. Any solution able to deal with it is greatly appreciated.
Edit: The OP has pointed out that column Value may contain character-based values which are important to identify a specific behaviour happened at a specific time.
Consequently, the whole vector or column is of type character in R (or factor). The code below has been amended to handle this by extracting numeric values to a separate column, computing the maximum values per group, coercing the result back to character and to copy the character-based values into the result.
The data.table solution below
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
Expected <- c(NA, NA, 3,3,3,1,2,3,3,3,"NC",3,3,3,NA,NA)
dat1<-data.frame(Label, Value, Expected)
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
setDT(dat1)[
# create temporary column with only numeric values
, Value_num := as.numeric(as.character(Value))][
# create temp cols for group id and group size
, `:=`(grp = .GRP, N = .N), by = rleid(Label)][
# for sufficiently large groups compute max values and coerce to char
N >= 3, new := as.character(max(Value_num)), by = grp][
# copy missing values
is.na(new), new := as.character(Value)][
# clean up
, c("grp", "N", "Value_num") := NULL][]
returns the expected result
Label Value Expected new
1: 0 NA NA NA
2: 0 NA NA NA
3: 1 1 3 3
4: 1 2 3 3
5: 1 3 3 3
6: 2 1 1 1
7: 2 2 2 2
8: 3 3 3 3
9: 3 2 3 3
10: 3 1 3 3
11: 4 NC NC NC
12: 5 1 3 3
13: 5 3 3 3
14: 5 2 3 3
15: 6 1 NA 1
16: 6 NA NA NA
except for row 15 where I believe the expected result should be 1 if we follow the words of the OP otherwise just report the values of the column "Value"
The warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
can be ignored as it's intended to convert non-numbers to NA, here.
Here is a dplyr solution. . NOTE: NC was changed to NA
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,NA,1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
library(dplyr)
dat1 %>%
filter(!is.na(Value)) %>%
group_by(Label) %>%
summarize(n = n(), max_Value = max(Value)) %>%
mutate(test = if_else(n>=3, max_Value, as.numeric(NA))) %>%
right_join(dat1, by = "Label") %>%
mutate(test = if_else(is.na(test), Value, test)) %>%
select(Label, Value, test)
# # A tibble: 16 × 3
# Label Value test
# <dbl> <dbl> <dbl>
# 1 0 NA NA
# 2 0 NA NA
# 3 1 1 3
# 4 1 2 3
# 5 1 3 3
# 6 2 1 1
# 7 2 2 2
# 8 3 3 3
# 9 3 2 3
# 10 3 1 3
# 11 4 NA NA
# 12 5 1 3
# 13 5 3 3
# 14 5 2 3
# 15 6 1 1
# 16 6 NA NA

Updating individual values (not rows) in an R data.frame

I would like to update values of var3 in an R data.frame mydata according to a simple criterion.
var1 var2 var3
1 1 4 5
2 3 58 800
3 8 232 8
I would think that the following should do:
mydata$var3[mydata$var3 > 500,] <- NA
However, this replaces the entire row of every matching record with NA (all cells of the row), instead of just the var3 value (cell):
var1 var2 var3
1 1 4 5
2 NA NA NA
3 8 232 8
How can I ensure that just the value for the selected variable is replaced? mydata should then look like
var1 var2 var3
1 1 4 5
2 3 58 NA
3 8 232 8
Use which and arr.ind=TRUE
> mydata[which(mydata[,3]>500, arr.ind=TRUE), 3] <- NA
> mydata
var1 var2 var3
1 1 4 5
2 3 58 NA
3 8 232 8
Or just modify your previous attempt...
mydata[mydata$var3 > 500, 3] <- NA
This also works
mydata$var3[mydata$var3 > 500 ] <- NA # note no comma is inside [ ]
Your attempt didnt work because mydata$var3 gives a vector and you are indexing it as if it were a matrix by using [mydata$var3 > 500,] so a dimension error is thrown. You almost got it, all you have to do is remove the comma in your code (see my last alternative).

Resources