find rows that are the same as a vector in R - r

I want to search row by row, and if it matches a pre defined vector, assign a value to a variable of that row. I prefer to solve it by using dplyr to stay in the pipeline.
for a simplified example:
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
d=c(1,2,NA)
D= data.frame(a,b,c,d)
My attempt is:
D %>% mutate(
i= case_when(
identical(c(a,b,c),c(1,1,1)) ~ 1,
identical(c(a,b,c),c(NA,NA,3)) ~ 2
)
)
I hope it gives me:
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2
but my code doesn't work I guess it's because it's not comparing a row to a vector.
I do not want to simply type within the case_when c==1 & b==1 & c== 1 ~ 1 because there will be too many variables to type in my dataset.
Thank you for your advise.

For this example
The following code would work
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
D= data.frame(a,b,c,d)
D %>% mutate(
i= case_when(
paste(a,b,c, sep=',') == paste(1,1,1, sep=",") ~ 1,
paste(a,b,c, sep=',') == paste(NA,NA,3, sep=",") ~ 2
)
)
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2

If we have multiple conditions, create a key/value dataset and then do a join
library(dplyr)
keydat <- data.frame(a =c(1, NA), b = c(1, NA), c = c(1, 3), i = c(1, 2))
left_join(D, keydat)
# a b c d i
#1 1 1 1 1 1
#2 2 NA 2 2 NA
#3 NA NA 3 NA 2

Related

adding two variables which has NA present

lets say data is 'ab':
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
ab <-c(a,b)
I would like to have new variable which is sum of the two but keeping NA's as follows:
desired output:
ab$c <-(6,2,7,NA,5,6)
so addition of number + NA should equal number
I tried following but does not work as desired:
ab$c <- a+b
gives me : 6 NA 7 NA NA NA
Also don't know how to include "na.rm=TRUE", something I was trying.
I would also like to create third variable as categorical based on cutoff <=4 then event 1, otherwise 0:
desired output:
ab$d <-(1,1,1,NA,0,0)
I tried:
ab$d =ifelse(ab$a<=4|ab$b<=4,1,0)
print(ab$d)
gives me logical(0)
Thanks!
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
dfd <- data.frame(a,b)
dfd$c <- rowSums(dfd, na.rm = TRUE)
dfd$c <- ifelse(is.na(dfd$a) & is.na(dfd$b), NA_integer_, dfd$c)
dfd$d <- ifelse(dfd$c >= 4, 1, 0)
dfd
a b c d
1 1 5 6 1
2 2 NA 2 0
3 3 4 7 1
4 NA NA NA NA
5 5 NA 5 1
6 NA 6 6 1

Removing columns based on a vector of names in R

I have a data.frame called DATA. Using BASE R, I was wondering how I could remove any variables in DATA that is named any of the following: ar = c("out", "Name", "mdif" , "stder" , "mpre")?
Currently, I use DATA[ , !names(DATA) %in% ar] but while this removes the unwanted variables, it again creates some new nuisance variables suffixed .1.
After extraction, is it possible to remove just suffixes?
Note1: We have NO ACCESS to r, the only input is DATA.
Note2: This is toy data, a functional solution is appreciated.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA
ar = c("out", "Name", "mdif" , "stder" , "mpre") # The names for exclusion
DATA[ , !names(DATA) %in% ar] ## Current solution
#>
# X Y Z X.1 Y.1 Z.1 ## X.1 Y.1 Z.1 are automatically created but no needed
# 1 2 1 3 1 1 2
# 2 2 1 3 NA 1 2
# 3 1 1 3 3 1 2
# 4 1 2 3 1 2 2
# 5 NA 1 3 NA NA 2
# 6 NA NA 3 NA NA 2
Ideally column names should be unique but if you want to keep duplicated column names, we can remove suffixes using sub after extraction
DATA1 <- DATA[ , !names(DATA) %in% ar]
names(DATA1) <- sub("\\.\\d+", "", names(DATA1))
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
In base R, if we create an object with the index, we can reuse it later instead of doing additional manipulations on the column name
i1 <- !names(DATA) %in% ar
DATA1 <- setNames(DATA[i1], names(DATA)[i1])
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
For reusuability, we can create a function
f1 <- function(dat, vec) {
i1 <- !names(dat) %in% vec
setNames(dat[i1], names(dat)[i1])
}
f1(DATA, ar)
If the datasets are stored in a list, use lapply to loop over the list and apply the f1
lst1 <- list(DATA, DATA)
lapply(lst1, f1, vec = ar)
If the 'ar' elements are also different for different list elements
arLst <- list(ar1, ar2)
Map(f1, lst1, vec = arLst)
Here,
ar1 <- c("out", "Name")
ar2 <- c("mdif" , "stder" , "mpre")
Here is also another option using tidyverse
library(dplyr)
library(stringr)
DATA %>%
set_names(make.unique(names(.))) %>%
select(-matches(str_c(ar, collapse="|"))) %>%
set_names(str_remove(names(.), "\\.\\d+$"))
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
NOTE: It is not recommended to have duplicate column names

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)
dedup_df = df[!duplicated(df$a), ]
How can I ensure that the output returns me the row that does not contain an NA on column c ?
I tried to use the dplyr package but the output prints only a result
library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)
Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.
But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.
I'll use you example, but with a slight modification
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
> df
a b c
1 A 1 NA
2 A 1 1
3 A 2 2
4 B 4 4
5 B 1 NA
6 B 1 1
7 C 2 2
8 C 2 2
In this case, your dedup_df contains an NA for the first value.
> dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
1 A 1 NA
4 B 4 4
7 C 2 2
Solution:
Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.
df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
2 A 1 1
6 B 1 1
7 C 2 2
You can also reorder in descending order
df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
4 B 4 4
3 A 2 2
7 C 2 2

Resources