I have a data frame. I'm trying to remove rows that have values in a column that match other rows that were conditionally removed. Let me provide a simple example for better explaining.
I'm tried using the previous post as a starting point:
Remove Rows From Data Frame where a Row match a String
>dat
A,B,C
4,3,Foo
2,3,Bar
1,2,Bar
7,5,Zap
First remove rows with "Foo" in column C:
dat[!grepl("Foo", dat$C),]
Now I want to remove any additional rows that have values in column B that match the values in rows with Foo. So in this example, any rows with B = 3 would be removed because row 1 has Foo, which was removed and has B=3.
>dat.new
1,2,Bar
7,5,Zap
Any ideas on how to do this would be appreciated.
We subset the 'B' values where 'C' is 'Foo', create a logical vector by checking those values in the 'B', negate (!) and also create a condition where the 'C' is not "Foo"
library(dplyr)
dat.new <- dat %>%
filter(!B %in% B[C == 'Foo'], C != 'Foo')
dat.new
# A B C
#1 1 2 Bar
#2 7 5 Zap
Or in base R with subset
subset(dat, !B %in% B[C == 'Foo'] & C != "Foo")
data
dat <- structure(list(A = c(4L, 2L, 1L, 7L), B = c(3L, 3L, 2L, 5L),
C = c("Foo", "Bar", "Bar", "Zap")), row.names = c(NA, -4L
), class = "data.frame")
Related
I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.
group_a
group_b
n_variants_a
n_variants_b
1
NA
1
2
NA
2
5
4
1
2
2
0
I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.
Running this over just one group and n_variant set works:
sum(subset(df, (!is.na(group_a)))$n_variants_a
However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.
summed_variants <- lapply(list_of_groups, function(g) {
lapply(list_of_variants, function(v) {
sum(subset(df, !(is.na(g)))$v)
I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.
Thanks for your help!
We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum
mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]),
names(df1)[1:2], names(df1)[3:4])
group_a group_b
3 4
Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum
library(stringr)
library(dplyr)
df1 %>%
summarise(across(contains('variants'),
~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))
-output
n_variants_a n_variants_b
1 3 4
data
df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))
I have two data.frame tables in R. Both have IDs for users who took particular actions. The users in the second table should all have done the actions in the first table, but I want to confirm. What would be the best way to determine if all the IDs in table 2 are represented in table, and if not what IDs aren't?
Table A
**Unique ID** **Count**
abc123 1
zyx456 15
888aaaa 4
Table B
**Unique ID** **Count**
abc123 1
zyx456 1
zzzzz123 2
I'm trying to get a response that abc123 and zyx456 in Table B are in Table A and that zzzzz123 is not represented in Table A but is in B (which would be an error, since all B should be in A).
This is an efficient one-liner in base R:
setdiff(TableB$ID, TableA$ID)
It will return an empty result if everything in TableB is in TableA, and return the missing fields if there are any.
Other answers may be better choices with broader context, but this is a simple solution for a simple problem.
We can do this easily with a join in the tidyverse:
library(tidyverse)
JoinedTable = full_join(
x = TableA %>% mutate(in.A = TRUE),
y = TableB %>% mutate(in.B = TRUE).
by = "UniqueID",
suffix = c(".A",".B")
)
### Use whichever of the following is applicable
## Is in both
JoinedTable %>%
filter(in.A, in.B)
## In A only
JoinedTable %>%
filter(in.A, !in.B)
## In B only
JoinedTable %>%
filter(!in.A, in.B)
Use a full_join to intersect the tables; set "by" to your ID column and adding a suffix to differentiate other columns that aren't unique to a particular column. I've added mutates to make the filtering code more clear, but you could also just look for NAs in the respective Counts columns (i.e. filter(!is.na(Count.A), is.na(Count.B)) to find ones in A but not B).
If you just want a vector of the ones that meet each condition, just tack on %>% pull(UniqueID) to grab that.
You can add another column to table B show if it is also in table A. Here is the code can make it (assuming dfA and dfB denote tables A and B):
dfB <- within(dfB, in_dfA <- UniqueID %in% tbla$UniqueID)
gives
> dfB
UniqueID Count in_dfA
1 abc123 1 TRUE
2 zyx456 1 TRUE
3 zzzzz123 2 FALSE
DATA
dfA <- structure(list(UniqueID = structure(c(2L, 3L, 1L), .Label = c("888aaaa",
"abc123", "zyx456"), class = "factor"), Count = c(1L, 15L, 4L
)), class = "data.frame", row.names = c(NA, -3L))
dfB <- structure(list(UniqueID = structure(1:3, .Label = c("abc123",
"zyx456", "zzzzz123"), class = "factor"), Count = c(1L, 1L, 2L
), in_dfA = c(TRUE, TRUE, FALSE)), row.names = c(NA, -3L), class = "data.frame")
How about using the %in% operator to see which are in both versus those that are not:
library(tibble)
library(tidyverse)
df1 <- tribble(~ID, ~Count,
'abc', 1,
'zyx', 15,
'other', 3)
df2 <- tribble(~ID, ~Count,
'abc', 2,
'zyx', 33,
'another', 334)
match <- df2[which(df2$ID %in% df1$ID),'ID']
notmatch <- df2[which(!(df2$ID %in% df1$ID)),'ID']
This outputs two comparisons that you can use to check for values in a function and pass errors if need be:
match
A tibble: 2 x 1
ID
<chr>
1 abc
2 zyx
notmatch
# A tibble: 1 x 1
ID
<chr>
1 another
You could do an update join to see which IDs are/aren't in the first table
tblb[tbla, on = 'UniqueID', in_tbla := i.UniqueID
][, in_tbla := !is.na(in_tbla)]
tblb
# UniqueID Count in_tbla
# 1: abc123 1 TRUE
# 2: zyx456 1 TRUE
# 3: zzzzz123 2 FALSE
Not sure if that's any better than #Onyambu's suggestion though (same output)
tblb[, in_tbla := UniqueID %in% tbla$UniqueID]
Data used:
tbla <- fread('
UniqueID Count
abc123 1
zyx456 15
888aaaa 4
')
tblb <- fread('
UniqueID Count
abc123 1
zyx456 1
zzzzz123 2
')
This question already has answers here:
duplicates in multiple columns
(2 answers)
Closed 5 years ago.
I am working on a dataset in R, where WO can have values "K" and "B". I want to have the WO be returned where the frequency per WO does not match between the "K" and "B" records. For example the following table:
df <- structure(list(WO = c(917595L, 917595L, 1011033L, 1011033L),
Invoice = c("B", "K", "B", "K"), freq = c(3L, 6L, 2L, 2L)),
.Names = c("WO", "Invoice", "freq"),
class = "data.frame", row.names = c(NA, -4L)
)
I want 917595 returned because 3 does not equal 6. However, 1011033 should be returned because its frequency matches.
Reshaping the data let's you do a comparison on the frequency values.
library(dplyr)
library(reshape2)
dframe <-
"WO,Invoice,freq
917595,B,3
917595,K,6
1011033,B,2
1011033,K,2" %>%
read.csv(text = .,
stringsAsFactors = FALSE)
dcast(dframe,
WO ~ Invoice,
value.var = "freq") %>%
filter(B != K)
We could do it with base R using duplicated
df1[!(duplicated(df1[c(1, 3)])|duplicated(df1[c(1,3)], fromLast = TRUE)),]
# WO Invoice freq
#1 917595 B 3
#2 917595 K 6
Or another option is to group by 'WO' and check if the number of unique elements in 'freq' is greater than 1
library(data.table)
setDT(df1)[, if(uniqueN(freq)>1) .SD, WO]
# WO Invoice freq
#1: 917595 B 3
#2: 917595 K 6
I have two data frames, data and meta. Some, but not all, columns in data are logical values, but they are coded in many different ways. The rows in meta describe the columns in data, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.
I need a procedure that replaces all data values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta row. Any data values in a conceptually logical column that do not match a value in the corresponding meta row should become NA.
Small toy example for meta:
name type false true
-----------------------------------------
a.char.var char NA NA
a.logical.var logical NA 7
another.logical.var logical 1 0
another.char.var char NA NA
Small toy example for data:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa 7 0 ba
ab NA 1 bb
ac 7 NA bc
ad 4 3 bd
Small toy example output:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa TRUE TRUE ba
ab FALSE FALSE bb
ac TRUE NA bc
ad NA NA bd
I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.
First we set up the data
meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
type=c('char', 'logical', 'logical', 'char'),
false=c(NA, NA, 1, NA),
true=c(NA, 7, 0, NA), stringsAsFactors = F)
data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
a.logical.var=c(7, NA, 7, 4),
another.logical.var=c(0,1,NA,3),
another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)
Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.
Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.
Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.
logical_meta <- meta[meta$type=='logical',]
data_out <- data #initialize
for(i in 1:nrow(logical_meta)) {
colname <- as.character(logical_meta$name[i]) #only need as.character if factor
data_out[,colname] <- NA
#false column first
if(is.na(logical_meta$false[i])) {
data_out[is.na(data[,colname]),colname] <- FALSE
} else {
data_out[which(data[,colname]==logical_meta$false[i]),
colname] <- FALSE
}
#true column next
if(is.na(logical_meta$true[i])) {
data_out[is.na(data[,colname]),colname] <- TRUE
} else {
data_out[which(data[,colname]==logical_meta$true[i]),
colname] <- TRUE
}
}
data_out
I've written a function that takes in the column index of data and tries to perform the operation you described.
The function first selects x as the column we are interested in. We then match the name of the column in data to the entries in the first column of meta, this gives our row of interest.
We then check if the column type is logical, if it isn't we just return x, nothing needed to be changed. If the column type is logical we then check whether its values match the true or false columns in meta.
convert_data <- function(colindex, dat, meta = meta){
x <- dat[,colindex] #select our data vector
#match the column name to the first column in meta
find_in_meta <- match(names(dat)[colindex],
meta[,1])
#what type of column is it
type_col <- meta[find_in_meta,2]
if(type_col != 'logical'){
return(x)
}else{
#fix if logical is NA
true_val <- ifelse(is.na(meta[find_in_meta,4]),'NA_val',
meta[find_in_meta,4])
#fix if logical is NA
false_val <- ifelse(is.na(meta[find_in_meta,3]), 'NA_val',
meta[find_in_meta, 3])
#fix if logical is NA
x <- ifelse(is.na(x), 'NA_val', x)
x <- ifelse(x == true_val, TRUE,
ifelse(x == false_val, FALSE, NA))
return(x)
}
}
We can then use lapply and a little data manipulation to get it into an acceptable form:
res <- lapply(1:ncol(df1), function(ind)
convert_data(colindex = ind, dat = df1, meta = meta))
setNames(do.call('cbind.data.frame', res), names(df1))
a.char.var a.logical.var another.logical.var another.char.var
1 aa TRUE TRUE ba
2 ab FALSE FALSE bb
3 ac TRUE NA bc
4 ad NA NA bd
data
meta <- structure(list(name = c("a.char.var", "a.logical.var", "another.logical.var",
"another.char.var"), type = c("char", "logical", "logical", "char"
), false = c(NA, NA, 1L, NA), true = c(NA, 7L, 0L, NA)), .Names = c("name",
"type", "false", "true"), class = "data.frame", row.names = c(NA,
-4L))
df1 <- structure(list(a.char.var = c("aa", "ab", "ac", "ad"), a.logical.var = c(7L,
NA, 7L, 4L), another.logical.var = c(0L, 1L, NA, 3L), another.char.var = c("ba",
"bb", "bc", "bd")), .Names = c("a.char.var", "a.logical.var",
"another.logical.var", "another.char.var"), class = "data.frame", row.names = c(NA,
-4L))
My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.
If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]
If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2
library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.
One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3
if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))