How I can select rows from a dataframe that do not match? - r

I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.

If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6

Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x

You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6

Related

How to find if a value exists in a range and print "FOUND" or "MISSING" in a new column

I am trying to perform a function simmiliar to the function in excel fount below:
IF(COUNTIF(RANGE, CRITERIA), "FOUND", "MISSING")
I want to print a new column in my dataframe with found or missing. I understand in R that I can use %in% for example:
A$C %in C$B
To find if the values in column C of the A dataframe exist in the values in column B of the C datafame. However, I do not know how to subset said results with a conditional function to print found or missing to a new column in the correct row.
Here is an example of the dataframes:
A <- data.frame("C" = c(3,5,9,21,25), "D" = 1:5)
C <- data.frame("B" = c(3,6,21,22,8) , "F" = 10:14)
A$C %in% C$B
A[A$C %in% C$B,]
Based on the limited information:
lookup_list <- c(1:3)
x <- c('a','b','c')
y <- c(10, 3, 5)
df <- data.frame(x,y)
x y
1 a 10
2 b 3
3 c 5
df <- df %>%
mutate(status = case_when(
y %in% lookup_list ~ 'FOUND',
!y %in% lookup_list ~ 'MISSING'
))
x y status
1 a 10 MISSING
2 b 3 FOUND
3 c 5 MISSING

Create and assign multiple new dataframe columns in ifelse statement?

So I have a data frame my_df as follows
my_df <- data.frame(c("0600", "0602", "0603"))
Now I need to write ifelse statement which calculates 3 more variables and append it to a new data frame.
I am not able to find out on how to add multiple executable statements in the loop and append the calculated variables to a new data frame.
Below is my code for ifelse statement.
with(my_df, ifelse(my_df$H == "0600",{d$D <- 1+1 & d$c <- "0600"},
ifelse(my_df$H == "0602",{d$D <- 2+1 & d$c <- "0602"},
{ d$D <- 3+1 & d$c <- "0603"}
)))
I am able to append values to new dataframe with only one executable code inside the ifloop i.e if I have only {d$D <- 1+1} it works perfectly but fails when I have multiple statements to execute.
My output data frame should be as shown below,
D C
2 0600
3 0602
4 0603
Your syntax for ifelse is off, but I would recommend using case_when from the dplyr library here:
library(dplyr)
d$D <- case_when(
my_df$H == "0600" ~ 1+1,
my_df$H == "0602" ~ 2+1,
TRUE ~ 3+1
)
d$c <- case_when(
my_df$H == "0600" ~ "0600",
my_df$H == "0602" ~ "0602",
TRUE ~ "0603"
)
You could also use ifelse, but you would need nested calls to ifelse and it probably would not look good, or be very maintainable.
Using a list:
# My data frame
my_df <- data.frame(H = c("0600", "0602", "0603"))
# My list to be used as a lookup
my_list <- list("0600" = c(D = 2, C = "0600"),
"0602" = c(D = 3, C = "0602"),
"0604" = c(D = 4, C = "0603"))
# Find corresponding values for 'H'
# Then bind into a data frame
do.call(bind_rows, my_list[my_df$H])
Result:
# A tibble: 3 x 2
# D C
# <chr> <chr>
# 1 2 0600
# 2 3 0602
# 3 4 0603
Using Base R
my_df <- data.frame("C" = c("0600", "0602", "0603"))
my_df$D <- ifelse(my_df$C=="0600",2,ifelse(my_df$C=="0602",3,ifelse(my_df$C=="0603",4,NA)))

Counting function in R

I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))

Drop columns when splitting data frame in R

I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks
Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )
First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.

Matching vector values by records in a data frame in R

I have a vector of values r as follows:
r<-c(1,3,4,6,7)
and a data frame df with 20 records and two columns:
id<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,1,4,15,16,17,18,19,20)
freq<-c(1,3,2,4,5,6,6,7,8,3,3,1,6,9,9,1,1,4,3,7,7)
df<-data.frame(id,freq)
Using the r vector I need to extract a sample of records (in the form of a new data frame) from df in a way that the freq values of the records, would be equal to the values I have in my r vector. Needless to say that if it finds multiple records with the same freq values it should randomly pick one of them. For instance one possible outcome can be:
id frequency
12 1
10 3
4 4
7 6
8 7
I would be thankful if anyone could help me with this.
You could try data.table
library(data.table)
setDT(df)[freq %in% r,sample(id,1L) , freq]
Or using base R
aggregate(id~freq, df, subset=freq %in% r, FUN= sample, 1L)
Update
If you have a vector "r" with duplicate values and want to sample the data set ('df') based on the length of unique elements in 'r'
r <-c(1,3,3,4,6,7)
res <- do.call(rbind,lapply(split(r, r), function(x) {
x1 <- df[df$freq %in% x,]
x1[sample(1:nrow(x1),length(x), replace=FALSE),]}))
row.names(res) <- NULL
You can use filter and sample_n from "dplyr":
library(dplyr)
set.seed(1)
df %>%
filter(freq %in% r) %>%
group_by(freq) %>%
sample_n(1)
# Source: local data frame [5 x 2]
# Groups: freq
#
# id freq
# 1 12 1
# 2 10 3
# 3 17 4
# 4 13 6
# 5 8 7
Have you tried using the match() function or %in%? This might not be a fast/clean solution, but uses only base R functions:
rUnique <- unique(r)
df2 <- df[df$freq %in% rUnique,]
x <- data.frame(id = NA, freq = rUnique)
for (i in 1:length(rUnique)) {
x[i,1] <- sample(df2[df2[, 2] == rUnique[i], 1], 1)
}
print(x)

Resources