Anti Merging Large DataSets with Multiple Conditions - r

2Suppose I have two data frames:
A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
I want to merge the two so that the when X1 and X2 in A are in a row in B, then those entire rows (with all columns) are returned from A. I have tried anti_join and merge, but the results are not working as planned and merge can not handle larger dataframes. I have also tried things with the data table package.
I would like the below dataframe to be returned or saved to a new object.
C <- data.frame(X1=c(2,4), X2=c(3,4), X3=c(2,5))

Wouldn't you just do A%>%anti_join(B, by = c("X1", "X2"))? That way you have the by set to both X1 and X2, and you get all the outliers.
> A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
> B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
> A%>%inner_join(B, by = c("X1", "X2"))
X1 X2 X3
1 1 3 3
2 3 4 14
3 5 6 4
> A%>%anti_join(B, by = c("X1", "X2"))
X1 X2 X3
1 2 3 2
2 4 4 5

Related

How to partition to multiple .csv from df based on whitespace row?

I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.
ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))

Conditional merging based on full join

I would like to conditionally merge two datasets such that the values in dataframe2 replace the values in dataframe1, unless dataframe2 contains missing values. This should be performed in the case of a full join such that rows from both dataframe are preserved.
This question is inspired from Conditional merge/replacement in R (which seems to work only for inner join).
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:5,x2=c("zz","qq", NA, "qy"),stringsAsFactors=FALSE)
I would like the following result:
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy
I tried the following code though it returns NA for the 4th column but I would like the original value to be preserved since in this case df2 contains missing value for 4.
df3 <- anti_join(df1, df2, by = "x1")
rbind(df3, df2)
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 <NA>
5 5 qy
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy

order() not doing its job

This is driving me nuts. I am trying to sort a data frame by the first row in ascendinging order using the order function. Below a minimal example:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels))
newdf <- df[,with(df,order(df[1,]))]
print(newdf)
I have also tried this with
newdf <- df[,order(df[1,])]
Here is the output I'm getting
X11 X2 X1 X10 X9 X8 X7 X6 X5 X4 X3
values 1 10 11 2 3 4 5 6 7 8 9
labels K B A J I H G F E D C
Which is clearly wrong! So what is going on here?
This is an odd way to structure your data in R, so it will cause headaches, but you can make it work. See #thelatemail 's comment re: columns vs rows. To make this work in your case, do:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels), stringsAsFactors = FALSE)
newdf <- df[order(as.numeric(df["values",]))]
newdf
# X11 X10 X9 X8 X7 X6 X5 X4 X3 X2 X1
# values 1 2 3 4 5 6 7 8 9 10 11
# labels K J I H G F E D C B A
Note, in particular, stringsAsFactors = FALSE when you create the data frame.
Remember, data.frames are lists, and each element of the list is a vector (possibly a list, but typically an atomic vector, especially if constructed in a standard way) of the same length. The individual elements of the data frame are columns. Rows are just the nested elements with the same index value. This makes it much easier to work with a data frame like this:
df <- data.frame(values = values, labels = labels)
df[order(df$values),]
# values labels
# 11 1 K
# 10 2 J
# 9 3 I
# 8 4 H
# 7 5 G
# 6 6 F
# 5 7 E
# 4 8 D
# 3 9 C
# 2 10 B
# 1 11 A
Here you don't have to worry at all about whether your numbers are going to be coerced to characters and/or factors when you line them up with another vector that's character. In this example, whether or not labels was a factor had no impact on values.

Randomly sample per group, make a new dataframe, repeat until all entities within a group are sampled

I want to take one random Site for every Region, create a new data frame, and repeat these processes until all Site are sampled. So, each data frame will NOT contain the same Site from the same Region.
A few Regions in my real data frame have more Sites (Region C has 4 Sites) than the other Regions. I want remove those rows (perhaps I should do this before making multiple data frames).
Here is an example data frame (real one has >100 Regions and >10 Sites per Region):
mydf <- read.table(header = TRUE, text = 'V1 V2 Region Site
5 1 A X1
5 6 A X2
8 9 A X3
2 3 B X1
3 1 B X2
7 8 B X3
1 2 C X1
9 4 C X2
4 5 C X3
6 7 C X4')
Repeating the following code for three times produces data frames that contains the same Sites for a given Region (The second and third tables both has Site X2 for Region A).
do.call(rbind, lapply(split(mydf, mydf$Region), function(x) x[sample(nrow(x), 1), ]))
V1 V2 Region Site
A 8 9 A X3
B 2 3 B X1
C 6 7 C X4
V1 V2 Region Site
A 5 6 A X2
B 7 8 B X3
C 9 4 C X2
V1 V2 Region Site
A 5 6 A X2
B 3 1 B X2
C 6 7 C X4
Could you please help me create multiple data frames so that all data frames contain all Regions, but each data frame contains unique Region-Site combination.
EDIT: Here are expected output. To produce these, in the first sampling, draw one Site (row) randomly from every Region and make a data frame. In the second sampling, repeat the same process but the same Site for a given Region cannot be drawn. What I want is independent data frames that contain unique combination of Region-Site.
V1 V2 Region Site
5 1 A X1
7 8 B X3
1 2 C X1
V1 V2 Region Site
5 6 A X2
3 1 B X2
4 5 C X3
V1 V2 Region Site
8 9 A X3
2 3 B X1
9 4 C X2
The great data.table package actually makes this very easy
# Turn mydf into a data.table
library(data.table)
setDT(mydf)
# Shuffle the rows of the table
dt <- dt[sample(.N)]
# In case there are multiple rows for a given Region <-> Site pair,
# eliminate duplicates.
dt <- unique(dt, by = c('Region', 'Site'))
# Get the first sample from each region group
# Note: .SD refers to the sub-tables after grouping by Region
dt[, .SD[1], by=Region]
# Get the second and third sample from each region group
dt[, .SD[2], by=Region]
dt[, .SD[3], by=Region]
In fact, you could combine into a one-liner as Frank suggested
library(data.table)
dt <- setDT(mydf)
dt <- unique(dt, by = c('Region', 'Site'))
dt[sample(.N), .SD[1:3], by = Region]
It worked! I don't see a check mark for accepting the answer, so I am doing here.

Summarise whether a value is contained in multiple other columns

I am investigating a large dataset with 100+ columns. One set of columns contain integers where the integers are not repeated across columns. For example, the number 6 may or may not appear in a row, but it will only appear once across the columns.
An example mock-up (bearing in mind that there are hundreds of other, non-related columns surrounding these):
> x1 <- c(1,6,4,5)
> x2 <- c(6,0,11,3)
> x3 <- c(5,0,9,6)
> df <- data.frame(cbind(x1, x2, x3))
> df
x1 x2 x3
1 1 6 5
2 6 0 0
3 4 11 9
4 5 3 6
Ideally using dplyr (since I am trying to become more "fluent"), how would I most cleanly create a new column to indicate whether or not a 6 was contained in the other columns? I am hesitant to use a function like reshape2's melt given the 100s of other columns in the dataset.
My current, messy, solution:
> library(dplyr)
> df <- mutate(df, Contains6 = (x1 == 6) + (x2 == 6) + (x3 == 6),
+ Contains6 = revalue(as.factor(as.character(Contains6)),
+ c("0"="No","1"="Yes")))
> df
x1 x2 x3 Contains6
1 1 6 5 Yes
2 6 0 0 Yes
3 4 11 9 No
4 5 3 6 Yes
Possible extension to this: would there be a clean, programmatic way of creating similar columns for all values contained in x1:x3, e.g. Contains1, Contains4, etc?
We can use apply with MARGIN=1
df$Contains6 <- c("no", "yes")[(apply(df==6, 1, any))+1L]
df$Contains6
#[1] "yes" "yes" "no" "yes"
If we need to create multiple "Contains" columns, we can loop with lapply
v1 <- c(1,4,6)
df[paste0("Contains", v1)] <- lapply(v1, function(i)
c('no', 'yes')[(apply(df==i, 1, any))+1L])

Resources