anti-join not working - giving 0 rows, why? - r

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.

The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

Related

Create dataframe with repeating string from scratch in R

I would like to create a dataframe that essentially would look something like this
Repeating the period from 1 to 10 and assigning the ID 42,574 times
so that I would end up with a 425,740 row dataframe.
I tried to create a dataframe using the following code
periodstring <- as.numeric(gl(10, 42574))
periods <- as.data.frame(periodstring)
but that sorts the numbers and other approaches did not quiete work. Is there a simple way to do this?
Thanks in advance.
Another option using rep:
data.frame(Period=rep(1:10,times=42574),
ID=rep(1:42574,each=10))
Output sample:
Period ID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

Assigning test / control group vector using split-apply-combine strategy [duplicate]

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
this should be simple but it's got me pulling my hair out!
Here is some data:
Clicks <- c(1,2,3,4,5,6,5,4,3,2)
Cost <- c(10,11,12,13,14,15,14,13,12,11)
Cluster <- c(1,1,1,2,2,1,1,1,1,1)
df <- data.frame(Clicks,Cost,Cluster)
I want to filter my df by cluster, assign a new vector that assigns "test" and "control" group at random, then recombine to the original data frame
Step 1: Filter (by cluster 1)
Clicks Cost Cluster
1 1 10 1
2 2 11 1
3 3 12 1
4 6 15 1
5 5 14 1
6 4 13 1
7 3 12 1
8 2 11 1
Step 2: Assign test and control group at random
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 6 15 1 Test
5 5 14 1 Control
6 4 13 1 Control
7 3 12 1 Test
8 2 11 1 Control
Step 3: Get back to the original data frame
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Test
7 5 14 1 Control
8 4 13 1 Control
9 3 12 1 Test
10 2 11 1 Control
Step 4: do the same for cluster 2
Thanks :)
How about
df$Group <- 'NULL'
df1 <- df
df1[df1$Cluster==1, ]$Group <- ifelse(runif(sum(df1$Cluster==1)) > 0.5, 'Control', 'Test')
df1
Clicks Cost Cluster Group
1 1 10 1 Test
2 2 11 1 Test
3 3 12 1 Test
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Control
7 5 14 1 Test
8 4 13 1 Test
9 3 12 1 Control
10 2 11 1 Control
df2 <- df
df2[df2$Cluster==2, ]$Group <- ifelse(runif(sum(df2$Cluster==2)) > 0.5, 'Control', 'Test')
df2
Clicks Cost Cluster Group
1 1 10 1 NULL
2 2 11 1 NULL
3 3 12 1 NULL
4 4 13 2 Test
5 5 14 2 Control
6 6 15 1 NULL
7 5 14 1 NULL
8 4 13 1 NULL
9 3 12 1 NULL
10 2 11 1 NULL

Combinations of Variables in R

I'm trying to create a fake data frame to examine the effects from a multinomial logit model in R. I have code that does precisely what I want to do, wich is to create a row representing every combination of levels of different variables.
var1 <- seq(1,10,1)
var2 <- seq(1,20,5)
FakeData <- as.data.frame(matrix(NA, nrow=length(var1) * length(var2),
ncol=2))
row <- 1
for(i in 1:length(var1)){
for(j in 1:length(var2)){
FakeData[row, 1] <- var1[i]
FakeData[row, 2] <- var2[j]
row <- row + 1
}
}
> head(FakeData)
V1 V2
1 1 1
2 1 6
3 1 11
4 1 16
5 2 1
6 2 6
My problem is that this code is very inefficient when applied to my problem with four variables of around ten levels each. Any tips on functions that might make it quicker?
You may be looking for expand.grid ?
R> expand.grid(var1, var2)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 1 6
12 2 6
13 3 6
14 4 6
15 5 6
16 6 6
17 7 6
18 8 6
19 9 6
20 10 6

Summing two dataframes based on common value

I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.

Resources