Removing rows with respect to specific columns [duplicate] - r

This question already has answers here:
Remove all duplicates except last instance
(4 answers)
Closed 6 years ago.
I want to remove all rows where columns a and b have the same values. Furthermore should column c contain the latest date if a and b are the same. I was thinking about to sort the dataframe with respect to column c and then removing all duplicates (a and c). It is my understanding that the function "duplicated" process in a specific order.
For example:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
> b <- c(1,1,2,4,1,1,2,2)
> c <- c("2016-10-01", "2016-10-02", "2016-10-03", "2016-10-04", "2016-10-04", "2016-10-05", "2016-10-06", "2016-10-07")
> df <-data.frame(a,b,c)
> df
a b c
1 A 1 2016-10-01
2 A 1 2016-10-02
3 A 2 2016-10-03
4 B 4 2016-10-04
5 B 1 2016-10-04
6 B 1 2016-10-05
7 C 2 2016-10-06
8 C 2 2016-10-07
I want to get the following dataframe as a result:
a b c
1 A 1 2016-10-02
2 A 2 2016-10-03
3 B 4 2016-10-04
4 B 1 2016-10-05
5 C 2 2016-10-07

Yes, duplicated processes in a specific order. To start from the bottom, use fromLast=TRUE.
> df[!duplicated( df[,1:2], fromLast=TRUE ), ]
a b c
2 A 1 2016-10-02
3 A 2 2016-10-03
4 B 4 2016-10-04
6 B 1 2016-10-05
8 C 2 2016-10-07

Related

Compare two columns of two different data frames with different length of rows return a third row

I have two different df which have the same columns: "O" for place and "date" for time.
Df 1 gives different information for a certain place (O) and time (date) in one 1 row and df 2 has many information for the same year and place in many different rows. No I want to extract one condition of the first df that applies for all the rows of the second df if values for "O" and "date" are equal.
To make it more clear:
I have one line in df 1: krnqm=250 for O=1002 and date=1885. Now I want a new column "krnqm" in df 2 where df2$krnqm = 250 for all rows where df2$0=1002 and df2$date=1885.
Unfortunately I have no idea how to put that condition into a code line and would be greatful for your help.
You can do this quite easily in base R using the merge function. Here's an example.
Simulate some data from your description:
df1 <- expand.grid(O = letters[c(2:4,7)], date = c(1,3))
df2 <- data.frame(O = rep(letters[1:6], c(2,3,3,6,2,2)), date = rep(1:3, c(3,2,4)))
df1$krnqm <- sample(1:1000, size = nrow(df1), replace=T)
> df1
O date krnqm
1 b 1 833
2 c 1 219
3 d 1 773
4 g 1 514
5 b 3 118
6 c 3 969
7 d 3 704
8 g 3 914
> df2
O date
1 a 1
2 a 1
3 b 1
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
9 d 3
10 d 1
11 d 1
12 d 1
13 d 2
14 d 2
15 e 3
16 e 3
17 f 3
18 f 3
Now let's combine the two data frames in the manner you describe.
df2 <- merge(df2, df1, all.x=T)
> df2
O date krnqm
1 a 1 NA
2 a 1 NA
3 b 1 833
4 b 2 NA
5 b 2 NA
6 c 3 969
7 c 3 969
8 c 3 969
9 d 1 773
10 d 1 773
11 d 1 773
12 d 2 NA
13 d 2 NA
14 d 3 704
15 e 3 NA
16 e 3 NA
17 f 3 NA
18 f 3 NA
So you can see, the krnqm column in the resulting data frame contains NAs for any combinations of 'O' and 'date' that were not found in the data frame where the krnqm values were extracted from. If your df1 has other columns, that you do not want to be included in the merge, just change the merge call slightly to only use those columns that you want: df2 <- merge(df2, df1[,c("O", "date", "krnqm")], all.x=T).
Good luck!

Dropping Columns by name in R [duplicate]

This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 5 years ago.
So I have a data-frame structured as:
> head(peakQ)
STATION_NUMBER DATA_TYPE YEAR PEAK_CODE PRECISION_CODE MONTH DAY HOUR MINUTE TIME_ZONE PEAK SYMBOL
1 05EE006 Q 1983 H NA 6 29 5 18 MST 1.980
2 05EE006 Q 1985 H NA 4 2 0 0 MST 1.380 B
3 05EE006 Q 1986 H NA 3 30 13 37 MST 2.640
4 05EE006 Q 1987 H NA 4 5 21 2 MST 1.590 B
5 05EE006 Q 1989 H NA 10 22 2 45 MST 0.473
6 05EE006 Q 1990 H NA 4 2 4 2 MST 1.470
I want to drop the columns; STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE
But, I want to assume that I know only the column names and not their index.
I already know that it is trivial to use indexes, such as:
> head(peakQ[, -c(1, 2, 4, 5)])
YEAR MONTH DAY HOUR MINUTE TIME_ZONE PEAK SYMBOL
1 1983 6 29 5 18 MST 1.980
2 1985 4 2 0 0 MST 1.380 B
3 1986 3 30 13 37 MST 2.640
4 1987 4 5 21 2 MST 1.590 B
5 1989 10 22 2 45 MST 0.473
6 1990 4 2 4 2 MST 1.470
but, why do I get an error using column names? and, what is the workaround?
> head(peakQ[, -c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE")])
Error in -c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE") :
invalid argument to unary operator
I am especially confused because the opposite operation works just fine.
> head(peakQ[, c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE")])
STATION_NUMBER DATA_TYPE PEAK_CODE PRECISION_CODE
1 05EE006 Q H NA
2 05EE006 Q H NA
3 05EE006 Q H NA
4 05EE006 Q H NA
5 05EE006 Q H NA
6 05EE006 Q H NA
Any help and/or a deeper explanation is appreciated.
There is no minus operator on character vectors; however, subset tries to simulate this using a vector of unevaluated names. Ditto for dplyr select. We could also use setdiff which avoids the need for a minus operator.
1) subset Try subset with the select= argument:
subset(peakQ, select = - c(STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE))
2) setdiff Another possibility is:
peakQ[setdiff(names(peakQ), c("STATION_NUMBER","DATA_TYPE","PEAK_CODE","PRECISION_CODE"))]
3) dplyr The dplyr package's select could also be used:
library(dplyr)
peakQ %>%
select(- c(STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE))
It seems that the "exclude" operator only works with indices and not column names. A remedy to overcome this problem might be to subset the column names with the %in% and ! operators:
> cols <- letters[1:5]
> cols
[1] "a" "b" "c" "d" "e"
> df1 <- as.data.frame(do.call(cbind, rep(list(1:5), 5)))
> names(df1) <- cols
> df1
a b c d e
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
> df1[,-c("a","b")]
Error in -c("a", "b") : invalid argument to unary operator
> df1[,!names(df1) %in% c("a","b")]
c d e
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5

Find co-occurrence of values in large data set

I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!
Update:
The original solution that I posted worked for your data. But your data has
the unusual property that no customer ever visited the same store in two different
months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4
We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4

R: Aggregate from two data frames on conditions

I have a data frame called "e" that contains posts froma platform, with unique entry_id and member_id:
row. member_id entry_id timestamp
1 1 a 2008-06-09 12:41:00
2 1 b 2008-07-14 18:41:00
3 1 c 2010-07-17 15:40:00
4 2 d 2008-06-09 12:41:00
5 2 e 2008-09-18 10:22:00
6 3 f 2008-10-03 13:36:00
I have another data frame called "c", that contains comments:
row. member_id comment_id timestamp
1 1 I 2007-06-09 12:41:00
2 1 II 2007-07-14 18:41:00
3 1 III 2009-07-17 15:40:00
4 2 IV 2007-06-09 12:41:00
5 2 V 2009-09-18 10:22:00
6 3 VI 2010-10-03 13:36:00
I want to count all the comments a member wrote before he posted an entry. So the data frame "e" should look like this. Only mind the years when reading the example. The solution however should cover minutes too:
row. member_id entry_id prev_comment_count timestamp
1 1 a 2 2008-06-09 12:41:00
2 1 b 2 2008-07-14 18:41:00
3 1 c 3 2010-07-17 15:40:00
4 2 d 1 2008-06-09 12:41:00
5 2 e 1 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
I alrady tried with the following function:
functionPrevComments <- function(givE) nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) &
(c["timestamp"] <= givE["timestamp"])))
But when I try to sapply it, I get the error
"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""
I used the "$" Operator for referenncing the colums I need before but then I got
"$ operator is invalid for atomic vectors "
How do I apply my function correctly or is there another and better solution the solve my problem ?
Best Regards,
Nikolas
Here's a slightly different option. Make sure you have both "timestamp" columns converted to POSIXct-class before running the code.
e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})
e
# row. member_id entry_id timestamp prev_comment_count
#1 1 1 a 2008-06-09 12:41:00 2
#2 2 1 b 2008-07-14 18:41:00 2
#3 3 1 c 2010-07-17 15:40:00 3
#4 4 2 d 2008-06-09 12:41:00 1
#5 5 2 e 2008-09-18 10:22:00 1
#6 6 3 f 2008-10-03 13:36:00 0
e$type <- "entry"
c$type <- "comment"
names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")
DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type,
DF$member_id,
FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]
# row member_id action_id timestamp type count
#1 1 1 a 2008-06-09 12:41:00 entry 2
#2 2 1 b 2008-07-14 18:41:00 entry 2
#3 3 1 c 2010-07-17 15:40:00 entry 3
#4 4 2 d 2008-06-09 12:41:00 entry 1
#5 5 2 e 2008-09-18 10:22:00 entry 1
#6 6 3 f 2008-10-03 13:36:00 entry 0
If this is not fast enough, it can be improved with data.table or dplyr.

Append sequence number to data frame based on grouping field and date field

I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:
x y
1 A 2012-01-02
2 A 2012-02-03
3 A 2012-02-25
4 A 2012-03-04
5 B 2012-01-02
6 B 2012-02-03
7 C 2013-01-02
8 C 2012-02-03
9 C 2012-03-04
10 C 2012-04-05
in to this:
x y v
1 A 2012-01-02 1
2 A 2012-02-03 2
3 A 2012-02-25 3
4 A 2012-03-04 4
5 B 2012-01-02 1
6 B 2012-02-03 2
7 C 2013-01-02 1
8 C 2012-02-03 2
9 C 2012-03-04 3
10 C 2012-04-05 4
where "x" is the individual, "y" is the date, and "v" is the appended sequence number
I have had success on a small data frame using a for loop in this code:
x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)
z$v=rep(1,nrow(z))
for(i in 2:nrow(z)){
if(z$x[i]==z$x[i-1]){
z$v[i]=(z$v[i-1]+1)
} else {
z$v[i]=1
}
}
but when I expand this to a much larger data frame (250K+ rows) the process takes forever.
Any thoughts on how I can make this more efficient?
This seems to work. May be overkill though.
## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))
EDIT
I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.
## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
## x y v
## 1 A 2012-01-02 1
## 2 A 2012-02-03 2
## 3 A 2012-02-25 3
## 4 A 2012-03-04 4
## 5 B 2012-01-02 1
## 6 B 2012-02-03 2
## 7 C 2013-01-02 1
## 8 C 2012-02-03 2
## 9 C 2012-03-04 3
## 10 C 2012-04-05 4
Also, an interesting one is stack. Take a look.
> stack(sapply(table(d$x), seq))
## values ind
## 1 1 A
## 2 2 A
## 3 3 A
## 4 4 A
## 5 1 B
## 6 2 B
## 7 1 C
## 8 2 C
## 9 3 C
## 10 4 C
I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.
# order data
z=z[order(z$x,z$y),]
#convert to data table
dt.z=data.table(z)
# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1
The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.

Resources