I have a data frame called "e" that contains posts froma platform, with unique entry_id and member_id:
row. member_id entry_id timestamp
1 1 a 2008-06-09 12:41:00
2 1 b 2008-07-14 18:41:00
3 1 c 2010-07-17 15:40:00
4 2 d 2008-06-09 12:41:00
5 2 e 2008-09-18 10:22:00
6 3 f 2008-10-03 13:36:00
I have another data frame called "c", that contains comments:
row. member_id comment_id timestamp
1 1 I 2007-06-09 12:41:00
2 1 II 2007-07-14 18:41:00
3 1 III 2009-07-17 15:40:00
4 2 IV 2007-06-09 12:41:00
5 2 V 2009-09-18 10:22:00
6 3 VI 2010-10-03 13:36:00
I want to count all the comments a member wrote before he posted an entry. So the data frame "e" should look like this. Only mind the years when reading the example. The solution however should cover minutes too:
row. member_id entry_id prev_comment_count timestamp
1 1 a 2 2008-06-09 12:41:00
2 1 b 2 2008-07-14 18:41:00
3 1 c 3 2010-07-17 15:40:00
4 2 d 1 2008-06-09 12:41:00
5 2 e 1 2008-09-18 10:22:00
6 3 f 0 2008-10-03 13:36:00
I alrady tried with the following function:
functionPrevComments <- function(givE) nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) &
(c["timestamp"] <= givE["timestamp"])))
But when I try to sapply it, I get the error
"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""
I used the "$" Operator for referenncing the colums I need before but then I got
"$ operator is invalid for atomic vectors "
How do I apply my function correctly or is there another and better solution the solve my problem ?
Best Regards,
Nikolas
Here's a slightly different option. Make sure you have both "timestamp" columns converted to POSIXct-class before running the code.
e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})
e
# row. member_id entry_id timestamp prev_comment_count
#1 1 1 a 2008-06-09 12:41:00 2
#2 2 1 b 2008-07-14 18:41:00 2
#3 3 1 c 2010-07-17 15:40:00 3
#4 4 2 d 2008-06-09 12:41:00 1
#5 5 2 e 2008-09-18 10:22:00 1
#6 6 3 f 2008-10-03 13:36:00 0
e$type <- "entry"
c$type <- "comment"
names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")
DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type,
DF$member_id,
FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]
# row member_id action_id timestamp type count
#1 1 1 a 2008-06-09 12:41:00 entry 2
#2 2 1 b 2008-07-14 18:41:00 entry 2
#3 3 1 c 2010-07-17 15:40:00 entry 3
#4 4 2 d 2008-06-09 12:41:00 entry 1
#5 5 2 e 2008-09-18 10:22:00 entry 1
#6 6 3 f 2008-10-03 13:36:00 entry 0
If this is not fast enough, it can be improved with data.table or dplyr.
Related
I have a huge table where there is information of 2 professionals in each line that goes like this:
df1 <- data.frame("Date" = c(1,2,3,4), "prof1" = c(25,59,10,5), "prof2" = c(5,7,8,25))
# Date prof1 prf2
#1 1 25 5
#2 2 59 7
#3 3 10 8
#4 4 5 25
... ... ...
I want to delete the line 4 because its the same with line 1, just with alternate values.
So I created a copy os that table with the values of the columns B and C switched like this:
df2 <- data.frame("Date" = c(1,2,3,4), "prof2" = c(5,7,8,25), "prof1" = c(25,59,10,5))
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
#4 4 25 5
... ... ...
And executed the code:
df1<- df1[!do.call(paste, df1[2:3]) %in% do.call(paste, df2[2:3]), ]
But it end up deleting the line 1 as well. Giving me this table:
# Date prof2 prof1
#2 2 7 59
#3 3 8 10
... ... ...
when what I wanted was this:
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
... ... ...
How can I delete only one of the lines that are similar to another?
If you don't care about which one of the duplicates you keep, you can just make sure that
prof2 > prof1 and then remove duplicates.
SWAP = which(df2$prof2 < df2$prof1)
temp = df2$prof2
df2$prof2[SWAP] = df2$prof1[SWAP]
df2$prof1[SWAP] = temp[SWAP]
df2 = df2[!duplicated(df2[,2:3]), ]
df2
Date prof2 prof1
1 1 25 5
2 2 59 7
3 3 10 8
We can do this with apply to loop over the rows of the dataset, sort, them, get the transpose, apply duplicated on it to get a logical vector and subset
df1[!duplicated(t(apply(df1[-1], 1, sort))),]
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or another option is pmin/pmax
subset(df1, !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or using filter from dplyr
library(dplyr)
df1 %>%
filter( !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
I have two different df which have the same columns: "O" for place and "date" for time.
Df 1 gives different information for a certain place (O) and time (date) in one 1 row and df 2 has many information for the same year and place in many different rows. No I want to extract one condition of the first df that applies for all the rows of the second df if values for "O" and "date" are equal.
To make it more clear:
I have one line in df 1: krnqm=250 for O=1002 and date=1885. Now I want a new column "krnqm" in df 2 where df2$krnqm = 250 for all rows where df2$0=1002 and df2$date=1885.
Unfortunately I have no idea how to put that condition into a code line and would be greatful for your help.
You can do this quite easily in base R using the merge function. Here's an example.
Simulate some data from your description:
df1 <- expand.grid(O = letters[c(2:4,7)], date = c(1,3))
df2 <- data.frame(O = rep(letters[1:6], c(2,3,3,6,2,2)), date = rep(1:3, c(3,2,4)))
df1$krnqm <- sample(1:1000, size = nrow(df1), replace=T)
> df1
O date krnqm
1 b 1 833
2 c 1 219
3 d 1 773
4 g 1 514
5 b 3 118
6 c 3 969
7 d 3 704
8 g 3 914
> df2
O date
1 a 1
2 a 1
3 b 1
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
9 d 3
10 d 1
11 d 1
12 d 1
13 d 2
14 d 2
15 e 3
16 e 3
17 f 3
18 f 3
Now let's combine the two data frames in the manner you describe.
df2 <- merge(df2, df1, all.x=T)
> df2
O date krnqm
1 a 1 NA
2 a 1 NA
3 b 1 833
4 b 2 NA
5 b 2 NA
6 c 3 969
7 c 3 969
8 c 3 969
9 d 1 773
10 d 1 773
11 d 1 773
12 d 2 NA
13 d 2 NA
14 d 3 704
15 e 3 NA
16 e 3 NA
17 f 3 NA
18 f 3 NA
So you can see, the krnqm column in the resulting data frame contains NAs for any combinations of 'O' and 'date' that were not found in the data frame where the krnqm values were extracted from. If your df1 has other columns, that you do not want to be included in the merge, just change the merge call slightly to only use those columns that you want: df2 <- merge(df2, df1[,c("O", "date", "krnqm")], all.x=T).
Good luck!
This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 5 years ago.
So I have a data-frame structured as:
> head(peakQ)
STATION_NUMBER DATA_TYPE YEAR PEAK_CODE PRECISION_CODE MONTH DAY HOUR MINUTE TIME_ZONE PEAK SYMBOL
1 05EE006 Q 1983 H NA 6 29 5 18 MST 1.980
2 05EE006 Q 1985 H NA 4 2 0 0 MST 1.380 B
3 05EE006 Q 1986 H NA 3 30 13 37 MST 2.640
4 05EE006 Q 1987 H NA 4 5 21 2 MST 1.590 B
5 05EE006 Q 1989 H NA 10 22 2 45 MST 0.473
6 05EE006 Q 1990 H NA 4 2 4 2 MST 1.470
I want to drop the columns; STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE
But, I want to assume that I know only the column names and not their index.
I already know that it is trivial to use indexes, such as:
> head(peakQ[, -c(1, 2, 4, 5)])
YEAR MONTH DAY HOUR MINUTE TIME_ZONE PEAK SYMBOL
1 1983 6 29 5 18 MST 1.980
2 1985 4 2 0 0 MST 1.380 B
3 1986 3 30 13 37 MST 2.640
4 1987 4 5 21 2 MST 1.590 B
5 1989 10 22 2 45 MST 0.473
6 1990 4 2 4 2 MST 1.470
but, why do I get an error using column names? and, what is the workaround?
> head(peakQ[, -c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE")])
Error in -c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE") :
invalid argument to unary operator
I am especially confused because the opposite operation works just fine.
> head(peakQ[, c("STATION_NUMBER", "DATA_TYPE", "PEAK_CODE", "PRECISION_CODE")])
STATION_NUMBER DATA_TYPE PEAK_CODE PRECISION_CODE
1 05EE006 Q H NA
2 05EE006 Q H NA
3 05EE006 Q H NA
4 05EE006 Q H NA
5 05EE006 Q H NA
6 05EE006 Q H NA
Any help and/or a deeper explanation is appreciated.
There is no minus operator on character vectors; however, subset tries to simulate this using a vector of unevaluated names. Ditto for dplyr select. We could also use setdiff which avoids the need for a minus operator.
1) subset Try subset with the select= argument:
subset(peakQ, select = - c(STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE))
2) setdiff Another possibility is:
peakQ[setdiff(names(peakQ), c("STATION_NUMBER","DATA_TYPE","PEAK_CODE","PRECISION_CODE"))]
3) dplyr The dplyr package's select could also be used:
library(dplyr)
peakQ %>%
select(- c(STATION_NUMBER, DATA_TYPE, PEAK_CODE, PRECISION_CODE))
It seems that the "exclude" operator only works with indices and not column names. A remedy to overcome this problem might be to subset the column names with the %in% and ! operators:
> cols <- letters[1:5]
> cols
[1] "a" "b" "c" "d" "e"
> df1 <- as.data.frame(do.call(cbind, rep(list(1:5), 5)))
> names(df1) <- cols
> df1
a b c d e
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
> df1[,-c("a","b")]
Error in -c("a", "b") : invalid argument to unary operator
> df1[,!names(df1) %in% c("a","b")]
c d e
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
This question already has answers here:
Remove all duplicates except last instance
(4 answers)
Closed 6 years ago.
I want to remove all rows where columns a and b have the same values. Furthermore should column c contain the latest date if a and b are the same. I was thinking about to sort the dataframe with respect to column c and then removing all duplicates (a and c). It is my understanding that the function "duplicated" process in a specific order.
For example:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
> b <- c(1,1,2,4,1,1,2,2)
> c <- c("2016-10-01", "2016-10-02", "2016-10-03", "2016-10-04", "2016-10-04", "2016-10-05", "2016-10-06", "2016-10-07")
> df <-data.frame(a,b,c)
> df
a b c
1 A 1 2016-10-01
2 A 1 2016-10-02
3 A 2 2016-10-03
4 B 4 2016-10-04
5 B 1 2016-10-04
6 B 1 2016-10-05
7 C 2 2016-10-06
8 C 2 2016-10-07
I want to get the following dataframe as a result:
a b c
1 A 1 2016-10-02
2 A 2 2016-10-03
3 B 4 2016-10-04
4 B 1 2016-10-05
5 C 2 2016-10-07
Yes, duplicated processes in a specific order. To start from the bottom, use fromLast=TRUE.
> df[!duplicated( df[,1:2], fromLast=TRUE ), ]
a b c
2 A 1 2016-10-02
3 A 2 2016-10-03
4 B 4 2016-10-04
6 B 1 2016-10-05
8 C 2 2016-10-07
How can I calculate if the ID appear consecutively for less then 5 days ? also calculate the day difference between same ID record .
I really cannot get the logic for this problem and I did not know what I can start with.
(The sample data given below is just a sample , my actual data is in huge volume.Hence,optimization is needed.)
sample data :
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"
)
)
Expected Output:
output<- data.frame(
id=c("A","A","A","A","A","B","C","C","C","C","D","D","D","D","D","D","D"),
date=c("1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"1/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"5/3/2013",
"1/3/2013",
"2/3/2013",
"3/3/2013",
"4/3/2013",
"5/3/2013",
"6/3/2013",
"7/3/2013" ),
num=c(0,1,2,3,4,0,0,1,2,4,0,1,2,3,4,5,6)
)
Calculation Logic :
Do calculation on the date difference. For example, 1/3 to 2/3 is 1 day difference so the row of 2/3, column idu:1 . 2/3 to 3/3 is 1 day difference so add on 1 row 3/3 , column idu:2 . 3/3 to 5/3 is 2 day difference so add 2 to idu . row 5/3 , column idu : 4 . (Base on same ID)
Date | idu
1/3 | 0
2/3 | 1
3/3 | 2
5/3 | 4
Thanks in advance.
sample<- data.frame(
id=c("A","B","C","D","A","C","D","A","C","D","A","D","A","C"),
date=c("1/3/2013","1/3/2013", "1/3/2013","1/3/2013","2/3/2013","2/3/2013",
"2/3/2013","3/3/2013","3/3/2013",
"3/3/2013",
"4/3/2013",
"4/3/2013",
"5/3/2013",
"5/3/2013"), stringsAsFactors = F)
library(lubridate)
sample$date <- dmy(sample$date)
sample1 <- sample[order(sample$id, sample$date), ]
sample1$idu <- unlist(sapply(rle(sample1$id)$lengths, seq_len)) -1
id date idu
1 A 2013-03-01 0
5 A 2013-03-02 1
8 A 2013-03-03 2
11 A 2013-03-04 3
13 A 2013-03-05 4
2 B 2013-03-01 0
3 C 2013-03-01 0
6 C 2013-03-02 1
9 C 2013-03-03 2
14 C 2013-03-05 3
4 D 2013-03-01 0
7 D 2013-03-02 1
10 D 2013-03-03 2
12 D 2013-03-04 3
In order to add a time lag column, several options are available. I'd simply do
sample1$diff <- c(0, int_diff(sample1$date)/days(1))
# Remainder cannot be expressed as fraction of a period.
# Performing %/%.
> sample1
id date idu diff
1 A 2013-03-01 0 0
5 A 2013-03-02 1 1
8 A 2013-03-03 2 1
11 A 2013-03-04 3 1
13 A 2013-03-05 4 1
2 B 2013-03-01 0 -4
3 C 2013-03-01 0 0
6 C 2013-03-02 1 1
9 C 2013-03-03 2 1
14 C 2013-03-05 3 2
4 D 2013-03-01 0 -4
7 D 2013-03-02 1 1
10 D 2013-03-03 2 1
12 D 2013-03-04 3 1
And do further changes as needed. replacing all negative values with 0.