I am trying to filter a dataframe in which I have three columns:
date (format: "day/month/year")
client name
client spending on an specific product
I want to filter this df so I could get only the newest data purchase by client
Is there any way I could do this?
Let me first create a dummy data frame
library(dplyr)
names <- c("A", "B", "C", "D")
client <- sample(names, size=20, replace=T)
dates <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20)
amount <- sample(c(0:1000), size=20)
df <- data.frame(dates, client, amount)
So the data frame looks like this
dates client amount
1 1999-08-21 A 632
2 1999-08-06 B 449
3 1999-03-20 B 402
4 1999-05-15 B 557
5 1999-04-29 D 960
6 1999-03-07 A 977
7 1999-12-02 D 106
8 1999-12-08 D 891
9 1999-12-06 B 375
10 1999-03-28 C 509
11 1999-07-27 C 722
12 1999-02-01 D 923
13 1999-02-20 B 517
14 1999-12-17 B 487
15 1999-11-27 C 486
16 1999-05-26 B 873
17 1999-01-11 A 493
18 1999-08-16 A 620
19 1999-03-17 B 899
20 1999-03-01 C 297
You can then get filter the data
result <- df %>%
group_by(client) %>%
filter(dates == max(dates))
result
which will give you the following result.
dates client amount
<date> <fct> <int>
1 1999-08-21 A 632
2 1999-12-08 D 891
3 1999-12-17 B 487
4 1999-11-27 C 486
This is almost an extension of a previous question I asked, but I've run into a new problem I haven't found a solution for.
Here is the original question and answer: Find matching intervals in data frame by range of two column values
(this found overlapping intervals that were common among different names within same data frame)
I now want to find a way to exclude row's in DF1 when there are overlapping intervals with a new data-frame, DF2.
Using the same DF1 :
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID1
ADAM 2 A 384 407 23 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
ADAM 6 C 522 599 77 ID1
This continues for 18 different names and two ID groups.
Now have a second data frame with intervals that I wish to exclude from the above data frame.
Here is an example of DF2:
Name Event Order Sequence start_event end_event duration Group
GAP1 1 A 55 121 66 ID1
GAP2 2 A 394 419 25 ID1
GAP3 3 C 502 635 133 ID1
I.E., I am hoping to find any interval for each "Name" in DF1, that is in the same "Sequence" and has overlapping time at any point of the interval found in DF2 (any portion, whether it begins before the start event, or begins midway and ends after the end event). I would like to iterate through each distinct "Name" in DF1. Also, the sequence matters, so I would only like to return results found common between sequence A and sequence A, then sequence B and sequence B, and finally sequence C and sequence C.
Desired Result (showing just the first name):
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
Last time the answer was resolved in part with foverlaps, but I am still not overly familiar with it to be able to solve this problem - assuming that's the best way to answer this.
Thanks!
This piece of code should work for you
library(data.table)
Dt1 <- data.table(a = 1:1000,b=1:1000 + 100)
Dt2 <- data.table(a = 100:200,b=100:200+10)
#identify the positions that are not allowed
badSeq <- unique(unlist(lapply(1:nrow(Dt2),function(i) Dt2[i,a:b,])))
#select for the rows outside of the range
correctPos <- sapply(1:nrow(Dt1),
function(i)
all(!Dt1[i,a:b %in% badSeq]))
Dt1[correctPos,]
I have done it with data.tables rather than data.frames. I like them better and they can be faster. But you can apply the same ideas to a data.frame
I have data as follows:
ID age sugarlevel
123 15 8
456 13 10
789 25 5
...
Anyone knows how to use R to split the data according to sugar level (>=7, <7)? Which means should split into two groups:
group 1:
ID age sugarlevel
123 15 8
456 13 10
...
group 2:
ID age sugarlevel
789 25 5
...
Thanks in advance.
We can split the dataset by a grouping variable df1$sugarlevel >=7 (from #nicola's comments)
lst <- setNames(split(df1, df1$sugarlevel >=7), paste0('group',1:2))
lst
#$group1
# ID age sugarlevel
#3 789 25 5
#$group2
# ID age sugarlevel
#1 123 15 8
#2 456 13 10
It is better to work with the dataset in the 'list', but if we need to have two sepearate objects in the global environment,
list2env(lst, envir=.GlobalEnv)
group1
# ID age sugarlevel
#3 789 25 5
I've got the following three dataframes:
df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Anne", "Christine"),
age=c(26, 54),
height=c(175, 160),
group=c(3,5),
group2=c("Teacher",6), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Christine"),
age=c(54),
height=c(160),
group=c(5),
group2=c(6),
group3=c("Scientist"), stringsAsFactors=FALSE)
I'd like to combine them so that I get the following result:
df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student", "Teacher", "Scientist", "Employer"))
At the moment I'm doing it this way:
df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"),
df3[,c(1,6)], all=TRUE, by="name")
row.ind <- which(df.all$group %in% c(6,5))
df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")]
row.ind2 <- which(df.all$group2 %in% c(6))
df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")]
This isn't generalisable and it is really messy. Maybe there would be a way to use merge_all or merge_recurse for the merging step (especially as there might be more than two dataframes to be merged), but I haven't figured out how. These two don't produce the right result:
df.all <- merge_all(list(df1, df2, df3))
df.all <- merge_recurse(list(df1, df2, df3), by=c("name"))
Is there a more general and elegant way to solve this problem?
Here is another possible approach, if I understand what you're ultimately after. (It is not clear what the numeric values in the "group" columns are, so I'm not sure this is exactly what you're looking for.)
Use Reduce() to merge your multiple data.frames.
temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping
temp
# name age height group1 group2 group3
# 1 Andy 48 168 Employer <NA> <NA>
# 2 Anne 26 175 3 Teacher <NA>
# 3 Christine 54 160 5 6 Scientist
# 4 John 31 180 Student <NA> <NA>
Use reshape() to reshape your data from wide to long.
df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="")
df.all
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# Anne.1 Anne 26 175 1 3
# Christine.1 Christine 54 160 1 5
# John.1 John 31 180 1 Student
# Andy.2 Andy 48 168 2 <NA>
# Anne.2 Anne 26 175 2 Teacher
# Christine.2 Christine 54 160 2 6
# John.2 John 31 180 2 <NA>
# Andy.3 Andy 48 168 3 <NA>
# Anne.3 Anne 26 175 3 <NA>
# Christine.3 Christine 54 160 3 Scientist
# John.3 John 31 180 3 <NA>
Take advantage of the fact that as.numeric() will coerce characters to NA, and use na.omit() to remove all of the rows with NA values.
na.omit(df.all[is.na(as.numeric(df.all$group)), ])
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# John.1 John 31 180 1 Student
# Anne.2 Anne 26 175 2 Teacher
# Christine.3 Christine 54 160 3 Scientist
Again, this might be over-generalizing your problem--there might be NA values in other columns, for example--but it might help direct you towards a solution to your problem.
First step is to use merge_recurse with all.x = TRUE:
library(reshape)
merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE)
# name age height group group2 group3
# 1 Anne 26 175 3 Teacher <NA>
# 2 Christine 54 160 5 6 Scientist
# 3 John 31 180 Student <NA> <NA>
# 4 Andy 48 168 Employer <NA> <NA>
Then you can use apply to get the last non-NA group from all the "group" columns:
group.cols <- grep("group", colnames(merge.all))
merge.all <- data.frame(merge.all[-group.cols],
group = apply(merge.all[group.cols], 1,
function(x)tail(na.omit(x), 1)))
# name age height group
# 1 Anne 26 175 Teacher
# 2 Christine 54 160 Scientist
# 3 John 31 180 Student
# 4 Andy 48 168 Employer
Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)