I currently have the below data.table with Name and Id recycling per day.
Date Name Id Widgets
2016-12-31 Bob Jones 0052A00001 5
2016-12-31 James Smith 0052A00002 25
2016-12-31 Tom Wilson 0052A00003 29
...
2016-01-31 Bob Jones 0052A00001 8
2016-01-31 James Smith 0052A00002 18
2016-01-31 Tom Wilson 0052A00003 20
Is it possible to apply the zoo function apply.weekly to this since there are not unique values per date? If not, what is the easiest way to aggregate this by a weekly value (or period of another length- say 4 days) and create groupings according to that?
You can create a grouping first before you match in the week. You can play around with cut to get your desired grouping.
grpWeek <- data.table(Date=seq.Date(as.Date("2016-01-01"), as.Date("2016-12-31"), by="1 day"))[,
list(Date,
DT_Week=week(Date),
Week_Num=format(Date, "%W"),
User_Week=cut(Date, breaks=52, labels=paste0("Week",1:52)))]
dt <- fread("Date,Name,Id,Widgets
2016-12-31,Bob Jones,0052A00001,5
2016-12-31,James Smith,0052A00002,25
2016-12-31,Tom Wilson,0052A00003,29
2016-01-31,Bob Jones,0052A00001,8
2016-01-31,James Smith,0052A00002,18
2016-01-31,Tom Wilson,0052A00003,20")
dt[,Date:=as.Date(Date)]
grpWeek[dt, on="Date"]
Related
This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 2 years ago.
I have a dataframe that looks like this:
Name Date
David 2019-12-23
David 2020-1-10
David 2020-2-13
Kevin 2019-2-12
Kevin 2019-3-19
Kevin 2019-5-1
Kevin 2019-7-23
Basically, I'm trying to calculate the date difference between each instance, specific to each person. I am currently using the following for-loop:
df$daysbetween <- with(df, ave(as.numeric(date) , name,
FUN=function(x) { z=c(NA,NA);
for( i in seq_along(x)[-(1:2)] ){
z <- c(z, (x[i]-x[i-1]))}
return(z) }) )
Currently, it calculates the difference between the second and third, and any following instance, perfectly fine. However, it doesn't calculate the difference between the first and second date and I need it to. Where is the error in my code coming from? Would appreciate any help.
transform(df, diff = ave(Date, Name, FUN = function(x)c(NA,diff(as.Date(x)))))
Name Date diff
1 David 2019-12-23 <NA>
2 David 2020-1-10 18
3 David 2020-2-13 34
4 Kevin 2019-2-12 <NA>
5 Kevin 2019-3-19 35
6 Kevin 2019-5-1 43
7 Kevin 2019-7-23 83
Just use lag from the dplyr package:
Description:
Find the "previous" (lag()) or "next" (lead()) values in a vector. Useful for comparing values behind of or ahead of the current values.
df %>%
group_by(name) %>%
mutate(diff = date - lag(date))
Output:
name date diff
<chr> <date> <drtn>
1 David 2019-12-23 NA days
2 David 2020-01-10 18 days
3 David 2020-02-13 34 days
4 Kevin 2019-02-12 NA days
5 Kevin 2019-03-19 35 days
6 Kevin 2019-05-01 43 days
7 Kevin 2019-07-23 83 days
I have a dataframe with two columns as shown below,
Name Indicator
DeAngelo Williams 1
Marcus Brown 1
Elaine Nelson 2
Steve Olson 3
Jennifer Carter 1
Michael Johnson 2
Angela Brawley 3
Dax Shepard 4
What I am trying to do is combine all the names where the Indicator Column values is 1 until the next value 1 is encountered, the final output should looks like this below.
Name
-------
DeAngelo Williams
Marcus Brown,Elaine Nelson,Steve Olson
Jennifer Carter, Michael Johnson, Angela Brawley, Dax Shepard
I am unable to think of a solution for this issue so any assistance on accomplishing this is much appreciated.
We can use aggregate from base R to do this. As #thelatemail mentioned in the comments, create a group by doing the cumulative sum of the logical vector Indicator==1, using the formula method, we paste the elements in 'Name' together.
aggregate(Name~cbind(Group=cumsum(Indicator==1)), df1, FUN=toString)[2]
I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA
I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:
ID Name Location
5014 John
5014 Kate California
5014 Jim
5014 Ryan California
5018 Pete
5018 Pat Indiana
5019 Jeff Arizona
5020 Chris Kentucky
5020 Mike
5021 Will Indiana
I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.
Things to note:
Every unique ID number has at least one row with the location field populated.
If two rows have the same ID number, they have the same location.
Two different ID numbers can have the same location.
ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
Any ideas for a solution? I'm currently using R with the data.table package, but I'm relatively new to it.
We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff as suggested by #Frank
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]
I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35