Fill in Blank Fields With a Value From Same Key Index - r

I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:
ID Name Location
5014 John
5014 Kate California
5014 Jim
5014 Ryan California
5018 Pete
5018 Pat Indiana
5019 Jeff Arizona
5020 Chris Kentucky
5020 Mike
5021 Will Indiana
I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.
Things to note:
Every unique ID number has at least one row with the location field populated.
If two rows have the same ID number, they have the same location.
Two different ID numbers can have the same location.
ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
Any ideas for a solution? I'm currently using R with the data.table package, but I'm relatively new to it.

We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff as suggested by #Frank
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]

Related

R Identifying Dataframe Change Patterns by Groups

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1
With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0
You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.
A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

How to Implement a Complex For-Loop + If Statement

I have two data sets, each containing five-digit ZIPs.
One data set looks like this:
From To Territory
7501 10000 Unassigned
10001 10463 Agent 1
10464 10464 Unassigned
10465 11769 Agent 2
And a second data set that looks like this:
zip5 address
1 10009 424 E 9TH ST APT 12, NEW YORK
2 10010 15 E 26TH ST APT 10C, NEW YORK
3 10013 310 GREENWICH ST, NEW YORK
4 10019 457 W 57TH ST, NEW YORK
I would like to write a for-loop in R that loops through the zip5 column in the second data set, then loops through both the From and the To columns from dataset 1, checking if the zip5 falls within the From and To range, and once it finds a match, assigns the Territory value from the first dataset into a new column in second dataset.
I started to try to think through the logic but quickly became overwhelmed and thought I would turn to the StackOverflow community for guidance.
Here was my initial attempt:
for (i in nrow(df1)){
for(j in nrow(df2)){
if(df1[1, "zip5"] > df2[1, "From"] & df1[1, "zip5"] <= df2[1, "To"])
df1$newColumn = df2[j, "Territory"]
}
}
You can use data.table::foverlaps for this:
library(data.table)
dat1 <- fread(text = '
From To Territory
7501 10000 Unassigned
10001 10463 "Agent 1"
10464 10464 Unassigned
10465 11769 "Agent 2"')
dat2 <- fread(text = '
zip5 address
10009 "424 E 9TH ST APT 12, NEW YORK"
10010 "15 E 26TH ST APT 10C, NEW YORK"
10013 "310 GREENWICH ST, NEW YORK"
10019 "457 W 57TH ST, NEW YORK"')
# if you use your own data and it is not a data.table, then do this:
setDT(dat1)
setDT(dat2)
Requirements to use foverlap:
Both frames must have two fields, a "from" and a "to". While it might seem inane since we want to determine if "zip5" is within "From" to "To", the premise of the function is to find overlaps in two ranges. Instead of putting in special-case code to allow a single column in one frame, they chose (I'm inferring) to keep it general. This means we need to copy zip5 to another column.
Both tables need to have their ranges as "keys". If there are other columns that are keys, then the range columns must be the last two. (And in order.)
# req't 1, need a range in the second frame
dat2[, zip5copy := zip5 ]
# set keys for both
setkey(dat1, From, To)
setkey(dat2, zip5, zip5copy)
And the code:
foverlaps(dat1, dat2)
# zip5 address zip5copy From To Territory
# 1: NA <NA> NA 7501 10000 Unassigned
# 2: 10009 424 E 9TH ST APT 12, NEW YORK 10009 10001 10463 Agent 1
# 3: 10010 15 E 26TH ST APT 10C, NEW YORK 10010 10001 10463 Agent 1
# 4: 10013 310 GREENWICH ST, NEW YORK 10013 10001 10463 Agent 1
# 5: 10019 457 W 57TH ST, NEW YORK 10019 10001 10463 Agent 1
# 6: NA <NA> NA 10464 10464 Unassigned
# 7: NA <NA> NA 10465 11769 Agent 2
The default mode when there are no matches is nomatch=NA, meaning that the missing columns of the extra rows are filled with NA, as above. This is equivalent to a "full join" (one ref for joins: https://stackoverflow.com/a/6188334). If you want just matching rows, then foverlaps(..., nomatch=NULL) will give you just 4 rows. (You can also reverse the order of dat1 and dat2, but you might still need to use this if your actual data requires.)

Apply.weekly for non unique date column?

I currently have the below data.table with Name and Id recycling per day.
Date Name Id Widgets
2016-12-31 Bob Jones 0052A00001 5
2016-12-31 James Smith 0052A00002 25
2016-12-31 Tom Wilson 0052A00003 29
...
2016-01-31 Bob Jones 0052A00001 8
2016-01-31 James Smith 0052A00002 18
2016-01-31 Tom Wilson 0052A00003 20
Is it possible to apply the zoo function apply.weekly to this since there are not unique values per date? If not, what is the easiest way to aggregate this by a weekly value (or period of another length- say 4 days) and create groupings according to that?
You can create a grouping first before you match in the week. You can play around with cut to get your desired grouping.
grpWeek <- data.table(Date=seq.Date(as.Date("2016-01-01"), as.Date("2016-12-31"), by="1 day"))[,
list(Date,
DT_Week=week(Date),
Week_Num=format(Date, "%W"),
User_Week=cut(Date, breaks=52, labels=paste0("Week",1:52)))]
dt <- fread("Date,Name,Id,Widgets
2016-12-31,Bob Jones,0052A00001,5
2016-12-31,James Smith,0052A00002,25
2016-12-31,Tom Wilson,0052A00003,29
2016-01-31,Bob Jones,0052A00001,8
2016-01-31,James Smith,0052A00002,18
2016-01-31,Tom Wilson,0052A00003,20")
dt[,Date:=as.Date(Date)]
grpWeek[dt, on="Date"]

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

in R, customize names of columns created by dcast.data.table

I am new to reshape2 and data.table and trying to learn the syntax.
I have a data.table that I want to cast from multiple rows per grouping variable(s) to one row per grouping variable(s). For simplicity, let's make it a table of customers, some of whom share addresses.
library(data.table)
# Input table:
cust <- data.table(name=c("Betty","Joe","Frank","Wendy","Sally"),
address=c(rep("123 Sunny Rd",2),
rep("456 Cloudy Ln",2),
"789 Windy Dr"))
I want the output to have the following format:
# Desired output looks like this:
(out <- data.table(address=c("123 Sunny Rd","456 Cloudy Ln","789 Windy Dr"),
cust_1=c("Betty","Frank","Sally"),
cust_2=c("Joe","Wendy",NA)) )
# address cust_1 cust_2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA
I would like columns for cust_1...cust_n where n is the max customers per address. I don't really care about the order--whether Joe is cust_1 and Betty is cust_2 or vice versa.
Just pushed a commit to data.table v1.9.5. dcast now
allows casting on multiple value.var columns and multiple fun.aggregate functions
understands undefined variables/expressions in formula
With this, we can do:
dcast(cust, address ~ paste0("cust", cust[, seq_len(.N),
by=address]$V1), value.var="name")
# address cust1 cust2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA
# My attempt:
setkey(cust,address)
x <- cust[,list(name, addr_cust_num=rank(name,ties.method="random")), by=address])
x[,addr_cust_num:=paste0("cust_",addr_cust_num)]
y <- dcast.data.table(x, address ~ addr_cust_num, value.var="name")
y
Note that I had to paste0 the "cust_" prefix. Before I added that step, I was using setnames(y, names(y), sub("(\\d+)","cust_\\1",names(y)) ) which seemed a clunkier (but probably faster) solution.
Wondering if there is a better way to do the prefixing.
Alternatively, you could just add the column directly to cust by reference:
# no need to set key
cust[, cust := paste("cust", seq_len(.N), sep="_"), by=address]
dcast.data.table(cust, address ~ cust, value.var="name")
# address cust_1 cust_2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA

Resources