I am having a little problem "filling in gaps". It's not a missing data question, it's more about merging but it's not working great.
So, my data looks like this
id name region Company
1 John Smith West Walmart
1 John Smith West Amazon
1 John Smith
1 John Smith West P&G
2 Jane Smith South Apple
2 Jane Smith
3 Richard Burkett
3 Richard Burkett West Walmart
And so on.
What I want to do is fill in those gaps in the region variable by their id. So, id 1, John Smith, on the third row, should have West in the third column. Jane Smith's region should be filled in "South" where it is missing.
I've tried creating a separate dataset and then merging it based on id but it creates duplicate rows and basically increases the N by something like 14 times (no idea why).
region1<-subset(df1, df1$region=="DC"| df1$region=="Midwest"|df1$region=="Northeast"|df1$region=="South"|df1$region=="West")
region<-region1[,c(id","region")]
df2<-merge(df1, region, by="id")
I've checked the structure of the variables. Id variable is interval and region is a factor. I think there should be a super simple way to do this but I'm just not getting it. Any ideas?
Thank you in advance.
Here´s an R base solution. Suppose your data.frame is df
regions <- sapply(split(df$region, df$id), function(x) {
ind <- is.na(x);
x[ind] <- x[!ind][1];
x
})
df$region <- unlist(regions)
df
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West <NA>
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South <NA>
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West <NA>
I would use dplyr::arrange followed by tidyr::fill
library(dplyr)
library(tidyr)
data.frame(id=c(1,1,1,1,2,2,3,3),
name=c(rep("John Smith",4), rep("Jane Smith", 2), rep("Richard Burkett", 2)),
region=c("West", "West", NA, "West", "South",NA, "West", NA),
Company=c("Walmart","Amazon",NA,"P&G","Apple",NA,"Walmart",NA)) %>%
arrange(id, name) %>%
fill(region)
Results in:
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West NA
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South NA
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West NA
The solution which should work is group_by on id and then fill. Ideally the solution which should work in OP condition should cover in both direction.
library(tidyverse)
df %>% group_by(id) %>%
fill(region) %>%
fill(region, .direction = "up")
# id name region Company
# <int> <chr> <chr> <chr>
#1 1 John Smith West Walmart
#2 1 John Smith West Amazon
#3 1 John Smith West <NA>
#4 1 John Smith West P&G
#5 2 Jane Smith South Apple
#6 2 Jane Smith South <NA>
#7 3 Richard Burkett West Walmart
#8 3 Richard Burkett West <NA>
Data
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), name = c("John Smith",
"John Smith", "John Smith", "John Smith", "Jane Smith", "Jane Smith",
"Richard Burkett", "Richard Burkett"), region = c("West", "West",
NA, "West", "South", NA, "West", NA), Company = c("Walmart",
"Amazon", NA, "P&G", "Apple", NA, "Walmart", NA)), .Names = c("id",
"name", "region", "Company"), class = "data.frame", row.names = c(NA,
-8L))
Related
I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("# george v. SALLY", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("# george v. SALLY", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
A simple approach, which depends somewhat on the structure of df2$versus, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update/Jul 14 2022:
If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))
Update 15/07:
See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using #langtang's 'new' data):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Old answer:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
I am trying to create a rolling average calculation in R, but I only have data for certain dates within a range. However, instead of having dates where there is no data be totally omitted from the rolling average calculation, I would like the dates where there is no data to be counted as 0 in the rolling average calculation.
I have created an example data frame to demonstrate, but in my actual data set I obviously have much more dates and names. Additionally, I am trying to figure out a way to make the names be added where they are added for each day that is 0.
Here is the code for an example DF
Date <- Date <- as.Date(c('2021-01-01','2021-01-01','2021-01-01', '2021-01-03','2021-01-03','2021-01-03', '2021-01-04','2021-01-04','2021-01-04', '2021-02-02','2021-02-02','2021-02-02', '2021-02-03','2021-02-03','2021-02-03', '2021-03-01','2021-03-01','2021-03-01'))
Name <- c("John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark")
Hours <- c(floor(runif(18, min=0, max = 14)))
data.frame(Date, Name, Hours)
With the above dataframe looking like this
Date Name Hours
1 2021-01-01 John Smith 11
2 2021-01-01 Jane Peters 9
3 2021-01-01 Jim Clark 6
4 2021-01-03 John Smith 7
5 2021-01-03 Jane Peters 1
6 2021-01-03 Jim Clark 9
7 2021-01-04 John Smith 2
8 2021-01-04 Jane Peters 4
9 2021-01-04 Jim Clark 10
10 2021-02-02 John Smith 2
11 2021-02-02 Jane Peters 1
12 2021-02-02 Jim Clark 3
13 2021-02-03 John Smith 8
14 2021-02-03 Jane Peters 7
15 2021-02-03 Jim Clark 0
16 2021-03-01 John Smith 11
17 2021-03-01 Jane Peters 6
18 2021-03-01 Jim Clark 8
To illustrate, I would like to add a day where the "Hours" is 0 for each day for each person where there is no data.
The end result would look something like this for the first few days in January, but extending until the end date at the beginning of March.
1 2021-01-01 John Smith 11
2 2021-01-01 Jane Peters 9
3 2021-01-01 Jim Clark 6
4 2021-01-02 John Smith 0
5 2021-01-02 Jane Peters 0
6 2021-01-02 Jim Clark 0
4 2021-01-03 John Smith 7
5 2021-01-03 Jane Peters 1
6 2021-01-03 Jim Clark 9
7 2021-01-04 John Smith 2
8 2021-01-04 Jane Peters 4
9 2021-01-04 Jim Clark 10
10 2021-01-05 John Smith 0
11 2021-01-05 Jane Peters 0
12 2021-01-05 Jim Clark 0
This way, I will be able to add the days where there is no data as 0's into my rolling average calculation.
It sounds like you are looking for the tidyr::complete function. That would look like this
library(tidyr)
complete(mydata, Date=seq(min(Date), max(Date), by="1 day"), Name, fill=list(Hours=0))
Just with base R you could merge your data to an expand.grid.
res <- merge(dat,
expand.grid(
Date=seq(as.Date('2021-01-01'), as.Date('2021-03-01'), 'days'),
Name=unique(dat$Name)),
all=TRUE)
res <- transform(res, Hours=replace(Hours, is.na(Hours), 0))
head(res)
# Date Name Hours
# 1 2021-01-01 Jane Peters 9
# 2 2021-01-01 Jim Clark 6
# 3 2021-01-01 John Smith 11
# 4 2021-01-02 Jane Peters 0
# 5 2021-01-02 Jim Clark 0
# 6 2021-01-02 John Smith 0
Data
dat <- structure(list(Date = structure(c(18628, 18628, 18628, 18630,
18630, 18630, 18631, 18631, 18631, 18660, 18660, 18660, 18661,
18661, 18661, 18687, 18687, 18687), class = "Date"), Name = c("John Smith",
"Jane Peters", "Jim Clark", "John Smith", "Jane Peters", "Jim Clark",
"John Smith", "Jane Peters", "Jim Clark", "John Smith", "Jane Peters",
"Jim Clark", "John Smith", "Jane Peters", "Jim Clark", "John Smith",
"Jane Peters", "Jim Clark"), Hours = c(11L, 9L, 6L, 7L, 1L, 9L,
2L, 4L, 10L, 2L, 1L, 3L, 8L, 7L, 0L, 11L, 6L, 8L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18"), class = "data.frame")
This question already has answers here:
Change the Blank Cells to "NA"
(15 answers)
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Closed 1 year ago.
If I have something like this:
id date type location
1 1/2/2021 student Georgia
1 Maine
1 New York
2 2/20/2021 teacher Florida
2 Louisiana
3 2/15/2021 student New York
3 Maine
3 California
3 Florida
I want to copy the values in the "date" and "type" column based on the value for "id":
id date type location
1 1/2/2021 student Georgia
1 1/2/2001 student Maine
1 1/2/2001 student New York
2 2/20/2021 teacher Florida
2 2/20/2021 teacher Louisiana
3 2/15/2021 student New York
3 2/15/2021 student Maine
3 2/15/2021 student California
3 2/15/2021 student Florida
I know this has probably asked but I can't figure out how to ask the question. Thanks for your help!
We can convert the blanks "" to NA and use fill
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(c(date, type), na_if, "")) %>%
group_by(id) %>%
fill(date, type)
Or another option is to change it to first element by 'id'
df %>%
group_by(id) %>%
mutate(across(c(date, type), first))
Here is a data.table option
> setDT(df)[, lapply(.SD, function(x) replace(x, is.na(x), na.omit(x))), id]
id date type location
1: 1 1/2/2021 student Georgia
2: 1 1/2/2021 student Maine
3: 1 1/2/2021 student New York
4: 2 2/20/2021 teacher Florida
5: 2 2/20/2021 teacher Louisiana
6: 3 2/15/2021 student New York
7: 3 2/15/2021 student Maine
8: 3 2/15/2021 student California
9: 3 2/15/2021 student Florida
Data
> dput(df)
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), date = c("1/2/2021",
NA, NA, "2/20/2021", NA, "2/15/2021", NA, NA, NA), type = c("student",
NA, NA, "teacher", NA, "student", NA, NA, NA), location = c("Georgia",
"Maine", "New York", "Florida", "Louisiana", "New York", "Maine",
"California", "Florida")), class = "data.frame", row.names = c(NA,
-9L))
I am trying to group my dataset by multiple variables and build a frequency table of the number of times a character variable appears. Here is an example data set:
Location State County Job Pet
Ohio Miami Data Dog
Urban Ohio Miami Business Dog, Cat
Urban Ohio Miami Data Cat
Rural Kentucky Clark Data Cat, Fish
City Indiana Shelby Business Dog
Rural Kentucky Clark Data Dog, Fish
Ohio Miami Data Dog, Cat
Urban Ohio Miami Business Dog, Cat
Rural Kentucky Clark Data Fish
City Indiana Shelby Business Cat
I want my output to look like this:
Location State County Job Frequency Pet:Cat Pet:Dog Pet:Fish
Ohio Miami Data 2 1 2 0
Urban Ohio Miami Business 2 2 2 0
Urban Ohio Miami Data 1 1 0 0
Rural Kentucky Clark Data 3 1 1 3
City Indiana Shelby Business 2 1 1 0
I have tried different iterations of the following code, and I get close, but not quite right:
Output<-df%>%group_by(Location, State, County, Job)%>%
dplyr::summarise(
Frequency= dplyr::n(),
Pet:Cat = count(str_match(Pet, "Cat")),
Pet:Dog = count(str_match(Pet, "Dog")),
Pet:Fish = count(str_match(Pet, "Fish")),
)
Any help would be appreciated! Thank you in advance
Try this:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
separate_rows(Pet,sep=',') %>%
mutate(Pet=trimws(Pet)) %>%
group_by(Location,State,County,Job,Pet) %>%
summarise(N=n()) %>%
mutate(Pet=paste0('Pet:',Pet)) %>%
group_by(Location,State,County,Job,.drop = F) %>%
mutate(Freq=n()) %>%
pivot_wider(names_from = Pet,values_from=N,values_fill=0)
Output:
# A tibble: 5 x 8
# Groups: Location, State, County, Job [5]
Location State County Job Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
<chr> <chr> <chr> <chr> <int> <int> <int> <int>
1 "" Ohio Miami Data 2 1 2 0
2 "City" Indiana Shelby Business 2 1 1 0
3 "Rural" Kentucky Clark Data 3 1 1 3
4 "Urban" Ohio Miami Business 2 2 2 0
5 "Urban" Ohio Miami Data 1 1 0 0
Some data used:
#Data
df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City",
"Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio",
"Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky",
"Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby",
"Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data",
"Business", "Data", "Data", "Business", "Data", "Data", "Business",
"Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish",
"Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA,
-10L), class = "data.frame")
I have this dataframe:
First.Name Last.Name Country Unit Hospital
John Mars UK Sales South
John Mars UK Sales South
John Mars UK Sales South
Lisa Smith USA HHRR North
Lisa Smith USA HHRR North
and this other:
First.Name Last.Name ID
John Valjean 1254
Peter Smith 1255
Frank Mars 1256
Marie Valjean 1257
Lisa Smith 1258
John Mars 1259
and I would like to merge them or paste them together to have:
I tried with x = merge(df1, df2, by.y=c('Last.Name','First.Name') but it doesnt seem to work. also with x = df1[c(df1$Last.Name, df1$First.Name) %in% c(df2$Last.Name, df2$First.Name),] and it also doesnt work.
When using merge, hou have to be careful with its arguments, especially with by, by.x, by.y, all, all.x and all.y. The description of each of these arguments is available here
Based on this, try out:
merge(df1, df2, by = c('First.Name', 'Last.Name')) # see #Sotos's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
merge(df1, df2, by.x = c('Last.Name','First.Name'),
by.y = c('Last.Name','First.Name')) # in you code, you set by.y but not by.x
# output
Last.Name First.Name Country Unit Hospital ID
1 Mars John UK Sales South 1259
2 Mars John UK Sales South 1259
3 Mars John UK Sales South 1259
4 Smith Lisa USA HHRR North 1258
5 Smith Lisa USA HHRR North 1258
# by in dplyr::left_join() works like by in merge()
dplyr::left_join(df1, df2, by = c('First.Name', 'Last.Name')) # see #tmfmnk's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
Data
df1 <- structure(list(First.Name = c("John", "John", "John", "Lisa",
"Lisa"), Last.Name = c("Mars", "Mars", "Mars", "Smith", "Smith"
), Country = c("UK", "UK", "UK", "USA", "USA"), Unit = c("Sales",
"Sales", "Sales", "HHRR", "HHRR"), Hospital = c("South", "South",
"South", "North", "North")), .Names = c("First.Name", "Last.Name",
"Country", "Unit", "Hospital"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(First.Name = c("John", "Peter", "Frank", "Marie",
"Lisa", "John"), Last.Name = c("Valjean", "Smith", "Mars", "Valjean",
"Smith", "Mars"), ID = 1254:1259), .Names = c("First.Name", "Last.Name",
"ID"), class = "data.frame", row.names = c(NA, -6L))