Adding Dates and Names Together in R DataFrame - r

I am trying to create a rolling average calculation in R, but I only have data for certain dates within a range. However, instead of having dates where there is no data be totally omitted from the rolling average calculation, I would like the dates where there is no data to be counted as 0 in the rolling average calculation.
I have created an example data frame to demonstrate, but in my actual data set I obviously have much more dates and names. Additionally, I am trying to figure out a way to make the names be added where they are added for each day that is 0.
Here is the code for an example DF
Date <- Date <- as.Date(c('2021-01-01','2021-01-01','2021-01-01', '2021-01-03','2021-01-03','2021-01-03', '2021-01-04','2021-01-04','2021-01-04', '2021-02-02','2021-02-02','2021-02-02', '2021-02-03','2021-02-03','2021-02-03', '2021-03-01','2021-03-01','2021-03-01'))
Name <- c("John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark","John Smith", "Jane Peters", "Jim Clark")
Hours <- c(floor(runif(18, min=0, max = 14)))
data.frame(Date, Name, Hours)
With the above dataframe looking like this
Date Name Hours
1 2021-01-01 John Smith 11
2 2021-01-01 Jane Peters 9
3 2021-01-01 Jim Clark 6
4 2021-01-03 John Smith 7
5 2021-01-03 Jane Peters 1
6 2021-01-03 Jim Clark 9
7 2021-01-04 John Smith 2
8 2021-01-04 Jane Peters 4
9 2021-01-04 Jim Clark 10
10 2021-02-02 John Smith 2
11 2021-02-02 Jane Peters 1
12 2021-02-02 Jim Clark 3
13 2021-02-03 John Smith 8
14 2021-02-03 Jane Peters 7
15 2021-02-03 Jim Clark 0
16 2021-03-01 John Smith 11
17 2021-03-01 Jane Peters 6
18 2021-03-01 Jim Clark 8
To illustrate, I would like to add a day where the "Hours" is 0 for each day for each person where there is no data.
The end result would look something like this for the first few days in January, but extending until the end date at the beginning of March.
1 2021-01-01 John Smith 11
2 2021-01-01 Jane Peters 9
3 2021-01-01 Jim Clark 6
4 2021-01-02 John Smith 0
5 2021-01-02 Jane Peters 0
6 2021-01-02 Jim Clark 0
4 2021-01-03 John Smith 7
5 2021-01-03 Jane Peters 1
6 2021-01-03 Jim Clark 9
7 2021-01-04 John Smith 2
8 2021-01-04 Jane Peters 4
9 2021-01-04 Jim Clark 10
10 2021-01-05 John Smith 0
11 2021-01-05 Jane Peters 0
12 2021-01-05 Jim Clark 0
This way, I will be able to add the days where there is no data as 0's into my rolling average calculation.

It sounds like you are looking for the tidyr::complete function. That would look like this
library(tidyr)
complete(mydata, Date=seq(min(Date), max(Date), by="1 day"), Name, fill=list(Hours=0))

Just with base R you could merge your data to an expand.grid.
res <- merge(dat,
expand.grid(
Date=seq(as.Date('2021-01-01'), as.Date('2021-03-01'), 'days'),
Name=unique(dat$Name)),
all=TRUE)
res <- transform(res, Hours=replace(Hours, is.na(Hours), 0))
head(res)
# Date Name Hours
# 1 2021-01-01 Jane Peters 9
# 2 2021-01-01 Jim Clark 6
# 3 2021-01-01 John Smith 11
# 4 2021-01-02 Jane Peters 0
# 5 2021-01-02 Jim Clark 0
# 6 2021-01-02 John Smith 0
Data
dat <- structure(list(Date = structure(c(18628, 18628, 18628, 18630,
18630, 18630, 18631, 18631, 18631, 18660, 18660, 18660, 18661,
18661, 18661, 18687, 18687, 18687), class = "Date"), Name = c("John Smith",
"Jane Peters", "Jim Clark", "John Smith", "Jane Peters", "Jim Clark",
"John Smith", "Jane Peters", "Jim Clark", "John Smith", "Jane Peters",
"Jim Clark", "John Smith", "Jane Peters", "Jim Clark", "John Smith",
"Jane Peters", "Jim Clark"), Hours = c(11L, 9L, 6L, 7L, 1L, 9L,
2L, 4L, 10L, 2L, 1L, 3L, 8L, 7L, 0L, 11L, 6L, 8L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18"), class = "data.frame")

Related

Running a string against multiple match dataframes

I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?
We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector
library(dplyr)
library(stringr)
library(tidyr)
strings %>%
mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
firstname = str_extract_all(string2,
str_c(firstname$firstnames, collapse = "|")),
lastname =str_extract_all(string2,
str_c(lastname$lastnames, collapse = "|")) ) %>%
unnest(where(is.list), keep_empty = TRUE) %>%
select(-string2)%>%
mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
-output
# A tibble: 9 × 3
string firstname lastname
<chr> <chr> <chr>
1 "Jennifer Rae Hancock Brown" Jennifer Hancock
2 "Lisa Smith Houston Blogger" Lisa Smith
3 "Lisa Smith Houston Blogger" Lisa Houston
4 "Tina Fay Las Cruces" Tina Fay
5 "\t\nJamie Tucker Style Expert" Jamie Tucker
6 "Jessica Wright Htx Satx" Jessica Wright
7 "Julie Green Lifestyle Blogger" Julie Green
8 "Mike S Thomas Football Player" Mike Thomas
9 "Tiny Fitness Houston Studio" <NA> <NA>
OP's expected
> final
string firstname lastname
1 Jennifer Rae Hancock Brown Jennifer Hancock
2 Lisa Smith Houston Blogger Lisa Smith
3 Lisa Smith Houston Blogger Lisa Houston
4 Tina Fay Las Cruces Tina Fay
5 \t\nJamie Tucker Style Expert Jamie Tucker
6 Jessica Wright Htx Satx Jessica Wright
7 Julie Green Lifestyle Blogger Julie Green
8 Mike S Thomas Football Player Mike Thomas
9 Tiny George Fitness Houston Studio N/A N/A

R- How to edit repeated records?

I'm reading data from a csv. This is what my data looks like. There are some records that have the same label/question on different days. I want to add numbers to the repeated questions.
UserID Full Name DOB EncounterID Date Type label responses
1 John Smith 1-1-90 13 1-1-21 Intro Check Were you given any info? (null)
1 John Smith 1-1-90 13 1-2-21 Intro Check Were you given any info? no
1 John Smith 1-1-90 13 1-3-21 Intro Check Were you given any info? yes
2 Jane Doe 2-2-80 14 1-6-21 Intro Check Were you given any info? no
2 Jane Doe 2-2-80 14 1-6-21 Care Check By using this service.. no
2 Jane Doe 2-2-80 14 1-6-21 Out Check How satisfied are you? unsat
Desired output (I would like to add numbers to the repeated questions as you can see below):
UserID Full Name DOB EncounterID Date Type label responses
1 John Smith 1-1-90 13 1-1-21 Intro Check Were you given any info?1 (null)
1 John Smith 1-1-90 13 1-2-21 Intro Check Were you given any info?2 no
1 John Smith 1-1-90 13 1-3-21 Intro Check Were you given any info?3 yes
2 Jane Doe 2-2-80 14 1-6-21 Intro Check Were you given any info? no
2 Jane Doe 2-2-80 14 1-6-21 Care Check By using this service.. no
2 Jane Doe 2-2-80 14 1-6-21 Out Check How satisfied are you? unsat
Here is a dplyr solution:
library(dplyr)
df %>%
group_by(UserID, label) %>%
mutate(newcol = row_number(),
label = if(sum(newcol)> 1) paste0(label,newcol) else label) %>%
ungroup() %>%
select(-newcol)
Or more straight as suggested by r2evans (many thanks!):
library(dplyr)
df %>%
group_by(UserID, label) %>%
mutate(label=if (n() > 1) paste0(label,row_number()) else label)
UserID Full.Name DOB EncounterID Date Type label responses
<int> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 1 John Smith 1-1-90 13 1-1-21 Intro Check Were you given any info?1 (null)
2 1 John Smith 1-1-90 13 1-2-21 Intro Check Were you given any info?2 no
3 1 John Smith 1-1-90 13 1-3-21 Intro Check Were you given any info?3 yes
4 2 Jane Doe 2-2-80 14 1-6-21 Intro Check Were you given any info? no
5 2 Jane Doe 2-2-80 14 1-6-21 Care Check By using this service.. no
6 2 Jane Doe 2-2-80 14 1-6-21 Out Check How satisfied are you? unsat
data:
df <- structure(list(UserID = c(1L, 1L, 1L, 2L, 2L, 2L), Full.Name = c("John Smith",
"John Smith", "John Smith", "Jane Doe", "Jane Doe", "Jane Doe"
), DOB = c("1-1-90", "1-1-90", "1-1-90", "2-2-80", "2-2-80",
"2-2-80"), EncounterID = c(13L, 13L, 13L, 14L, 14L, 14L), Date = c("1-1-21",
"1-2-21", "1-3-21", "1-6-21", "1-6-21", "1-6-21"), Type = c("Intro",
"Intro", "Intro", "Intro", "Care", "Out"), label = c("Check Were you given any info?",
"Check Were you given any info?", "Check Were you given any info?",
"Check Were you given any info?", "Check By using this service..",
"Check How satisfied are you?"), responses = c("(null)", "no",
"yes", "no", "no", "unsat")), class = "data.frame", row.names = c(NA,
-6L))
Try this:
ave(dat$label, dat[c("UserID", "label")],
FUN = function(z) if (length(z) > 1) seq_along(z) else "")
# [1] "1" "2" "3" "" "" ""
which can be used as
dat$label <- paste0(dat$label,
ave(dat$label, dat[c("UserID", "label")],
FUN = function(z) if (length(z) > 1) seq_along(z) else "")
)
# UserID Full.Name DOB EncounterID Date Type label responses
# 1 1 John Smith 1-1-90 13 1-1-21 Intro Check Were you given any info?1 (null)
# 2 1 John Smith 1-1-90 13 1-2-21 Intro Check Were you given any info?2 no
# 3 1 John Smith 1-1-90 13 1-3-21 Intro Check Were you given any info?3 yes
# 4 2 Jane Doe 2-2-80 14 1-6-21 Intro Check Were you given any info? no
# 5 2 Jane Doe 2-2-80 14 1-6-21 Care Check By using this service.. no
# 6 2 Jane Doe 2-2-80 14 1-6-21 Out Check How satisfied are you? unsat
Data
dat <- structure(list(UserID = c(1, 1, 1, 2, 2, 2), Full.Name = c("John Smith", "John Smith", "John Smith", "Jane Doe", "Jane Doe", "Jane Doe"), DOB = c("1-1-90", "1-1-90", "1-1-90", "2-2-80", "2-2-80", "2-2-80"), EncounterID = c(13, 13, 13, 14, 14, 14), Date = c("1-1-21", "1-2-21", "1-3-21", "1-6-21", "1-6-21", "1-6-21"), Type = c("Intro", "Intro", "Intro", "Intro", "Care", "Out"), label = c("Check Were you given any info?", "Check Were you given any info?", "Check Were you given any info?", "Check Were you given any info?", "Check By using this service..", "Check How satisfied are you?"), responses = c("(null)", "no", "yes", "no", "no", "unsat")), row.names = c(NA, -6L), class = "data.frame")

Obtain a unique ID from a different dataframe

I have this dataframe:
First.Name Last.Name Country Unit Hospital
John Mars UK Sales South
John Mars UK Sales South
John Mars UK Sales South
Lisa Smith USA HHRR North
Lisa Smith USA HHRR North
and this other:
First.Name Last.Name ID
John Valjean 1254
Peter Smith 1255
Frank Mars 1256
Marie Valjean 1257
Lisa Smith 1258
John Mars 1259
and I would like to merge them or paste them together to have:
I tried with x = merge(df1, df2, by.y=c('Last.Name','First.Name') but it doesnt seem to work. also with x = df1[c(df1$Last.Name, df1$First.Name) %in% c(df2$Last.Name, df2$First.Name),] and it also doesnt work.
When using merge, hou have to be careful with its arguments, especially with by, by.x, by.y, all, all.x and all.y. The description of each of these arguments is available here
Based on this, try out:
merge(df1, df2, by = c('First.Name', 'Last.Name')) # see #Sotos's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
merge(df1, df2, by.x = c('Last.Name','First.Name'),
by.y = c('Last.Name','First.Name')) # in you code, you set by.y but not by.x
# output
Last.Name First.Name Country Unit Hospital ID
1 Mars John UK Sales South 1259
2 Mars John UK Sales South 1259
3 Mars John UK Sales South 1259
4 Smith Lisa USA HHRR North 1258
5 Smith Lisa USA HHRR North 1258
# by in dplyr::left_join() works like by in merge()
dplyr::left_join(df1, df2, by = c('First.Name', 'Last.Name')) # see #tmfmnk's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
Data
df1 <- structure(list(First.Name = c("John", "John", "John", "Lisa",
"Lisa"), Last.Name = c("Mars", "Mars", "Mars", "Smith", "Smith"
), Country = c("UK", "UK", "UK", "USA", "USA"), Unit = c("Sales",
"Sales", "Sales", "HHRR", "HHRR"), Hospital = c("South", "South",
"South", "North", "North")), .Names = c("First.Name", "Last.Name",
"Country", "Unit", "Hospital"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(First.Name = c("John", "Peter", "Frank", "Marie",
"Lisa", "John"), Last.Name = c("Valjean", "Smith", "Mars", "Valjean",
"Smith", "Mars"), ID = 1254:1259), .Names = c("First.Name", "Last.Name",
"ID"), class = "data.frame", row.names = c(NA, -6L))

Filling in NAs in data in R by id

I am having a little problem "filling in gaps". It's not a missing data question, it's more about merging but it's not working great.
So, my data looks like this
id name region Company
1 John Smith West Walmart
1 John Smith West Amazon
1 John Smith
1 John Smith West P&G
2 Jane Smith South Apple
2 Jane Smith
3 Richard Burkett
3 Richard Burkett West Walmart
And so on.
What I want to do is fill in those gaps in the region variable by their id. So, id 1, John Smith, on the third row, should have West in the third column. Jane Smith's region should be filled in "South" where it is missing.
I've tried creating a separate dataset and then merging it based on id but it creates duplicate rows and basically increases the N by something like 14 times (no idea why).
region1<-subset(df1, df1$region=="DC"| df1$region=="Midwest"|df1$region=="Northeast"|df1$region=="South"|df1$region=="West")
region<-region1[,c(id","region")]
df2<-merge(df1, region, by="id")
I've checked the structure of the variables. Id variable is interval and region is a factor. I think there should be a super simple way to do this but I'm just not getting it. Any ideas?
Thank you in advance.
Here´s an R base solution. Suppose your data.frame is df
regions <- sapply(split(df$region, df$id), function(x) {
ind <- is.na(x);
x[ind] <- x[!ind][1];
x
})
df$region <- unlist(regions)
df
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West <NA>
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South <NA>
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West <NA>
I would use dplyr::arrange followed by tidyr::fill
library(dplyr)
library(tidyr)
data.frame(id=c(1,1,1,1,2,2,3,3),
name=c(rep("John Smith",4), rep("Jane Smith", 2), rep("Richard Burkett", 2)),
region=c("West", "West", NA, "West", "South",NA, "West", NA),
Company=c("Walmart","Amazon",NA,"P&G","Apple",NA,"Walmart",NA)) %>%
arrange(id, name) %>%
fill(region)
Results in:
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West NA
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South NA
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West NA
The solution which should work is group_by on id and then fill. Ideally the solution which should work in OP condition should cover in both direction.
library(tidyverse)
df %>% group_by(id) %>%
fill(region) %>%
fill(region, .direction = "up")
# id name region Company
# <int> <chr> <chr> <chr>
#1 1 John Smith West Walmart
#2 1 John Smith West Amazon
#3 1 John Smith West <NA>
#4 1 John Smith West P&G
#5 2 Jane Smith South Apple
#6 2 Jane Smith South <NA>
#7 3 Richard Burkett West Walmart
#8 3 Richard Burkett West <NA>
Data
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), name = c("John Smith",
"John Smith", "John Smith", "John Smith", "Jane Smith", "Jane Smith",
"Richard Burkett", "Richard Burkett"), region = c("West", "West",
NA, "West", "South", NA, "West", NA), Company = c("Walmart",
"Amazon", NA, "P&G", "Apple", NA, "Walmart", NA)), .Names = c("id",
"name", "region", "Company"), class = "data.frame", row.names = c(NA,
-8L))

R reshape cast error

I am trying to follow the following example for the reshape package but am getting an error
smithsm <- melt(smiths)
smithsm
subject variable value
1 John Smith time 1.00
2 Mary Smith time 1.00
3 John Smith age 33.00
4 Mary Smith age NA
5 John Smith weight 90.00
6 Mary Smith weight NA
7 John Smith height 1.87
8 Mary Smith height 1.54
cast(smithsm, time + subject ~ variable)
This gives the error "Error: Casting formula contains variables not found in molten data: time". Does anyone know what is causing this error? The above is taken word for word from an example
Thanks!
The smithsm dataset doesn't have time column. It is not clear what the expected wide form is. Perhaps, this helps
library(reshape2)
dcast(smithsm, subject~variable, value.var='value')
# subject age height time weight
#1 John Smith 33 1.87 1 90
#2 Mary Smith NA 1.54 1 NA
data
smithsm <- structure(list(subject = c("John Smith", "Mary Smith", "John Smith",
"Mary Smith", "John Smith", "Mary Smith", "John Smith", "Mary Smith"
), variable = c("time", "time", "age", "age", "weight", "weight",
"height", "height"), value = c(1, 1, 33, NA, 90, NA, 1.87, 1.54
)), .Names = c("subject", "variable", "value"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

Resources