Related
I have two tibbles with list of strings in each. I need to compare one list of strings with another list of strings and depending on the comparison create a new column.
Small example below:
## Tibble 1 - the 'master'
structure(list(terms = c("This", "is", "a", "stri", "of", "areas",
"times", "two", "to", "see", "what", "will", "be", "in", "the",
"magic", "will", "rally", "for", "a", "cry", "from", "the", "deepest",
"part", "of", "the", "ocean", "com", "en", "au", "us"), rank = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K",
"L", "M", "N", "O", "P"), id = 1:32), row.names = c(NA, -32L), class = c("tbl_df",
"tbl", "data.frame"))
## Tibble 2 - the 'comparison'
structure(list(conds = c("this.com", "two.org", "magic.edu",
"cry/en/org", "magic.com"), ind = structure(c(2L, 1L, 5L, 3L,
4L), .Label = c("bad", "good", "Indifferent", "Maybe", "Ugly"
), class = "factor")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Ideally the output would be a mutated 'master' tibble with the ind value inserted depending on the comparison of the strings
Attempt so far:
terms <- terms %>% mutate(
test = ifelse(
sapply(lapply(terms, grepl, condition_str$conds), any) == TRUE,
condition_str$ind,
'NA'))
terms
result
# A tibble: 32 x 4
terms rank id test
<chr> <chr> <int> <chr>
1 This A 1 NA
2 is B 2 1
3 a C 3 5
4 stri D 4 NA
5 of E 5 NA
6 areas F 6 NA
7 times G 7 NA
8 two H 8 5
9 to I 9 NA
10 see J 10 NA
It gives me a result, the factor levels are carried across but the factor names are not. It fails on a larger data set I am working on.
Questions:
Is there a purrr solution that uses stringr or stringi? My problem might be in my string matching
Is there a way to use incorporate fixed = TRUE into the grepl function?
Is there a way to get the classification levels into the mutated column?
Thanks for any assistance.
James
I have a dataframe that's a result of combining multiple sheets from excel. The columns did not align properly. I need to check if a subset of rows is all NA. If they are NA, then I need to check if the adjacent equally sized subset has content, and if it does, I need to copy over that row to replace the NAs.
This is what the data looks like from my dput:
structure(list(id = 1:20, A = c(NA, NA, NA, NA, NA, "c", "d",
"q", "p", "m", NA, NA, NA, NA, NA, "k", "o", "i", "a", "b"),
B = c(NA, NA, NA, NA, NA, "h", "a", "f", "b", "e", NA, NA,
NA, NA, NA, "m", "c", "s", "g", "p"), C = c(NA, NA, NA, NA,
NA, "a", "f", "j", "s", "g", NA, NA, NA, NA, NA, "l", "m",
"o", "k", "t"), D = c(NA, NA, NA, NA, NA, "n", "r", "l",
"h", "g", NA, NA, NA, NA, NA, "j", "p", "f", "d", "q"), E = c("j",
"p", "n", "i", "g", NA, NA, NA, NA, NA, "k", "e", "s", "m",
"l", NA, NA, NA, NA, NA), F = c("o", "d", "r", "q", "a",
NA, NA, NA, NA, NA, "h", "s", "f", "j", "k", NA, NA, NA,
NA, NA), G = c("f", "c", "a", "l", "m", NA, NA, NA, NA, NA,
"n", "t", "s", "e", "r", NA, NA, NA, NA, NA), H = c("r",
"c", "h", "i", "j", NA, NA, NA, NA, NA, "f", "e", "b", "l",
"n", NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = "data.frame")
If you have equal number of non-missing values in each row as shown in the shared example you can drop NA values in each row.
df1 <- as.data.frame(t(apply(df, 1, na.omit)))
# V1 V2 V3 V4 V5
#1 1 j o f r
#2 2 p d c c
#3 3 n r a h
#4 4 i q l i
#5 5 g a m j
#6 6 c h a n
#7 7 d a f r
#8 8 q f j l
#9 9 p b s h
#10 10 m e g g
#11 11 k h n f
#12 12 e s t e
#13 13 s f s b
#14 14 m j e l
#15 15 l k r n
#16 16 k m l j
#17 17 o c m p
#18 18 i s o f
#19 19 a g k d
#20 20 b p t q
To check for 1st half values and if all of them are NA we select second half we can do :
cbind(df[1], t(apply(df[-1], 1, function(x) {
x1 <- (length(x)/2)
if(all(is.na(x[1:x1]))) x[(x1+1):length(x)]
else x[1:x1]
})))
I have sparse data which has a score taken at periodic intervals and a measurement taken at more regular interval for multiple subjects along with corresponding dates. I would like to generate date ranges based on the score dates for each subject ID ie. starting at the score date and ending at the next score date (or starting/ending at the first/last subject observation if the score doesn't fall on those dates).
I would then like to average the measurement variable within these date ranges. The averaging step should be straightforward but I am stuck on generating the date ranges.
Below is a sample of the data and an example of how I would envision the resulting data
sample data:
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "D", "D"), date = c("1/21/2020", "1/27/2020",
"2/1/2020", "2/3/2020", "2/5/2020", "2/6/2020", "2/8/2020", "2/9/2020",
"2/11/2020", "2/12/2020", "2/13/2020", "2/15/2020", "2/18/2020",
"2/20/2020", "2/21/2020", "2/22/2020", "2/25/2020", "2/1/2020",
"2/5/2020", "2/7/2020", "2/8/2020", "2/11/2020", "2/12/2020",
"1/30/2020", "2/10/2020", "2/11/2020", "2/6/2020", "2/7/2020",
"2/8/2020", "2/9/2020", "2/11/2020", "2/13/2020", "2/14/2020",
"2/16/2020", "2/17/2020", "2/20/2020", "2/23/2020", "2/26/2020",
"3/1/2020", "3/3/2020", "3/5/2020"), score = c(0.5, 2, NA, NA,
3, NA, NA, NA, NA, NA, 2.5, NA, NA, 1.5, NA, NA, NA, 3, NA, NA,
2.5, NA, 1, 0.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14,
NA, NA, 11.5, NA, 9.5, NA), measure = c(0.394160734, 0.722462998,
0.82984815, 0.738432745, 0.321792398, 0.167492308, 0.218020898,
0.929210786, 0.686818585, 0.939678073, 0.708172942, 0.299863884,
0.48216267, 0.290307369, 0.801947902, 0.579418467, 0.78101844,
0.219494852, 0.875129822, 0.517971003, 0.475625007, 0.723003744,
0.257473477, 0.629818537, 0.817369151, 0.628573413, 0.364660834,
0.5971024, 0.002274261, 0.318937617, 0.983917106, 0.685933928,
0.487922831, 0.151769304, 0.392413694, 0.012429414, 0.149627658,
0.011724992, 0.536998203, 0.798399999, 0.763353822)), class = "data.frame", row.names = c(NA,
-41L))
answer data:
structure(list(ID = c("A", "A", "A"), startDate = c("1/21/2020",
"1/27/2020", "2/5/2020"), endDate = c("1/27/2020", "2/5/2020",
"2/13/2020"), score = c(0.5, 2, 3), measure = c(0.394160734,
0.763581298, 0.543835508)), class = "data.frame", row.names = c(NA,
-3L))
Here's a way with dplyr :
library(dplyr)
df %>%
group_by(ID, grp = cumsum(!is.na(score))) %>%
summarise(start_date = first(date),
score = first(score),
measure = mean(measure)) %>%
mutate(end_date = lead(start_date, default = last(start_date))) %>%
select(-grp)
# ID start_date score measure end_date
# <chr> <chr> <dbl> <dbl> <chr>
# 1 A 1/21/2020 0.5 0.394 1/27/2020
# 2 A 1/27/2020 2 0.764 2/5/2020
# 3 A 2/5/2020 3 0.544 2/13/2020
# 4 A 2/13/2020 2.5 0.497 2/20/2020
# 5 A 2/20/2020 1.5 0.613 2/20/2020
# 6 B 2/1/2020 3 0.538 2/8/2020
# 7 B 2/8/2020 2.5 0.599 2/12/2020
# 8 B 2/12/2020 1 0.257 2/12/2020
# 9 C 1/30/2020 0.5 0.692 1/30/2020
#10 D 2/6/2020 NA 0.449 2/17/2020
#11 D 2/17/2020 14 0.185 2/26/2020
#12 D 2/26/2020 11.5 0.274 3/3/2020
#13 D 3/3/2020 9.5 0.781 3/3/2020
Using data.table
library(data.table)
setDT(df)[, .(start_date = first(date),
score = first(score),
measure = mean(measure)),
by = .(ID, grp = cumsum(!is.na(score)))
][, end_date := shift(start_date, type= 'lead', fill = last(start_date))
][, grp := NULL][]
Is there an elegant/tidy way to fill in the data if there are non-null values to the right? I have a wonky work-around but wanted to know if there was a nice dplyr way to do this.
actual <-
tibble(
a = c("A", NA, NA, NA, NA, NA, NA, "B", NA, NA, NA),
b = c(NA, "A", NA, NA, NA, "C", NA, NA, "E", NA, NA),
c = c(NA, NA, "B", NA, NA, NA, "D", NA, NA, "F", "G"),
d = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)
desired <-
tibble(
w = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B"),
x = c(NA, "A", "A", "A", "A", "C", "C", NA, "E", "E", "E"),
y = c(NA, NA, "B", "B", "B", NA, "D", NA, NA, "F", "G"),
z = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)
We can use fill from tidyr together with dplyr like the following.
library(dplyr)
library(tidyr)
dat <- actual %>%
fill(a) %>%
group_by(a) %>%
fill(b) %>%
group_by(b) %>%
fill(c) %>%
group_by(c) %>%
fill(d) %>%
ungroup()
print(dat)
# # A tibble: 11 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 A NA NA NA
# 2 A A NA NA
# 3 A A B NA
# 4 A A B C
# 5 A A B D
# 6 A C NA NA
# 7 A C D NA
# 8 B NA NA NA
# 9 B E NA NA
# 10 B E F NA
# 11 B E G NA
I have the following data frame which I have obtained from a count. I have used dput to make the data frame available and then edited the data frame so there is a duplicate of A.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
print(df)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
Now I would like to take distinct on Procedure and only keep the first A.
df %>%
distinct(Procedure, .keep_all=TRUE)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
It does not work. Strange...
If we print the Procedure column, we can see that there are duplicated levels for a, which is problematic for the distinct function.
df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor
One way to fix is to drop the factor levels. We can use factor function to achieve this. Another way is to convert the Procedure column to character.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
library(tidyverse)
df %>%
mutate(Procedure = factor(Procedure)) %>%
distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
# Procedure n
# <fct> <int>
# 1 D 10717
# 2 A 4412
# 3 C 1480
You have duplicated value in a label parameter .Label = c("A", "A", "C", "D", "-1"). That is an issue. Btw your way of initializing of a tibble seems to be very strange (i do not know exactly your goal but still)
Why not use
df <- tibble(
Procedure = c("D", "A", "A", "C"),
n = c(10717L, 4412L, 2058L, 1480L)
)