I have a data that looks like this:
Sample data can be build using codes
df<-structure(list(ID = c(1, 1, 1, 2, 2, 3, 3, 3), Date = c("Day 1",
"Day 7", "Day 29", "Day_8", "Day9", "Day7", "Day.1", "Day 21"
), Score = c("A", "B", "E", "D", "F", "G", "A", "B"), Pass = c("Y",
"Y", "N", "Y", "N", "N", "Y", "Y")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
How can I write a piping code to complete a filter and selection. The filter I want is the earliest date, and data selection is (ID, Date, Score). If it is doable, I would like to clean data a little bit as it is allover the place right now. The final data might looks like this:
Could anyone give me some guidance on this. if possible, both base and tidyverse?
my thought is:
df1 <- df %>% filter() %>% select(-Pass)
PartII:
If date is something like ad date, how should I get the max(Date)?
New data set can be build using
df2<- structure(list(Subject = c("39-903", "39-903", "39-903",
"39-903", "39-903", "39-903", "39-903", "39-903", "39-905",
"39-906", "39-907", "304-902", "301-902", "301-903", "301-904"
), DT = c("30 Apr 2019", "25 Jun 2019", "23 Jul 2019", "24 Oct 2019",
"19 Dec 2019", "27 Jan 2020", "05 Apr 2020", "29 Apr 2020", "",
"03 Dec 2018", "12 Jul 2019", "29 Apr 2020", "30 Dec 2019", "13 Jan 2020",
"8 Jun 2020")), class = "data.frame", row.names = c(NA, -15L))
I tried
df2<- df2%>% group_by(Subject)%>% mutate(Date=dmy("DT")) %>% filter (Date==max(Date)
and got warning Warning messages: 1: All formats failed to parse. No formats found.. Mine did not work.
You can also try:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% group_by(ID) %>%
mutate(Date=gsub('\\.','_',Date),
Val=parse_number(Date)) %>%
filter(Val==min(Val)) %>%
select(c(ID,Date,Score))
Output:
# A tibble: 3 x 3
# Groups: ID [3]
ID Date Score
<dbl> <chr> <chr>
1 1 Day 1 A
2 2 Day_8 D
3 3 Day_1 A
Update: More versatile solution:
#Code 2
newdf <- df %>% group_by(ID) %>%
mutate(Val=as.numeric(gsub("[^0-9-]", "", Date))) %>%
filter(Val==min(Val)) %>%
select(c(ID,Date,Score))
Output:
# A tibble: 3 x 3
# Groups: ID [3]
ID Date Score
<dbl> <chr> <chr>
1 1 Day 1 A
2 2 Day_8 D
3 3 Day.1 A
You can try with this:
library(dplyr)
df %>%
mutate(Date = as.numeric(stringr::str_extract(Date, "\\d+"))) %>%
group_by(ID) %>%
slice_min(Date) %>%
ungroup() %>%
select(-Pass)
#> # A tibble: 3 x 3
#> ID Date Score
#> <dbl> <dbl> <chr>
#> 1 1 1 A
#> 2 2 8 D
#> 3 3 1 A
For Date I believe it's better if you keep just the number instead of Day X.
We can use base R
df1 <- transform(df, Date = as.numeric(gsub("\\D+", "", Date)))
df1[with(df1, ave(Date, ID, FUN = min) == Date),]
# ID Date Score Pass
#1 1 1 A Y
#4 2 8 D Y
#7 3 1 A Y
if we need to keep the NA elements as well, can use an OR (|) condition
library(dplyr)
library(lubridate)
df2%>%
group_by(Subject)%>%
mutate(Date=dmy(DT)) %>%
filter (Date==max(Date) |is.na(Date))
Related
I have a subdata as below. I would like to do some manipulations in 3 steps:
reshape it to long format so that for each "id" I see the "m" columns.
for each id only keeping one of the repetitions, e.g id 101 has two "15 ag", I would like to keep only one. Only one even I see many repetitions.
assign the values in column "m" to some scores, as: "15 ag" assign to 0, "12 cer" assign to 1,"18 di" assign to 6,"11 dem" assign to 2,"25 dia" assign to 0.
then I sum all the scores up in a new column called "sum".
In the end, I will have for each id the "sum".
Any help for this is appreciated.
d <- structure(list(id = c(101, 101, 101, 101, 102, 102, 102, 103,
103, 103, 103, 104), m = c("15 ag", "15 ag", NA, "12 cer", NA,
"18 di", "12 cer", "11 dem", "11 dem", NA, "12 cer", "25 dia"
)), class = c("tbl_df", "tbl", "data.frame"),row.names =c(NA,-12L))
library(tidyverse)
d %>%
group_by(id) %>%
distinct() %>%
drop_na() %>%
mutate(m = case_when(
m %in% c("15 ag", "25 dia") ~ 0,
m == "12 cer" ~ 1,
m == "18 di" ~ 6,
m == "11 dem" ~ 2,
TRUE ~ NA_real_
)) %>%
summarise(sum = sum(m))
# A tibble: 4 × 2
id sum
<dbl> <dbl>
1 101 1
2 102 7
3 103 3
4 104 0
New column without summarise
d %>%
group_by(id) %>%
distinct() %>%
drop_na() %>%
mutate(m = case_when(
m %in% c("15 ag", "25 dia") ~ 0,
m == "12 cer" ~ 1,
m == "18 di" ~ 6,
m == "11 dem" ~ 2,
TRUE ~ NA_real_
)) %>%
mutate(sum = sum(m))
# A tibble: 7 × 3
# Groups: id [4]
id m sum
<dbl> <dbl> <dbl>
1 101 0 1
2 101 1 1
3 102 6 7
4 102 1 7
5 103 2 3
6 103 1 3
7 104 0 0
> dput(df1)
structure(list(X = c("cc_China", "bb_China", "dd_China", "cc_Egypt",
"bb_Egypt", "dd_Egypt"), Country = c("China", "China", "China",
"Egypt", "Egypt", "Egypt"), May = c(2, 3, 8, 2, 4, 1), Jun = c(2,
2, 5, 5, 5, 5), Jul = c(3, NA, NA, 3, 2, NA), Aug = c(4, 6, 3,
2, 3, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I have a dataset as such, I have extract the country from X column into the Country column. For each country, I wish to get the latest month and their value, where all 3 different row (cc,bb and dd) are not NA. For China, the latest is Aug where, all cc,bb and dd have values. For Egypt, the latest month would be Jun, where it was the latest month where all 3 datas are available. Thanks.
df1>
X Country May Jun Jul Aug
cc_China China 2 2 3 4
bb_China China 3 2 NA 6
dd_China China 8 5 NA 3
cc_Egypt Egypt 2 5 3 2
bb_Egypt Egypt 4 5 2 3
dd_Egypt Egypt 1 5 NA NA
I wish to get this
Month X Value
Aug cc_China 4
Aug bb_China 6
Aug dd_China 3
Jun cc_Egypt 5
Jun bb_Egypt 5
Jun dd_Egypt 5
Get the data in long format and for each Country keep only those rows which have all non-NA values in a month. For each Country you can then keep only the max month.
Since we cannot compare character month names directly, I have converted them to numbers using inbuilt vector month.abb.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -c(X, Country), names_to = 'Month') %>%
mutate(month_num = match(Month, month.abb)) %>%
group_by(Country, Month) %>%
filter(all(!is.na(value))) %>%
group_by(Country) %>%
filter(month_num == max(month_num)) %>%
ungroup %>% select(-month_num, -Country)
# X Month value
# <chr> <chr> <dbl>
#1 cc_China Aug 4
#2 bb_China Aug 6
#3 dd_China Aug 3
#4 cc_Egypt Jun 5
#5 bb_Egypt Jun 5
#6 dd_Egypt Jun 5
Suppose I'm given the following input dataframe:
ID Date
1 20th May, 2020
1 21st May, 2020
1 28th May, 2020
1 29th May, 2020
2 20th May, 2020
2 1st June, 2020
I want to generate the following dataframe:
ID Date Delta
1 20th May, 2020 0
1 21st May, 2020 1
1 28th May, 2020 7
1 29th May, 2020 1
2 20th May, 2020 0
2 1st June, 2020 12
Where the idea is, first I group by id. Then within my current id. I iterate over the days and subtract the current date with the previous date with the exception of the first date which is just itself.
I have been using dplyr but I am uncertain on how to achieve this for groups and how to do this iteratively
My goal is to filter the deltas and retain 0 and anything larger than 7 but it must follow the 'preceeding date' logic within a specific id.
library(dplyr)
dat %>%
mutate(Date = as.Date(gsub("[a-z]{2} ", " ", Date), format = "%d %b, %Y")) %>%
group_by(ID) %>%
mutate(Delta = c(0, diff(Date))) %>%
ungroup()
# # A tibble: 6 x 3
# ID Date Delta
# <dbl> <date> <dbl>
# 1 1 2020-05-20 0
# 2 1 2020-05-21 1
# 3 1 2020-05-28 7
# 4 1 2020-05-29 1
# 5 2 2020-05-20 0
# 6 2 2020-06-01 12
Steps:
remove the ordinal from numbers, so that we can
convert them to proper Date-class objects, then
diff them within ID groups.
Data
dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Date = c(" 20th May, 2020", " 21st May, 2020", " 28th May, 2020", " 29th May, 2020", " 20th May, 2020", " 1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
Similar logic as #r2evans but with different functions.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Delta = as.integer(Date - lag(Date, default = first(Date)))) %>%
ungroup
# ID Date Delta
# <int> <date> <int>
#1 1 2020-05-20 0
#2 1 2020-05-21 1
#3 1 2020-05-28 7
#4 1 2020-05-29 1
#5 2 2020-05-20 0
#6 2 2020-06-01 12
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L), Date = c("20th May, 2020",
"21st May, 2020", "28th May, 2020", "29th May, 2020", "20th May, 2020",
"1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
I have the below times in a column and would like to change the standalone mm:ss values to 00:mm:ss. The final output must be numeric.
I am not sure where to start here. I think that gsub() may be an appropriate solution, but I am not sure of the syntax to only add to the standalone mm:ss values.
This is what I see in excel
Name Section Role Last Activity Total Activity
A biology Student Feb 8 at 1:08pm 18:16
B biology Student Feb 8 at 1:37pm 3:22:10
C biology Student Feb 8 at 10:37pm 9:51
D biology Student Feb 8 at 11:50am 5:32:31
E biology Student Feb 9 at 12:08pm 7:17:49
F biology Student Feb 10 at 1:33am 12:25:41
This is what I see when I import this into R
structure(list(Name = c("A", "B", "C", "D", "E", "F"), Section = c("biology",
"biology", "biology", "biology", "biology", "biology"), Role = c("Student",
"Student", "Student", "Student", "Student", "Student"), `Last Activity` = c("Feb 8 at 1:08pm",
"Feb 8 at 1:37pm", "Feb 8 at 10:37pm", "Feb 8 at 11:50am", "Feb 9 at 12:08pm",
"Feb 10 at 1:33am"), `Total Activity` = structure(c(-2209009440,
-2209063070, -2209039740, -2209055249, -2209048931, -2209030459
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
The following is a preliminary solution to your problem. Note that these values are not numeric since there is no number 25:12 etc. If you want to calculate on them, you may transform them do dates, e.g. POSIX-classes.
Data
ex <- c("20:28",
"18:53",
"25:01:00",
"17:55",
"27:04:00",
"24:43:00")
Code
ex[stringr::str_count(ex, ":") == 1] <- gsub("^",
"00:",
ex[stringr::str_count(ex, ":") == 1])
Output
> ex
[1] "00:20:28" "00:18:53" "25:01:00"
[4] "00:17:55" "27:04:00" "24:43:00"
You can count number of character in the data and prepend '00:' if it is less than 8.
df <- data.frame(time = c("20:28", "18:53", "25:01:00", "17:55", "27:04:00", "24:43:00"))
df <- transform(df, time = ifelse(nchar(time) < 8, paste0('00:', time), time))
df
# time
#1 00:20:28
#2 00:18:53
#3 25:01:00
#4 00:17:55
#5 27:04:00
#6 24:43:00
For the updated dataset we can turn the column Last Activity into date-time and use format.
library(lubridate)
library(dplyr)
df <- df %>%
mutate(`Last Activity` = format(ymd_hm(paste('2020', `Last Activity`)), '%T'),
`Total Activity` = format(`Total Activity`, '%T'))
df
# Name Section Role `Last Activity` `Total Activity`
# <chr> <chr> <chr> <chr> <chr>
#1 A biology Student Feb 8 at 1:08pm 18:16:00
#2 B biology Student Feb 8 at 1:37pm 03:22:10
#3 C biology Student Feb 8 at 10:37pm 09:51:00
#4 D biology Student Feb 8 at 11:50am 05:32:31
#5 E biology Student Feb 9 at 12:08pm 07:17:49
#6 F biology Student Feb 10 at 1:33am 12:25:41
I have a dataframe like this:
data <- data.frame(Time = rep(c("Jan 1999", "Feb 1999", "Mar 1999"), each = 3), Country = rep(c("Australia", "Brazil", "Canada"), 3), rep(Group = c("A", "B", "A"), 3), Intercept = NA)
and another dataframe with coefficients from a regression where A and B are the Intercepts for the different groups.
coeffs <- data.frame(Time = c("Jan 1999", "Feb 1999", "Mar 1999"), A = c(1,2,3), B = c(3,2,1))
Now I want to put the Intercepts from the coeffs dataframe into the dataframe's intercept columns. I did this the following way:
l <- length(unique(data[,"Country"]))
data[,"Intercept"] <- ifelse(data_1[,"Group_1"] == "A", rep(coeffs_1[,"A"], each = l), rep(coeffs_1[,"B"], each = l))
This seems to work well for the 2 groups, but now I need to do the same thing for 7 groups and I don't see how I could generalize the approach above. I guess I could use a 7 level nested ifelse statement or a for loop, but there has to be a more elegant way.
Thanks for your help!
Get coeffs in long format and join with data :
library(dplyr)
coeffs %>%
tidyr::pivot_longer(cols = -Time, names_to = 'Group',
values_to = 'Intercept') %>%
right_join(data, by = c('Time', 'Group'))
# A tibble: 9 x 4
# Time Group Intercept Country
# <chr> <chr> <dbl> <chr>
#1 Jan 1999 A 1 Australia
#2 Jan 1999 A 1 Canada
#3 Jan 1999 B 3 Brazil
#4 Feb 1999 A 2 Australia
#5 Feb 1999 A 2 Canada
#6 Feb 1999 B 2 Brazil
#7 Mar 1999 A 3 Australia
#8 Mar 1999 A 3 Canada
#9 Mar 1999 B 1 Brazil
Used this dataframe for data :
data <- data.frame(Time = rep(c("Jan 1999", "Feb 1999", "Mar 1999"), each = 3),
Country = rep(c("Australia", "Brazil", "Canada"), 3),
Group = rep(c("A", "B", "A"), 3))