Split a dataframe by repeated rows in R? [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed last year.
I am looking for a way to transfor a data.frame with thousands of rows like this
code date value uname tcode
<chr> <date> <dbl> <ord> <int>
1 CODE1 1968-02-01 14.1 "" NA
2 CODE1 1968-03-01 9.50 "" NA
3 CODE1 1968-04-01 22.1 "" NA
4 CODE2 1968-02-01 15.1 "" NA
5 CODE2 1968-03-01 13.50 "" NA
6 CODE2 1968-04-01 23.1 "" NA
7 CODE3 1968-02-01 16.1 "" NA
8 CODE3 1968-03-01 15.50 "" NA
9 CODE3 1968-04-01 13.1 "" NA
Into something like:
date CODE1 CODE2 CODE3
<date> <dbl> <dbl> <dbl>
1 1968-02-01 14.1 15.1 16.1
2 1968-03-01 9.50 13.50 15.50
3 1968-04-01 22.1 23.1 13.1
This seems straightforward but I am having difficulty realizing this task. Thanks!

With tidyverse you can use pivot_wider
library(dplyr)
library(tidyr)
df %>% select(-c(uname,tcode)) %>% pivot_wider(names_from="code")
# A tibble: 3 x 4
date CODE1 CODE2 CODE3
<chr> <dbl> <dbl> <dbl>
1 1968-02-01 14.1 15.1 16.1
2 1968-03-01 9.5 13.5 15.5
3 1968-04-01 22.1 23.1 13.1
Data
df <- structure(list(code = c("CODE1", "CODE1", "CODE1", "CODE2", "CODE2",
"CODE2", "CODE3", "CODE3", "CODE3"), date = c("1968-02-01", "1968-03-01",
"1968-04-01", "1968-02-01", "1968-03-01", "1968-04-01", "1968-02-01",
"1968-03-01", "1968-04-01"), value = c(14.1, 9.5, 22.1, 15.1,
13.5, 23.1, 16.1, 15.5, 13.1), uname = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), tcode = c(NA, NA, NA, NA, NA, NA, NA, NA, NA
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

Related

Specify which column(s) a specific date appears in R

I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")
Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N

How to set missing some columns and their corresponding columns in data frame in R

I have a longitudinal data with three follow-up. The columns 2,3 and 4
I want to set the value 99 in the columns v_9, v_01, and v_03 to NA, but I want to set their corresponding columns (columns "d_9", "d_01","d_03" and "a_9", "a_01","a_03") as NA as well. As an example for ID 101 as below:
How can I do this for all the individuals and my whole data set in R? thanks in advance for the help.
"id" "v_9" "v_01" "v_03" "d_9" "d_01" "d_03" "a_9" "a_01" "a_03"
101 12 NA 10 2015-03-23 NA 2003-06-19 40.50650 NA 44.1065
structure(list(id = c(101, 102, 103, 104), v_9 = c(12, 99, 16,
25), v_01 = c(99, 12, 16, NA), v_03 = c(10, NA, 99, NA), d_9 = structure(c(16517,
17613, 16769, 10667), class = "Date"), d_01 = structure(c(13291,
NA, 13566, NA), class = "Date"), d_03 = structure(c(12222, NA,
12119, NA), class = "Date"), a_9 = c(40.5065, 40.5065, 30.19713,
51.40862), a_01 = c(42.5065, 41.5112, 32.42847, NA), a_03 = c(44.1065,
NA, 35.46543, NA)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Try this function:
fn <- function(df){
for(s in c("_9" , "_01" , "_03")){
i <- which(`[[`(df,paste0("v",s)) == 99)
df[i, paste0("v",s)] <- NA
df[i, paste0("d",s)] <- NA
df[i, paste0("a",s)] <- NA
}
df
}
df <- fn(df)
Output
# A tibble: 4 × 10
id v_9 v_01 v_03 d_9 d_01 d_03 a_9 a_01 a_03
<dbl> <dbl> <dbl> <dbl> <date> <date> <date> <dbl> <dbl> <dbl>
1 101 12 NA 10 2015-03-23 NA 2003-06-19 40.5 NA 44.1
2 102 NA 12 NA NA NA NA NA 41.5 NA
3 103 16 16 NA 2015-11-30 2007-02-22 NA 30.2 32.4 NA
4 104 25 NA NA 1999-03-17 NA NA 51.4 NA NA

Slice based on multiple date ranges and multiple columns to formating a new dataframe with R

Let's say I have a sample data as follow:
df <- structure(list(date = structure(c(18912, 18913, 18914, 18915,
18916, 18917, 18918, 18919, 18920, 18921, 18922, 18923), class = "Date"),
value1 = c(1.015, NA, NA, 1.015, 1.015, 1.015, 1.015, 1.015,
1.015, 1.015, 1.015, 1.015), value2 = c(1.115668, 1.104622,
1.093685, 1.082857, 1.072135, 1.06152, 1.05101, NA, NA, 1.0201,
1.01, 1), value3 = c(1.015, 1.030225, NA, NA, 1.077284, 1.093443,
1.109845, 1.126493, 1.14339, 1.160541, 1.177949, 1.195618
)), row.names = c(NA, -12L), class = "data.frame")
and three date intervals:
date_range1 <- (date>='2021-10-12' & date<='2021-10-15')
date_range2 <- (date>='2021-10-16' & date<='2021-10-18')
date_range3 <- (date>='2021-10-21' & date<='2021-10-23')
I need to slice data from value1, value2 and value3 using date_range1, date_range2 and date_range3 respectively, and finally concatenate them to one column as follows:
Please note type 1, 2 and 3 are numbers to indicate date ranges: date_range1, date_range2 and date_range3.
How could I achieve that with R's packages? Thanks.
EDIT:
str(real_data)
Out:
tibble [1,537 x 5] (S3: tbl_df/tbl/data.frame)
$ date : chr [1:1537] "2008-01-31" "2008-02-29" "2008-03-31" "2008-04-30" ...
$ value1: num [1:1537] 11.3 11.4 11.4 11.3 11.2 ...
$ value2 : num [1:1537] 11.4 11.4 11.3 11.3 11.1 ...
$ value3: num [1:1537] NA NA NA NA NA NA NA NA NA NA ...
$ value4 : chr [1:1537] "11.60" "10.20" "12.55" "10.37" ...
You may use use dplyr::case_when
library(dplyr)
df %>%
mutate(type = case_when(
date>='2021-10-12' & date<='2021-10-15' ~ 1,
date>='2021-10-16' & date<='2021-10-18' ~ 2,
date>='2021-10-21' & date<='2021-10-23' ~ 3,
TRUE ~ NA_real_
),
value = case_when(
type == 1 ~ value1,
type == 2 ~ value2,
type == 3 ~ value3,
TRUE ~ NA_real_
)) %>%
select(date, value, type) %>%
filter(!is.na(type))
date value type
1 2021-10-12 1.015000 1
2 2021-10-13 NA 1
3 2021-10-14 NA 1
4 2021-10-15 1.015000 1
5 2021-10-16 1.072135 2
6 2021-10-17 1.061520 2
7 2021-10-18 1.051010 2
8 2021-10-21 1.160541 3
9 2021-10-22 1.177949 3
10 2021-10-23 1.195618 3

Merge two dataframes: specifically merge a selection of columns based on two conditions?

I have two datasets on the same 2 patients. With the second dataset I want to add new information to the first, but I can't seem to get the code right.
My first (incomplete) dataset has a patient ID, measurement time (either T0 or FU1), year of birth, date of the CT scan, and two outcomes (legs_mass and total_mass):
library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)
df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA,
NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# Which gives the following dataframe
df1
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 NA NA
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 NA NA
The second dataset adds to the legs_mass and total_mass columns:
df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842,
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif",
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# Which gives the following dataframe:
df2
# A tibble: 2 x 5
ID date_ct ctscan_label legs_mass total_mass
<dbl> <date> <chr> <dbl> <dbl>
1 115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif 956. 1016.
2 370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif 21.3 21.3
What I am trying to do, is...
Add the legs_mass and total_mass column values from df2 to df1, based on ID number and date_ct.
Add the new columns of df2 (the one that is not in df1; ctscan_label) to df1, also based on the date of the ct and patient ID.
So that the final dataset df3 looks as follows:
df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5,
1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
# Corresponding to the following tibble:
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 956. 1016.
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 21.3 21.3
I have tried the merge function and rbind from baseR, and bind_rows from dplyr but can't seem to get it right.
Any help?
You can join the two datasets and use coalesce to keep one non-NA value from the two datasets.
library(dplyr)
left_join(df1, df2, by = c("ID", "date_ct")) %>%
mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y),
total_mass = coalesce(total_mass.x, total_mass.y)) %>%
select(-matches('\\.x|\\.y'), -ctscan_label)
# ID time year_of_birth date_ct leg_mass total_mass
# <dbl> <fct> <dbl> <date> <dbl> <dbl>
#1 115 T0 1970 2015-08-04 9.1 14.5
#2 115 FU1 1970 2018-11-07 956. 1016.
#3 370 T0 1961 2015-08-04 NA NA
#4 370 FU1 1961 2020-09-30 21.3 21.3
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") :=
.(fcoalesce(legs_mass, i.legs_mass),
fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]
-output
df1
ID time year_of_birth date_ct legs_mass total_mass
1: 115 T0 1970 2015-08-04 9.1 14.5
2: 115 FU1 1970 2018-11-07 956.1 1015.9
3: 370 T0 1961 2015-08-04 NA NA
4: 370 FU1 1961 2020-09-30 21.3 21.3

Unique/distinict in R that doesn't keep any of the non-unique rows [duplicate]

This question already has an answer here:
R, Removing duplicate along with the original value [duplicate]
(1 answer)
Closed 2 years ago.
I have a tbl_df that looks like this:
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acp-2 Ea 301. 32.4
3 acp-2 Ep 188. 20.6
4 acs-13 MSx1 69.1 8.20
5 acs-22 Ea 176. 19.4
6 acs-22 Ep 64.3 7.70
7 acs-3 Ea 156. 17.2
8 acs-3 Ep 75.5 8.87
9 add-2 Ea 123. 6.62
10 add-2 Ep 125. 6.69
I would like to remove all non-unique rows based on "Genes"/ not keep any of the rows. So it should look like:
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acs-13 MSx1 69.1 8.20
where none of the repeated genes are selected and the rest of the column data are maintained. I have tried unique(), distinct(), !duplicated etc - none of these remove all the non-unqiue rows.
Try this:
library(dplyr)
#Code
new <- df %>%
group_by(Genes) %>%
filter(n()==1)
Output:
# A tibble: 2 x 4
# Groups: Genes [2]
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acs-13 MSx1 69.1 8.2
Some data used:
#Data
df <- structure(list(Genes = c("abts-1", "acp-2", "acp-2", "acs-13",
"acs-22", "acs-22", "acs-3", "acs-3", "add-2", "add-2"), Cell = c("MSx1",
"Ea", "Ep", "MSx1", "Ea", "Ep", "Ea", "Ep", "Ea", "Ep"), AC = c(94.9,
301, 188, 69.1, 176, 64.3, 156, 75.5, 123, 125), FC = c(6.81,
32.4, 20.6, 8.2, 19.4, 7.7, 17.2, 8.87, 6.62, 6.69)), row.names = c(NA,
-10L), class = "data.frame")

Resources