Lookup value from multiple sets of columns - r

This is kind of VOOLKUP problem in excel. I have a data set like the following.
dat1 <- read.table(header=TRUE, text="
ID Name1 Name2
1384 Rem_Ps Tel_Nm
1442 Teq_Ls Sel_Nm
1340 Fem_Bs Tem_Mn
1419 Few_Bn Ten_Gf
1359 Fem_Bs Tem_Mn
1237 Qwl_Po Mnt_Pj
1288 Tem_na Tem_Rt
1261 Sem_Na Tel_Tr
1382 Rem_Ps Tel_Nm
1316 Fem_Bs Tem_Mn
1279 Sem_Na Yem_Rt
1366 Sel_Ve Mkl_Po
1269 Rem_Ps Tel_Nm
")
dat1
ID Name1 Name2
1 1384 Rem_Ps Tel_Nm
2 1442 Teq_Ls Sel_Nm
3 1340 Fem_Bs Tem_Mn
4 1419 Few_Bn Ten_Gf
5 1359 Fem_Bs Tem_Mn
6 1237 Qwl_Po Mnt_Pj
7 1288 Tem_na Tem_Rt
8 1261 Sem_Na Tel_Tr
9 1382 Rem_Ps Tel_Nm
10 1316 Fem_Bs Tem_Mn
11 1279 Sem_Na Yem_Rt
12 1366 Sel_Ve Mkl_Po
13 1269 Rem_Ps Tel_Nm
The above dataset would lookup value from the following data set. Both of the lookup values Name1 and Name2 would use dat2 seven columns QC1 to NC3 to lookup the values. More clarification: If Name1 is found from the seven columns and Name2 is also found in the seven columns, only then we will consider the option as valid. For example: the second row has two values Teq_ls and Sel_Nm. As Teq_ls is not found the seven columns, we will toss this row.
dat2 <- read.table(header=TRUE, text="
ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
AB1 1123 44ed Fem_Bs Ten_Gf NA NA Tem_Mn Tem_Mn NA
AB2 123 331s Tem_Rt Qwl_Po NA Ten_Gf NA Tem_Mn Mnt_Pj
AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr NA NA
AB4 1234 33ey Sem_Na NA NA NA Tem_Rt NA Yem_Rt
AB5 13243 ed43 Rem_Ps NA NA Tem_Mn NA Tel_Nm NA
AB6 123 34rt NA Ten_Gf NA Sel_Ve Mkl_Po Tem_Rt NA
")
dat2
ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
1 AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
2 AB2 123 331s Tem_Rt Qwl_Po <NA> Ten_Gf <NA> Tem_Mn Mnt_Pj
3 AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr <NA> <NA>
4 AB4 1234 33ey Sem_Na <NA> <NA> <NA> Tem_Rt <NA> Yem_Rt
5 AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
6 AB6 123 34rt <NA> Ten_Gf <NA> Sel_Ve Mkl_Po Tem_Rt <NA>
The result would be like this.
ID Name1 Name2 ID1 REQ REM
1384 Rem_Ps Tel_Nm AB5 13243 ed43
1340 Fem_Bs Tem_Mn AB1 1123 44ed
1359 Fem_Bs Tem_Mn AB1 1123 44ed
1237 Qwl_Po Mnt_Pj AB2 123 331s
1261 Sem_Na Tel_Tr AB3 123 334q
1382 Rem_Ps Tel_Nm AB5 13243 ed43
1316 Fem_Bs Tem_Mn AB1 1123 44ed
1279 Sem_Na Yem_Rt AB4 1234 33ey
1366 Sel_Ve Mkl_Po AB6 123 34rt
1269 Rem_Ps Tel_Nm AB5 13243 ed43

Let's do it in base:
z <- which(apply(dat1, 1, function(x) apply(dat2, 1, function(z) x[[2]] %in% z & x[[3]] %in% z)), arr.ind = TRUE)
cbind(dat1[z[,2],], dat2[z[,1],])
ID Name1 Name2 ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
1 1384 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
3 1340 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
5 1359 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
6 1237 Qwl_Po Mnt_Pj AB2 123 331s Tem_Rt Qwl_Po <NA> Ten_Gf <NA> Tem_Mn Mnt_Pj
8 1261 Sem_Na Tel_Tr AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr <NA> <NA>
9 1382 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
10 1316 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
11 1279 Sem_Na Yem_Rt AB4 1234 33ey Sem_Na <NA> <NA> <NA> Tem_Rt <NA> Yem_Rt
12 1366 Sel_Ve Mkl_Po AB6 123 34rt <NA> Ten_Gf <NA> Sel_Ve Mkl_Po Tem_Rt <NA>
13 1269 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>

Related

Unnest or move rows to columns?

This is just one of those things that I can't figure out how to word in order to search for a solution to my problem. I have some election data for Democratic and Republican candidates. The data is contained in 2 rows per county with one of those rows corresponding to one of the two candidates.
I need a data frame with one row per county and I need to create a new column out of the second row for each county. I've tried to un-nest the dataframe, but that doesn't work. I've seen something about using un-nest and mutate together, but I can't figure that out. Transposing the dataframe didn't help either. I've also tried to ungroup without success.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# Remove unnecessary columns
election <- within(election, rm('ElectionDate','OfficeCode.Text.','DistrictCode.Text.','StatusCode','CountyCode','OfficeDescription','PartyOrder','PartyName','CandidateID','CandidateFirstName','CandidateMiddleName','CandidateFormerName','WriteIn.W..Uncommitted.Z.','Recount...','Nominated.N..Elected.E.'))
# Remove offices other than POTUS
election <- election[-c(167:2186),]
# Keep only DEM and REP parties
election <- election %>%
filter(PartyDescription == "Democratic" |
PartyDescription == "Republican")
[
I'd like it to look like this:
dplyr
library(dplyr)
library(tidyr) # pivot_wider
election %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
slice(-(167:2186)) %>%
filter(PartyDescription %in% c("Democratic", "Republican")) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
# # A tibble: 83 x 25
# CountyName Biden Trump Richer LaFave Cambensy Wagner Metsa Markkanen Lipton Strayhorn Carlone Frederick Bernstein Diggs Hubbard Meyers Mosallam Vassar `O'Keefe` Schuitmaker Dewaelsche Stancato Gates Land
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ALCONA 2142 4848 NA NA NA NA NA NA 1812 1748 4186 4209 1818 1738 4332 4114 1696 1770 4273 4187 1682 1733 4163 4223
# 2 ALGER 2053 3014 NA NA 2321 2634 NA NA 1857 1773 2438 2470 1795 1767 2558 2414 1757 1769 2538 2444 1755 1757 2458 2481
# 3 ALLEGAN 24449 41392 NA NA NA NA NA NA 20831 19627 37681 38036 20043 19640 38805 37375 18820 19486 37877 39052 19081 19039 37322 38883
# 4 ALPENA 6000 10686 NA NA NA NA NA NA 5146 4882 8845 8995 5151 4873 9369 8744 4865 4935 9212 8948 4816 4923 9069 9154
# 5 ANTRIM 5960 9748 NA NA NA NA NA NA 5042 4798 8828 8886 4901 4797 9108 8737 4686 4810 9079 8867 4679 4781 8868 9080
# 6 ARENAC 2774 5928 NA NA NA NA NA NA 2374 2320 4626 4768 2396 2224 4833 4584 2215 2243 5025 4638 2185 2276 4713 4829
# 7 BARAGA 1478 2512 NA NA NA NA 1413 2517 1267 1212 2057 2078 1269 1233 2122 2003 1219 1243 2090 2056 1226 1228 2072 2074
# 8 BARRY 11797 23471 NA NA NA NA NA NA 9794 9280 20254 20570 9466 9215 20885 20265 9060 9324 21016 20901 8967 9121 20346 21064
# 9 BAY 26151 33125 NA NA NA NA NA NA 23209 22385 26021 26418 23497 22050 27283 25593 21757 22225 27422 25795 21808 21999 26167 26741
# 10 BENZIE 5480 6601 NA NA NA NA NA NA 4704 4482 5741 5822 4584 4479 6017 5681 4379 4449 5979 5756 4392 4353 5704 5870
# # ... with 73 more rows
#r2evans had the right idea, but slicing the data before filtering lost a lot of the voting data. I hadn't realized that before.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# That's an ugly dataset...let's make it better
election <- election[-c(1:5,7:9,11,13:15,17:19)]
election <- election %>%
filter(CandidateLastName %in% c("Biden", "Trump")) %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)

Add date points between separate dates in a dataframe and create blanks (NA) in the other columns were those newly rows were created in r

This is how my data looks like:
> dput(head(h01_NDVI_specveg_data_spectra,6))
structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"
), collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"), NDVI = c(0.581769436997319, 0.539445628997868,
0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
I have separate dates without order as you can see in the example (ex.: 2011-04-12; 2011-04-28; 2011-05-31...). What I want is to insert the missing dates between the dates that I have. On top of that, consequently, I want to create new rows for the other columns, where for NDVI those rows would be NA.
Check this example of the desired output:
ID
collection_date
NDVI
h01
2011-04-12
0.5817694
h01
2011-04-13
NA
h01
2011-04-14
NA
h01
2011-04-15
NA
h01
2011-04-16
NA
h01
2011-04-17
NA
h01
2011-04-18
NA
h01
2011-04-19
NA
h01
2011-04-20
NA
h01
2011-04-21
NA
h01
2011-04-22
NA
h01
2011-04-23
NA
h01
2011-04-24
NA
h01
2011-04-25
NA
h01
2011-04-26
NA
h01
2011-04-27
NA
h01
2011-04-28
0.5394456
h01
2011-04-29
NA
h01
2011-04-30
NA
...
..........
..
Any help will be much appreciated.
df1 <- structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"),
collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"),
NDVI = c(0.581769436997319, 0.539445628997868, 0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155)),
row.names = c(NA, -6L), class = c("data.frame"))
We create a data.frame containing all dates and tidyr::left_join it with the existing (incomplete) data. The NA are created automatically.
library(dplyr)
library(tidyr)
data.frame(collection_date = seq.Date(min(df1$collection_date), max(df1$collection_date), "days")) %>%
left_join(df1) %>%
arrange(collection_date) %>%
select(ID, collection_date, everything())
Returns:
ID collection_date NDVI
1 h01 2011-04-12 0.5817694
2 <NA> 2011-04-13 NA
3 <NA> 2011-04-14 NA
4 <NA> 2011-04-15 NA
5 <NA> 2011-04-16 NA
6 <NA> 2011-04-17 NA
7 <NA> 2011-04-18 NA
8 <NA> 2011-04-19 NA
9 <NA> 2011-04-20 NA
10 <NA> 2011-04-21 NA
11 <NA> 2011-04-22 NA
12 <NA> 2011-04-23 NA
13 <NA> 2011-04-24 NA
14 <NA> 2011-04-25 NA
15 <NA> 2011-04-26 NA
16 <NA> 2011-04-27 NA
17 h01 2011-04-28 0.5394456
18 <NA> 2011-04-29 NA
19 <NA> 2011-04-30 NA
20 <NA> 2011-05-01 NA
21 <NA> 2011-05-02 NA
22 <NA> 2011-05-03 NA
23 <NA> 2011-05-04 NA
24 <NA> 2011-05-05 NA
25 <NA> 2011-05-06 NA
26 <NA> 2011-05-07 NA
27 <NA> 2011-05-08 NA
28 <NA> 2011-05-09 NA
29 <NA> 2011-05-10 NA
30 <NA> 2011-05-11 NA
31 <NA> 2011-05-12 NA
32 <NA> 2011-05-13 NA
33 <NA> 2011-05-14 NA
34 <NA> 2011-05-15 NA
35 <NA> 2011-05-16 NA
36 <NA> 2011-05-17 NA
37 <NA> 2011-05-18 NA
38 <NA> 2011-05-19 NA
39 <NA> 2011-05-20 NA
40 <NA> 2011-05-21 NA
41 <NA> 2011-05-22 NA
42 <NA> 2011-05-23 NA
43 <NA> 2011-05-24 NA
44 <NA> 2011-05-25 NA
45 <NA> 2011-05-26 NA
46 <NA> 2011-05-27 NA
47 <NA> 2011-05-28 NA
48 <NA> 2011-05-29 NA
49 <NA> 2011-05-30 NA
50 h01 2011-05-31 0.3385417
51 <NA> 2011-06-01 NA
52 <NA> 2011-06-02 NA
53 <NA> 2011-06-03 NA
54 <NA> 2011-06-04 NA
55 <NA> 2011-06-05 NA
56 <NA> 2011-06-06 NA
57 <NA> 2011-06-07 NA
58 <NA> 2011-06-08 NA
59 <NA> 2011-06-09 NA
60 <NA> 2011-06-10 NA
61 <NA> 2011-06-11 NA
62 <NA> 2011-06-12 NA
63 <NA> 2011-06-13 NA
64 h01 2011-06-14 0.3027140
65 <NA> 2011-06-15 NA
66 <NA> 2011-06-16 NA
67 <NA> 2011-06-17 NA
68 <NA> 2011-06-18 NA
69 <NA> 2011-06-19 NA
70 <NA> 2011-06-20 NA
71 <NA> 2011-06-21 NA
72 <NA> 2011-06-22 NA
73 <NA> 2011-06-23 NA
74 <NA> 2011-06-24 NA
75 <NA> 2011-06-25 NA
76 <NA> 2011-06-26 NA
77 <NA> 2011-06-27 NA
78 <NA> 2011-06-28 NA
79 <NA> 2011-06-29 NA
80 <NA> 2011-06-30 NA
81 <NA> 2011-07-01 NA
82 <NA> 2011-07-02 NA
83 <NA> 2011-07-03 NA
84 h01 2011-07-04 0.3058824
85 <NA> 2011-07-05 NA
86 <NA> 2011-07-06 NA
87 <NA> 2011-07-07 NA
88 <NA> 2011-07-08 NA
89 <NA> 2011-07-09 NA
90 <NA> 2011-07-10 NA
91 <NA> 2011-07-11 NA
92 <NA> 2011-07-12 NA
93 <NA> 2011-07-13 NA
94 <NA> 2011-07-14 NA
95 h01 2011-07-15 0.2694394
Edit:
In order to have ID = "h01" everywhere we just add it to the constructed data.frame. I.e.:
library(dplyr)
library(tidyr)
data.frame(collection_date = seq.Date(min(df1$collection_date), max(df1$collection_date), "days"),
ID = "h01") %>%
left_join(df1) %>%
arrange(collection_date) %>%
select(ID, collection_date, everything())
library(tidyverse)
library(lubridate)
df = structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"
), collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"), NDVI = c(0.581769436997319, 0.539445628997868,
0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 = tibble(
ID = "h01",
collection_date = seq(ymd("2011-04-10"), ymd("2011-07-16"), 1)
) %>% left_join(df, by = c("ID", "collection_date"))
df2 %>% head(10)
output
# A tibble: 98 x 3
ID collection_date NDVI
<chr> <date> <dbl>
1 h01 2011-04-10 NA
2 h01 2011-04-11 NA
3 h01 2011-04-12 0.582
4 h01 2011-04-13 NA
5 h01 2011-04-14 NA
6 h01 2011-04-15 NA
7 h01 2011-04-16 NA
8 h01 2011-04-17 NA
9 h01 2011-04-18 NA
10 h01 2011-04-19 NA
# ... with 88 more rows
output df2 %>% tail(10)
# A tibble: 10 x 3
ID collection_date NDVI
<chr> <date> <dbl>
1 h01 2011-07-07 NA
2 h01 2011-07-08 NA
3 h01 2011-07-09 NA
4 h01 2011-07-10 NA
5 h01 2011-07-11 NA
6 h01 2011-07-12 NA
7 h01 2011-07-13 NA
8 h01 2011-07-14 NA
9 h01 2011-07-15 0.269
10 h01 2011-07-16 NA
You may use tidyr::complete -
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
complete(collection_date = seq(min(collection_date),
max(collection_date), by = 'days')) %>%
ungroup
# ID collection_date NDVI
# <chr> <date> <dbl>
# 1 h01 2011-04-12 0.582
# 2 h01 2011-04-13 NA
# 3 h01 2011-04-14 NA
# 4 h01 2011-04-15 NA
# 5 h01 2011-04-16 NA
# 6 h01 2011-04-17 NA
# 7 h01 2011-04-18 NA
# 8 h01 2011-04-19 NA
# 9 h01 2011-04-20 NA
#10 h01 2011-04-21 NA
#11 h01 2011-04-22 NA
#12 h01 2011-04-23 NA
#13 h01 2011-04-24 NA
#14 h01 2011-04-25 NA
#15 h01 2011-04-26 NA
#16 h01 2011-04-27 NA
#17 h01 2011-04-28 0.539
#18 h01 2011-04-29 NA
#19 h01 2011-04-30 NA
#20 h01 2011-05-01 NA
#...
#...
The benefit of this approach would be that it would create missing dates based on min and max for each ID.

How to mark episodes in which patient will be readmitted in 30-days?

I have a dataset with patients episodes.
Every patient has its own patientPersonalNumber.
Inpatient episode has admission and discharge date.
I need to mark in new variable (with TRUE, or with 1) all episodes that patient in that episode will be readmitted within 30 days.
install.packages("lubridate")
library(lubridate)
admission <- c("06/23/2013", "06/30/2013", "07/12/2013","06/24/2013","06/28/2013","06/29/2013","06/23/2013","06/24/2013","06/24/2013","07/02/2013","07/09/2013","06/24/2013","09/08/2013","07/22/2014")
discharge<- c("06/25/2013", "07/03/2014", "07/17/2014","06/30/2013","06/30/2013","07/02/2013","06/29/2013","06/29/2013","06/27/2013","07/05/2013","07/12/2013","06/28/2013","10/12/2013","08/01/2014")
admission.date <- mdy(admission)
discharge.date <- mdy(discharge)
patientPersonalNumber<-c("001","002","004","005","006","007","008","009","010", "005","005","011","005", "004")
df<-data.frame(patientPersonalNumber,admission.date,discharge.date)
df
patientPersonalNumber admission.date discharge.date
1 001 2013-06-23 2013-06-25
2 002 2013-06-30 2014-07-03
3 004 2014-07-12 2014-07-17
4 005 2013-06-24 2013-06-30
5 006 2013-06-28 2013-06-30
6 007 2013-06-29 2013-07-02
7 008 2013-06-23 2013-06-29
8 009 2013-06-24 2013-06-29
9 010 2013-06-24 2013-06-27
10 005 2013-07-02 2013-07-05
11 005 2013-07-09 2013-07-12
12 011 2013-06-24 2013-06-28
13 005 2013-09-08 2013-10-12
14 004 2014-07-22 2014-08-01
So I have to mark lines (3,4,10) as true.
#4 Patient 005 discharged 2013-06-30 was admitted 2013-07-02
#10 Patient 005 discharged 2013-07-05 was admitted 2013-07-09
#3 Patient 004 discharged 2013-06-30 was admitted 2013-07-22
I appreciate any help.
#origianl data were edit
would go with something like this:
require(tidyverse)
df %>%
arrange(patientPersonalNumber, admission.date) %>%
group_by(patientPersonalNumber) %>%
mutate(re.admin = (lag(discharge.date) + 30) >= admission.date) %>%
mutate(re.admin = ifelse(is.na(re.admin), FALSE, re.admin ))
# A tibble: 14 x 4
# Groups: patientPersonalNumber [10]
patientPersonalNumber admission.date discharge.date re.admin
<chr> <date> <date> <lgl>
1 001 2013-06-23 2013-06-25 FALSE
2 002 2013-06-30 2014-07-03 FALSE
3 004 2013-07-22 2014-08-01 FALSE
4 004 2014-07-12 2014-07-17 TRUE
5 005 2013-06-24 2013-06-30 FALSE
6 005 2013-07-02 2013-07-05 TRUE
7 005 2013-07-09 2013-07-12 TRUE
8 005 2013-09-08 2013-10-12 FALSE
9 006 2013-06-28 2013-06-30 FALSE
10 007 2013-06-29 2013-07-02 FALSE
11 008 2013-06-23 2013-06-29 FALSE
12 009 2013-06-24 2013-06-29 FALSE
13 010 2013-06-24 2013-06-27 FALSE
14 011 2013-06-24 2013-06-28 FALSE

How to create a new table given X and Y from a data frame

I'm trying to get a new dataset where it can take two columns and make a new table based on a calculation of a third column.
Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000
1931 60 1008 1007 1006 1005
1141 1000 1014 1013 1007 1006
I need to make a new table where it is:
Cust 1014 1013 1008 1007 1006 1001 1000
1009 NA NA NA T *.1 T *.1 T*.05 T * .025
1010 NA NA NA T *.1 T *.1 T*.05 T * .025
1011 NA NA NA T *.1 T *.1 T*.05 T * .025
1013 NA NA NA T *.1 T *.1 T*.05 T * .025
1931 NA NA T*.1 T *.1 T*.05 T * .025 NA
1141 T*.1 T *.1 NA T*.05 T * .025 NA NA
I just can't seem to figure it out and I'm not even sure if it is possible.
A tidyverse solution:
library(tidyverse)
df %>% gather(select = -c(Cust, T)) %>%
select(-key) %>%
spread(value, T) %>%
map2_dfc(c(1, .025, .05, rep(.1, 6)), ~ .x * .y)
# Cust `1000` `1001` `1005` `1006` `1007` `1008` `1013` `1014`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1009 3.75 7.5 NA 15 15 NA NA NA
# 2 1010 1.25 2.5 NA 5 5 NA NA NA
# 3 1011 1.25 2.5 NA 5 5 NA NA NA
# 4 1013 250 500 NA 1000 1000 NA NA NA
# 5 1141 NA NA NA 100 100 NA 100 100
# 6 1931 NA NA 6 6 6 6 NA NA
library(dplyr)
library(tidyr)
library(data.table)
df %>% gather(key=k,value = val, -c('Cust','T')) %>%
mutate(val_upd=ifelse(k=='S1'|k=='S2','T*.1',ifelse(k=='S3','T*.05','T*.025'))) %>%
#Change 'T*.1' to T*.1 to get the actual value
select(-T,-k) %>% dcast(Cust~val,value.var='val_upd')
Cust 1000 1001 1005 1006 1007 1008 1013 1014
1 1009 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
2 1010 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
3 1011 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
4 1013 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
5 1141 <NA> <NA> <NA> T*.025 T*.05 <NA> T*.1 T*.1
6 1931 <NA> <NA> T*.025 T*.05 T*.1 T*.1 <NA> <NA>
Data
df <- read.table(text = "
Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000
1931 60 1008 1007 1006 1005
1141 1000 1014 1013 1007 1006
", header=TRUE)
This is one way using a combination of reshape2::melt, dplyr::select, tidyr::spread and dplyr::mutate. May not be the best way, but it should do what you want:
# Read the data (if you don't already have it loaded)
df <- read.table(text="Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000", header=T)
# Manipulate your data.frame. Replace df with the name of your data.frame
reshape2::melt(df, c("Cust", "T"), c("S1", "S2", "S3", "S4")) %>%
dplyr::select(-variable) %>%
tidyr::spread(value, T) %>%
dplyr::mutate(`1007`=`1007`*0.1,
`1006`=`1006`*0.1,
`1001`=`1001`*0.05,
`1000`=`1000`*0.025)
# Cust 1000 1001 1006 1007
#1 1009 3.75 7.5 15 15
#2 1010 1.25 2.5 5 5
#3 1011 1.25 2.5 5 5
#4 1013 250.00 500.0 1000 1000
You'll need the backticks as R doesn't handle having numeric colnames very well.
Let me know if I've misunderstood anything/something doesn't make sense

How to convert data into a Time Series using R?

I have an intraday dataset of stock-related quotes. How do I convert it into a time series?
Time Size Ask Bid Trade
11-1-2016 9:00:12 100 <NA> 901 <NA>
11-1-2016 9:00:21 5 <NA> <NA> 950
11-1-2016 9:00:21 5 <NA> 950 <NA>
11-1-2016 9:00:21 10 905 <NA> <NA>
11-1-2016 9:00:24 500 <NA> 921 <NA>
11-1-2016 9:00:28 2 <NA> 879 <NA>
11-1-2016 9:00:31 6 1040 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:39 5 <NA> 950 <NA>
11-1-2016 9:00:39 10 905 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:44 2 <NA> 879 <NA>
11-1-2016 9:00:44 6 1040 <NA> <NA>
11-1-2016 9:00:45 1 1005 <NA> <NA>
11-1-2016 9:00:46 1 1000 <NA> <NA>
11-1-2016 9:00:47 1 <NA> 900 <NA>
11-1-2016 9:00:47 5 <NA> <NA> 950
11-1-2016 9:00:47 5 <NA> 950 <NA>
11-1-2016 9:00:47 10 905 <NA> <NA>
11-1-2016 9:00:48 1 <NA> 900 <NA>
11-1-2016 9:00:48 1 1000 <NA> <NA>
11-1-2016 9:00:52 5 <NA> <NA> 950
11-1-2016 9:00:52 5 <NA> 950 <NA>
11-1-2016 9:00:52 10 905 <NA> <NA>
11-1-2016 9:00:53 10 <NA> <NA> 939
11-1-2016 9:00:55 1 <NA> 900 <NA>
11-1-2016 9:00:55 1 1000 <NA> <NA>
11-1-2016 9:00:55 10 <NA> <NA> 939
11-1-2016 9:00:55 5 <NA> 950 <NA>
11-1-2016 9:00:55 10 905 <NA> <NA>
11-1-2016 9:00:59 10 <NA> <NA> 939
11-1-2016 9:01:04 10 <NA> <NA> 950
11-1-2016 9:01:04 25 <NA> 950 <NA>
11-1-2016 9:01:06 1 <NA> 900 <NA>
11-1-2016 9:01:06 1 1000 <NA> <NA>
11-1-2016 9:01:14 19 <NA> <NA> 972
11-1-2016 9:01:14 20 <NA> 972 <NA>
11-1-2016 9:01:14 10 905 <NA> <NA>
11-1-2016 9:01:17 19 <NA> <NA> 972
11-1-2016 9:01:17 1 <NA> 912 <NA>
The structure of the dataset is
'data.frame': 35797 obs. of 5 variables:
$ Time : POSIXct, format: "2016-11-01 09:00:12" "2016-11-01 09:00:21" ..
$ Size : chr "100" "5" "5" "10" ...
$ ASk : chr NA NA NA "905" ...
$ Bid : chr "901" NA "950" NA ...
$ Trade: chr NA "950" NA NA ...
Once the data is converted into a time series object, then how do I aggregate the column of Ask, Bid and Trade for every 5 minute.

Resources