Specify number of columns to read when first row is missing values - r

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).
The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).
Example code:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)

Use the base read.csv then convert to typle if need be:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17

I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.
Assuming you have a file test.csv like this:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
Try something like this:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv now looks like this:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
So we now have 7 columns, the first is the timestamp.
This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
The final mydata:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17

Related

R: Replace values in one dataframe with value from another dataframe if conditions are met, otherwise append skipped data

From my instrumentation, I receive two different .tsv files containing my data. The first file contains, among other things, the name of the sample, its position in a 12x8 grid, and its output data. The second file contains average data from replicate sets based off the first text file. I've re-created an example of the two files in these data frames -- I actually read them using the read.table() function.
#re-creation of first .tsv file
Data <- data.frame(Name = c("100", "100", "200", "250", "1E5", "1E5", "Negative", "Negative"),
Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
Output = c("20.00", "20.10", "21.67", "23.24", "21.97", "22.03", "38.99", "38.99"))
Data
Name Pos Output
1 100 A3 20.00
2 100 A4 20.10
3 200 B3 21.67
4 250 B4 23.24
5 1E5 C3 21.97
6 1E5 C4 22.03
7 Negative D3 38.99
8 Negative D4 38.99
#re-creation of second .tsv file
Replicates <- data.frame(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
Mean.Cq = c(20.05, 22.00, 38.99)
STD.Cq = c(0.05, 0.03, 0.00))
Replicates
Replicates Mean.Cq STD.Cq
1 A3, A4 20.05 0.05
2 C3, C4 22.00 0.03
3 D3, D4 38.99 0.00
This is what I'm trying to create:
#Rename values in Replicates$Name with value in Data$Name if replicate is present; append with non-replicate data
Name Mean.Cq STD.Cq
1 100 20.05 0.05
2 1E5 22.00 NA
3 Negative 38.99 NA
4 200 21.67 0.03
5 250 23.24 0.00
I can do this manually by creating a dataframe using stringr and rbind.fill from slices of the "Data" dataframe such that I keep the first instance of each name of the replicates, then remove the $Replicates column from the "Replicates" dataframe and replace it with the $Name column of the newly-created sliced dataframe. I can then append the rows of non-replicate samples to the "Replicates" dataframe. However, not all of my files have the exact same pattern of replicates, or number of samples.
I have been trying in vein to mimic this example such that I can do this process for each file set regardless of the order or number of replicates, instead of going through each one and cleaning by hand. How do I transform this manual process into a for loop to keep from having to make a bunch of sliced dataframes?
The first part of my problem I think has been the ability to detect only part of the Replicates$Replicates pattern in Data$Name , not just the individual characters
For example detect either A3 OR A4 from Replicates$Replicates[1] in Data$Name, then replace the value of Replicates$Replicates with the value of the first match found in Data$Name. I'm stuck at this step.
> str_replace(Replicates$Replicates, (str_detect(Data$Name, "[Replicates$Replicates]")))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Any insight would be super helpful! Sill new to programming, bioinformatics, and data science and I'm trying to figure it out as I go on my data.
EDIT
Thank you, #Skaqqs for helping to answer the question. I made edits from his answer to fit this in the tidyverse, which I have been finding a bit easier to adapt to than base R. Splitting the replicates into two columns, then sorting and joining did the trick (and that's where I was getting stuck).
require(tidyverse)
Samples <- tibble(Name = c("100", "100", "200", "250", "1E5", "1E5",
"Negative", "Negative"),
Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
Output = c("20.00", "20.10", "21.67", "23.24", "21.97",
"22.03", "38.99", "38.99"))
Replicates <- tibble(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
Mean.Cq = c(20.05, 22.00, 38.99),
STD.Cq = c(0.05, 0.03, 0.00))
Samples %>%
.[str_order(.$Pos, numeric = TRUE),]
Replicates %>%
mutate("R1" = gsub(x = Replicates, pattern = "^(.*),.*", replacement = "\\1")) %>%
mutate("R2" = gsub(x = Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>%
select("Mean.Cq", "STD.Cq", "Wells") %>%
relocate("Wells", 1) %>%
right_join(Samples, by = c("Wells"="Pos")) %>%
.[str_order(.$Wells, numeric = TRUE),] %>%
select("Name", "Mean.Cq", "STD.Cq") %>%
distinct(Name, .keep_all = TRUE)
# A tibble: 5 x 3
Name Mean.Cq STD.Cq
<chr> <dbl> <dbl>
1 100 20.0 0.05
2 200 NA NA
3 250 NA NA
4 1E5 22 0.03
5 Negative 39.0 0
This sounds like a join/merge question to me. My suggestion is to split Replicates$Replicates into two fields and essentially treat their data separately too. Then after joining your two Replicates tables with Data, use unique() to drop duplicates in your summary table.
library(dplyr)
# Split `Replicates$Replicates` into two fields
# This assumes your `Replicates` field is has two values, seperated by a comma and whitespace
Replicates$R1 <- gsub(x = Replicates$Replicates, pattern = "^(.*),.*", replacement = "\\1")
Replicates$R2 <- gsub(x = Replicates$Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")
# Inner-join `Data` and `Replicates` by `R1` and `R2`
df <- merge(Data, Replicates, by.x = "Pos", by.y = "R1", all.x = FALSE)
df2 <- merge(Data, Replicates, by.x = "Pos", by.y = "R2", all.x = FALSE)
df3 <- dplyr::bind_rows(df, df2)
unique(df3[,c("Name", "Mean.Cq", "STD.Cq")])
#> Name Mean.Cq STD.Cq
#> 1 100 20.05 0.05
#> 2 1E5 22.00 0.03
#> 3 Negative 38.99 0.00

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

Deidentifying data and creating crosswalk using duawranglr in R

I am trying to deidentify data using the duawranglr package in R presented in this example: https://cran.r-project.org/web/packages/duawranglr/vignettes/securing_data.html.
As an example, I created a data frame:
data <- data.frame(
Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2)
)
I am trying to create unique, hexadecimal strings without a crosswalk that correspond to the Name column, using the deid_dua function.
data <- deid_dua(data, id_col = "Name", new_id_name = "DID", write_crosswalk = TRUE, id_length = 12)
The error that I keep getting is:
Error in data.frame(old = old_ids, new = new_ids, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 5, 0
At first I thought the issue was with the name column being a factor. However, I receive the same error after converting it to character using the stringsAsFactors = FALSE statement in data.frame. I'm also not sure based on the CRAN example if I need these statements:
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')
df <- read_dua_file(admin_file)
df
Do they apply if you're not importing the data? The example doesn't explain very well what they are for.
Here's a much simpler solution:
# create a custom 8-digit random identifier string called ID:
library(stringi)
data$ID <- stri_rand_strings(nrow(data), 8)
# remove the name column to create a de-identified dataset
data_deidentified <- data[,-1]
Your data_deidentified dataframe will look something like this:
V1 V2 ID
1 16 3 V2Hziep8
2 20 7 vFeQW1OQ
3 34 5 E5vcWYfm
4 25 3 VLbHzU3H
5 26 2 acCbXiO1
And obviously retain the original data dataframe as your crosswalk. You can make the ID variable longer by changing the '8' value in that call.
Now if you have duplicate names in your data, you will need to do a few extra steps:
# note that I've modified the original dataframe to include two "Martin" values:
data <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin", "Martin"),
V1 = c(16, 20, 34, 25, 26, 28),
V2 = c(3, 7, 5, 3, 2, 5))
# get list of unique names and convert to dataframe
names <- data.frame('Name' = unique(data$Name))
# assign ID string to each unique name
names$ID <- stri_rand_strings(nrow(names), 8)
# now merge back into original df
data <- merge(data, names)
Your result is:
Name V1 V2 ID
1 Jan 25 3 e8da7lO4
2 Jane 20 7 pGeeklL1
3 Kate 16 3 5yYAtO9B
4 Martin 26 2 BwC6jPBh
5 Martin 28 5 BwC6jPBh
6 Rod 34 5 f3xvGbu2
I get an error if I don't set a crosswalk first, but this is fairly trivial:
library(duawranglr)
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
# You only have a single column to obscure, so you only need a one-cell data frame to set up
set_dua_cw(data.frame(secure = "Name"))
#> -- duawranglr note -------------------------------------------------------------------
#> DUA crosswalk has been set!
# Simultaneously secure the data and write the crosswalk
df <- deid_dua(df,
id_col = "Name",
new_id_name = "ID",
write_crosswalk = T,
id_length = 12,
crosswalk_filename = "cw.csv")
print(df)
#> ID V1 V2
#> 1 950dce035280 16 3
#> 2 6b95d061b59f 20 7
#> 3 00a5d8ab2a4c 34 5
#> 4 ea03e704d806 25 3
#> 5 3eba984ebcba 26 2
And you can see the contents of the crosswalk by reading the csv file's contents
read.csv("cw.csv")
#> Name ID
#> 1 Kate 950dce035280
#> 2 Jane 6b95d061b59f
#> 3 Rod 00a5d8ab2a4c
#> 4 Jan ea03e704d806
#> 5 Martin 3eba984ebcba
And if you want to get the names back in the future, you can do:
cw <- read.csv("cw.csv")
df$Name <- cw$Name[match(cw$ID, df$ID)]
I'm a little late, but as the package author, I'll try to clear up some confusion.
tl;dr
The answer #Allan Cameron gave worked for me, but if all you want to do is hash your IDs, then #mh765's solution is probably the best.
Longer explanation of duawranglr purpose
duawranglr assumes you have a restricted data frame and that you want to do two things so that you can share it:
Drop columns which contain restricted data elements (like DOB or
other identifying information)
Convert unique identifiers into another unique ID that can't be used to back into the original IDs (in case the original IDs are also restricted, like SSNs)
Since you aren't trying to do #1, then it makes sense to have a DUA crosswalk that only has one column with one element: the name of your ID column (per #Allan Cameron).
But let's say you have two potential levels of security and in the second, you can't include V1. Then your DUA crosswalk might look like this:
library(duawranglr)
## your data frame
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
## create dua crosswalk
dua_cw <- data.frame(secure_level_i = c("Name",""),
secure_level_ii = c("Name", "V1"))
## show cw (level_i won't allow name; level_ii won't allow name or V1)
dua_cw
secure_level_i secure_level_ii
1 Name Name
2 V1
## set the dua cw
set_dua_cw(dua_cw)
-- duawranglr note -------------------------------------------------------------
DUA crosswalk has been set!
Now you can set the level of security. Let's say you set it at secure_level_i, meaning it's okay to keep V1 in the final data frame you share:
## set DUA level
set_dua_level("secure_level_i", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
Now you can use deid_dua() as you wanted to hash your IDs, in this case, names.
## deidentify data (don't need to set id_col since we set it in set_dua_level)
df <- deid_dua(df,
new_id_name = "DID",
write_crosswalk = TRUE,
id_length = 12,
crosswalk_filename = "cw.csv")
## show result
df
DID V1 V2
1 d164bb624da2 16 3
2 a8b33e3b0230 20 7
3 a1d287cbdde7 34 5
4 1c00ba576e1a 25 3
5 a870564b3365 26 2
## show crosswalk
read.csv("cw.csv")
Name DID
1 Kate d164bb624da2
2 Jane a8b33e3b0230
3 Rod a1d287cbdde7
4 Jan 1c00ba576e1a
5 Martin a870564b3365
## check restrictions to see if you can save data
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
If, however, you set_dua_level() to "secure_level_ii", then you won't pass the last check since you'll still have V1 in your data.
## set new more secure level
set_dua_level("secure_level_ii", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
The following variables are not allowed at the current data usage level
restriction [ secure_level_ii ] and MUST BE REMOVED before saving:
- V1
To pass under the new level, you'll need to drop V1 from your data frame.
## drop
df$V1 <- NULL
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
As a final note, your id_col must contain unique IDs. The names work in the toy example because they are unique, but as others have noted, repeated names for different observations won't work with duawranglr.

Replacing only NA values in xts object column wise using specific formula

I want to replace NA values in my xts object with formula Beta * Exposure * Index return.
My xts object is suppose Position_SimPnl created below:
library(xts)
df1 <- data.frame(Google = c(NA, NA, NA, NA, 500, 600, 700, 800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(NA, NA, NA, NA, NA, 6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57))
Position_SimPnl <- xts(df1, order.by = Sys.Date() - 1:8)
For Beta there is a specific dataframe:
Beta_table <- data.frame (AENA = c(0.3,0.5,0.6), Apple = c(0.2,0.5,0.8), Google = c(0.1,0.3,0.5), Audi = c(0.4,0.6,0.7), AXP = c(0.5,0.7, 0.9), BMW = c(0.3,0.4, 0.5))
rownames(Beta_table) <- c(".SPX", ".FTSE", ".STOXX")
For exposure there is another dataframe:
Base <- data.frame (RIC = c("AENA","BMW","Apple","Audi","Google"), Exposure = c(100,200,300,400,500))
For Index return there is a xts object (Index_FX_Returns):
df2 <- data.frame(.SPX = c(0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08),
.FTSE = c(0.5, 0.4,0.3,0.2,0.3,0.4,0.3,0.4),
.STOXX = c(0.15,0.25,0.35,0.3,0.45,0.55,0.65,0.5))
Index_FX_Returns <- xts(df2,order.by = Sys.Date() - 1:8)
Also there is a dataframe which links RIC with Index:
RIC_Curr_Ind <- data.frame(RIC = c("AENA", "Apple", "Google", "Audi", "BMW"), Currency = c("EUR.","USD.","USD.","EUR.","EUR."), Index = c(".STOXX",".SPX",".SPX",".FTSE",".FTSE"))
What I want is for a particular position of NA value in Position_SimPnl it should look into the column name and get the corresponding index name from RIC_Curr_Ind dataframe and then look for the beta value from Beta_table by matching column name (column name of NA) and row name (index name derived from column name of NA).
Then again by matching the column name from Position_SimPnl with the RIC column from 'Base' dataframe it would extract the corresponding exposure value.
Then by matching column name from Position_SimPnl with RIC column from RIC_Curr_Ind dataframe, it would get the corresponding index name and from that index name it would look into the column name for xts object Index_FX_Returns and get the corresponding return value for the same date as of the NA value.
After getting the Beta, Exposure and Index return values I want the NA value to be replaced by formula: Beta * Exposure * Index return. Also I want only the NA values in Position_SimPnl to be replaced. the other values should remain as it was previously.I used the following formula for replacing the NA values:
do.call(merge, lapply(Position_SimPnl, function(y) {if(is.na(y)){y = (Beta_table[match(RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)],rownames(Beta_table)), match(colnames(y),colnames(Beta_table))]) * (Base$Exposure[match(colnames(y), Base$RIC)]) * (Index_FX_Returns[,RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)]])} else{y}}))
However in the output, if a particular column contains NA it is replacing all the values in the column (including which were not NA previously). Also I am getting multiple warning messages like
"In if (is.na(y)) { ... :
the condition has length > 1 and only the first element will be used".
I think because of this all values of column are getting transformed including non-NA ones. Can anyone suggest how to effectively replace these NA values by the formula mentioned above, keeping the other values same. Any help would be appreciated
Because you need to combine all data sets to achieve your formula Beta * Exposure * Index, consider building a master data frame comprised of all needed components. However, you face two challenges:
different data types (xts objects and data frame)
different data formats (wide and long formats)
For proper merging and calculating, consider converting all data components into data frames and reshaping to long format (i.e., all but Base and RIC_Curr_Ind). Then, merge and calculate with ifelse to fill NA values. Of course, at the end, you will have to reshape back to wide and convert back to XTS.
Reshape
# USER-DEFINED METHOD GIVEN THE MULTIPLE CALLS
proc_transpose <- function(df, col_pick, val_col, time_col) {
reshape(df,
varying = names(df)[col_pick],
times = names(df)[col_pick], ids = NULL,
v.names = val_col, timevar = time_col,
new.row.names = 1:1E4, direction = "long")
}
# POSITIONS
Position_SimPnl_wide_df <- data.frame(date = index(Position_SimPnl),
coredata(Position_SimPnl))
Position_SimPnl_long_df <- proc_transpose(Position_SimPnl_wide_df, col_pick = -1,
val_col = "Position", time_col = "RIC")
# BETA
Beta_table_long_df <- proc_transpose(transform(Beta_table, Index = row.names(Beta_table)),
col_pick = 1:ncol(Beta_table),
val_col = "Beta", time_col = "RIC")
# INDEX
Index_FX_Returns_wide_df <- data.frame(date = index(Index_FX_Returns),
coredata(Index_FX_Returns))
Index_FX_Returns_long_df <- proc_transpose(Index_FX_Returns_wide_df, col = -1,
val_col = "Index_value", time_col = "Index")
Merge
# CHAIN MERGE
master_df <- Reduce(function(...) merge(..., by="RIC"),
list(Position_SimPnl_long_df,
Beta_table_long_df,
Base)
)
# ADDITIONAL MERGES (NOT INCLUDED IN ABOVE CHAIN DUE TO DIFFERENT by)
master_df <- merge(master_df,
Index_FX_Returns_long_df, by=c("Index", "date"))
master_df <- merge(master_df,
RIC_Curr_Ind, by=c("Index", "RIC"))
Calculation
# FORMULA: Beta * Exposure * Index
master_df$Position <- with(master_df, ifelse(is.na(Position),
Beta * Exposure * Index_value,
Position))
Final Preparation
# RE-ORDER ROWS AND SUBSET COLS
master_df <- data.frame(with(master_df, master_df[order(RIC, date),
c("date", "RIC", "Position")]),
row.names = NULL)
# RESHAPE WIDE (REVERSE OF ABOVE)
Position_SimPnl_new <- setNames(reshape(master_df, idvar = "date",
v.names = "Position", timevar = "RIC",
direction = "wide"),
c("date", unique(master_df$RIC)))
# CONVERT TO XTS
Position_SimPnl_new <- xts(transform(Position_SimPnl_new, date = NULL),
order.by = Position_SimPnl_new$date)
Position_SimPnl_new
# AENA Apple Audi BMW Google
# 2019-11-27 58 80 8 8000 800.0
# 2019-11-28 57 70 7 7000 700.0
# 2019-11-29 56 60 6 6000 600.0
# 2019-11-30 55 50 5 24 500.0
# 2019-12-01 54 40 4 16 2.0
# 2019-12-02 53 30 3 24 1.5
# 2019-12-03 52 20 2 32 1.0
# 2019-12-04 51 10 1 40 0.5

Data wrangling - data spread over three rows - dplyr

I have a very untidy data set something like this
A tibble: 200000 x 2
ChatData
<chr>
1 Sep 30, 2018 7:12pm
2 Person A
3 Hello
4 Sep 30, 2018 7:11pm
5 Person B
6 Hello there
7 Sep 30, 2018 7:10pm
8 Person A
...
As you can see it goes date, person name, comment, and repeats.
I am working on the problem and have a very complex method that adds a score column depending on the names etc....
I would like to transform this into something like this
Person A , Person B
Hello NA
NA Hello there
how's you, NA
...
(The date as a row name or third column would be great but not essential to the question)
Optimally I am looking for a dplyr/tidyverse solution
I am working with lots of data so no slow for loops etc..
Raw data to work with:
structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
If anyone is wondering I am analysing facebook messenger data, and this is the form it comes in when you download it.
Thank you.
In this case, your starting data set has only one column (aka feature). But in this case, there are three types of data that are encoded here about each message: a timestamp, the label of the person, and a message. It will be more useful to transform these into a table where each message is in its own row, and each column represents a different aspect of each observation, i.e. in long, or "tidy", format: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In the approach below, the user first defines what features are repeated in the data set. I call them "headers" here, since I'm working toward a table where these are the column headers. Then the script adds that information to the data and converts the single-column data into a tidy format with one row per message, and one aspect of each message in each column.
Your requested output is a minor variation of this, addressed in the last line below: %>% spread(person, msg), which separates out the Person A and Person b data into separate columns.
library(tidyverse)
header_names <- c("timestamp", "person", "msg")
rows_per <- length(header_names)
data_length <- length(data$ChatData) / rows_per
data2 <- data %>%
mutate(msg_number = rep(1:(nrow(data)/rows_per), each=rows_per),
# This line repeats the header_names sequence for each msg
header = rep(header_names, data_length)) %>%
spread(header, ChatData) %>%
mutate(timestamp = lubridate::mdy_hm(timestamp)) %>%
spread(person, msg)
head(data2)
# A tibble: 2 x 4
msg_number timestamp `Person A` `Person B`
<int> <dttm> <chr> <chr>
1 1 2018-09-30 19:12:00 Hello NA
2 2 2018-09-30 19:11:00 NA Hello there
As you basically just have a character vector that you would like to convert into a 3 columnn data.frame
One other option is to simply use matrix and specify ncol=3 and byrow=TRUE
# your sample data
d <- structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list( NULL, c("date_time", "person", "message")) )
Result is a character matrix:
date_time person message
[1,] "Sep 30, 2018 7:12pm" "Person A" "Hello"
[2,] "Sep 30, 2018 7:11pm" "Person B" "Hello there"
But you can wrap that in as.data.frame() to convert to a data.frame and continue working from there with dplyr if that's what you want.
Put it together for a whole solution:
It becomes a nice short, readable bit of code IMO:
library(dplyr)
library(lubridate)
result_df <-
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list(NULL, c("date_time", "person", "message")) ) %>%
as.data.frame() %>%
mutate(date_time=lubridate::mdy_hm(date_time))
Here is one approach:
data %>% group_by(msg_number = rep(1:(nrow(data)/3), each=3)) %>%
summarize(msg_data = list(ChatData)) %>% as.data.frame
msg_number msg_data
1 1 Sep 30, 2018 7:12pm, Person A, Hello
2 2 Sep 30, 2018 7:11pm, Person B, Hello there
This numbers each message and puts the data into a column list.

Resources