How to fulfill missing cells of a data frame in R? - r

I have a dataset like this.
df = data.frame( name= c("Tommy", "John", "Dan"), age = c(20, NA, NA) )
I tried to set 15 y.o. to John and Dan.
df[ ( df$age != 20) , ]$age = 15
But I got an error as follows,
Error in [<-.data.frame(tmp, (df$age != 20), , value = list(name = c(NA_integer_, : missing values are not allowed in subscripted assignments of data frames
What is a nice way to set new values to these missing cells?

If you want to modify all cells that are not 20, including other valid values for age, I would do the following:
# Creating a data frame with another valid age
df = data.frame( name= c("Tommy", "John", "Dan","Bob"), age = c(20, NA, NA,12) )
# Substitute values different than 20 for 15
df[df$age!=20 | is.na(df$age),"age"] <- 15
name age
1 Tommy 20
2 John 15
3 Dan 15
4 Bob 15

We can use is.na
library(data.table)
setDT(df)[is.na(age), age:= 15]

Try this:
df$age[is.na(df$age)] <- 15
or using your style of syntax:
df[is.na(df$age), ]$age = 15
The error you get arises because df$age != 20 produces the following:
[1] FALSE NA NA
The NA values in the age column are not interpreted correctly as not being equal to twenty in the logical statement.

Related

Deidentifying data and creating crosswalk using duawranglr in R

I am trying to deidentify data using the duawranglr package in R presented in this example: https://cran.r-project.org/web/packages/duawranglr/vignettes/securing_data.html.
As an example, I created a data frame:
data <- data.frame(
Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2)
)
I am trying to create unique, hexadecimal strings without a crosswalk that correspond to the Name column, using the deid_dua function.
data <- deid_dua(data, id_col = "Name", new_id_name = "DID", write_crosswalk = TRUE, id_length = 12)
The error that I keep getting is:
Error in data.frame(old = old_ids, new = new_ids, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 5, 0
At first I thought the issue was with the name column being a factor. However, I receive the same error after converting it to character using the stringsAsFactors = FALSE statement in data.frame. I'm also not sure based on the CRAN example if I need these statements:
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')
df <- read_dua_file(admin_file)
df
Do they apply if you're not importing the data? The example doesn't explain very well what they are for.
Here's a much simpler solution:
# create a custom 8-digit random identifier string called ID:
library(stringi)
data$ID <- stri_rand_strings(nrow(data), 8)
# remove the name column to create a de-identified dataset
data_deidentified <- data[,-1]
Your data_deidentified dataframe will look something like this:
V1 V2 ID
1 16 3 V2Hziep8
2 20 7 vFeQW1OQ
3 34 5 E5vcWYfm
4 25 3 VLbHzU3H
5 26 2 acCbXiO1
And obviously retain the original data dataframe as your crosswalk. You can make the ID variable longer by changing the '8' value in that call.
Now if you have duplicate names in your data, you will need to do a few extra steps:
# note that I've modified the original dataframe to include two "Martin" values:
data <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin", "Martin"),
V1 = c(16, 20, 34, 25, 26, 28),
V2 = c(3, 7, 5, 3, 2, 5))
# get list of unique names and convert to dataframe
names <- data.frame('Name' = unique(data$Name))
# assign ID string to each unique name
names$ID <- stri_rand_strings(nrow(names), 8)
# now merge back into original df
data <- merge(data, names)
Your result is:
Name V1 V2 ID
1 Jan 25 3 e8da7lO4
2 Jane 20 7 pGeeklL1
3 Kate 16 3 5yYAtO9B
4 Martin 26 2 BwC6jPBh
5 Martin 28 5 BwC6jPBh
6 Rod 34 5 f3xvGbu2
I get an error if I don't set a crosswalk first, but this is fairly trivial:
library(duawranglr)
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
# You only have a single column to obscure, so you only need a one-cell data frame to set up
set_dua_cw(data.frame(secure = "Name"))
#> -- duawranglr note -------------------------------------------------------------------
#> DUA crosswalk has been set!
# Simultaneously secure the data and write the crosswalk
df <- deid_dua(df,
id_col = "Name",
new_id_name = "ID",
write_crosswalk = T,
id_length = 12,
crosswalk_filename = "cw.csv")
print(df)
#> ID V1 V2
#> 1 950dce035280 16 3
#> 2 6b95d061b59f 20 7
#> 3 00a5d8ab2a4c 34 5
#> 4 ea03e704d806 25 3
#> 5 3eba984ebcba 26 2
And you can see the contents of the crosswalk by reading the csv file's contents
read.csv("cw.csv")
#> Name ID
#> 1 Kate 950dce035280
#> 2 Jane 6b95d061b59f
#> 3 Rod 00a5d8ab2a4c
#> 4 Jan ea03e704d806
#> 5 Martin 3eba984ebcba
And if you want to get the names back in the future, you can do:
cw <- read.csv("cw.csv")
df$Name <- cw$Name[match(cw$ID, df$ID)]
I'm a little late, but as the package author, I'll try to clear up some confusion.
tl;dr
The answer #Allan Cameron gave worked for me, but if all you want to do is hash your IDs, then #mh765's solution is probably the best.
Longer explanation of duawranglr purpose
duawranglr assumes you have a restricted data frame and that you want to do two things so that you can share it:
Drop columns which contain restricted data elements (like DOB or
other identifying information)
Convert unique identifiers into another unique ID that can't be used to back into the original IDs (in case the original IDs are also restricted, like SSNs)
Since you aren't trying to do #1, then it makes sense to have a DUA crosswalk that only has one column with one element: the name of your ID column (per #Allan Cameron).
But let's say you have two potential levels of security and in the second, you can't include V1. Then your DUA crosswalk might look like this:
library(duawranglr)
## your data frame
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
## create dua crosswalk
dua_cw <- data.frame(secure_level_i = c("Name",""),
secure_level_ii = c("Name", "V1"))
## show cw (level_i won't allow name; level_ii won't allow name or V1)
dua_cw
secure_level_i secure_level_ii
1 Name Name
2 V1
## set the dua cw
set_dua_cw(dua_cw)
-- duawranglr note -------------------------------------------------------------
DUA crosswalk has been set!
Now you can set the level of security. Let's say you set it at secure_level_i, meaning it's okay to keep V1 in the final data frame you share:
## set DUA level
set_dua_level("secure_level_i", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
Now you can use deid_dua() as you wanted to hash your IDs, in this case, names.
## deidentify data (don't need to set id_col since we set it in set_dua_level)
df <- deid_dua(df,
new_id_name = "DID",
write_crosswalk = TRUE,
id_length = 12,
crosswalk_filename = "cw.csv")
## show result
df
DID V1 V2
1 d164bb624da2 16 3
2 a8b33e3b0230 20 7
3 a1d287cbdde7 34 5
4 1c00ba576e1a 25 3
5 a870564b3365 26 2
## show crosswalk
read.csv("cw.csv")
Name DID
1 Kate d164bb624da2
2 Jane a8b33e3b0230
3 Rod a1d287cbdde7
4 Jan 1c00ba576e1a
5 Martin a870564b3365
## check restrictions to see if you can save data
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
If, however, you set_dua_level() to "secure_level_ii", then you won't pass the last check since you'll still have V1 in your data.
## set new more secure level
set_dua_level("secure_level_ii", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
The following variables are not allowed at the current data usage level
restriction [ secure_level_ii ] and MUST BE REMOVED before saving:
- V1
To pass under the new level, you'll need to drop V1 from your data frame.
## drop
df$V1 <- NULL
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
As a final note, your id_col must contain unique IDs. The names work in the toy example because they are unique, but as others have noted, repeated names for different observations won't work with duawranglr.

Replacing only NA values in xts object column wise using specific formula

I want to replace NA values in my xts object with formula Beta * Exposure * Index return.
My xts object is suppose Position_SimPnl created below:
library(xts)
df1 <- data.frame(Google = c(NA, NA, NA, NA, 500, 600, 700, 800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(NA, NA, NA, NA, NA, 6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57))
Position_SimPnl <- xts(df1, order.by = Sys.Date() - 1:8)
For Beta there is a specific dataframe:
Beta_table <- data.frame (AENA = c(0.3,0.5,0.6), Apple = c(0.2,0.5,0.8), Google = c(0.1,0.3,0.5), Audi = c(0.4,0.6,0.7), AXP = c(0.5,0.7, 0.9), BMW = c(0.3,0.4, 0.5))
rownames(Beta_table) <- c(".SPX", ".FTSE", ".STOXX")
For exposure there is another dataframe:
Base <- data.frame (RIC = c("AENA","BMW","Apple","Audi","Google"), Exposure = c(100,200,300,400,500))
For Index return there is a xts object (Index_FX_Returns):
df2 <- data.frame(.SPX = c(0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08),
.FTSE = c(0.5, 0.4,0.3,0.2,0.3,0.4,0.3,0.4),
.STOXX = c(0.15,0.25,0.35,0.3,0.45,0.55,0.65,0.5))
Index_FX_Returns <- xts(df2,order.by = Sys.Date() - 1:8)
Also there is a dataframe which links RIC with Index:
RIC_Curr_Ind <- data.frame(RIC = c("AENA", "Apple", "Google", "Audi", "BMW"), Currency = c("EUR.","USD.","USD.","EUR.","EUR."), Index = c(".STOXX",".SPX",".SPX",".FTSE",".FTSE"))
What I want is for a particular position of NA value in Position_SimPnl it should look into the column name and get the corresponding index name from RIC_Curr_Ind dataframe and then look for the beta value from Beta_table by matching column name (column name of NA) and row name (index name derived from column name of NA).
Then again by matching the column name from Position_SimPnl with the RIC column from 'Base' dataframe it would extract the corresponding exposure value.
Then by matching column name from Position_SimPnl with RIC column from RIC_Curr_Ind dataframe, it would get the corresponding index name and from that index name it would look into the column name for xts object Index_FX_Returns and get the corresponding return value for the same date as of the NA value.
After getting the Beta, Exposure and Index return values I want the NA value to be replaced by formula: Beta * Exposure * Index return. Also I want only the NA values in Position_SimPnl to be replaced. the other values should remain as it was previously.I used the following formula for replacing the NA values:
do.call(merge, lapply(Position_SimPnl, function(y) {if(is.na(y)){y = (Beta_table[match(RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)],rownames(Beta_table)), match(colnames(y),colnames(Beta_table))]) * (Base$Exposure[match(colnames(y), Base$RIC)]) * (Index_FX_Returns[,RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)]])} else{y}}))
However in the output, if a particular column contains NA it is replacing all the values in the column (including which were not NA previously). Also I am getting multiple warning messages like
"In if (is.na(y)) { ... :
the condition has length > 1 and only the first element will be used".
I think because of this all values of column are getting transformed including non-NA ones. Can anyone suggest how to effectively replace these NA values by the formula mentioned above, keeping the other values same. Any help would be appreciated
Because you need to combine all data sets to achieve your formula Beta * Exposure * Index, consider building a master data frame comprised of all needed components. However, you face two challenges:
different data types (xts objects and data frame)
different data formats (wide and long formats)
For proper merging and calculating, consider converting all data components into data frames and reshaping to long format (i.e., all but Base and RIC_Curr_Ind). Then, merge and calculate with ifelse to fill NA values. Of course, at the end, you will have to reshape back to wide and convert back to XTS.
Reshape
# USER-DEFINED METHOD GIVEN THE MULTIPLE CALLS
proc_transpose <- function(df, col_pick, val_col, time_col) {
reshape(df,
varying = names(df)[col_pick],
times = names(df)[col_pick], ids = NULL,
v.names = val_col, timevar = time_col,
new.row.names = 1:1E4, direction = "long")
}
# POSITIONS
Position_SimPnl_wide_df <- data.frame(date = index(Position_SimPnl),
coredata(Position_SimPnl))
Position_SimPnl_long_df <- proc_transpose(Position_SimPnl_wide_df, col_pick = -1,
val_col = "Position", time_col = "RIC")
# BETA
Beta_table_long_df <- proc_transpose(transform(Beta_table, Index = row.names(Beta_table)),
col_pick = 1:ncol(Beta_table),
val_col = "Beta", time_col = "RIC")
# INDEX
Index_FX_Returns_wide_df <- data.frame(date = index(Index_FX_Returns),
coredata(Index_FX_Returns))
Index_FX_Returns_long_df <- proc_transpose(Index_FX_Returns_wide_df, col = -1,
val_col = "Index_value", time_col = "Index")
Merge
# CHAIN MERGE
master_df <- Reduce(function(...) merge(..., by="RIC"),
list(Position_SimPnl_long_df,
Beta_table_long_df,
Base)
)
# ADDITIONAL MERGES (NOT INCLUDED IN ABOVE CHAIN DUE TO DIFFERENT by)
master_df <- merge(master_df,
Index_FX_Returns_long_df, by=c("Index", "date"))
master_df <- merge(master_df,
RIC_Curr_Ind, by=c("Index", "RIC"))
Calculation
# FORMULA: Beta * Exposure * Index
master_df$Position <- with(master_df, ifelse(is.na(Position),
Beta * Exposure * Index_value,
Position))
Final Preparation
# RE-ORDER ROWS AND SUBSET COLS
master_df <- data.frame(with(master_df, master_df[order(RIC, date),
c("date", "RIC", "Position")]),
row.names = NULL)
# RESHAPE WIDE (REVERSE OF ABOVE)
Position_SimPnl_new <- setNames(reshape(master_df, idvar = "date",
v.names = "Position", timevar = "RIC",
direction = "wide"),
c("date", unique(master_df$RIC)))
# CONVERT TO XTS
Position_SimPnl_new <- xts(transform(Position_SimPnl_new, date = NULL),
order.by = Position_SimPnl_new$date)
Position_SimPnl_new
# AENA Apple Audi BMW Google
# 2019-11-27 58 80 8 8000 800.0
# 2019-11-28 57 70 7 7000 700.0
# 2019-11-29 56 60 6 6000 600.0
# 2019-11-30 55 50 5 24 500.0
# 2019-12-01 54 40 4 16 2.0
# 2019-12-02 53 30 3 24 1.5
# 2019-12-03 52 20 2 32 1.0
# 2019-12-04 51 10 1 40 0.5

Creating new column based on values in preceding column

I would like to add a new column to a data.frame that converts from the numeric value in the first column to the corresponding string (if any) from a subsequent matching column i.e. the column name partially matches this value in the first column.
In this example, I wish to add a value for 'Highest_Earner', which depends on the value in the Earner_Number column:
> df1 <- data.frame("Earner_Number" = c(1, 2, 1, 5),
"Earner5" = c("Max", "Alex", "Ben", "Mark"),
"Earner1" = c("John", "Dora", "Micelle", "Josh"))
> df1
Earner_Number Earner5 Earner1
1 1 Max John
2 2 Alex Dora
3 1 Ben Micelle
4 5 Mark Josh
The result should be:
> df1
Earner_Number Earner5 Earner1 Highest_Earner
1 1 Max John John
2 2 Alex Dora Neither
3 1 Ben Micelle Michelle
4 5 Mark Josh Mark
I have tried cutting the data.frame into various smaller pieces, but was wondering if someone had a somewhat cleaner method?
#Have to convert them to character for nested if else to work.
df$Earner5 <- as.character(df$Earner5)
df$Earner1 <- as.character(df$Earner1)
#Using nested if to get your column.
df$Higher_Earner <- ifelse(df$Earner_Number == 5, df$Earner5,
ifelse(df$Earner_Number==1df$Earner1,"Neither"))
dplyr approach
library(tidyverse)
df <- tibble("Earner_Number" = c(1,2,1,5), "Earner5" = c('Max', 'Alex','Ben','Mark'), "Earner1" = c("John","Dora","Micelle",'Josh'))
df %>%
mutate(Highest_Earner = case_when(Earner_Number == 1 ~ Earner1,
Earner_Number == 5 ~ Earner5,
TRUE ~ 'Neither'))

Calculated column in R with lookup from another data frame

I am pretty new to R and I already created some function but Im kinda lost in here: I need to come up with values of the "Result" column. I currently have a data frame with column names of "Seasons" and "Total". However, I need to add another column "Result". To get this, I need to look up the "multiplier" value from another data frame.
Seasons Total Result
Winter 200 100
Fall 50 25
Spring 10 5
Summer 120 12
I have other data frame with column and row values of
Multiplier Value
Win1 0.5
Win2 0.1
Win1 should only be multiplied to "Total" when Seasons are Winter, Fall and Spring while Win2 must only be multiplied to "Total" when season is Summer. This should be the value of "Result" column.
Thank you
This works
data1 = data.frame(Seasons = c("Winter","Fall","Spring","Summer"),
Total = c(200,50,10,120),stringsAsFactors = F)
data2 = data.frame(Multiplier = c("Win1","Win2"), Value = c(0.5,0.1), stringsAsFactors = F)
data1$Total = ifelse(data1$Seasons != "Summer", data1$Total[data1$Seasons != "Summer"]*
data2[data2$Multiplier%in%"Win1",2],
data1$Total[data1$Seasons == "Summer"]*
data2[data2$Multiplier%in%"Win2",2])
Another option using dplyr could be as;
df1 %>% mutate(Result = ifelse(Seasons %in% c("Winter", "Fall", "Spring"),
Total*df2[df2$Multiplier=="Win1",]$Value,
Total*df2[df2$Multiplier=="Win2",]$Value))
# Seasons Total Result
#1 Winter 200 100
#2 Fall 50 25
#3 Spring 10 5
#4 Summer 120 12
# OR 2nd Option is using with
df1$Result <- with(df1, ifelse(Seasons %in% c("Winter", "Fall", "Spring"),
Total*df2[df2$Multiplier=="Win1",]$Value,
Total*df2[df2$Multiplier=="Win2",]$Value) )
Data
df1 <- read.table(text = "Seasons Total
Winter 200
Fall 50
Spring 10
Summer 120", header = T, stringsAsFactor =F)
df2 <- read.table(text = "Multiplier Value
Win1 0.5
Win2 0.1", header = T, stringsAsFactor = F)
You can use ifelse()
df$Result <- ifelse(df$Seasons=="Summer", df2$Value[2]*df$Total, df2$Value[1]*df$Total)

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Resources