Calculated column in R with lookup from another data frame - r

I am pretty new to R and I already created some function but Im kinda lost in here: I need to come up with values of the "Result" column. I currently have a data frame with column names of "Seasons" and "Total". However, I need to add another column "Result". To get this, I need to look up the "multiplier" value from another data frame.
Seasons Total Result
Winter 200 100
Fall 50 25
Spring 10 5
Summer 120 12
I have other data frame with column and row values of
Multiplier Value
Win1 0.5
Win2 0.1
Win1 should only be multiplied to "Total" when Seasons are Winter, Fall and Spring while Win2 must only be multiplied to "Total" when season is Summer. This should be the value of "Result" column.
Thank you

This works
data1 = data.frame(Seasons = c("Winter","Fall","Spring","Summer"),
Total = c(200,50,10,120),stringsAsFactors = F)
data2 = data.frame(Multiplier = c("Win1","Win2"), Value = c(0.5,0.1), stringsAsFactors = F)
data1$Total = ifelse(data1$Seasons != "Summer", data1$Total[data1$Seasons != "Summer"]*
data2[data2$Multiplier%in%"Win1",2],
data1$Total[data1$Seasons == "Summer"]*
data2[data2$Multiplier%in%"Win2",2])

Another option using dplyr could be as;
df1 %>% mutate(Result = ifelse(Seasons %in% c("Winter", "Fall", "Spring"),
Total*df2[df2$Multiplier=="Win1",]$Value,
Total*df2[df2$Multiplier=="Win2",]$Value))
# Seasons Total Result
#1 Winter 200 100
#2 Fall 50 25
#3 Spring 10 5
#4 Summer 120 12
# OR 2nd Option is using with
df1$Result <- with(df1, ifelse(Seasons %in% c("Winter", "Fall", "Spring"),
Total*df2[df2$Multiplier=="Win1",]$Value,
Total*df2[df2$Multiplier=="Win2",]$Value) )
Data
df1 <- read.table(text = "Seasons Total
Winter 200
Fall 50
Spring 10
Summer 120", header = T, stringsAsFactor =F)
df2 <- read.table(text = "Multiplier Value
Win1 0.5
Win2 0.1", header = T, stringsAsFactor = F)

You can use ifelse()
df$Result <- ifelse(df$Seasons=="Summer", df2$Value[2]*df$Total, df2$Value[1]*df$Total)

Related

Specify number of columns to read when first row is missing values

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).
The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).
Example code:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)
Use the base read.csv then convert to typle if need be:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17
I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.
Assuming you have a file test.csv like this:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
Try something like this:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv now looks like this:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
So we now have 7 columns, the first is the timestamp.
This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
The final mydata:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17

Replacing only NA values in xts object column wise using specific formula

I want to replace NA values in my xts object with formula Beta * Exposure * Index return.
My xts object is suppose Position_SimPnl created below:
library(xts)
df1 <- data.frame(Google = c(NA, NA, NA, NA, 500, 600, 700, 800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(NA, NA, NA, NA, NA, 6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57))
Position_SimPnl <- xts(df1, order.by = Sys.Date() - 1:8)
For Beta there is a specific dataframe:
Beta_table <- data.frame (AENA = c(0.3,0.5,0.6), Apple = c(0.2,0.5,0.8), Google = c(0.1,0.3,0.5), Audi = c(0.4,0.6,0.7), AXP = c(0.5,0.7, 0.9), BMW = c(0.3,0.4, 0.5))
rownames(Beta_table) <- c(".SPX", ".FTSE", ".STOXX")
For exposure there is another dataframe:
Base <- data.frame (RIC = c("AENA","BMW","Apple","Audi","Google"), Exposure = c(100,200,300,400,500))
For Index return there is a xts object (Index_FX_Returns):
df2 <- data.frame(.SPX = c(0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08),
.FTSE = c(0.5, 0.4,0.3,0.2,0.3,0.4,0.3,0.4),
.STOXX = c(0.15,0.25,0.35,0.3,0.45,0.55,0.65,0.5))
Index_FX_Returns <- xts(df2,order.by = Sys.Date() - 1:8)
Also there is a dataframe which links RIC with Index:
RIC_Curr_Ind <- data.frame(RIC = c("AENA", "Apple", "Google", "Audi", "BMW"), Currency = c("EUR.","USD.","USD.","EUR.","EUR."), Index = c(".STOXX",".SPX",".SPX",".FTSE",".FTSE"))
What I want is for a particular position of NA value in Position_SimPnl it should look into the column name and get the corresponding index name from RIC_Curr_Ind dataframe and then look for the beta value from Beta_table by matching column name (column name of NA) and row name (index name derived from column name of NA).
Then again by matching the column name from Position_SimPnl with the RIC column from 'Base' dataframe it would extract the corresponding exposure value.
Then by matching column name from Position_SimPnl with RIC column from RIC_Curr_Ind dataframe, it would get the corresponding index name and from that index name it would look into the column name for xts object Index_FX_Returns and get the corresponding return value for the same date as of the NA value.
After getting the Beta, Exposure and Index return values I want the NA value to be replaced by formula: Beta * Exposure * Index return. Also I want only the NA values in Position_SimPnl to be replaced. the other values should remain as it was previously.I used the following formula for replacing the NA values:
do.call(merge, lapply(Position_SimPnl, function(y) {if(is.na(y)){y = (Beta_table[match(RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)],rownames(Beta_table)), match(colnames(y),colnames(Beta_table))]) * (Base$Exposure[match(colnames(y), Base$RIC)]) * (Index_FX_Returns[,RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)]])} else{y}}))
However in the output, if a particular column contains NA it is replacing all the values in the column (including which were not NA previously). Also I am getting multiple warning messages like
"In if (is.na(y)) { ... :
the condition has length > 1 and only the first element will be used".
I think because of this all values of column are getting transformed including non-NA ones. Can anyone suggest how to effectively replace these NA values by the formula mentioned above, keeping the other values same. Any help would be appreciated
Because you need to combine all data sets to achieve your formula Beta * Exposure * Index, consider building a master data frame comprised of all needed components. However, you face two challenges:
different data types (xts objects and data frame)
different data formats (wide and long formats)
For proper merging and calculating, consider converting all data components into data frames and reshaping to long format (i.e., all but Base and RIC_Curr_Ind). Then, merge and calculate with ifelse to fill NA values. Of course, at the end, you will have to reshape back to wide and convert back to XTS.
Reshape
# USER-DEFINED METHOD GIVEN THE MULTIPLE CALLS
proc_transpose <- function(df, col_pick, val_col, time_col) {
reshape(df,
varying = names(df)[col_pick],
times = names(df)[col_pick], ids = NULL,
v.names = val_col, timevar = time_col,
new.row.names = 1:1E4, direction = "long")
}
# POSITIONS
Position_SimPnl_wide_df <- data.frame(date = index(Position_SimPnl),
coredata(Position_SimPnl))
Position_SimPnl_long_df <- proc_transpose(Position_SimPnl_wide_df, col_pick = -1,
val_col = "Position", time_col = "RIC")
# BETA
Beta_table_long_df <- proc_transpose(transform(Beta_table, Index = row.names(Beta_table)),
col_pick = 1:ncol(Beta_table),
val_col = "Beta", time_col = "RIC")
# INDEX
Index_FX_Returns_wide_df <- data.frame(date = index(Index_FX_Returns),
coredata(Index_FX_Returns))
Index_FX_Returns_long_df <- proc_transpose(Index_FX_Returns_wide_df, col = -1,
val_col = "Index_value", time_col = "Index")
Merge
# CHAIN MERGE
master_df <- Reduce(function(...) merge(..., by="RIC"),
list(Position_SimPnl_long_df,
Beta_table_long_df,
Base)
)
# ADDITIONAL MERGES (NOT INCLUDED IN ABOVE CHAIN DUE TO DIFFERENT by)
master_df <- merge(master_df,
Index_FX_Returns_long_df, by=c("Index", "date"))
master_df <- merge(master_df,
RIC_Curr_Ind, by=c("Index", "RIC"))
Calculation
# FORMULA: Beta * Exposure * Index
master_df$Position <- with(master_df, ifelse(is.na(Position),
Beta * Exposure * Index_value,
Position))
Final Preparation
# RE-ORDER ROWS AND SUBSET COLS
master_df <- data.frame(with(master_df, master_df[order(RIC, date),
c("date", "RIC", "Position")]),
row.names = NULL)
# RESHAPE WIDE (REVERSE OF ABOVE)
Position_SimPnl_new <- setNames(reshape(master_df, idvar = "date",
v.names = "Position", timevar = "RIC",
direction = "wide"),
c("date", unique(master_df$RIC)))
# CONVERT TO XTS
Position_SimPnl_new <- xts(transform(Position_SimPnl_new, date = NULL),
order.by = Position_SimPnl_new$date)
Position_SimPnl_new
# AENA Apple Audi BMW Google
# 2019-11-27 58 80 8 8000 800.0
# 2019-11-28 57 70 7 7000 700.0
# 2019-11-29 56 60 6 6000 600.0
# 2019-11-30 55 50 5 24 500.0
# 2019-12-01 54 40 4 16 2.0
# 2019-12-02 53 30 3 24 1.5
# 2019-12-03 52 20 2 32 1.0
# 2019-12-04 51 10 1 40 0.5

Frequency tables by groups with weighted data in R

I wish to calculate two kind of frequency tables by groups with weighted data.
You can generate reproducible data with the following code :
Data <- data.frame(
country = sample(c("France", "USA", "UK"), 100, replace = TRUE),
migrant = sample(c("Native", "Foreign-born"), 100, replace = TRUE),
gender = sample (c("men", "women"), 100, replace = TRUE),
wgt = sample(100),
year = sample(2006:2007)
)
Firstly, I try to calculate a frequency table of migrant status (Native VS Foreign-born) by country and year. I wrote the following code using the packages questionr and plyr :
db2006 <- subset (Data, year == 2006)
db2007 <- subset (Data, year == 2007)
result2006 <- as.data.frame(cprop(wtd.table(db2006$migrant, db2006$country, weights=db2006$wgt),total=FALSE))
result2007 <- as.data.frame(cprop(wtd.table(db2007$migrant, db2007$country, weights=db2007$wgt),total=FALSE))
result2006<-rename (result2006, c(Freq = "y2006"))
result2007<-rename (result2007, c(Freq = "y2007"))
result <- merge(result2006, result2007, by = c("Var1","Var2"))
In my real database, I have 10 years so it takes times to apply this code for all the years. Does anyone know a faster way to do it ?
I also wish to calculate the share of women and men among migrant status by country and year. I am looking for something like :
Var1 Var2 Var3 y2006 y2007
Foreign born France men 52 55
Foreign born France women 48 45
Native France men 51 52
Native France women 49 48
Foreign born UK men 60 65
Foreign born UK women 40 35
Native UK men 48 50
Native UK women 52 50
Does anyone have an idea of how I can get these results?
You could do this by: making a function with the code you've already written; using lapply to iterate that function over all years in your data; then using Reduce and merge to collapse the resulting list into one data frame. Like this:
# let's make your code into a function called 'tallyho'
tallyho <- function(yr, data) {
require(dplyr)
require(questionr)
DF <- filter(data, year == yr)
result <- with(DF, as.data.frame(cprop(wtd.table(migrant, country, weights = wgt), total = FALSE)))
# rename the last column by year
names(result)[length(names(result))] <- sprintf("y%s", year)
return(result)
}
# now iterate that function over all years in your original data set, then
# use Reduce and merge to collapse the resulting list into a data frame
NewData <- lapply(unique(Data$year), function(x) tallyho(x, Data)) %>%
Reduce(function(...) merge(..., all=T), .)

How to fulfill missing cells of a data frame in R?

I have a dataset like this.
df = data.frame( name= c("Tommy", "John", "Dan"), age = c(20, NA, NA) )
I tried to set 15 y.o. to John and Dan.
df[ ( df$age != 20) , ]$age = 15
But I got an error as follows,
Error in [<-.data.frame(tmp, (df$age != 20), , value = list(name = c(NA_integer_, : missing values are not allowed in subscripted assignments of data frames
What is a nice way to set new values to these missing cells?
If you want to modify all cells that are not 20, including other valid values for age, I would do the following:
# Creating a data frame with another valid age
df = data.frame( name= c("Tommy", "John", "Dan","Bob"), age = c(20, NA, NA,12) )
# Substitute values different than 20 for 15
df[df$age!=20 | is.na(df$age),"age"] <- 15
name age
1 Tommy 20
2 John 15
3 Dan 15
4 Bob 15
We can use is.na
library(data.table)
setDT(df)[is.na(age), age:= 15]
Try this:
df$age[is.na(df$age)] <- 15
or using your style of syntax:
df[is.na(df$age), ]$age = 15
The error you get arises because df$age != 20 produces the following:
[1] FALSE NA NA
The NA values in the age column are not interpreted correctly as not being equal to twenty in the logical statement.

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.
I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.
Please see an example / fictitious data set:
require(data.table)
data <- data.table(month = rep(c('may', 'jun', 'jul'), each = 5),
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))
In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.
For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".
If I were to try and do this by hand using data table, I would want something like:
mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))
I would want this repeated "by = month".
If I were to use the "plyr" package for a data frame, I would solve using the following approach:
require(plyr)
data <- data.frame(data)
mdata <- ddply(data, .(month), function(dfmo) {
dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
return(dfmo)
})
Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.
A different way to illustrate :
data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]
month may.q1 may.q2
1: may yes econ
2: may yes econ
3: may yes econ
4: may yes econ
5: may yes econ
6: jun lunch foggy
7: jun lunch foggy
8: jun lunch foggy
9: jun lunch foggy
10: jun lunch foggy
11: jul oranges heavy rain
12: jul oranges heavy rain
13: jul oranges heavy rain
14: jul oranges heavy rain
15: jul oranges heavy rain
But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.
Edit: Seems very inefficient on bigger data. Check out #MatthewDowle's answer for a really fast and neat solution.
Here's a solution using data.table.
dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][,
ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]
The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).
melt.dt <- function(DT, id.var) {
stopifnot(inherits(DT, "data.table"))
measure.var <- setdiff(names(DT), id.var)
ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
m1 <- lapply(c("list", id.var), as.name)
m2 <- as.call(lapply(c("factor", "ind"), as.name))
m3 <- as.call(lapply(c("c", measure.var), as.name))
quoted <- as.call(c(m1, ind = m2, values = m3))
DT[, eval(quoted)]
}
The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.
What about something like this
data <- data.table(
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
)
tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")
str(tmp)
## Classes ‘data.table’ and 'data.frame': 30 obs. of 5 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ may : chr "yes" "yes" "yes" "yes" ...
## $ jun : chr "breakfast" "breakfast" "breakfast" "breakfast" ...
## $ jul : chr "oranges" "oranges" "oranges" "oranges" ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
If you want to go further and melting this data again you can use the melt package
require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")
str(res)
## 'data.frame': 90 obs. of 3 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ month : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
## $ response: chr "yes" "yes" "yes" "yes" ...

Resources