I'm having an issue with the sdTrim function, which had previously ran perfectly.
I have a dataframe (= new_data) containing the following variable names:
There are 8 different conditions: FA_1, HIT_1, ..., FA_4, HIT_4
I wanted to trim the reaction times and calculate a mean per participant and per condition. I used the following code:
trimmedData <- sdTrim(new_data, minRT = 150, sd = 2, pptVar = "participant", condVar = "condition", rtVar = "rt", accVar = "accuracy", perParticipant = TRUE, returnType = "mean")
This used to work fine, but suddenly my condition variable is not recognized as such anymore: instead of 8 variables, all are put into one:
What seems to be the issue here?
I tried different ways of including perCondition = TRUE, FALSE etc. which did not change anything.
the participant and condition variables are characters, the rt is numeric
As far as I can tell, the problem is with your data, not with your code. The example data you posted only has one row per participant/condition at most; there isn't a FA_3 or FA_4 for participant 988. If your real data doesn't have enough data for each combination of participant and conditions, then it looks like sdTrim just averages by participant.
I'm unfamiliar with reaction time data, but you might be able to accomplish what you're looking for using group_by and summarize from dplyr.
Below is an example with a larger dataset based on your example data.
library(trimr)
set.seed(123)
participant <- c(rep("1", 100), rep("2", 100), rep("3", 100))
accuracy <- sample(x = c("1", "0"), size = 300, replace = TRUE, prob = c(.9, .1))
condition <- sample(x = c("hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "FA_3", "FA_4", "hit_4", "hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "hit_4"), size = 300, replace = TRUE)
rt <- sample(x = 250:625, size = 300)
new_data <- data.frame(participant, accuracy, condition, rt)
trimmedData <- sdTrim(data = new_data,
minRT = 150,
sd = 2,
pptVar = "participant",
condVar = "condition",
rtVar = "rt",
accVar = "accuracy",
perParticipant = TRUE,
returnType = "mean")
print(trimmedData)
participant FA_1 hit_1 hit_3 hit_2 FA_4 FA_2 FA_3 hit_4
1 1 439.800 477.250 433.85 440.375 426.286 439.500 508.8 457.429
2 2 477.067 489.933 466.50 360.000 405.000 387.533 427.2 428.364
3 3 398.333 446.500 438.00 362.077 445.000 432.333 419.2 497.125
Update (1/23/23)
In both your original and your updated datasets, you simply don't have enough values per condition to properly use sdTrim() with both participant = TRUE and condition = TRUE (condition is automatically set to TRUE if you don't specify it).
Here is a link to the sdTrim() function on Github. Start looking at line 545, which describes what happens when you have both participant and condition set to TRUE.
Part of this function involves taking the standard deviation of the data for each combination of participant and condition. If you only have one value for each combination of participant and condition, your standard deviation value will be NA. See the below example of just using participant 988 and condition hit_4. Once your standard deviation is NA, NA's just follow after that.
You either need a larger dataset with more values for each combination of participant and condition or you need to set perParticipant and perCondition to both be FALSE. If you do the second option, you will have two NaN values because those values fall under the minRT threshold that you set. However, you can avoid that by also doing returnType = "raw".
new_data <- structure(list(participant = c("986", "986", "986", "986", "986", "986", "986", "986", "988", "988", "988", "988", "988", "988", "988", "988"), accuracy = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), condition = c("hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "FA_3", "FA_4", "hit_4", "hit_1", "FA_1", "hit_2", "FA_2", "hit_3", "hit_4", "FA_3", "FA_4"), rt = c(638, 286, 348, 310, 404, 301, 216, 534, 348, 276, 256, 293, 495, 438, 73, 73)), row.names = c(NA, -16L), class = "data.frame")
stDev <- 2
minRT <- 150
# get the list of participant numbers
participant <- unique(new_data$participant)
# get the list of experimental conditions
conditionList <- unique(new_data$condition)
# trim the data
trimmedData <- new_data[new_data$rt > minRT, ]
# ready the final data set
finalData <- as.data.frame(matrix(0, nrow = length(participant), ncol = length(conditionList)))
# give the columns the condition names
colnames(finalData) <- conditionList
# add the participant column
finalData <- cbind(participant, finalData)
# convert to data frame
finalData <- data.frame(finalData)
# intialise looping variable for subjects
i <- 1
j <- 2
# take apart the loop
# focus on participant 988, condition hit_4
currSub <- "988"
currCond <- "hit_4"
# get relevant data
tempData <- trimmedData[trimmedData$participant == currSub & trimmedData$condition == currCond, ]
# find the cutoff
curMean <- mean(tempData$rt)
print(curMean)
[1] 438
curSD <- sd(tempData$rt)
print(curSD) # <- here is where the NA values start
[1] NA
curCutoff <- curMean + (stDev * curSD)
# trim the data
curData <- tempData[tempData$rt < curCutoff, ]
# find the average, and add to the data frame
finalData[i, j] <- round(mean(curData$rt))
head(finalData)
> participant hit_1 FA_1 hit_2 FA_2 hit_3 FA_3 FA_4 hit_4
1 986 NA 0 0 0 0 0 0 0
2 988 0 0 0 0 0 0 0 0
Related
I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "ori_data" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "ori_data" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "ori_data" that fall between the coordinates in "ref" though I have not been successful. The "ori_data" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!
ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))
ori_data = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))
final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))
I have tried this line of code and I'm unsure as to why it does not work.
for (i in 1:dim(ori_data)[1])
{
for (j in 1:dim(ref)[1])
{
ID = which(ori_data[i, "Chromosome"] == ref[j,
"Chromosome"])
if (length(ID) > 0)
{
Pos = ori_data[ID, "POS"]
IDj = which(Pos >= ref[j, "txStarts"] & Pos <=
ref[j, "txEnds"])
print(IDj)
if (length(IDj) > 0)
{
ori_data = cbind("Entrez" = ref[IDj,
"EntrezID"], ori_data)
}
}
}
}
In base apply could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.
ori_data$EntrezID <- apply(ori_data[c("Chromosome", "Pos")], 1, \(x)
ref$EntrezID[ref$Chromosome == x["Chromosome"] &
x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
ori_data
# Chromosome Pos Sample EntrezID
#1 19 58345186 HCC1 1
#2 8 18401213 HCC2 10
#3 20 44619519 HCC1 100
#4 18 27950966 HCC3 1000
A version which could be faster:
lup <- list2env(split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome))
ori_data$EntrezID <- Map(\(x, y) {
. <- get(x, envir=lup)
.$EntrezID[y >= .$txStarts & y <= .$txEnds][1]
}, ori_data$Chromosome, ori_data$Pos)
Or another way but not keeping the original order. (If original order is important, have a look at unsplit.)
#Assuming you have many rows with same Chromosome
x <- split(ori_data, ori_data$Chromosome)
#Assuming you have also here many rows with same Chromosome
lup <- split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome)
#Now I am soting this by the names of x - try which Method ist faster
#Method 1:
lup <- lup[names(x)]
#Method 2:
lup <- mget(names(x), list2env(lup))
res <- do.call(rbind, Map(\(a, b) {
cbind(a, b[1][a$Pos >= b[[2]] & a$Pos <= b[[3]]][1])
}, x, lup))
One option would be to use sqldf, which should also be efficient for a large dataframe.
library(tibble)
library(sqldf)
as_tibble(sqldf("select dna.*, ref.EntrezID from dna
join ref on dna.Pos > ref.'txStarts' and
dna.Pos < ref.'txEnds'"))
Another option using fuzzy_join:
library(dplyr)
library(fuzzyjoin)
dna %>%
fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
match_fun = list(`>`, `<`)) %>%
select(names(dna), EntrezID)
Output
Chromosome Pos Sample EntrezID
1 19 58345186 HCC1 1
2 8 18401213 HCC2 10
3 20 44619519 HCC1 100
4 18 27950966 HCC3 1000
If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join
library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID,
on = .(Chromosome, Pos > txStarts, Pos <txEnds)]
-output
> dna
Chromosome Pos Sample EntrezID
<char> <num> <char> <num>
1: 19 58345186 HCC1 1
2: 8 18401213 HCC2 10
3: 20 44619519 HCC1 100
4: 18 27950966 HCC3 1000
data
dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)
I don't understand spatial.data at all. I have been studying but I'm missing something.
What I have: data.frame enterprises with the columns: id, parent_subsidiary, city_cod.
What I need: the mean and the max distance from the parent's city to the subsidiary cities.
Ex:
id | mean_dist | max_dist
1111 | 25km | 50km
232 | 110km | 180km
333 | 0km | 0km
What I did :
library("tidyverse")
library("sf")
# library("brazilmaps") not working anymore
library("geobr")
parent <- enterprises %>% filter(parent_subsidiary==1)
subsidiary <- enterprises %>% filter(parent_subsidiary==2)
# Cities - polygons
m_city_br <- read_municipality(code_muni="all", year=2019)
# or shp_city<- st_read("/BR_Municipios_2019.shp")
# data.frame with the column geom
map_parent <- left_join(parent, m_city_br, by=c("city_cod"="code_muni"))
map_subsidiary <- left_join(subsidiary, m_city_br, by=c("city_cod"="code_muni"))
st_distance(map_parent$geom[1],map_subsidiary$geom[2]) %>% units::set_units(km)
# it took a long time and the result is different from google.maps
# is it ok?!
# To do by ID -- I also stucked here
distance_p_s <- data.frame(id=as.numeric(),subsidiar=as.numeric(),mean_dist=as.numeric(),max_dist=as.numeric())
id_v <- as.vector(parent$id)
for (i in 1:length(id_v)){
test_p <- map_parent %>% filter(id==id_v[i])
test_s <- map_subsidiary %>% filter(id==id_v[i])
total <- 0
value <- 0
max <- 0
l <- 0
l <- nrow(test_s)
for (j in 1:l){
value <- as.numeric(round(st_distance(test_p$geom[1],test_s$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
}
mean_dist <- total/l
done <- data.frame(id=id[i],subsidiary=l,mean_dist=round(mean_dist,2),max_dist=max)
distance_p_s <- rbind(distance_p_s,done)
rm(done)
}
}
Is it right?
Can I calculate the centroid of the cities and than calculate the distance?
I realize that the distance from code_muni==4111407 to code_muni==4110102, the distance is 0, but is another city (Imbituva, PR,Brasil - IvaĆ, PR,Brasil). Why?
Data example: structure(list(id = c("1111", "1111", "1111", "1111", "232", "232", "232", "232", "3123", "3123", "4455", "4455", "686", "333", "333", "14112", "14112", "14112", "3633", "3633"), parent_subsidiary = c("1","2", "2", "2", "1", "2", "2", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "2", "1", "2"), city_cod = c(4305801L,4202404L, 4314803L, 4314902L, 4318705L, 1303403L, 4304507L, 4314100L, 2408102L, 3144409L, 5208707L, 4205407L, 5210000L, 3203908L, 3518800L, 3118601L, 4217303L, 3118601L, 5003702L, 5205109L)), row.names = c(NA, 20L), class = "data.frame")
PS: this is Brazilian cities
https://github.com/ipeaGIT/geobr/tree/master/r-package
Great problem. I looked at it for a little while. Then I came back and looked some more after thinking about it. The mean was not calculated. Only the distances were determined from each parent to its subsidiaries.
The data was binded - the cities data and the data frame data. Then the new df was mutated to add the centroid data for each point on the surface.
The df was split by id and resulted in a list of 8 df's. Each df contained separate parent with related subsidiaries. (1:4, 1:3, 1:4, 1:2, .... )
A loop with a function cleaned up the 8 df's, and calculated the distance from each parent to each subsidiary.
I checked the distance of the first df in the list against values for distances from a website. The distances of df1 were nearly identical to the website.
The output is shown at [link]
I did something like that:
distance_p_s <- data.frame(id=as.character(),
qtd_subsidiary=as.numeric(),
dist_min=as.numeric(),
dist_media=as.numeric(),
dist_max=as.numeric())
id <- as.vector(mparentid$id)
for (i in 1:length(id)){
eval(parse(text=paste0("
print('Filtering id: ",id[i]," (",i," of ",length(id),")')
")))
teste_m <- mparentid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- msubsidiaryid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- st_centroid(teste_f)
teste_m <- st_centroid(teste_m)
teste_f = st_transform(teste_f, 4674)
teste_m = st_transform(teste_m, 4674)
total <- 0
value <- 0
min <- 0
max <- 0
l <- 0
l <- nrow(teste_f)
for (j in 1:l){
eval(parse(text=paste0("
print('Tratando id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
value <- as.numeric(round(st_distance(teste_m$geom[1],teste_f$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
if(j==1){
min<-value
} else {
ifelse(value<min,min<-value,NA)}
}
dist_med <- total/l
done <- data.frame(id=id[i],qtd_subsidiary=l,dist_min=min,dist_media=round(dist_med,2),dist_max=max)
distance_p_s <- rbind(distance_p_s,done)
eval(parse(text=paste0("
print('Concluido id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
rm(done)
}
Probably this is not the best way, but it solved my problem for now.
I want to replace NA values in my xts object with formula Beta * Exposure * Index return.
My xts object is suppose Position_SimPnl created below:
library(xts)
df1 <- data.frame(Google = c(NA, NA, NA, NA, 500, 600, 700, 800),
Apple = c(10, 20,30,40,50,60,70,80),
Audi = c(1,2,3,4,5,6,7,8),
BMW = c(NA, NA, NA, NA, NA, 6000,7000,8000),
AENA = c(50,51,52,53,54,55,56,57))
Position_SimPnl <- xts(df1, order.by = Sys.Date() - 1:8)
For Beta there is a specific dataframe:
Beta_table <- data.frame (AENA = c(0.3,0.5,0.6), Apple = c(0.2,0.5,0.8), Google = c(0.1,0.3,0.5), Audi = c(0.4,0.6,0.7), AXP = c(0.5,0.7, 0.9), BMW = c(0.3,0.4, 0.5))
rownames(Beta_table) <- c(".SPX", ".FTSE", ".STOXX")
For exposure there is another dataframe:
Base <- data.frame (RIC = c("AENA","BMW","Apple","Audi","Google"), Exposure = c(100,200,300,400,500))
For Index return there is a xts object (Index_FX_Returns):
df2 <- data.frame(.SPX = c(0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08),
.FTSE = c(0.5, 0.4,0.3,0.2,0.3,0.4,0.3,0.4),
.STOXX = c(0.15,0.25,0.35,0.3,0.45,0.55,0.65,0.5))
Index_FX_Returns <- xts(df2,order.by = Sys.Date() - 1:8)
Also there is a dataframe which links RIC with Index:
RIC_Curr_Ind <- data.frame(RIC = c("AENA", "Apple", "Google", "Audi", "BMW"), Currency = c("EUR.","USD.","USD.","EUR.","EUR."), Index = c(".STOXX",".SPX",".SPX",".FTSE",".FTSE"))
What I want is for a particular position of NA value in Position_SimPnl it should look into the column name and get the corresponding index name from RIC_Curr_Ind dataframe and then look for the beta value from Beta_table by matching column name (column name of NA) and row name (index name derived from column name of NA).
Then again by matching the column name from Position_SimPnl with the RIC column from 'Base' dataframe it would extract the corresponding exposure value.
Then by matching column name from Position_SimPnl with RIC column from RIC_Curr_Ind dataframe, it would get the corresponding index name and from that index name it would look into the column name for xts object Index_FX_Returns and get the corresponding return value for the same date as of the NA value.
After getting the Beta, Exposure and Index return values I want the NA value to be replaced by formula: Beta * Exposure * Index return. Also I want only the NA values in Position_SimPnl to be replaced. the other values should remain as it was previously.I used the following formula for replacing the NA values:
do.call(merge, lapply(Position_SimPnl, function(y) {if(is.na(y)){y = (Beta_table[match(RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)],rownames(Beta_table)), match(colnames(y),colnames(Beta_table))]) * (Base$Exposure[match(colnames(y), Base$RIC)]) * (Index_FX_Returns[,RIC_Curr_Ind$Index[match(colnames(y),RIC_Curr_Ind$RIC)]])} else{y}}))
However in the output, if a particular column contains NA it is replacing all the values in the column (including which were not NA previously). Also I am getting multiple warning messages like
"In if (is.na(y)) { ... :
the condition has length > 1 and only the first element will be used".
I think because of this all values of column are getting transformed including non-NA ones. Can anyone suggest how to effectively replace these NA values by the formula mentioned above, keeping the other values same. Any help would be appreciated
Because you need to combine all data sets to achieve your formula Beta * Exposure * Index, consider building a master data frame comprised of all needed components. However, you face two challenges:
different data types (xts objects and data frame)
different data formats (wide and long formats)
For proper merging and calculating, consider converting all data components into data frames and reshaping to long format (i.e., all but Base and RIC_Curr_Ind). Then, merge and calculate with ifelse to fill NA values. Of course, at the end, you will have to reshape back to wide and convert back to XTS.
Reshape
# USER-DEFINED METHOD GIVEN THE MULTIPLE CALLS
proc_transpose <- function(df, col_pick, val_col, time_col) {
reshape(df,
varying = names(df)[col_pick],
times = names(df)[col_pick], ids = NULL,
v.names = val_col, timevar = time_col,
new.row.names = 1:1E4, direction = "long")
}
# POSITIONS
Position_SimPnl_wide_df <- data.frame(date = index(Position_SimPnl),
coredata(Position_SimPnl))
Position_SimPnl_long_df <- proc_transpose(Position_SimPnl_wide_df, col_pick = -1,
val_col = "Position", time_col = "RIC")
# BETA
Beta_table_long_df <- proc_transpose(transform(Beta_table, Index = row.names(Beta_table)),
col_pick = 1:ncol(Beta_table),
val_col = "Beta", time_col = "RIC")
# INDEX
Index_FX_Returns_wide_df <- data.frame(date = index(Index_FX_Returns),
coredata(Index_FX_Returns))
Index_FX_Returns_long_df <- proc_transpose(Index_FX_Returns_wide_df, col = -1,
val_col = "Index_value", time_col = "Index")
Merge
# CHAIN MERGE
master_df <- Reduce(function(...) merge(..., by="RIC"),
list(Position_SimPnl_long_df,
Beta_table_long_df,
Base)
)
# ADDITIONAL MERGES (NOT INCLUDED IN ABOVE CHAIN DUE TO DIFFERENT by)
master_df <- merge(master_df,
Index_FX_Returns_long_df, by=c("Index", "date"))
master_df <- merge(master_df,
RIC_Curr_Ind, by=c("Index", "RIC"))
Calculation
# FORMULA: Beta * Exposure * Index
master_df$Position <- with(master_df, ifelse(is.na(Position),
Beta * Exposure * Index_value,
Position))
Final Preparation
# RE-ORDER ROWS AND SUBSET COLS
master_df <- data.frame(with(master_df, master_df[order(RIC, date),
c("date", "RIC", "Position")]),
row.names = NULL)
# RESHAPE WIDE (REVERSE OF ABOVE)
Position_SimPnl_new <- setNames(reshape(master_df, idvar = "date",
v.names = "Position", timevar = "RIC",
direction = "wide"),
c("date", unique(master_df$RIC)))
# CONVERT TO XTS
Position_SimPnl_new <- xts(transform(Position_SimPnl_new, date = NULL),
order.by = Position_SimPnl_new$date)
Position_SimPnl_new
# AENA Apple Audi BMW Google
# 2019-11-27 58 80 8 8000 800.0
# 2019-11-28 57 70 7 7000 700.0
# 2019-11-29 56 60 6 6000 600.0
# 2019-11-30 55 50 5 24 500.0
# 2019-12-01 54 40 4 16 2.0
# 2019-12-02 53 30 3 24 1.5
# 2019-12-03 52 20 2 32 1.0
# 2019-12-04 51 10 1 40 0.5
Based on the following data-frame, I would like to compute the rolling correlations (with a window size of 12):
library(rugarch)
library(rmgarch)
data(dji30retw)
Dat = dji30retw[, 1:8, drop = FALSE]
> dput(head(Dat))
structure(list(AA = c(-0.00595239852729524, 0.00595239852729524,
-0.0149479614358734, 0.0470675108579858, 0.0170944333593002,
0.0251059211310762), AXP = c(-0.00794285351393668, -0.0258495814613253,
-0.0265355536259657, -0.0359320092260634, -0.0555200763856309,
0.0254559199933486), BA = c(-0.00886642920564158, -0.0102302682508148,
-0.0142397228111357, -0.0237478178500363, -0.046456440823212,
-0.0590524317817008), BAC = c(-0.0311983708558615, 0, -0.0358461317731357,
-0.0258794479878207, -0.0304205967007118, -0.0116506172199752
), C = c(-0.0258635105899192, -0.0176216013498196, -0.0134230203321406,
-0.0944096844710748, -0.0352681388374579, 0.0203052661607457),
CAT = c(0.0158733491562901, 0.0411369055604894, -0.046400075604764,
-0.00794706169253204, -0.0106952891167477, 0.0369435151916841
), CVX = c(-0.0220481372217624, 0.0632438936600297, -0.0165791288029112,
-0.0340063679851951, -0.0287101058824313, 0.0112631922787107
), DD = c(0.00638979809877117, 0.0354573118367292, -0.0354573118367292,
0.00529381860971498, -0.031101702565588, -0.0198026272961791
)), .Names = c("AA", "AXP", "BA", "BAC", "C", "CAT", "CVX",
"DD"), row.names = c("1987-03-27", "1987-04-03", "1987-04-10",
"1987-04-17", "1987-04-24", "1987-05-01"), class = "data.frame")
And then after computing the rolling correlations, I would like to create a data-frame consisting of one column with the average correlation coefficient per time period T (in this case: per week).
Is there anyone out there that could help me out? I would really appreciate that!
Thanks in advance!
There are methods in R that are more tailored to time series analysis than the one I'm about to show. Here's a link.
This is a very inelegant solution. I've created my own data for the example:
library(dplyr)
#set seed
set.seed(123)
#initialize matrix
roll_corr <- data.frame(matrix(nrow = 365,ncol = 5))
names(roll_corr) <- c("date","week","sales1","sales2","corr")
#generate sequence of dates
roll_corr$date <- seq(as.Date("2000/01/01"), as.Date("2000/12/30"), by="day")
# calculate week number
roll_corr$week <- as.numeric(roll_corr$date - roll_corr$date[1]) %/% 7
#generate random variates for sales
roll_corr$sales1 <- rnorm(365,500,1000)
roll_corr$sales2 <- runif(365,1000,80000)
#calculate rolling correlation using for loop
for(i in 1:365) {
roll_corr$corr[i] <- cor(roll_corr$sales1[1:i],roll_corr$sales2[1:i])
}
#use dplyr to group data by week and calculate average correlation
weekly_roll_corr <- roll_corr %>%
group_by(week) %>%
summarize(average = mean(corr,na.rm = TRUE))
head(weekly_roll_corr)
week average
1 0 0.1480184
2 1 -0.1008872
3 2 0.1265146
4 3 0.2481083
5 4 0.2518001
6 5 0.1892407
I am working on a problem. Here is an idea of what the original 60k row data frame looks like.
dataOne <- data.frame(
marketVal = c(NA, 543534, NA, 115435, NA),
bathrooms = c(3,3,2,3,5),
garageSqFt = c(400, 385, 454, 534, 210),
totalSqFT = c(NA, NA, 1231, 2232, 4564),
units = c(1, 1, 1, 1, 1),
subDivId = c("112", "111", "111", "111", "112"),
ID = c(4,56,67,94,130) )
Some of the NA's for market value have been retrieved and stored in a new
data frame that looks like so:
dataTwo <- data.frame(
marketVal = c(123123,234234),
IDTwo = c(4,67) )
str(dataTwo)
dataOne$marketVal <- dataTwo$marketVal[match(dataTwo$ID, dataOne$ID)]
comparing ID's from both data frames I am attempting to replace the NA's in the first data frame with the market values in the second data frame. I've tried the match function as follows:
dataOne$marketValue <- dataTwo$marketValue[match(dataOne$ID, dataTwo$ID)]
but recieve an error "replacement has 2 rows, data has 5 calls". I fugured the fact these two data frames not being the same size wouldn't matter as we are only comparing the ID's found in either. How can I accomplish this efficiently considering around 4500 NA's need to be updated?
Your method isn't working because it is producing a vector with 5 values: 1 NA 2 NA NA which is longer than your dataTwo dataframe. Drop the NA values and your method would work.
This is how I would do it:
rowMatch <- which(dataOne$ID %in% dataTwo$ID)
dataOne[rowMatch, ]$marketVal <- dataTwo$marketVal
(Please note your ID variables were actually IDOne and IDTwo respectively in the example you'd provided.)
You can use merge
require(tidyverse)
new <- merge(dataOne, dataTwo, by.x = 'ID', by.y = 'IDTwo', all.x = T)
new$marketVal <- new %$% coalesce(marketVal.x, marketVal.y)
We could use safe_left_join from my package safejoin, and "patch"
the matches from the rhs into the lhs when columns conflict.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
dataOne <- data.frame(
marketVal = c(NA, 543534, NA, 115435, NA),
bathrooms = c(3,3,2,3,5),
garageSqFt = c(400, 385, 454, 534, 210),
totalSqFT = c(NA, NA, 1231, 2232, 4564),
units = c(1, 1, 1, 1, 1),
subDivId = c("112", "111", "111", "111", "112"),
ID = c(4,56,67,94,130) )
dataTwo <- data.frame(
marketVal = c(123123,234234),
IDTwo = c(4,67) )
safe_left_join(dataOne, dataTwo, by=c(ID= "IDTwo"), conflict = "patch")
# marketVal bathrooms garageSqFt totalSqFT units subDivId ID
# 1 123123 3 400 NA 1 112 4
# 2 543534 3 385 NA 1 111 56
# 3 234234 2 454 1231 1 111 67
# 4 115435 3 534 2232 1 111 94
# 5 NA 5 210 4564 1 112 130
or for the same effect in this case we can use dplyr::coalesce
library(dplyr)
safe_left_join(dataOne, dataTwo, by=c(ID= "IDTwo"), conflict = coalesce)