This is the code that I am trying to run and it's taking a while.
Districts is a data frame of 39299 rows and 16 columns and lm_data is a data frame of 59804 rows and 16 variables. I want to set up a new variable in lm_data called tentativeStartDate which takes on the value of districts$firstDay[j] if a couple of conditions are meant. Is there a more efficient way to do this?
for (i in 1: nrow(lm_data)){
for (j in 1: nrow(districts)){
if (lm_data$DISTORGID[i] == districts$DISTORGID[j] & lm_data$gradeCode[i] == districts$gradeCode[j]){
lm_data$tentativeStartDate[i] = districts$firstDay[j]
}
}
}
Not sure if this will work since I can't test it, but if it does work it should be much faster.
# get the indices
idx <- which(lm_data$DISTORGID == districts$DISTORGID & lm_data$gradeCode == districts$gradeCode)
lm_data$tentativeStartDate[idx] <- districts$firstDay[idx]
Related
I am new to R, I have a question about Loop through all the combinations of unique days and unique individuals in the activity_budget dataset. For each iteration of the inner loop subset on the current value of day and individual of your loops. Calculate the mean time value for this subset and store it in a vector called my_vector .
I write a bunch of code but I received an error. thank you in advanced.
setwd("C:/ /")
activity_budget <- read.csv("activity_budget.csv")
getwd()
str(activity_budget)
head(activity_budget)
my_vector<-NULL
for(i in unique(activity_budget$day)){
for(j in unique(activity_budget$individual)){
subset_data<-subset(activity_budget, activity_budget$day == i & activity_budget$individual== j)
my_vector<-mean(activity_budget$time[subset(activity_budget, activity_budget$day & activity_budget$individual)],na.rm=TRUE)
}
}
my_vector<-NULL
unique(activity_budget$day)
unique(activity_budget$individual)
unique(activity_budget$time)
mean(activity_budget$time)
activity_budget$day==i& activity_budget$individual==j
my_vector<-NULL
index<-0
for(i in unique(activity_budget$day)){
for(j in unique(activity_budget$individual)){
subset_data<- activity_budget$day == i & activity_budget$individual==j
index<-index+1
my_vector[index]<-mean(activity_budget$time[subset_data],na.rm=TRUE)
}
}
my_vector
I have a data frame (call it 'ModelOutput') with three columns (Trial, DurationRet, DiscountRate) and another (call it 'drdata') with three columns (Scenario, variable, value).
I want to quickly filter drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet to return drdata$value into the ModelOutput$DiscountRate column. Is there a way to do this efficiently?
Here are my two attempts, the first of which fails and the second of which is entirely too slow.
ModelOutput$Trial <- drdata[drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet,"value"]
foreach(row = 1:nrow(ModelOutput)) %do%{
ModelOutput[row, "DiscountRate"] <- drdata[drdata$Scenario == ModelOutput[row, "Trial"] & drdata$variable == as.factor(ModelOutput[row,"DurationRet"]+1),"value"]
}
It took me a minute, but I realized joins could do the job I was looking for.
Here is my final code:
ModelOutput <- ModelOutput %>% full_join(drdata, by = c(Trial = "Scenario", DurationRet = "variable"))
I 've got an R assignment in which I have to add a column to my matrix. It's about dates(time zones), I use dplyr and lubridate libraries.
So I want from the below table to according to the state column to add its OlsonName(i.e. NSW -> Australia/NSW)
Event.ID Database Date.Time Nearest.town State *OlsonName*
1 20812 Wind 23/11/1975 07:00 SYDNEY NSW *Australia/NSW*
2 20813 Tornado 02/12/1975 14:00 BARHAM NSW *Australia/NSW*
I implement that with a function and a loop:
#function
addOlsonNames <- function(aussieState,aussieTown){
if(aussieState=="NSW"){
if(aussieTown=="BROKEN HILL"){
value <- "Australia/Broken_Hill";
}else{
value <- "Australia/NSW"
}
}else if(aussieState=="QLD"){
value <- "Australia/Queensland"
}else if(aussieState=="NT"){
value <- "Australia/North"
}else if(aussieState=="SA"){
value <- "Australia/South"
}else if(aussieState=="TAS"){
value <- "Australia/Tasmania"
}else if(aussieState=="VIC"){
value <- "Australia/Victoria"
}else if(aussieState=="WA"){
value <- "Australia/West"
}else if(aussieState=="ACT"){
value <- "Australia/ACT"
}
else{
value <- "NAN"
}
return(value)
}
#loop
for(i in 1:nrow(aussieStorms)){
aussieStorms$OlsonName[i] <- addOlsonNames(State[i],Nearest.town[i])
}
Most of the instances are classified correctly like on my table above but some of the instances are misclassified(i.e. State~TAS -> OlsonName~Australia/West. Altough I have some State~TAS -> OlsonName~Australia/Tasmania).
Seems strange to me. What might be the issue ?
Update:
I also tried mutate() and that's what I got:
aus1 <- mutate(aussieStorms,OlsonXYZ = addOlsonNames(State,Nearest.town))
Warning messages:
1: In if (aussieState == "NSW") { :
the condition has length > 1 and only the first element will be used
2: In if (aussieTown == "BROKEN HILL") { :
the condition has length > 1 and only the first element will be used
If Ben Bolker's comment is right then the problem is in here:
for(i in 1:nrow(aussieStorms)){
aussieStorms$OlsonName[i] <- addOlsonNames(State[i],Nearest.town[i])
}
in that the values passed to addOlsonNames are not coming from rows of the aussieStorms data frame. If R isn't giving an error, then it must be getting State[i] from another object called State in your R workspace. Similarly for Nearest.town. If those objects aren't the same as the ones in your aussieStorms data frame, that would explain the apparent misclassification.
[Its also possible that you've used attach on a data frame at some point, and State is being got from that. But attaching data frames is a bad idea as you can see here...]
Ben's solution, ie making them aussieStorms$State and aussieStorms$Nearest.town look good to me.
I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH
FYI, I'm new to using R so my code is likely quite clunky. I've done my homework on this but haven't been able to find an "Except" logical operator for R and really need something like that in my code. My input data is a .csv containing integers and null values with 12 columns and 1440 rows.
oneDayData <- read.csv("data.csv") # Loading data
oneDayMatrix <- data.matrix(oneDayData, rownames.force = NA) #turning data frame into a matrix
rowBefore <- data.frame(oneDayData[i-1,10], stringsAsFactors=FALSE) # Creating a variable to be used in the if statement, represents cell before the cell in the loop
ctr <- 0 # creating a counter and zeroing it
for (i in 1:nrow(oneDayMatrix)) {
if ((oneDayMatrix[i,10] == -180) & (oneDayMatrix[i,4] == 0)) { # Makes sure that there is missing data matched with a zero in activityIn
impute1 <- replace(oneDayMatrix[ ,10], oneDayMatrix[i,10], rowBefore)
ctr <- (ctr + 1) # Populating the counter with how many rows get changed
}
else{
print("No data fit this criteria.")
}
}
print(paste(ctr, "rows have been changed.")) # Printing the counter and number of rows that got changed enter code here
I would like to add some kind of EXCEPT condition to my if statement or equivalent that says something like: employ the two previous conditions (see if statement in code) EXCEPT when oneDayMatrix[i-1, 4] > 0. I would really appreciate any help with this and thank you in advance!
"Except" is equivalent to "if not". The "not" operator in R is !. So to add that oneDayMatrix[i-1, 4] > 0 exception, you just need to modify your if statement as follows:
if ((oneDayMatrix[i, 10] == -180) &
(oneDayMatrix[i, 4] == 0) &
!(oneDayMatrix[i-1, 4] > 0)) { ... }
or equivalently:
if ((oneDayMatrix[i, 10] == -180) &
(oneDayMatrix[i, 4] == 0) &
(oneDayMatrix[i-1, 4] <= 0)) { ... }
This goes on top of a couple fixes that need to be made to your code:
as I pointed out, rowBefore is not defined properly: in terms of i which is not defined yet. Inside your for loop, just replace rowBefore with oneDayMatrix[i-1, 10]
as #noah pointed out, you need to start your loop at the second index: for (i in 2:nrow(oneDayMatrix)).