After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.
I am having trouble with a particular task:
For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.
I can do this for the entire data.table using the set command, but I need to do this operation by station.
Example:
# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE),
wind=rgamma(n=1e6, shape=1.5, rate=1/10),
other=rnorm(n=1.6),
key="station")
# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate),
j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)
The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.
I tried to use the by=station operator within the [ ], but I am not sure how to do this.
One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).
# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]
This FR #2793 filed by #eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:
# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]
Related
After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!
Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]
I am trying to construct a script in r to force it to ignore objects it can’t find.
A simplified version of my script is as follows
Trial<-sum(a,b,c,d,e)
A-e are numeric vectors generates by calculating the sum of a column in a data frame.
My problem is I want to use the same script over multiple different conditions (and have far more objects than just a-e). For some of these conditions some of the objects a-e may not exist. Therefore r returns error object d not found.
To avoid having to generate a unique script for each condition I would like to force to ignore any missing objects.
I would be grateful for any help!
Welcome to SO! As mentioned in the comments, in the future try to include a working example in your question. The preferred solution to your problem would be to avoid assigning values to individual variables in the first place. Try to restructure your code so that your column sums get assign to, for example, a list. In the example below, I create some sample data, assign column sum values to a vector, and compute the sum of the vector, without creating a new variable for each column.
# Create sample data
rData <- as.data.frame(matrix(c(1:6), nrow=6, ncol=5, byrow = TRUE))
print(rData)
# Compute column sum
sumVec <- apply(rData, 2, sum)
print(sumVec)
# Compute sum of column sums
total <- sum(sumVec)
print(total)
If you have to use individual variables, before adding them up, you could check if the variable exists, and if not, create it and assign NA. You can then compute the sum of your variables after excluding NA.
# Sample variables
a <- 15
b <- 20
c <- 50
# Assign NA if it doesn't exist (one variable at a time)
if(!exists("d")) { d <- NA }
# Assign NA using sapply (preferred)
sapply(c("a","b","c","d","e"), function(x)
if(!exists(x)) { assign(x, NA, envir=.GlobalEnv) }
)
# Compute sum after excluding NA
altTotal <- sum(na.omit(c(a,b,c,d,e)))
print(altTotal)
Hopefully this will get you closer to the solution!
I have a big data table called "dt", and I want to produce a data table of the same dimensions which gives the deviation from the row mean of each entry in dt.
This code works but it seems very slow to me. I hope there's a way to do it faster? Maybe I'm building my table wrong so I'm not taking advantage of the by-reference assignment. Or maybe this is as good as it gets?
(I'm a R novice so any other tips are appreciated!)
Here is my code:
library(data.table)
r <- 100 # of rows
c <- 100 # of columns
# build a data table with random cols
# (maybe not the best way to build, but this isn't important)
dt <- data.table(rnorm(r))
for (i in c(1:(c-1))) {
dt <- cbind(dt,rnorm(r))
}
colnames(dt) <- as.character(c(1:c))
devs <- copy(dt)
means <- rowMeans(dt)
for (i in c(1:nrow(devs))) {
devs[i, colnames(devs) := abs(dt[i,] - means[[i]])]
}
If you subtract a vector from a data.frame (or data.table), that vector will be subtracted from every column of the data.frame (assuming they're all numeric). Numeric functions like abs also work on all-numeric data.frames. So, you can compute devs with
devs <- abs(dt - rowMeans(dt))
You don't need a loop to create dt either, you can use replicate, which replicates its second argument a number of times specified by the first argument, and arranges the results in a matrix (unless simplify = FALSE is given as an argument)
dt <- as.data.table(replicate(r, rnorm(r)))
Not sure if its what you are looking for, but the sweep function will help you applying operation combining matrices and vectors (like your row means).
table <- matrix(rnorm(r*c), nrow=r, ncol=c) # generate random matrix
means <- apply(table, 1, mean) # compute row means
devs <- abs(sweep(table, 1, means, "-")) # compute by row the deviation from the row mean
This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.
I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.