Removing Values in DataFrame Based On Condition R - r

I have a data frame of 4 columns (magnified for this example). Most columns have outliers which are significantly larger than the other values in the data frame. For example: A column (with a maximum value of 99), has outliers with 96, 97, 98, 99. These outliers signify essentially "no response". This obviously heavily skews the data, thus they must be removed. I want to remove the outliers, but each variable has a different maximum value (and different set of outliers) and some have decimals.
96, 97, 98, 99 must be removed ONLY from the columns that have those as reserve values. So the function must know which columns have each specific classification of reserve values. More below.
The issue is that, I do not want to "remove from all columns" the reserve values as some values may mean something else in another column. For example removing 996 in one column could mean something of significance in another column, such as hourly wage/week.
It get tricky as some have decimals like hours worked/week. For example. 37.5 hours worked per week would have reserve values of 999.6, 999.7, 999.8, 999.9.
This length would be classified as 5.1.
I need to remove these reserve values from the data frame, but they must first match the corresponding reserve value length. Since each column has a different reserve value, the column names of the data frame should correspond to a specific reserve value.
df <- data.frame("children#" = c(1,5,0,2,10),
"annual income" = c(700000.00,50000.65,30000.45,1000000.59,9999999.96),
"hour wage"= c(25.65,9999999.99,50.23,1000.72,65.16),
"hours worked/week" = c(148.5,77.0,64.2,25.9,999.7))
Max length of children# is 2
Max length of annual income is 10.2 (10 total, 2 decimal)
Max length of hour wage is 10.2
Max length of hours worked/week is 5.1 (5 total, 1 decimal)
ALWAYS WILL BE 4 RESERVE VALUES
If max length = 2, remove reserve values: 96,97,98,99
If max length = 3, remove reserve values: 996, 997, 998, 999... and so forth with solid numbers
With decimals:
If max length = 5.1, remove reserve values: 999.6, 999.7, 999.8, 999.9.
If max length = 10.2, remove reserve values: 9999999.96, 9999999.97, 9999999.98, 9999999.99
Thus, I would like to figure out how to make a function that will
find max lengths
connect the corresponding max lengths with the correct reserve values
remove reserve values from data frame based on max lengths of each column
So far I have the max lengths of each column with the decimal points.
I just need some help with connecting it to the reserve values and getting those reserve values removed from the data frame.
If more info is required please comment as I will elaborate further if needed.
Code sample: For the reserve values I was thinking of creating a separate data frame and using that to remove the values. Other suggestions are welcome.
Find.Max.Length <- function(data){
# Check Max Length of each column
tmp <- data.frame(lapply(data, function(x) max(nchar(x, keepNA = F))))
tmp <- data.frame(t(tmp))
return(tmp)}
max.length <- Find.Max.Length(df)
Check.Decimal.Places <- function(x){
if((x %% 1) != 0){
nchar(strsplit(sub('0+$', '',as.character(x)), ".", fixed = TRUE)[[1]][[2]])
}else{
return(0)}
}
decimal <- data.frame(Check.Decimal.Places(df$random)) #<--- used to
initialize the variable before the loop
for(i in seq_along(df)){
decimal[i] <- data.frame(Check.Decimal.Places(df[[i]]))}
decimal<- data.frame(t(decimal))
rownames(decimal) <- names(df)
length.df <- cbind(max.length, decimal)
names(length.df) <- c("Max Length", "Decimal Place")
length.df$NewVariableLength <- paste0(length.df$`Max Length`, sep=
".",length.df$`Decimal Place`)
NOTE: Row names of length.df data frame match original data frame names. That can possibly be a way to link the two together?
There is probably a faster way to do this all, all suggestions are welcome.

edit: Now I understand what you mean with "reserve values" - answers from a survey that should not be counted (e.g. "I don't want to answer this question")
You have essentially three easy methods here without having to search of "integer length" or other overengineering:
Max values (i.e., remove the four highest values),
Manual thresholds (i.e., remove all values above X),
If-else logic (i.e., if answer == X, remove it).
Building the dataset
Your data did not correspond to your specifications ("always 4 outliers"), so I took the liberty to extend it.
df <- data.frame(
"children" = c(1, 0, 96, 2, 10, 99, 98, 99),
"annual_income" = c(700000.00, 50000.65, 30000.45, 1000000.59, 9999999.96, 9999999.97, 9999999.98, 9999999.99),
"hour_wage"= c(25.65, 9999999.99, 50.23, 9999999.98, 9999999.99, 9999999.98, 1000.72, 65.16),
"hours_worked_week" = c(148.5, 999.6, 77.0, 64.2, 999.9, 999.8, 25.9, 999.7)
)
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 96 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 99 9999999.97 9999999.98 999.8
7 98 9999999.98 1000.72 25.9
8 99 9999999.99 65.16 999.7
1. Maximum-Values-Approach (obsolete after clarification)
Load libraries
library(dplyr)
library(magrittr)
Get the four outliers
children_out <- tail(sort(df$children), 4)
Replace outliers with NA
df[df$children %in% children_out,]
%<>% mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
Caveat: This approach will work only if you always have four outliers for each column.
2. Manual thresholds
Load libraries
library(dplyr)
library(magrittr)
Exclude existing NA and replace anything that is 96 or above with NA
df[!is.na(df$children) & df$children >=96, ] %<>%
mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
3. If-else logic
Load libraries
library(dplyr)
library(magrittr)
Save "reserved answers"
children_res <- c(96, 97, 98, 99)
Replace anything that is a reserved answer with NA (excluding existing NA is not needed here)
df[df$children %in% children_res, ] %<>%
mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
4. edit: Combined approach 1&3
Load libraries
library(dplyr)
library(magrittr)
Get "reserved answers"
children_res <- tail(sort(unique(df$children)), 4)
Replace anything that is a reserved answer with NA (excluding existing NA is not needed here)
df[df$children %in% children_res, ] %<>%
mutate(children = NA)
Caveat: This approach will work only if you always have ALL reserved answers (e.g., 96, 97, 98, and 99) present in each column. This will NOT WORK if, by accident, nobody would answer "97".

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

How to apply a function to multiple columns to create multiple new columns in R?

I've this list of sequences aqi_range and a dataframe df:
aqi_range = list(0:50,51:100,101:250)
df
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
1 85.6 3 264 75.7 3 240
2 105. 6 243 76.4 3 191
3 95.8 19 287 48.4 8 134
4 85.5 50 166 64.8 32 103
5 55.9 24 117 46.7 19 77
6 37.5 6 116 31.3 3 87
7 26 5 69 15.5 3 49
8 82.3 34 169 49.6 25 120
9 170 68 272 133 67 201
10 254 189 323 226 173 269
Now I've created these two pretty simple functions that i want to apply to this dataframe to calculate the AQI=Air Quality Index for each pollutant.
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
max_val = max(i)
return (max_val - min_val)
}}}
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
return(min_val)
}
}}
Basically the first function "min_max_diff" takes the value of column df$PM10_mean / df$PM2.5_mean and check for it in the list "aqi_range" and then returns a certain value (difference of min and max value of the sequence in which it's available). Similarly the second function "c_low" just returns the minimum value of the sequence.
I want to apply this kind of manipulation (formula defined below) to PM10_mean column to create new columns PM10_AQI:
df$PM10_AQI = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean - df$PM10_min) + c_low(df$PM10_mean,aqi_range)
I hope it explains it properly.
If your problem is just how to compute the given transformation to several columns in a data frame, you could write a for loop, construct the name of each variable involved in the transformation using string transformation functions (in this case sub() is useful), and refer to the columns in the data frame using the [ notation (as opposed to the $ notation --since the [ notation accepts strings to specify columns).
Following I show an example of such code with a small sample data with 3 observations:
(note that I modified the definition of the AQI range values (now I just define the breaks where the range changes --assuming they are all integers), and your functions min_max_diff() and c_low() which are collapsed into one single function returning the min and max values of the AQI range where the values are found --again this assumes that the AQI values are integer values)
# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)
# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
PM10_min=c(3, 6, 19),
PM10_max=c(264, 243, 287),
PM2.5_mean=c(75.7, 76.4, 48.4),
PM2.5_min=c(3, 3, 8),
PM2.5_max=c(240, 191, 134))
# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
aqi_range_groups = findInterval(values, aqi_range_breaks)
return( list(min=aqi_range_breaks[aqi_range_groups],
max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}
# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
vmin = sub("_mean$", "_min", vmean)
vmax = sub("_mean$", "_max", vmean)
vaqi = sub("_mean$", "_AQI", vmean)
aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) /
(df[,vmax] - df[,vmin]) / (df[,vmean] - df[,vmin]) +
aqi_range_min_max$min
}
Note how the findInterval() function has been used to find the range where an array of values fall. That was the key to make your transformation work for a data frame column.
The expected output of this process is:
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max PM10_AQI PM2.5_AQI
1 85.6 3 264 75.7 3 240 51.00227 51.002843893
2 105.0 6 243 76.4 3 191 101.00635 51.003550930
3 95.8 19 287 48.4 8 134 51.00238 0.009822411
Please check the formula that computes AQI because you had a syntax error in it (look for / *, which I have replaced with / in the formula in my code).
Note that the use of $ in the regular expression used in sub() to match the string "_mean" is used to replace the "_mean" string only when it occurs at the end of the variable name.

Dataframe operation within and between dataframes

How can I make some operation within and between dataframes in R?
For example, here is a dataframe on stock returns.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
Date X Y Z
1 2009-01-01 -0.31758501 -1.2718424 -2.9979292
2 2009-01-02 -1.06440187 0.4202969 -5.7925412
3 2009-01-03 0.26475736 -2.3955779 -2.2638179
4 2009-01-04 -0.83653746 0.4161053 -10.1011995
5 2009-01-05 -0.12214392 0.7143456 3.6851497
6 2009-01-06 -0.01186287 -2.1322029 -0.1577852
7 2009-01-07 0.27729415 0.1323237 -4.4237673
8 2009-01-08 -1.74389562 0.4962045 0.4192498
9 2009-01-09 0.83150240 -0.9241747 -1.6752324
10 2009-01-10 -0.52863956 0.1044531 -1.2083588
Q1) I'd like to create a dataframe with previous day.
For example, final result that I want would be expressed lag(stocks,1)
What is the most simple and elegant way to achieve this?
Is there any simple way to use dplyr?
Q2) How can I apply any basic arithmetic operation to this dataframe?
for example, I'd like to create dataframes with,
stocks1 = stocks + 1
stocks2 = stocks x 3
stocks3 = stocks2 / stocks1 (operation between two dataframes)
stocks4 = stocks3 / lag(stocks1)
Something like this.
What would be the most simple and elegant way?
To address the first problem, this might be of help to you. You don't necessarily need to use dplyr in this instance, using the head() function should be sufficient if all you wish to do is lag the variables.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
previous<-head(stocks,9)
df<-data.frame(stocks$time[2:10],stocks$X[2:10],stocks$Y[2:10],stocks$Z[2:10],previous$X,previous$Y,previous$Z)
col_headings<-c("time","X","Y","Z","previousX","previousY","previousZ")
names(df)<-col_headings
Here, the dates from 2nd January to 10th January are displayed, with the lags for X, Y, and Z also being included in the data frame.
> df
time X Y Z previousX previousY
1 2009-01-02 0.7878110 -2.1394047 0.68775794 -0.0759606 1.2863089
2 2009-01-03 -0.2767296 -2.3453356 -1.56313888 0.7878110 -2.1394047
3 2009-01-04 -0.2122021 0.1589629 -1.13926020 -0.2767296 -2.3453356
4 2009-01-05 0.1195826 3.2320352 -0.32020803 -0.2122021 0.1589629
5 2009-01-06 0.7642622 -0.7621168 1.66614679 0.1195826 3.2320352
6 2009-01-07 -0.3073972 -2.9475654 5.63945611 0.7642622 -0.7621168
7 2009-01-08 0.3597369 0.5011861 5.95424269 -0.3073972 -2.9475654
8 2009-01-09 -1.8701881 0.4417496 1.34273218 0.3597369 0.5011861
9 2009-01-10 -1.1172033 -0.5566736 0.05432339 -1.8701881 0.4417496
previousZ
1 3.2188050
2 0.6877579
3 -1.5631389
4 -1.1392602
5 -0.3202080
6 1.6661468
7 5.6394561
8 5.9542427
9 1.3427322
As regards calculations, it depends on what you are trying to do.
e.g. do you want to add 1 to each row in Z?
> df$Z+1
[1] 1.6877579 -0.5631389 -0.1392602 0.6797920 2.6661468 6.6394561
[7] 6.9542427 2.3427322 1.0543234
You could divide two stock returns by each other as you've specified as well. Note that we have combined them in the one dataframe, so we are not necessarily conducting an "operation between two dataframes" per se.
> df$Y/df$Z
[1] -3.11069421 1.50040132 -0.13953168 -10.09354826 -0.45741275
[6] -0.52266839 0.08417294 0.32899307 -10.24740160
By specifying the dataframe (in this case, df), along with the associated variable (as indicated after the $ symbol), then you should be able to carry out a wide range of calculations across the dataframe.

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

R - Turn many time series in 1D format into 3D array, with each time series tagged with two labels

I have a one dimensional csv which contains data of many time series, each time series is tagged with two labels.
After reading the file into R, what key function(s) should I use to quickly turn the data into 3D matrix?
The data are in this format:
Date, Price, Stock Ticker, Country
1/1/2012, 98, ABC.US, US
1/2/2012, 100, ABC.US, US
.
.
.
1/1/2012, 36, XYZ.US, US
1/2/2012, 34, XYZ.US, US
.
.
.
.
1/1/2012, 78, MNO.LN, UK
1/2/2012, 75, MNO.LN, UK
.
.
I want to turn this table into 3D array with dimensions of date, stock ticker and country:
3DTable[Date,Ticker,Country]
Assuming that #sebastian-c interpreted your question correctly, this is a one-liner in base R that gets you there:
tapply(x$Price, x[, -2], c)
# , , Country = UK
#
# Stock.Ticker
# Date ABC.US MNO.LN XYZ.US
# 1/1/2012 NA 78 NA
# 1/2/2012 NA 75 NA
#
# , , Country = US
#
# Stock.Ticker
# Date ABC.US MNO.LN XYZ.US
# 1/1/2012 98 NA 36
# 1/2/2012 100 NA 34
I think I have an answer for what you want.
Create data frame
x <- data.frame(Date=rep(c("1/1/2012", "1/2/2012"), 3),
Price=c(98, 100, 36, 34, 78, 75),
"Stock Ticker"=rep(c("ABC.US", "XYZ.US", "MNO.LN"), each=2),
Country=rep(c("US", "US", "UK"), each=2))
Create set of all possible options
all.opts <- expand.grid(Date=levels(x$Date),
Stock.Ticker=levels(x$Stock.Ticker),
Country=levels(x$Country))
Join this with the data (there may be a way in base R, but I don't know it)
library(plyr)
x2 <- join(all.opts, x)
Make the array
x.arr2 <- array(x2$Price, dim=c(2, 3, 2),
dimnames=list(levels(x2$Date), levels(x2$Stock.Ticker), levels(x2$Country)))
Admire handiwork:
x.arr2
#, , UK
#
# ABC.US MNO.LN XYZ.US
#1/1/2012 NA 78 NA
#1/2/2012 NA 75 NA
#
#, , US
#
# ABC.US MNO.LN XYZ.US
#1/1/2012 98 NA 36
#1/2/2012 100 NA 34
Try this:
library(plyr)
daply(x, c("Date", "Stock Ticker", "Country"), function(y) y$Price)
The first argument to daply is your data frame, the second are the variables you want to use as dimensions (perhaps you need to change the space to a dot, depending on how you read your data), and the third is the function which computes the array value from the daraframe row.

Resources