I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.
The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:
If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.
My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...
UPDATE:
Here's a reproducible example:
dat <- data.frame(
X = sample(2000:2016, 50, replace=TRUE),
Y = sample(c("yes", "no"), 50, replace = TRUE),
Z = sample(c("french","german","english"), 50, replace=TRUE)
)
dat2 <- subset(dat, dat$X==2000) #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ] #Random selection of 1
What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled
Note:
I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.
I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.
EDIT: updated based on comments.
You could add a column that indicates if a row is part of your sample. So maybe try the following:
df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))
library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>%
mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>%
mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
mutate(in_sample = in_sample1+in_sample2) %>%
select(-in_sample1,-in_sample2)
Output:
# A tibble: 12 x 4
# Groups: year [2]
year id age in_sample
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 7.00 1.00
2 1.00 2.00 7.00 1.00
3 1.00 3.00 7.00 0
4 1.00 4.00 12.0 1.00
5 1.00 5.00 12.0 0
6 1.00 6.00 12.0 0
7 2.00 7.00 7.00 1.00
8 2.00 8.00 7.00 0
9 2.00 9.00 7.00 1.00
10 2.00 10.0 12.0 0
11 2.00 11.0 12.0 0
12 2.00 12.0 12.0 1.00
Futher operations are then trivial:
# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))
I plotted a graph in R:
OBD=read.csv("OBD.CSV",header = TRUE,stringsAsFactors=FALSE)
x1 <- OBD$Time1
x2 <- OBD$Time2
y1<-OBD$Vehicle_speed
y2 <-OBD$Engine_speed
par(mar=c(5,4,4,5)+.1)
plot(x1,y1,type="l",col="yellow",ylab = "Vehicle speed")
par(new=TRUE)
plot(x2,y2,type="l",col="blue4",xaxt="n",yaxt="n",xlab="Time",ylab="")
axis(4)
mtext("Engine speed",side=4,line=3)
legend("topleft",col=c("blue4","yellow"),lty=1,legend=c("y1","y2"))
Sample data, CSV format:
Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023
From this table, i just want find a the most occurring engine speed and vehicle speed or most occurring range.
To find the most common (mode) vehicle speed, you can pull this from table
mySpeeds <- table(df$Vehicle_speed)
modeSpeed <- as.numeric(names(mySpeeds)[which.max(mySpeeds)])
modeSpeed
[1] 4.85
To get such a value for a range of speeds, you should use cut:
# get range categories
df$speedRange <- cut(df$Vehicle_speed, breaks=c(-Inf, 4.85, 4.90, 4.95, Inf))
mySpeedsRange <- table(df$speedRange)
modeSpeedRange <- names(mySpeedsRange)[which.max(mySpeedsRange)]
modeSpeedRange
[1] "(4.85,4.9]"
cut takes a numeric variable and returns a factor variable based on the second (breaks) argument. You can supply breaks with a single number indicating the number of breaks, or a vector, indicating the unique cut points. I included -Inf and Inf to ensure full coverage.
OBD <- read.csv(text = "Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023")
> table(OBD$Engine_speed)
649.5 650 650.5 651
2 6 4 3
Or for a couple of columns:
tables <- apply(OBD[ ,c(1,3,5)], 2, table)
> tables
$Vehicle_speed
4.82 4.85 4.87 4.89 4.91 4.94 4.96 4.98
1 2 2 2 2 2 2 2
$Engine_speed
649.5 650 650.5 651
2 6 4 3
$Engine_torq
11
15
To get only the most occuring:
> lapply(tables, which.max)
$Vehicle_speed
4.85
2
$Engine_speed
650
2
$Engine_torq
11
1
Does this solve the problem?
I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))