Generating equations from factors in R [duplicate] - r

This question already has answers here:
Google Docs exports spreadsheet values with commas. read.csv() in R treats these as factors instead of numeric
(2 answers)
Closed 8 years ago.
I am fairly new to R, and I am trying to create a new column, which is one column minus another column. For example:
price <- c("$10.00", "$7.15", "$8.75", "12.00", "9.20")
quantity <- c(5, 6, 7, 8, 9)
price <- as.factor(price)
quantity <- as.factor(quantity)
df <- data.frame(price, quantity)
In my actual data set, all the columns imported as factors. When I try to create the new column I get this:
diff <- price - quantity
In Ops.factor(price, quantity): - not meaningful for factors
I have tried to coerce the data to numeric using as.numeric(df), as.numeric(levels(df)), as.numeric(levels(df))[df], and setting stringsAsFactors to false, but the data gets converted to NAs. Data.matrix changes the values. Is there another way to get the above equation to work? Thanks!

You should avoid "" and $ in price column and avoid converting them to factors if you want to do math operations on them:
price <- c(10.00, 7.15, 8.75, 12.00, 9.20)
quantity <- c(5, 6, 7, 8, 9)
df <- data.frame(price, quantity)
df$diff <- price - quantity
df
price quantity diff
1 10.00 5 5.00
2 7.15 6 1.15
3 8.75 7 1.75
4 12.00 8 4.00
5 9.20 9 0.20

Try:
as.numeric(gsub("^\\$","", price))-as.numeric(as.character(quantity))
#[1] 5.00 1.15 1.75 4.00 0.20
Or from df
df$diff <- Reduce(`-`,lapply(df, function(x) as.numeric(gsub("^\\$","",x))))
df$diff
#[1] 5.00 1.15 1.75 4.00 0.20

If you're stuck with factor columns, you could add a new diff column with within() and some type coercion
> within(df, {
diff <- as.numeric(gsub("[$]", "", price)) -
as.numeric(as.character(quantity))
})
# price quantity diff
# 1 $10.00 5 5.00
# 2 $7.15 6 1.15
# 3 $8.75 7 1.75
# 4 12.00 8 4.00
# 5 9.20 9 0.20
You may also consider going back and re-reading the data into R. It's simple, and will make things a little easier. Here's how you could do it and get the desired result that way.
Create a data file: This won't be necessary for you, since you can just read the original file again.
> write.table(df, "df.txt")
Read the data into R, remove the $ sign, and calculate the difference:
> df2 <- read.table("df.txt", stringsAsFactors = FALSE)
> df2$price <- as.numeric(gsub("[$]", "", df2$price))
> with(df2, { price - quantity })
# [1] 5.00 1.15 1.75 4.00 0.20

Related

Elegant way of adding columns on a specific position in a data frame

I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96

Display a random sample without subsetting from the main dataframe

I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.
The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:
If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.
My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...
UPDATE:
Here's a reproducible example:
dat <- data.frame(
X = sample(2000:2016, 50, replace=TRUE),
Y = sample(c("yes", "no"), 50, replace = TRUE),
Z = sample(c("french","german","english"), 50, replace=TRUE)
)
dat2 <- subset(dat, dat$X==2000) #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ] #Random selection of 1
What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled
Note:
I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.
I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.
EDIT: updated based on comments.
You could add a column that indicates if a row is part of your sample. So maybe try the following:
df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))
library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>%
mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>%
mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
mutate(in_sample = in_sample1+in_sample2) %>%
select(-in_sample1,-in_sample2)
Output:
# A tibble: 12 x 4
# Groups: year [2]
year id age in_sample
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 7.00 1.00
2 1.00 2.00 7.00 1.00
3 1.00 3.00 7.00 0
4 1.00 4.00 12.0 1.00
5 1.00 5.00 12.0 0
6 1.00 6.00 12.0 0
7 2.00 7.00 7.00 1.00
8 2.00 8.00 7.00 0
9 2.00 9.00 7.00 1.00
10 2.00 10.0 12.0 0
11 2.00 11.0 12.0 0
12 2.00 12.0 12.0 1.00
Futher operations are then trivial:
# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))

How to find out the most occurring range in a list

I plotted a graph in R:
OBD=read.csv("OBD.CSV",header = TRUE,stringsAsFactors=FALSE)
x1 <- OBD$Time1
x2 <- OBD$Time2
y1<-OBD$Vehicle_speed
y2 <-OBD$Engine_speed
par(mar=c(5,4,4,5)+.1)
plot(x1,y1,type="l",col="yellow",ylab = "Vehicle speed")
par(new=TRUE)
plot(x2,y2,type="l",col="blue4",xaxt="n",yaxt="n",xlab="Time",ylab="")
axis(4)
mtext("Engine speed",side=4,line=3)
legend("topleft",col=c("blue4","yellow"),lty=1,legend=c("y1","y2"))
Sample data, CSV format:
Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023
From this table, i just want find a the most occurring engine speed and vehicle speed or most occurring range.
To find the most common (mode) vehicle speed, you can pull this from table
mySpeeds <- table(df$Vehicle_speed)
modeSpeed <- as.numeric(names(mySpeeds)[which.max(mySpeeds)])
modeSpeed
[1] 4.85
To get such a value for a range of speeds, you should use cut:
# get range categories
df$speedRange <- cut(df$Vehicle_speed, breaks=c(-Inf, 4.85, 4.90, 4.95, Inf))
mySpeedsRange <- table(df$speedRange)
modeSpeedRange <- names(mySpeedsRange)[which.max(mySpeedsRange)]
modeSpeedRange
[1] "(4.85,4.9]"
cut takes a numeric variable and returns a factor variable based on the second (breaks) argument. You can supply breaks with a single number indicating the number of breaks, or a vector, indicating the unique cut points. I included -Inf and Inf to ensure full coverage.
OBD <- read.csv(text = "Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023")
> table(OBD$Engine_speed)
649.5 650 650.5 651
2 6 4 3
Or for a couple of columns:
tables <- apply(OBD[ ,c(1,3,5)], 2, table)
> tables
$Vehicle_speed
4.82 4.85 4.87 4.89 4.91 4.94 4.96 4.98
1 2 2 2 2 2 2 2
$Engine_speed
649.5 650 650.5 651
2 6 4 3
$Engine_torq
11
15
To get only the most occuring:
> lapply(tables, which.max)
$Vehicle_speed
4.85
2
$Engine_speed
650
2
$Engine_torq
11
1
Does this solve the problem?

Adding NA's to a vector

Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!
You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA
Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)

Find the 2 max values for each factor in R

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Resources