Round multiple vectors in dataframe with plyr

Round multiple vectors in dataframe with plyr - r

The numbers in this data.frame are rounded to 3 decimal places:
habitats_df <- data.frame(habitat = c("beach", "grassland", "freshwater"), v1 = c(0.000, 0.670, 0.032), v2 = c(0.005, 0.824, 0.012))
habitat v1 v2
1 beach 0.000 0.005
2 grassland 0.670 0.824
3 freshwater 0.032 0.012
I need them rounded to 2 decimal places. I tried to use plyr::l_ply like this:
library(plyr)
l_ply(habitats_df[,2:3], function(x) round(x, 2))
But it didn't work. How can I use plyr:: l_ply to round the numbers in habitats_df?

You don't really need plyr for this, since a simply lapply combined with round does the trick. I provide a solution in base R as well as plyr
Try this in base R:
roundIfNumeric <- function(x, n=1)if(is.numeric(x)) round(x, n) else x
as.data.frame(
lapply(habitats_df, roundIfNumeric, 2)
)
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01
And the same with plyr:
library(plyr)
quickdf(llply(habitats_df, roundIfNumeric, 2))
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01

# plyr alternative
library(plyr)
data.frame(habitat = habitats_df$habitat,
numcolwise(.fun = function(x) round(x, 2))(habitats_df))
# habitat v1 v2
# 1 beach 0.00 0.00
# 2 grassland 0.67 0.82
# 3 freshwater 0.03 0.01
# base alternative
data.frame(habitat = habitats_df$habitat,
lapply(habitats_df[ , -1], function(x) round(x, 2)))

Related

Best way to find weighted averages from two dataframes in R?

new to R so sorry if this is a bit broad but I'm not really even sure where to start with an approach to this problem.
I have two dataframes, df1 containing demographic data from certain Census tracts.
AfricanAmerican AsianAmerican Hispanic White
Tract1 0.25 0.25 0.25 0.25
Tract2 0.50 0.10 0.20 0.10
Tract3 0.05 0.10 0.35 0.50
And df2 contains observations polygons with the percentage of each census tract that makes up its area.
Poly1 Poly2 Poly3
Tract1 0.33 0.25 0.00
Tract2 0.33 0.25 0.10
Tract3 0.34 0.50 0.90
What I want to do is get the weighted averages of the demographic data in each observation polygon
AfricanAmerican AsianAmerican Hispanic White
Poly1 0.26 0.15 0.27 0.29
Poly2 0.21 0.14 0.29 0.34
Poly3 0.10 0.10 0.34 0.46
So far I'm thinking I could do something like
sum(df1$AfricanAmerican * df2$Poly1)
Then use a for loop to iterate over all demographic variables for one polygon, then nest that in another for loop to iterate over all polygons, but given that I have hundreds of Census tracts and polygons in my working dataset, I was wondering if there was a better approach?

Use colSums of the products in mapply.
t(mapply(function(...) colSums(`*`(...)), list(df1), df2))
# AfricanAmerican AsianAmerican Hispanic White
# [1,] 0.2645 0.1495 0.2675 0.2855
# [2,] 0.2125 0.1375 0.2875 0.3375
# [3,] 0.0950 0.1000 0.3350 0.4600
If you want to round to two digits, just wrap round(..., 2) around it.
Data:
df1 <- read.table(header=T, text='
AfricanAmerican AsianAmerican Hispanic White
Tract1 0.25 0.25 0.25 0.25
Tract2 0.50 0.10 0.20 0.10
Tract3 0.05 0.10 0.35 0.50
')
df2 <- read.table(header=T, text='
Poly1 Poly2 Poly3
Tract1 0.33 0.25 0.00
Tract2 0.33 0.25 0.10
Tract3 0.34 0.50 0.90
')

Libraries
library(tidyverse)
Sample Data
df1 <-
tibble(
Trat = paste0("Trat",1:3),
AfricanAmerican = c(.25,.5,.05),
AsianAmerican = c(.25,.1,.1),
Hispanic = c(.25,.2,.35)
)
df2 <-
tibble(
Trat = paste0("Trat",1:3),
Poly1 = c(.33,.33,.34),
Poly2 = c(.25,.25,.5),
Poly3 = c(0,.1,.9)
) %>%
#Pivot df2 making a single column for all Poly values
pivot_longer(cols = -Trat,names_to = "Poly")
Code
df1 %>%
#Join df1 and df2 by Trat
left_join(df2) %>%
#Grouping by Poly
group_by(Poly) %>%
#Sum values across variables AfricanAmerican to Hispanic
summarise(across(AfricanAmerican:Hispanic,function(x)sum(x*value)))
Output
Joining, by = "Trat"
# A tibble: 3 x 4
Poly AfricanAmerican AsianAmerican Hispanic
<chr> <dbl> <dbl> <dbl>
1 Poly1 0.264 0.150 0.268
2 Poly2 0.212 0.138 0.288
3 Poly3 0.095 0.1 0.335

Assign different values to a large number of columns

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!

Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks

Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))

This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

Mean of quartile for multiple columns and multiple dates

I'm trying to find the mean forward return (column fwd_rtn) of each quartile for each column (ie for quartiles for PB, PE, PS) for each date group (1/1/2016...1/4/2016)
head(df)
Date Stock Price PB PE PS fwd_rtn
1 1/1/2016 A 11.90 0.4 0.10 0.57 -0.015
2 1/1/2016 B 3.56 0.8 0.09 0.26 -0.036
3 1/1/2016 C 1.29 1.2 0.18 1.60 0.10
......
4 1/4/2016 A 12.80 0.39 0.13 0.53 -0.01
5 1/4/2016 B 4.03 0.76 0.08 0.23 0.02
6 1/4/2016 C 1.83 0.87 0.14 1.16 0.03
So far i have been able to find the mean return for 1 column for 1 date using this code
df$qPB <- cut(df$PB, breaks = quantile(df$PB, c(0,.25,.5,.75,1)),include.lowest = TRUE)
aggregate(df$fwd_rtn,list(qPB = df$qPB),FUN=mean)
which gave me the right answers. But I'm struggling to do it for the multiple columns. I think I'm supposed to use dplyr and the gather() function but i dont know how.

To get quartiles of a single variable by date you can use the ave function:
df$qPB <- ave(df$PB, df$Date, FUN= function(i) cut(i, breaks = quantile(df$PB,
c(0,.25,.5,.75,1)),include.lowest = TRUE)
# a minor addition to aggregate
aggregate(df$fwd_rtn, list("qPB"=df$qPB, "date"=df$Date), FUN=mean)
You should take a look at using lapply or sapply to move through multiple columns.

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds

This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT

You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848

Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Round multiple vectors in dataframe with plyr - r

Related

Best way to find weighted averages from two dataframes in R?

Assign different values to a large number of columns

apply a function on columns with specific names

Mean of quartile for multiple columns and multiple dates

Speed up `strsplit` when possible output are known

Categories

Resources