Apply a function over several columns - r

I am trying to use values from a look up table, to multiply corresponding values in a main table.
This is an example of some data
The look up
lu = structure(list(year = 0:12, val = c(1.6422, 1.6087, 1.5909, 1.4456,
1.4739, 1.4629, 1.467, 1.4619, 1.2588, 1.1233, 1.1664, 1.1527,
1.2337)), .Names = c("year", "val"), class = "data.frame", row.names = c(NA,
-13L))
Main data
dt = structure(list(year = c(3L, 4L, 6L, 10L, 3L, 9L, 10L, 7L, 7L,
1L), x = 1:10, y = 1:10), .Names = c("year", "x", "y"), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
I can produce the results I want by merging and then multiplying one column at a time
library(data.table)
dt = merge(dt, lu, by = "year")
dt[, xnew := x*val][, ynew := y*val]
However, I have many variables to apply this over. There have been many questions on this, but I cannot get it to work.
Using ideas from How to apply same function to every specified column in a data.table , and R Datatable, apply a function to a subset of columns , I tried
dt[, (c("xnew", "ynew")):=lapply(.SD, function(i) i* val), .SDcols=c("x", "y")]
Error in FUN(X[[i]], ...) : object 'val' not found
for (j in c("x", "y")) set(dt, j = j, value = val* dat[[j]])
Error in set(dt, j = j, value = val * dt[[j]]) : object 'val' not found
And just trying the multiplication without assigning (from Data table - apply the same function on several columns to create new data table columns) also didnt work.
dt[, lapply(.SD, function(i) i* val), .SDcols=c("x", "y")]
Error in FUN(X[[i]], ...) : object 'val' not found
Please could you point out my error. Thanks.
Im using data.table version v1.9.6.

We can try by join and then by specifying .SDcols
dt[lu, on = .(year), nomatch =0
][, c("x_new", "y_new") := lapply(.SD, `*`, val), .SDcols = x:y][]

Related

xts dropping column names

I have a data.frame
res0 = structure(list(year = "2017", il = 11200000), .Names = c("year",
"il"), row.names = c(NA, -1L), class = "data.frame")
however, when I try to make this an xts object I lose the column names.
as.xts(x = res0[,2:ncol(res0)], order.by = as.POSIXct(paste0(res0$year,"-01-01")), name = NULL)
This returns:
[,1]
2017-01-01 11200000
instead of
il
2017-01-01 11200000
Subscripting in R drops dimensions by default. Use drop = FALSE to prevent this.
res0[, 2:ncol(res0), drop = FALSE]
Also note that this works to create an n x 1 zoo series with year as the index.
library(zoo)
z <- read.zoo(res0, FUN = c, drop = FALSE)

R Creating Dynamic variables from group aggregated set of DataFrames

My problem statement is I have a list of dataframes as df1,df2,df3.Data is like
df1
a,b,c,d
1,2,3,4
1,2,3,4
df2
a,b,c,d
1,2,3,4
1,2,3,4
Now, for these two dataframe I should create a new dataframe taking aggregated column of those two dataframes ,for that I am using below code
for(i in 1:2){
assign(paste(final_val,i,sep=''),sum(assign(paste(df,i,sep='')))$d*100)}
I am getting the error:
Error in assign(paste(hvp_route_dsct_clust, i, sep = "")) :
argument "value" is missing, with no default
My output should look like
final_val1 <- 800
final_val2 <- 800
And for those values final_val1,final_val2 I should be creating dataframe dynamicaly
Can anybody please help me on this
If we need to use assign, get the object names from the global environment with ls by specifying the pattern 'df' followed by one or more numbers (\\d+), create another vector of 'final_val's ('nm1'), loop through the sequence of 'nm1', assign each of the element in 'nm2' to the value we got from extracting the column 'd' of each 'df's multiplied by 100 and taking its sum.
nm1 <- ls(pattern = "df\\d+")
nm2 <- paste0("final_val", seq_along(nm1))
for(i in seq_along(nm1)){
assign(nm2[i], sum(get(nm1[i])$d*100))
}
final_val1
#[1] 800
final_val2
#[1] 800
Otherwise, we place the datasets in a list, extract the 'd' column, multiply with 100 and do the column sums
unname(colSums(sapply(mget(nm1), `[[`, 'd') * 100))
#800 800
data
df1 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))

r apply multiple conditions to multiple columns (vectors of function argument values)

I am trying to apply multiple conditions to multiple columns of a data.frame where condition i should be applied to column i, i.e. the applied condition is dependent on the column I am in. I have a working solution but it has two major drawbacks, it's potentially slow on large data as it uses a for loop and it requires the two input vectors "columns the condition is applied to" and "condition to be applied" in the same order. I envisaged a solution that utilizes fast data wrangling package functions e.g. dplyr, data.table and is more flexible with respect to order of argument vector elements. an example should make it clear (here the condition is only a threshold test but in the bigger problem it may be a more complex boolean expression involving variables of the data set).
t <- structure(list(a = c(2L, 10L, 10L, 10L, 3L),
b = c(5L, 10L, 20L, 20L, 20L),
c = c(100L, 100L, 100L, 100L, 100L)),
.Names = c("a", "b", "c"),
class = "data.frame",
row.names = c(NA, -5L))
foo_threshold <-
function(data, cols, thresholds, condition_name){
df <- data.frame(matrix(ncol = length(cols), nrow = nrow(data)))
colnames(df) <- paste0(cols, "_", condition_name)
for (i in 1:length(cols)){
df[,i] <- ifelse(data[,i] > thresholds[i],T,F)
}
return(df)
}
foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18),
condition_name = "bigger_threshold")
I have tried to solve it in a dplyr chain but I fail to pass the argument vectors correctly, i.e. how to make it clear that he should apply condition i to column i. below an illustration where I was going. it's not working and it misses some points but I think it illustrates what I am trying to achieve. note that here conditions are assumed to be in a data.frame where column variable holds the col names and threshold is extracted via a lookup (dplyr filer + select chain).
foo_threshold <- function(data, cols, thresholds, cond_name) {
require(dplyr)
# fun to evaluate boolean condition
foo <- function(x) {
threshold <- thresholds %>% filter(variable==x) %>% select(threshold)
temp <- ifelse(x > threshold, T, F)
return(temp)
}
vars <- setNames(cols, paste0(cols,"_",cond_name))
df_out <-
data %>%
select_(.dots = cols) %>%
mutate_(funs(foo(.)), vars) %>%
select_(.dots = names(vars))
return(df_out)
}
# create threshold table
temp <-
data.frame(variable = c("a", "b"),
threshold = c(5, 18),
stringsAsFactors = F)
# call function (doesn't work)
foo_threshold(data = t, thresholds = temp, cond_name = "bigger_threshold")
Edit: #thepule data.frame of conditions may look like below where x is the column. so each condition is evaluated for each row of its corresponding column.
conditions <-
data.frame(variable = c("a", "b"),
condition = c("x > 5 and x < 10", "!x %in% c("o", "p")"),
stringsAsFactors = F)
Made another attempt using sweep instead of mapply. Left the previous answer as it was as I feel it adds value showing how inefficient mapply is.
This new answer appears to run at a little over twice as fast as the OP. I think it is a tiny bit slower than the current best rated answer but has slightly more concise code.
It runs even faster if you are willing to accept the result as a matrix instead of a data.frame.
library(dplyr)
temp <- structure(list(a = c(2L, 10L, 10L, 10L, 3L),
b = c(5L, 10L, 20L, 20L, 20L),
c = c(100L, 100L, 100L, 100L, 100L)),
.Names = c("a", "b", "c"),
class = "data.frame",
row.names = c(NA, -5L))
foo_threshold <- function(data , cols , thresholds , condition_name ) {
dat <- sweep ( data [ cols ] , 2 , thresholds , ">" ) %>% as.data.frame()
names(dat) <- paste0(names(dat) , "_" , condition_name)
return(dat)
}
foo_threshold(data = temp, cols = c("a", "b"), thresholds = c(5, 18),
condition_name = "bigger_threshold")
Final attempt at an answer. Tried to make the code more generic so that it can accept arbitrary functions. Nicely it also appears to run significantly faster than any of my previous answers. I am also quite tired atm so apologies if I have made a silly mistake.
temp <- structure(list(a = c(2L, 10L, 10L, 10L, 3L),
b = c(5L, 10L, 20L, 20L, 20L),
c = c(100L, 100L, 100L, 100L, 100L)),
.Names = c("a", "b", "c"),
class = "data.frame",
row.names = c(NA, -5L))
condition <- c(function(x) x> 5 ,
function(x) x > 18 )
foo_threshold <- function ( data , cols , threshold , condition_name ) {
dat <- data[0]
for ( i in 1:length(condition)) dat[cols[i]] <- condition[[i]]( data[[cols[i]]] )
names(dat) <- paste0( cols , "_" , condition_name)
return(dat)
}
foo_threshold(data = temp, cols = c("a", "b"), threshold = condition ,
condition_name = "bigger_threshold")
Try this:
library(dplyr)
foo_threshold <-
function(data, cols, thresholds, condition_name){
temp <- rbind(data[,cols], thresholds) %>%
lapply(function(x) x[1:length(x)-1] > last(x)) %>% data.frame()
colnames(temp) <- paste0(cols, "_", condition_name)
return(temp)
}
foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18),
condition_name = "bigger_threshold")
In order to test which is faster:
test <- data.frame(a = runif(10000000), b = runif(10000000), stringsAsFactors = F)
lapply(list(foo_threshold_original, foo_threshold),
function(x) system.time(x(data = test, cols = c("a", "b"), thresholds = c(0.5, 0.8),
condition_name = "bigger_threshold")))
where foo_threshold_original is your initial version.
The result is:
[[1]]
user system elapsed
3.95 0.64 4.58
[[2]]
user system elapsed
1.73 0.24 1.96
So the new version is actually faster on bigger data frames.
How about this ? Doesn't use dplyr (I loaded it anyway to use the pipes though)
library(dplyr)
foo_threshold <- function( data , cols , thresholds , condition_name){
dat <- mapply( function(x , val) x > val , data[cols] , thresholds ) %>% as.data.frame
names(dat) <- paste0(names(dat) , "_" , condition_name)
return(dat)
}
edit: simplified

How to save the column names and their corresponding type in R into excel?

i have a R data set with >200 columns. I need to get what class each column is and get that into excel, with col name and its corresponding class as two columns
1. Using lapply/sapply with stack/melt
You could do this using lapply/sapply to get the class of each column and then using stack from base R or melt from reshape2 to get the 2 column data.frame.
res <- stack(lapply(df, class))
#or
library(reshape2)
res1<- melt(lapply(df, class))
Then use write.csv or using any of the specialized libraries for writing to excel data i.e. XLConnect, WriteXLS etc.
write.csv(res, file="file1.csv", row.names=FALSE, quote=FALSE)
.csv files can be opened in excel
2. From the output of str
Or you could use capture.output and regex to get the required info from the str and convert it to data.frame using read.table
v1 <- capture.output(str(df))
v2 <- grep("\\$", v1, value=TRUE)
res2 <- read.table(text=gsub(" +\\$ +(.*)\\: +([A-Za-z]+) +.*", "\\1 \\2", v2),
sep="",header=FALSE,stringsAsFactors=FALSE)
head(res2,2)
# V1 V2
#1 t02.clase Factor
#2 Std_A_CLI_monto_sucursal_1 chr
data
df <-structure(list(t02.clase = structure(c(1L, 1L, 1L), .Label = "AK",
class = "factor"),Std_A_CLI_monto_sucursal_1 = c("0", "0", "0"),
Std_A_CLI_monto_sucursal_2 = c(0, 0.01303586, 0), Std_A_CLI_monto_sucursal_3 =
c(0.051311597, 0.003442244, 0.017347593), Std_A_CLI_monto_sucursal_4 = c(0L,
0L, 0L), Std_A_CLI_promociones = c(0.4736842, 0.5, 0), Std_A_CLI_dias_cliente =
c(0.57061341, 0.55492154, 0.05991441), Std_A_CLI_sucursales = c(0.05555556,
0.05555556, 0.05555556)), .Names = c("t02.clase", "Std_A_CLI_monto_sucursal_1",
"Std_A_CLI_monto_sucursal_2", "Std_A_CLI_monto_sucursal_3",
"Std_A_CLI_monto_sucursal_4", "Std_A_CLI_promociones", "Std_A_CLI_dias_cliente",
"Std_A_CLI_sucursales"), row.names = c("1", "2", "3"), class = "data.frame")

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1

Resources