I am trying to get away from loops in R and was looking to both vectorize and speed up a section of my code.
I am looking to convert a For loop using lapply, but am getting an error:
Reproducible example:
library(dplyr)
# This works using a For loop -----------------------------------
# create sample data frame
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
diff <- numeric() # initialize
# Loop through each item and take difference of latest value from earlier values
for (myitem in unique(df$Item)) {
y = df[df$Date == last(df$Date) & df$Item == myitem, "Value"] # Latest value for an item
x = df[df$Item == myitem, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
df_final <- mutate(df, Difference = diff)
df_final
I found related questions here (lapply), here (lapply), and here ($ operator) but none really helped me with my question.
Here is how I tried to vectorize using lapply:
# Same thing using vectorized approach ----------------------------------
mylist <- list(unique(df$Item))
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date) & df$Item == mylist, "Value"] # Latest value for an item
x = df[df$Item == mylist, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
# throws error
diff_vector <- unlist(lapply(mylist, myfunction))
df_final2 <- mutate(df, Difference = diff_vector)
df_final2
My real data set has hundreds of thousand of rows. If someone could point me in the right direction on how to vectorize this to get the same output as the For loop I would appreciate it.
Thanks!
So lapply isn't being used quite right here, that's all!
lapply applies a function to each element of a list. To be explicit, it takes each element of a list, and applies the function to that element.
So if you want it to apply a function to several subsets of a data frame, you need to get it a list which is several subsets of a data frame. So let's create that list first.
We can do this using the split function, it splits your data frame into several data frames based on a column and stores these as a list. A list of subsets of a data frame. Perfect!
So let's replace the line where you create mylist with this line instead.
mylist <- split(df,df[,c("Item")])
Now we just need to make some changes tomyfunction. Remember we're now passing through our data already subsetted, so we can remove the conditions about the Item matching with what we'd expect. Remember this function will get applied to each of these data frames in their entirety.
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date), "Value"] # Latest value for an item
x = df[, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
And the rest my friend, is exactly as you have it :)
I'm not sure lapply is the right approach to take. I'd stick with mutate - which you already seem to be using:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
df <- df %>%
group_by(Item) %>%
mutate(diff = last(Value) - Value)
df
#> # A tibble: 9 x 4
#> # Groups: Item [3]
#> Date Item Value diff
#> <fct> <fct> <int> <int>
#> 1 Jan1 A 10 2
#> 2 Jan2 A 11 1
#> 3 Jan3 A 12 0
#> 4 Jan1 B 13 2
#> 5 Jan2 B 14 1
#> 6 Jan3 B 15 0
#> 7 Jan1 C 16 2
#> 8 Jan2 C 17 1
#> 9 Jan3 C 18 0
Created on 2018-06-27 by the reprex package (v0.2.0).
This does assume that the observations (at least within the "Item" group) are arranged in order. If not, add arrange(Date) %>% as a step after group_by
you could create a table with the latest value, join with the original table and get the difference or use data.table to create an additional column with latest value
library(data.table)
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
setDT(df)
df[,latestVal:=last(Value),by=.(Item)][,diff:=latestVal-Value][,.(Date,Item,Value,diff)]
Related
I want to create a dataframe with a column whose value depends on another object's value.
Here's an example, I want my column to be called "conditional_colname":
x = "conditional_colname"
df <- data.frame(x = c(1, 2, 3))
df
> x
1 1
2 2
3 3
I could try the following indirection syntax in tidy evaluation, but it returns an error:
data.frame({{x}} := c(1, 2, 3))
> Error in `:=`({ : could not find function ":="
I can sort out the problem through the use of the rename function and indirection in tidy evaluation syntax, as in:
df %>% rename({{x}} := x)
> conditional_colname
1 1
2 2
3 3
but that involves creating the dataframe with a wrong name and then renaming it, is there any option to do it from the creation of the dataset?
{{..}} can be used with tibbles -
library(tibble)
library(rlang)
df <- tibble({{x}} := c(1, 2, 3))
df
# A tibble: 3 × 1
# conditional_colname
# <dbl>
#1 1
#2 2
#3 3
A solution with data.frame would be with setNames.
df <- setNames(data.frame(c(1, 2, 3)), x)
I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?
One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).
Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)
I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect
I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.
As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3
You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.
I have a list of functions, for example:
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
I need to call stored functions in myFunctions based on some criteria for example, I have a table (myTable) with prices and I need to calculate means and medians (I also need to do more things like standardize names, join a specific value with a table with codes, etc).
If a value in a column in myTable is == "a" I want to use function calculateMean, if == "b" I want to use function calculateMedian, if == "c" use function calculateMean.
What is the best way to do this? I am saving functions as a list as I will have a lot of functions. And how can I call a function in the myFunctions based on a specific criteria?
Thanks!
Maybe the following does what the question asks for.
Depending on ID, function priceStat determines which function from myFunctions to apply to column price.
priceStat <- function(x, funlist) {
type <- unique(as.character(x[["ID"]]))
f <- switch(type,
pear = funlist[[1]],
orange = funlist[[2]])
f(x[["price"]])
}
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
set.seed(1234)
df1 <- data.frame(ID = sample(c("pear", "orange"), 20, TRUE),
price = runif(20),
stringsAsFactors = FALSE)
sapply(split(df1, df1$ID), priceStat, myFunctions)
# orange pear
#0.3036828 0.5427695
Here is something that I think does what you are suggesting.
library(dplyr)
Create some data.
set.seed(1234)
data <- tibble(id = rep(letters[1:2], each = 3), price = rnorm(6, 100, 5))
data
# # A tibble: 6 x 2
# id price
# <chr> <dbl>
# 1 a 94.0
# 2 a 101.
# 3 a 105.
# 4 b 88.3
# 5 b 102.
# 6 b 103.
Create a list of functions. Note we named the list item for the id we want to apply it to.
myFunctions <- list(
a = mean,
b = median
)
Group the data on the id. Then iterate over each list item, calling summarize(). For each list (which is the subset of the data for that given id) call the function from the myFunctions list.
data %>%
group_by(id) %>%
group_modify(~ summarize(.x, calc = myFunctions[[pull(.y[1])]](.x$price)))
# # A tibble: 2 x 2
# id calc
# <chr> <dbl>
# 1 a 100.
# 2 b 102.
Testing it out.
> mean(data$price[data$id == "a"])
[1] 100.258
> median(data$price[data$id == "b"])
[1] 102.1456
I would like to filter a data.table and then select the last row of that filtered data.table in a single command. The desired result from this command is a number as I am using the last row of that filtered expression in a formula.
I would like to avoid defining a new variable where I filter the data.table and then select .N in another line.
Is this possible?
Thank you!
library(data.table)
library(magrittr)
# Sample data
dt <- data.table(style = c(rep("A", times = 10),
rep("B", times = 10)),
id = 1:20)
# Want to select last row when dt is filtered for style == "A"
5 * dt[style == "A" & .N, id]
#> [1] 5 10 15 20 25 30 35 40 45 50
# Desired output is 50
# Want to avoid defining previously
a <- dt[style == "A"] %>%
.[.N, id]
5 * a
#> [1] 50
Created on 2019-03-15 by the reprex package (v0.2.1)
You can use j in x[i, j] to select the last element matching the filter in i:
dt[style == "A", id[.N]]
# or
dt[style == "A", last(id)]
Using dplyr
dt %>%
group_by(style) %>%
summarise(id = last(id) * 5)
# A tibble: 2 x 2
style id
<chr> <dbl>
1 A 50
2 B 100