R writing a function to avoid for loop - r

Hi I am trying to learn ways in which I can avoid loops in my codes.
I have an example data here:
options(warn=-1) #Turning warnings off here
Company=c("A","C","B","B","A","C","C","A","B","C","B","A")
CityID=as.character(c(1,1,1,2,2,2,3,3,3,4,4,4))
Value=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
Sales=c(1,1,0,0,0,1,1,0,1,0,1,0)
df=data.frame(Company,CityID,Sales,Value)
df$new_value=0
I also created a custom function (simple example only for testing purposes) as below.
funcCity12 = function(data){
data_new=data[which(data$CityID == '1'|data$CityID == '2'),]
for (i in 1:nrow(data_new)){
data_company=df[(df$Company)==data_new[i,'Company'] & !df$CityID==1 & !df$CityID==2,]
data_new[i,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
}
data_new
}
df2=funcCity12(data=df) # obtaining the result here
Now I am trying to write a function to avoid the for loop in the previous function.
funcCity12_no_loop = function(x,df){
data_company=df[(df$Company)==x[,'Company'] & !df$CityID==1 & !df$CityID==2,]
x[,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}
funcCity12_no_loop(x=df[1,],df=df) #Output for the first row of df1
This seems to be working when I input the rows individually. What I am stuck at is how to run this function for all rows of the dataframe. I am not sure if the 2nd function requires more changes for this purpose. Any help is appreciated. Thanks in advance.
P.S. For the second function, my initial reaction was to create a for loop and loop through the observations, but that defeats the whole purpose.
EDIT
This is based on #eonurk's answer
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
})
Output is shown below:

You can use apply function to reach out each individual observation of your dataframe.
For instance, you can multiplicate Values and Sales columns for no reason at all with following:
apply(df,1, function(x){ as.numeric(x["Sales"])*as.numeric(x["Value"])})
Edit:
Now you just need to use dplyr package
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}) %>% as.data.frame %>% t

Here is one way without a loop. First we filter based on your criteria, then we group by company and calculate the max, then we join the dataframe to the original dataset (also filtered based on your criteria). I didn't make it a function, but the building blocks are all there.
library(tidyverse)
list(
df %>%
filter(CityID %in% 1:2) %>%
select(-new_value),
df %>%
filter(! CityID %in% 1:2 & Sales == 1) %>%
group_by(Company) %>%
summarise(new_value = max(Value))
) %>%
reduce(full_join, by = "Company")
#> Company CityID Sales Value new_value
#> 1 A 1 1 120.5 NA
#> 2 C 1 1 123.0 123.2
#> 3 B 1 0 125.0 120.7
#> 4 B 2 0 122.5 120.7
#> 5 A 2 0 122.1 NA
#> 6 C 2 1 121.7 123.2

Related

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

How to address specific cell in lapply?

I want to "upgrade" my code by replacing mass-import using for loop with lapply function. After using lapply(list.files(), read.csv) I've got a list of dataframes. The problem is, the data is a bit messy and some things (like participant's sex) are mentioned only once, in one specific cell. It wasn't a problem when I used a for loop, as I could just refer to a specific cell. When I used:
for (x in list.files()) {
temp <- read.csv(x)
temp %>% slice(4:11) %>% select(form_2.index, form_2.response) %>% mutate(sex = temp[1,4])
#temp[1,4] is the one cell where the participant's sex is mentioned
database <- rbind(datadase, temp)
each temp variable looked like this:
form_2.index form_2.response sex$form.response
<dbl> <chr> <chr>
1 1 yes male
2 2 no male
3 3 no male
4 4 yes male
5 5 yes male
6 6 yes male
7 7 no male
8 8 no male
That's what I want. But how can I refer to a certain cell when using lapply? The following code doesn't work, as the temp variable is now a list:
temp <- lapply(list.files(), read_csv())
temp %>% lapply(slice, 4:11) %>% lapply(select, form_2.index, form_2.response) %>% lapply(mutate, plec = temp[1,4])
The slice and select functions work all right, the problem lies in the mutate part. Given it's a list, I need to point to a certain element of the list, not only column and row, but how can I do that? After all, I want it to be done in each element. Any ideas?
You can do :
library(dplyr)
temp <- lapply(list.files(), function(x) {
tmp <- readr::read_csv(x)
tmp %>%
slice(4:11) %>%
select(form_2.index, form_2.response) %>%
mutate(sex = tmp[1,4])
})

How do I filter data in data frame and change column's cell values based on it using a loop?

Currently working with a larger data frame with various participant IDs that looks like this:
#ASC_new Data Frame
Pcp Choice Target ASC Product choice_consis
2393 zwyn27soc B A 1 USB drive 0
2394 zwyn27soc B A 1 job 0
2395 zwyn27soc B B 1 USB drive 0
2397 zwyn27soc B A 1 printer 0
2399 zwyn27soc B B 1 walking shoes 0
2400 zwyn27soc B A 1 printer 0
I would like to try to loop through each participant (Pcp), and look at their choices in column "Choice." For example, under both of the products "USB drive," the participant chose "B" (Choice). Therefore, under "choice_consis," I want there to be a 1 to replace the 0 because the choices are consistent or equal. Although, my for loop for going through the participants and product names isn't working:
#Examples/snippets of my values
pcp_list <- list("ybg606k3l", "yk83d2asc", "yl55v0zhm", "zwyn27soc")
product_list <- list("USB drive", "printer", "walking shoes", "job")
#for loop that isn't working
for (i in pcp_list){ #iterating through participant codes
for (j in product_list){ #iterating through product names
comparison <- filter(ASC_new, Pcp == i & Product == j) #filtering participant data and products into new dataframe
choice_1 <- ASC_new$Choice[1] #creating labels for choice 1 and 2
choice_2 <- ASC_new$Choice[2]
if (isTRUE(choice_1 == choice_2)){ #comparing choice 1 and choice 2 and adding value of 1 to Choice_consis column if they are equal
ASC_new$choice_consis[1] <- 1
ASC_new$choice_consis[2] <- 1
}
}
}
In the end I would like a data frame where each participant's choice_consis is labeled with a 1 or 0 expressing if they chose the same item (A,B,D) both times that each product appeared.
This is something that's pretty natural to do using dplyr, if you don't care about collapsing across different choices. I'll illustrate on a toy data frame:
IDs <- 1:2
choices <- c('A', 'B')
products <- c('USB', 'Printer')
df <- data.frame(Pcp = rep(IDs, each = 4),
Choice = c(rep(choices, each = 2),
rep(choices, each = 2)),
Product = c(rep(products, times = 2),
rep(products, each = 2)))
df %>%
dplyr::group_by(Pcp, Product) %>%
dplyr::summarize(choice_consis = as.numeric(length(unique(Choice)) == 1))
This does (in essence) the same thing you're trying to do with your for loop: look at each combination of participants and products (that's what the group_by does) and then analyze that combination (that's what the summarize does). It's a little more succinct and readable than a double for loop. I'd check out Chapter 5 of Hadley's book on R for Data Science to learn more about these sorts of things.
As far as what's wrong with your for loop, the issue is that even though you create your comparison data frame, all the subsequent operations are on ASC_new. So if you wanted to use a for loop and maintain the structure of your original data, you could do something like:
for (i in pcp_list) {
for (j in product_list) {
compare <- (ASC_new$Pcp == i) & (ASC_new$Product == j)
choices <- ASC_new$Choice[compare]
if (length(unique(choices)) == 1) {
ASC_new$choice_consis[compare] <- 1
}
}
}
Creating a new data frame as you did makes it a little harder to substitute values in the original (because we don't know "where" the filtered data frame came from), so I just get the indices of the original data frame corresponding to the participant-product combination. Note also that I eliminated the hard-coding of the fact that there are only two choices, as well as the isTRUE within the if statement (== will evaluate to TRUE or FALSE, as desired).
Hope this helps!
You can count the unique value of Choice for each Pcp and Product and assign 1 if it is 1 or 0 otherwise.
This can be done in base R :
df$choice_consis <- +(with(df, ave(Choice, Pcp, Product, FUN = function(x)
length(unique(x)))) == 1)
dplyr :
library(dplyr)
df %>%
group_by(Pcp, Product) %>%
mutate(choice_consis = +(n_distinct(Choice) == 1))
and data.table
library(data.table)
setDT(df)[, choice_consis := as.integer(uniqueN(Choice) == 1), .(Pcp, Product)]

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?
You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]
You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4
A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

Nested subsetting with "["

I recently discovered that, after subsetting an object (i.e. a data frame) with "[", the resulting object could be subset with "[" on the same line of code (I should have realized it earlier!). Here is an example:
# Create a data frame
df1 <- as.data.frame(matrix(1:9, nrow = 3))
# Take a look at the data frame
df1
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
# If I want the value which is on the 3rd row and 2nd column
df1[3,2]
[1] 6
# But I could also
df1[,2][3]
[1] 6
A few words on the second alternative. df[,2] returns an atomic vector, which is then subset with df[,2][3].
The following data frame will be helpful to illustrate my issue. It is a simple data frame containing the name of 26 students, their respective department as well as a numeric value. A seed number is added for reproducibility.
set.seed(123)
df2 <- data.frame(name = letters, dept = sample(c("econ", "stat", "math"), 26, replace = TRUE), value = runif(26, 0, 100))
head(df2)
name dept value
1 a econ 54.40660
2 b math 59.41420
3 c stat 28.91597
4 d math 14.71136
5 e math 96.30242
6 f econ 90.22990
I would like to know who has the lowest value in the econ department. The first thing I tried was:
df2[df2$dept == "econ" & df2$value == min(df2$value),]
[1] name dept value
<0 rows> (or 0-length row.names)
It took me a while to understand what I was doing wrong, but I finally realized that the problem was that my code assumed that the person who had the lowest value overall was also from the econ department, which is not the case (and that's the answer that R gave me). Actually, the person with the lowest value overall is from the stat department.
i <- which(df$value == min(df$value))
df[i,]
name dept value
9 i stat 2.461368
Of course, I can easily find the answer to my question with:
df_econ <- df2[df2$dept == "econ",]
df_econ
name dept value
1 a econ 54.40660
6 f econ 90.22990
15 o econ 14.28000
17 q econ 41.37243
18 r econ 36.88455
19 s econ 15.24447
df_econ[df_econ$value == min(df_econ$value),]
name dept value
15 o econ 14.28
But I would like to know if I can get the same result using "nested" subsetting with the [ operator. What I mean is with a code like this:
df2[df2$dept == "econ",][... ,]
I do not know how to refer to the value column at this point since the resulting data frame of the first subsetting operation df2[df2$dept == "econ",] is a data frame different from df2. I also know that the value column is the 3rd column, but I do not know how to set subsetting conditions using column indexes rather than their names.
Thank you for your help.
Here are some options:
library(dplyr)
# also in #bramtayl's answer:
df2 %>% filter(dept == "econ") %>% filter(value==min(value))
# or
df2 %>% filter(dept == "econ") %>% slice(which.min(value))
# or...
library(data.table)
setDT(df2)[dept == "econ"][value==min(value)]
# or
setDT(df2)[dept == "econ"][which.min(value)]
These packages offer convenient ways of chaining not available in base R except awkwardly, like
subset(subset(df2, dept=="econ"), value == min(value))
There may be other packages, but these two are widely used lately.
Comment. If you're just browsing data, I'd recommend aggregating at the dept level:
# dplyr:
df2 %>% group_by(dept) %>% slice(which.min(value))
# data.table:
df2[, .SD[which.min(value)], by=dept]
dept name value
1: econ o 14.280002
2: math t 13.880606
3: stat i 2.461368
Agreed that chaining is necessary:
library(magrittr)
df %>%
`[`(.$dept == "econ", ) %>%
`[`(.$value == min(.$value), )
Why not stick with dplyr though?
library(dplyr)
df %>%
filter(dept == "econ") %>%
filter(value == min(value) )

Resources