I am trying to wrap my head around R, and I'm sure I'm doing something silly.
I have a dataframe that includes 30 brands (whose names I have separately in a list called "brands") and a list of new names that I wish to insert into the dataframe (called "known brands").
I am trying to populate the results of an if statement within new columns in an R dataframe (using the names within "known brands), but this keeps on generating an error message (unexpected '{' in "{")
I'm not sure where I'm going wrong - here's my code:
for(i in 1:length(brands)){
plot1a_df <- plot1a_df %>% mutate(known_brands[i] = ifelse(brands[i] >1, 1, 0))
}
To illustrate with data (assume 3 x2 columns):
plot1a_df = data.frame(brands = c(1,0,2), Misc = c(0,0,0))
The idea is to end up with a third column ("known_brands") with c(0,0,1)
To add a logical column with dplyr:
library(dplyr)
plot1a_df %>% mutate(is_brand_known = brands %in% brand_list)
Another example with iris dataset.
species_list = c('setosa')
iris %>% mutate(is_setosa = Species %in% species_list)
for (i in 1:30){
plot1a_df[, known_cols[i]] <- ifelse(plot1a_df[,brands[i]] >1, 1, 0)
print(plot1a_df[, known_cols[i]])
}
Found my solution could be achieved without mutate - although, still wonder if it is possible to combine for loops within dplyr (realise there's a lot of commentary on here, but nothing at a high level (for this simpleton to understand at least!)
Related
I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.
To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:
dfa<- dflist[[1]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
filter(n_missing<=nrow(dflist[[1]])*0.30)
This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try:
First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * :
Caused by error in [.data.frame:
! undefined columns selected".
Here is my code:
dfid<-list()
for (i in 1:4){
dfid[[i]]<-dflist[[i]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}
So my questions are:
How to fix this error to make the goal possible?
Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?
Thanks a lot for your help in advance~~!
The columns should be quoted if we are using [ unless it is an object. It may be easier to loop with map/lapply
library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>%
mutate(across(where(is.numeric), as.character))%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >= n()*0.01)%>%
filter(n_missing <= n()*0.30))
We don't need the [ when we use the chain
dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
tmp <- dflist[[i]]
dfid[[i]] <- tmp %>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=n()*0.01)%>%
filter(n_missing <=n()*0.30)
}
I have code that creates a new dataframe, df2 which is a copy of an existing dataframe, df but with four new columns a,b,c,d. The values of these columns are given by their own functions.
The code below works as intended but it seems repetitive. Is there a more succinct form that you would recommend?
df2 <- df %>% mutate(a = lapply(df[,c("value")], f_a),
b = lapply(df[,c("value")], f_b),
c = lapply(df[,c("value")], f_c),
d = lapply(df[,c("value")], f_d)
)
Example of cell contents in "value" column "-0.57(-0.88 to -0.26)".
I am applying a function to extract first number:
f_a <- function(x){
substring(x, 1, regexpr("\\(", x)[1] - 1)
}
This works fine when applied to a single string (-0.57 from the example). In the data frame I found that lapply gives correct values based on input from any cell in the "value" column. The code seems a bit repetitive but works.
We can use map
library(tidyverse)
df[c('a', 'b', 'c', d')] <- map(list(f_a, f_b, f_c, f_d), ~ lapply(df$value, .x))
Note: Without the functions or an example, not clear whether this is the optimal solution. Also, as noted in the comments, many of the functions can be applied directly on the column instead of looping through each element.
I have been using Stata and the loops are easily executed there. However, in R I have faced some errors in looping over variables. I tried some of the codes over here and it does not work. Basically, I am trying to clean the data by logging the values. I had to convert negative values to positive first before logging them.
I intend to loop over multiple firm statistics on the dataframe but I faced errors in doing so.
varlist <- c("revenue", "profit", "cost")`
for (v in varlist) {
data$log_v <- log(abs(ifelse(data$v>1, data$v, NA)))
data$log_v <- ifelse(data$v<0, data$log_v*-1,data$log_v)
}
Error in $<-.data.frame(tmp,"log_v", value = numeric(0)) : replacement has 0 rows, data has 9
It looks like you might be assuming that data$log_v is getting read as data$log_profit, but R's going to take it own it's own and read it as "log_v" all 3 times. This example might not be quite everything you're trying to do but it might help you. It's taking a list of variables and referencing them via their string names.
df <- data.frame(x = rnorm(15), y = rnorm(15))
vars <- c("x", "y")
for (v in vars) {
df[paste0("log_", v)] <- log(abs(df[v]))
}
Here's roughly the same thing in data.table.
library(data.table)
dt <- data.table(x = rnorm(15), y = rnorm(15))
dt[, `:=`(log_x = log(abs(x)), log_y = log(abs(y)))]
Here is an explanation to the source of your confusion:
A data.frame is a special type of list, it's elements are vectors of the same length – columns. Normally, you access an element of a list using the [[ function, for example df[["revenue"]]. Instead of "revenue", you can also use a variable, such as df[[varlist[1]]]. So far, so good.
However, lists have a convenience operator, $, which allows you to access the elements with less typing: df$revenue. Unfortunately, you cannot use variables this way: this by design. Since you don't have to use quotes with $, the operator cannot know whether you mean revenue as the literal name of the element or revenue as the variable that holds the literal name of the element.
Therefore, if you want to use variables, you need to use the [[ function, and not the $. Since programmers hate typing and want to make code as terse as possible, various ways around it have been invented, such as data.tables and tidyverse (I am exaggerating a bit here).
Also, here is a tidyverse solution.
library(tidyverse)
varlist <- c("revenue", "profit", "cost")
df <- data.frame(revenue=rnorm(100), profit=rnorm(100), cost=rnorm(100))
df <- df %>% mutate_at(varlist, list(log10 = ~ log10(abs(.))))
Explanation:
mutate_all applies log10(abs(.)) to every column. The dot . is a temporary variable that hold the column values for each of the columns.
by default, mutate_all will replace the existing variables. However, if instead of providing a function (~ log10(abs(.))) you provide a named list (list(log10 = ~ log10(abs(.)))), it will add new columns using log10 as a suffix in column name.
this method makes it easy to apply several functions to your columns, not only the one.
See? No (obvious) loops at all!
Essentially its about using bitmask/binary columns and row-oriented operations against a data table/frame: Firstly, to construct a logical vector from a combination of selected columns that can be used to mask a charcter vector to represent 'what' columns are flagged. Secondly, row-expansion - given a count in one column, prouce a data table that contains the original row data replicated that number of times.
For summarising the flags using a row-wise bitmask, which uses purrr:reduce to concatenate the row-represented flags, I cannot find a succinct method to do this in a %>% chain rather than a separate for loop. I suspect a purrr::map is required but I cannot get it/the syntax right.
For the row expansion, the nested for loop has appalling performance and I cannot find a way for dplyr/purrr to, row-wise, replicate that row a given number of times per row. A map and other functions would need to produce and append multiple rows which, I don't think map is capable of.
The following code produces the required output - but, apart from performance issues (especially regarding row expansion), I'd like to be able to do this as vectorised operations.
library(tidyverse)
library(data.table)
dt <- data.table(C1=c(0,0,1,0,1,0),
C2=c(1,0,0,0,0,1),
C3=c(0,1,0,0,1,0),
C4=c(0,1,1,0,0,0),
C5=c(0,0,0,0,1,1),
N=c(5,2,6,8,1,3),
Spurious = '')
flags <- c("Scratching Head","Screaming",
"Breaking Keyboard","Coffee Break",
"Giving up")
# Summarise states
flagSummary <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5),.funs=as.logical) %>%
dplyr::mutate(States=c(""))
for(i in 1:nrow(interim)){
interim$States[i] <-
flags[as.logical(interim[i,1:5])] %>%
purrr::reduce(~ paste(.x, .y, sep = ","),.init="") %>%
stringr::str_replace("^[,]","") }
dplyr::select(interim,States,N) }
summary <- flagSummary(dt)
View(summary)
# Expand states
expandStates <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5), .funs=as.logical) %>%
dplyr::select_at(vars(C1:C5,N)) %>%
data.table::setnames(.,append(flags,"Count"))
expansion <- interim[0,1:5]
for(i in 1:nrow(interim)){
for(j in 1:interim$Count[i]){
expansion <- bind_rows(expansion, interim[i,1:5]) } }
expansion }
expansion <- expandStates(dt)
View(expansion)
As stated, the code produces the expected result. I'd 'like' to see the same without resorting to for loops and whilst being able to chain the functions into the initial mutate/selects.
As for the row expansion of the expandStates function, the answer is proffered here Replicate each row of data.frame and specify the number of replications for each row? by A5C1D2H2I1M1N2O1R2T1.
Essentially, the nested for loop is simply replaced by
interim[rep(rownames(interim[,1:5]),interim$Count),][1:5]
On my 'actual' data, this reduces user systime from 28.64 seconds to 0.06 to produce some 26000 rows.
I want to extract some values out of a vector, modify them and put them back at the original position.
I have been searching a lot and tried different approaches to this problem. I'm afraid this might be really simple but I'm not seeing it yet.
Creating a vector and convert it to a dataframe with. Also creating a empty dataframe for the results.
hight <- c(5,6,1,3)
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
hight_min_df <- data.frame()
Extract for every pair of values the smaller value with corresponding ID.
for(i in 1:(length(hight_df[,2])-1))
{
hight_min_df[i,1] <- which(grepl(min(hight_df[,2][i:(i+1)]), hight_df[,2]))
hight_min_df[i,2] <- min(hight_df[,2][i:(i+1)])
}
Modify the extracted values and aggregate same IDs by higher value. At the end writing the modified values back.
hight_min_df[,2] <- hight_min_df[,2]+20
adj_hight <- aggregate(x=hight_min_df[,2],by=list(hight_min_df[,1]), FUN=max)
hight[adj_hight[,1]] <- adj_hight[,2]
This works perfectly as long a I have only uniqe values in hight.
How can I run this script with a vector like this: hight <- c(5,6,1,3,5)?
Alright there's a lot to unpack here. Instead of looping, I would suggest piping functions with dplyr. Read the vignette here - it is an outstanding resource and an excellent approach to data manipulation in R.
So using dplyr we can rewrite your code like this:
library(dplyr)
hight <- c(5,6,1,3,5) #skip straight to the test case
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
adj_hight <- hight_df %>%
#logic psuedo code: if the last hight (using lag() function),
# going from the first row to the last,
# is greater than the current rows hight, take the current rows value. else
# take the last rows value
mutate(subst.id = ifelse(lag(hight) > hight, ID, lag(ID)),
subst.val = ifelse(lag(hight) > hight, hight, lag(hight)) + 20) %>%
filter(!is.na(subst.val)) %>% #remove extra rows
select(subst.id, subst.val) %>% #take just the columns we want
#grouping - rewrite of your use of aggregate
group_by(subst.id) %>%
summarise(subst.val = max(subst.val)) %>%
data.frame(.)
#tying back in
hight[adj_hight[,1]] <- adj_hight[,2]
print(hight)
Giving:
[1] 25 6 21 23 5