I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.
To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:
dfa<- dflist[[1]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
filter(n_missing<=nrow(dflist[[1]])*0.30)
This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try:
First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * :
Caused by error in [.data.frame:
! undefined columns selected".
Here is my code:
dfid<-list()
for (i in 1:4){
dfid[[i]]<-dflist[[i]]%>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}
So my questions are:
How to fix this error to make the goal possible?
Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?
Thanks a lot for your help in advance~~!
The columns should be quoted if we are using [ unless it is an object. It may be easier to loop with map/lapply
library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>%
mutate(across(where(is.numeric), as.character))%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >= n()*0.01)%>%
filter(n_missing <= n()*0.30))
We don't need the [ when we use the chain
dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
tmp <- dflist[[i]]
dfid[[i]] <- tmp %>%
mutate_if(is.numeric,as.character)%>%
skim()%>%
as_tibble()%>%
filter(character.n_unique >=n()*0.01)%>%
filter(n_missing <=n()*0.30)
}
Related
I have two datasets -- one of covariates and one of word lists. I want to subset the word list dataset by covariate values using their common participant ID variable pid. I wrote the following code to accomplish this:
x <- covariates %>% filter(percdemhomophily == "A lot")
y <- demlists %>% filter(pid %in% x$pid)
What this does is filter the covariates dataset by a specific variable value, and then uses the filtered covariates to filter the word list dataset by pid. To me, this seems like it should be able to be written into a function so that I don't have to hardcode these lines over and over. However, when I use the function below:
covarsubset <- function(listdata, surveydata, survvar, varvalue){
subsetid <- surveydata %>% filter(survvar == varvalue)
subsetfluency <- listdata %>% filter(pid %in% subsetid$pid)
subsetfluency
}
When I use the function above, I get an "object not found" error for the covariate variable I'm trying to filter by. But the code is essentially identically to the code that works -- just more generalized.
subset <- covarsubset(demlists, covariates, percdemhomophily, "A lot")
I don't understand why this function won't work. What am I missing?
I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))
I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)
Essentially its about using bitmask/binary columns and row-oriented operations against a data table/frame: Firstly, to construct a logical vector from a combination of selected columns that can be used to mask a charcter vector to represent 'what' columns are flagged. Secondly, row-expansion - given a count in one column, prouce a data table that contains the original row data replicated that number of times.
For summarising the flags using a row-wise bitmask, which uses purrr:reduce to concatenate the row-represented flags, I cannot find a succinct method to do this in a %>% chain rather than a separate for loop. I suspect a purrr::map is required but I cannot get it/the syntax right.
For the row expansion, the nested for loop has appalling performance and I cannot find a way for dplyr/purrr to, row-wise, replicate that row a given number of times per row. A map and other functions would need to produce and append multiple rows which, I don't think map is capable of.
The following code produces the required output - but, apart from performance issues (especially regarding row expansion), I'd like to be able to do this as vectorised operations.
library(tidyverse)
library(data.table)
dt <- data.table(C1=c(0,0,1,0,1,0),
C2=c(1,0,0,0,0,1),
C3=c(0,1,0,0,1,0),
C4=c(0,1,1,0,0,0),
C5=c(0,0,0,0,1,1),
N=c(5,2,6,8,1,3),
Spurious = '')
flags <- c("Scratching Head","Screaming",
"Breaking Keyboard","Coffee Break",
"Giving up")
# Summarise states
flagSummary <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5),.funs=as.logical) %>%
dplyr::mutate(States=c(""))
for(i in 1:nrow(interim)){
interim$States[i] <-
flags[as.logical(interim[i,1:5])] %>%
purrr::reduce(~ paste(.x, .y, sep = ","),.init="") %>%
stringr::str_replace("^[,]","") }
dplyr::select(interim,States,N) }
summary <- flagSummary(dt)
View(summary)
# Expand states
expandStates <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5), .funs=as.logical) %>%
dplyr::select_at(vars(C1:C5,N)) %>%
data.table::setnames(.,append(flags,"Count"))
expansion <- interim[0,1:5]
for(i in 1:nrow(interim)){
for(j in 1:interim$Count[i]){
expansion <- bind_rows(expansion, interim[i,1:5]) } }
expansion }
expansion <- expandStates(dt)
View(expansion)
As stated, the code produces the expected result. I'd 'like' to see the same without resorting to for loops and whilst being able to chain the functions into the initial mutate/selects.
As for the row expansion of the expandStates function, the answer is proffered here Replicate each row of data.frame and specify the number of replications for each row? by A5C1D2H2I1M1N2O1R2T1.
Essentially, the nested for loop is simply replaced by
interim[rep(rownames(interim[,1:5]),interim$Count),][1:5]
On my 'actual' data, this reduces user systime from 28.64 seconds to 0.06 to produce some 26000 rows.
I am trying to wrap my head around R, and I'm sure I'm doing something silly.
I have a dataframe that includes 30 brands (whose names I have separately in a list called "brands") and a list of new names that I wish to insert into the dataframe (called "known brands").
I am trying to populate the results of an if statement within new columns in an R dataframe (using the names within "known brands), but this keeps on generating an error message (unexpected '{' in "{")
I'm not sure where I'm going wrong - here's my code:
for(i in 1:length(brands)){
plot1a_df <- plot1a_df %>% mutate(known_brands[i] = ifelse(brands[i] >1, 1, 0))
}
To illustrate with data (assume 3 x2 columns):
plot1a_df = data.frame(brands = c(1,0,2), Misc = c(0,0,0))
The idea is to end up with a third column ("known_brands") with c(0,0,1)
To add a logical column with dplyr:
library(dplyr)
plot1a_df %>% mutate(is_brand_known = brands %in% brand_list)
Another example with iris dataset.
species_list = c('setosa')
iris %>% mutate(is_setosa = Species %in% species_list)
for (i in 1:30){
plot1a_df[, known_cols[i]] <- ifelse(plot1a_df[,brands[i]] >1, 1, 0)
print(plot1a_df[, known_cols[i]])
}
Found my solution could be achieved without mutate - although, still wonder if it is possible to combine for loops within dplyr (realise there's a lot of commentary on here, but nothing at a high level (for this simpleton to understand at least!)