I have a very large dataframe which is built as follows:
Originaldf
I want to perform a pairwise t test within item A, comparing the measured value within the condition groups. So I would like to see if for all observations pertaining to item A, there is a difference between the measured values of the control group, test group, and placebo group (Condition).
The first thing I did was to split the dataframe into a list using dplyr's filter function.
Listdf <- split(originaldf, Item)
This worked and I got a list containing 82 elements with one dataframe corresponding to each item in the original dataframe.
I now am trying to perform the pairwise.t.test function on each element of the list. I am relatively new to R and think that writing a loop for this process, though inefficient, would help me understand what is going on the background. I know there is also the option to use the lapply function. I tried this on the Listdf with the following code, which I know is most likely much too simple but was worth a try.
lapply(Listdf, pairwise.t.test(Value, Condition))
However, I get the error Error in factor(g) : object 'Condition' not found. Not sure if there is a way to more specifically reference Condition so that it can be found. I've performed an individual pairwise.t.test on one of the items which worked with the following code.
pairwise.t.test(List$ItemA$Value, List$ItemA$Condition, p.adjust.method = "none")
However, I assume this would not work within the lapply function because I want it to perform the t.test for ItemA, ItemB, ItemC etc...
The loop I have tried so far is as follows:
for (i in Listdf) {
pairwise.t.test(List$i$logAddedConstant, List$i$Condition, p.adjust = "none")
}
For this I get the error "Error in split.default(X, group) : first argument must be a vector"
I believe this error corresponds to the original splitting of the original dataframe. However I don't quite understand why this error would show up this late in the code because the splitting of the dataframe worked without a problem.
I know I am probably missing something fundamental, but I am quite stumped and have tried multiple options to no avail. If anyone has another idea or suggestion I would be very grateful for the help. Please let me know if I should add some more information.
I made a very short example of a data.frame which is likewise structured as your originaldf
df <- data.frame(Item = c("A", "B", "C", "A", "B", "C"),
Value=runif(6),
Condition=c("Control","Control","Control", "Test", "Test", "Test"))
Listdf <- split(df, df$Item)
Using a simple for-loop
p <-list()
for (i in 1:length(Listdf)) {
p[[i]] <- pairwise.t.test(Listdf[[i]]$Value, Listdf[[i]]$Condition, p.adjust = "none")
}
Using lapply
p <- lapply(1:length(Listdf), function(x) {pairwise.t.test(Listdf[[x]]$Value, Listdf[[x]]$Condition, p.adjust = "none")})
Related
I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))
I have been using Stata and the loops are easily executed there. However, in R I have faced some errors in looping over variables. I tried some of the codes over here and it does not work. Basically, I am trying to clean the data by logging the values. I had to convert negative values to positive first before logging them.
I intend to loop over multiple firm statistics on the dataframe but I faced errors in doing so.
varlist <- c("revenue", "profit", "cost")`
for (v in varlist) {
data$log_v <- log(abs(ifelse(data$v>1, data$v, NA)))
data$log_v <- ifelse(data$v<0, data$log_v*-1,data$log_v)
}
Error in $<-.data.frame(tmp,"log_v", value = numeric(0)) : replacement has 0 rows, data has 9
It looks like you might be assuming that data$log_v is getting read as data$log_profit, but R's going to take it own it's own and read it as "log_v" all 3 times. This example might not be quite everything you're trying to do but it might help you. It's taking a list of variables and referencing them via their string names.
df <- data.frame(x = rnorm(15), y = rnorm(15))
vars <- c("x", "y")
for (v in vars) {
df[paste0("log_", v)] <- log(abs(df[v]))
}
Here's roughly the same thing in data.table.
library(data.table)
dt <- data.table(x = rnorm(15), y = rnorm(15))
dt[, `:=`(log_x = log(abs(x)), log_y = log(abs(y)))]
Here is an explanation to the source of your confusion:
A data.frame is a special type of list, it's elements are vectors of the same length – columns. Normally, you access an element of a list using the [[ function, for example df[["revenue"]]. Instead of "revenue", you can also use a variable, such as df[[varlist[1]]]. So far, so good.
However, lists have a convenience operator, $, which allows you to access the elements with less typing: df$revenue. Unfortunately, you cannot use variables this way: this by design. Since you don't have to use quotes with $, the operator cannot know whether you mean revenue as the literal name of the element or revenue as the variable that holds the literal name of the element.
Therefore, if you want to use variables, you need to use the [[ function, and not the $. Since programmers hate typing and want to make code as terse as possible, various ways around it have been invented, such as data.tables and tidyverse (I am exaggerating a bit here).
Also, here is a tidyverse solution.
library(tidyverse)
varlist <- c("revenue", "profit", "cost")
df <- data.frame(revenue=rnorm(100), profit=rnorm(100), cost=rnorm(100))
df <- df %>% mutate_at(varlist, list(log10 = ~ log10(abs(.))))
Explanation:
mutate_all applies log10(abs(.)) to every column. The dot . is a temporary variable that hold the column values for each of the columns.
by default, mutate_all will replace the existing variables. However, if instead of providing a function (~ log10(abs(.))) you provide a named list (list(log10 = ~ log10(abs(.)))), it will add new columns using log10 as a suffix in column name.
this method makes it easy to apply several functions to your columns, not only the one.
See? No (obvious) loops at all!
I am converting my for-loops in R for a model that has multiple input datasets. In the for-loop I use the current loop value to retrieve values from other datasets. I am looking to replicate this using an apply function (over columns in a dataset) however I'm struggling to establish index of the apply function in order to retrieve the appropriate variables from other data
The apply function references the column by the variable in the function which is fine and I've tried to use both colname (after having named my various columns by number) but have not had any joy. Below is an example dataset and for loop with what I'd like to achieve (simplified somewhat). The length of the vectors and the number of columns in the tabular dataset will always be equal.
iteration<-1:3
df <- data.frame("column1" = 6:10, "column2" = 12:16, "column3" = 31:35)
variable1<-rnorm(3,mean = 25)
variable2<-rnorm(3, mean = 0.21)
outcome<-numeric()
for (i in iteration) {
intermediate<-(mean(df[,i])*variable1[i])^variable2[i]
outcome<-c(outcome,intermediate)
}
outcome
The expected results are outcome above...trying this in apply
What I imagine it to be is this:
apply(df, 2, function(x) (mean(x)*variable1[colnumber(x)])^variable2[colnumber(x)]
or perhaps
apply(df, 2, function(x) (mean(x)*variable1[x])^variable2[x])
but these two obviously do not work.
first time user so apologies for any etiquette issues but found the answer to my own problem using the purrr package, but maybe this helps someone else
pmap(list(df, variable1, variable2), function(df, variable1, variable2) (mean(df)*variable1)^variable2)
I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff
Goal
If the following description is hard follow, please see the example "before" and "after" to see a straightforward example.
I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction.
I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade).
Further, I'd like to do this with the plyr ddply function.
(If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.)
In terms of pseudo code I'm working off of:
merge each row of Side1 and Side2
by row, get unique() appearances of each item id
apply table() function
transpose and relabel output from table
Example of the structure of my data, and the output I desire.
Data Example (before):
df <- data.frame(TradeID = c("01","02","03","04"))
df$Side1 = list(c("001","001","002"),
c("002","002","003"),
c("001","004"),
c("001","002","003","004"))
df$Side2 = list(c("001"),c("007"),c("009"),c())
Desired Output (after):
df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"),
RelFreq_byTrade = c(3,3,2,2,1,1))
One method to do this without ddply
I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me.
temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique)))
df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp),
RelFreq_byTrade = temp[])
Thanks for any help you can offer!
Curtis
I believe this will do what you're asking for. It uses ddply. Twice!
res <- ddply(df, .(TradeID), function(df) data.frame(ItemID = c(df$Side1[[1]],df$Side2[[1]]), TradeID = df$TradeID))
ddply(res, .(ItemID), summarise, RelFreq_byTrade = length(unique(TradeID)))
Note that the ItemsIDs are slightly out of order.