Accessing multiple data sources with [[ ]] indexing in R - r

here is my code:
file.number <- c(1:29)
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv), paste0(file.number, ".data"))
n <- c(1:3,10:15,21:26)
sw <- na.omit(data[[n]]$RT[data[[n]]$rep.sw=="sw"])
rep <-na.omit(data[[n]]$RT[data[[n]]$rep.sw=="rep"])
The problem is that 3rd line - if n = 1, it works, but if I include multiple numbers I get an error "recursive indexing fail." Is there a way I can access multiple indexes at once?
Thanks R Community! Any advice would be much appreciated!

Too long for a comment.
It looks like data is a list of data frames. The list elements are named, e.g. 1.data, 2.data, etc. and each data frame has, among other things, columns named RT and rep.sw. So, like this:
## representative example???
df <- data.frame(RT=1:100,rep.sw=sample(c("sw","rep"),100,replace=TRUE))
data <- setNames(lapply(1:29,function(i)df),paste0(1:29,".data"))
You seem to want to remove NA's from the RT column of each data frame for rows where res.sw=="sw" (or "rep").
If that is correct, then something like this should work:
sw <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="sw"])))
rep <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="rep"])))
This code will pass the data frames identified in n to the function one at a time, and for each of those return the rows of column RT for which rep.sw="sw", with NA's omitted. The result will be a list of vectors.
I notice that most of the columns are imported as factors, which is probably a bad idea. You might want to import using:
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv, stringsAsFactors=FALSE),
paste0(file.number, ".data"))

Related

error with dfidx: the two indexes don't define unique observations

I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))

Combine imputed data by group in r using mice

my question is a follow-up to this question on imputation by group using "mice":
multiple imputation and multigroup SEM in R
The code in the answer works fine as far as the imputation part goes. But afterwards I am left with a list of actually complete data but more than one set. The sample looks as follows:
'Set up data frame'
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
'Introduce NAs'
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
df
'Impute values by group:'
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
df.clean
As you can see, df.clean is a list of 3. One element per group. But each element containing a complete data set I am looking for.
The original answer suggests to rbind() the obtained data in df.clean which leaves me with a new data set with 45 (3x the original size) observations.
Here is the original code for the last step:
imputed.both <- do.call(args = df.clean, what = rbind)
Which data is the "right" one? And why the last step?
Thanks a bunch!
There's a bug in the code, i have a edited version below that works:
#Set up data frame
set.seed(12345)
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
#Introduce NAs
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
# check NAs
colSums(is.na(df))
#Impute values by group:
# here's the bug
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
imputed.both <- do.call(args = df.clean, what = rbind)
dim(imputed.both)
# returns 15,4
In the code in the question, you have
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
dim(do.call(rbind,df.clean))
#this returns 45,4
The function is specified with "x" but you call df from the global environment. Hence you impute on the complete df.
So to answer your question, if you do this step:
split(df,df$ID)
You split your data frame into a list of data.frames with only A,B or Cs. Then if you lapply through this list, you get
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
names(df.clean)
lapply(df.clean,dim)
each item of the list df.clean contains a subset of the original df, with ID being A, B or C. Now you combine this list together into a data.frame using:
imputed.both <- do.call(rbind,df.clean)

How to find similarities between 2 datasets and generate a new dataframe consisting of these rows which coincide?

I have the results of radiosonde observations for more than 1000 stations in one file and list of stations (81) that actually interest me. I need to make a new data frame where the first file's rows would be included.
So, I have two datasets imported from .txt files to R. The first is a data frame 6694668x6 and the second one is 81x1, where second dataset's rows conicide with some of first dataset's 1st column values (values are looking like this: ACM00078861).
d = data.frame(matrix(ncol = 6, nrow = 0))
for(i in 1:81){
for (j in 1:6694668) {
if(stations[i,1] == ghgt_00z.mly[j,1]){
rbind(d,ghgt_00z.mly[j,] )
j + 1
} else {j+1}
}
}
I wanted to generate a new dataframe which would look like the "ghgt_00z.mly", but containing only the rows for the stations which are listed in "stations".
Ofc, the code was running for couple of days and I have receaved only the warning message.
Please, help me!
There's a lot of options how to do this. I persolaly use classic merge()
res <- merge(x=stations, y=ghgt_00z.mly, by='common_coulmn_name', all.x = TRUE)
Where common_coulmn_name is the same column name present in both df's. As a result you have combined two df's with all columns present in both datasets, you can remove them if you want.
Second useful option is:
library(dplyr)
inp <- ghgt_00z.mly$column_of_interest
res <- filter(stations, grepl(paste(inp, collapse="|"), column_in_stations))
Where inp and column_in_stations should contain some same values.
Due to I don't have datasets I can't check these solutions, so I don't guarantee if they work fine.

Saving rows into variables in R

I have a 18-by-48 matrix.
Is there a way to save each of the 18 rows automatically in a separate variable (e.g., from r1 to r18) ?
I'd definitely advise against splitting a data.frame or matrix into its constituent rows. If i absolutely had to split the rows up, I'd put them in a list then operate from there.
If you desperately had to split it up, you could do something like this:
toy <- matrix(1:(18*48),18,48)
variables <- list()
for(i in 1:nrow(toy)){
variables[[paste0("variable", i)]] <- toy[i,]
}
list2env(variables, envir = .GlobalEnv)
I'd be inclined to stop after the for loop and avoid the list2env. But I think this should give you your result.
I believe you can select a row r from your dataframe d by indexing without a column specified:
var <- d[r,]
Thus you can extract all of the rows into a variable by using
var <- d[1:length(d),]
Where var[1] is the first row, var[2] the second. Etc.. not sure if this is exactly what you are looking for. Why would you want 18 different variables for each row?
result <- data.frame(t(mat))
colnames(result) <- paste("r", 1:18, sep="")
attach(result)
your matrix is mat

repeat the assigning of data frame in R [duplicate]

This question already has answers here:
Reading multiple files into multiple data frames
(2 answers)
Closed 6 years ago.
I am new to R and stackoverflow so this will probably have a very simple solution.
I have a set of data from 20 different subject. In the future I will have to perform a lot of different actions on this data and will have to repeat this action for all individual sets. Analyzing them separately and recombining them.
My question is how can I automate this process:
P4 <- read.delim("P4Rtest.txt")
P7 <- read.delim("P7Rtest.txt")
P13 <- read.delim("P13Rtest.txt")
etc etc etc.
I have tried looping with a for loop but see to get stuck with creating a new data.frame with a unique name every time.
Thank you for your help
The R way to do this would be to keep all the data sets together in a named list. For that you can use the following, where n is the number of files.
nm <- paste0("P", 1:n) ## create the names P1, P2, ..., Pn
dfList <- setNames(lapply(paste0(nm, "Rtest.txt"), read.delim), nm)
Now dfList will contain all the data sets. You can access them individually with dfList$P1 for P1, dfList$P2 for P2, and so on.
There are a bunch of different ways of doing stuff like this. You could combine all the data into one data frame using rbind. The first answer here has a good way of doing that: Replace rbind in for-loop with lapply? (2nd circle of hell)
If you combine everything into one data frame, you'll need to add a column that identifies the participant. So instead of
P4 <- read.delim("P4Rtest.txt")
...
You would have something like
my.list <- vector("list", number.of.subjects)
for(participant.number in 1:number.of.subjects){
# load individual participant data
participant.filename = paste("P", participant, "Rtest.txt", sep="")
participant.df <- read.delim(participant.filename)
# add a column:
participant.df$participant.number = participant.number
my.list[[i]] <- participant.df
}
solution <- rbind(solution, do.call(rbind, my.list))
If you want to keep them separate data frames for some reason, you can keep them in a list (leave off the last rbind line) and use lapply(my.list, function(participant.df) { stuff you want to do }) whenever you want to do stuff to the data frames.
You can use assign. Assuming all your files have a similar format as you have shown, this will work for you:
# Define how many files there are (with the numbers).
numFiles <- 10
# Run through that sequence.
for (i in 1:numFiles) {
fileName <- paste0("P", i, "Rtest.txt") # Creating the name to pull from.
file <- read.delim(fileName) # Reading in the file.
dName <- paste0("P", i) # Creating the name to assign the file to in R.
assign(dName, file) # Creating the file in R.
}
There are other methods that are faster and more compact, but I find this to be more readable, especially for someone who is new to R.
Additionally, if your numbers aren't a complete sequence like I've used here, you can just define a vector of what numbers are used like:
numFiles <- c(1, 4, 10, 25)

Resources