bind_rows bind rows - but some columns are missing - r

I read this data set and I want to join the data for the training set and the test set (I should mention that this is part of a coursera course exercise).
I have read both data sets and gave all columns names,the training data have 7352 rows and 562 columns and the test set have 2947 rows and 562 columns.
The names of the columns of both data sets are the same.
When I try to join the data with bind_rows I get a data set with 10299 rows but with 478 columns, not 562.
When I use rbind I get the correct result, but I need to cast it again using tbl_df so I prefer doing it using bind_rows.
The following is the script I wrote, running it from a folder containing the unzipped data from the above ling (e.g the folder "UCI HAR Dataset") reproduces the problem.
## Setting the script folder to be current directory
CurrentScriptDirectory = script.dir <- dirname(sys.frame(1)$ofile)
setwd(CurrentScriptDirectory)
library(dplyr)
#Readin the data
train_x <- tbl_df(read.table("./UCI HAR Dataset/train/X_train.txt"))
train_y <- tbl_df(read.table("./UCI HAR Dataset/train/y_train.txt"))
test_x <- tbl_df(read.table("./UCI HAR Dataset/test/X_test.txt"))
test_y <- tbl_df(read.table("./UCI HAR Dataset/test/y_test.txt"))
#Giving the y's proper names
colnames(train_y) <- c("Activity Name")
colnames(test_y) <- c("Activity Name")
#Reading features names
featuerNames<-read.table("./UCI HAR Dataset/features.txt")
featuerNames<-featuerNames[,2]
#Giving the training and test data proper names
colnames(train_x) <- featuerNames
colnames(test_x) <- featuerNames
labeledTrainingSet <- bind_cols(train_x,train_y)
labeledTestSet <- bind_cols(test_x,test_y)
labledDataSet <- bind_rows(labeledTrainingSet,labeledTestSet)
Can someone help me understand what I'm doing wrong ?

I've worked with that dataset and ran into the same issue. As others mentioned, there are duplicate features.
Rename duplicate columns and make them legal. You can use:
make.names(X, unique = TRUE, allow_ = TRUE)
where X is a character vector. The function will add to existing column names so you don't lose original nomenclature. see http://www.inside-r.org/r-doc/base/make.names for more details
After all of your column names are unique dplyr::bind_rows() will work!

Just checked it out. You have duplicated names in your featureNames set. These are dropped by bind_rows.
test1<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test1)<- c("A","B","B")
test2<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test2)<- c("A","B","B")
test3 <-bind_rows(test1, test2)

Related

error with dfidx: the two indexes don't define unique observations

I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))

How to create a table in R populated with 1s and 0s to show presence of values from another table?

I'm working with data regarding people and what class of medicine they were prescribed. It looks something like this (the actual data is read in via txt file):
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.table(test)
test <- unique(test[, 1:2])
test
The table has about 5 million rows, 45k unique patients, and 49 unique medicines. Some patients have multiples of the same medicines, which I remove. Not all patients have every medicine. I want to make each of the 49 unique medicines into separate columns, and have each unique patient be a row, and populate the table with 1s and 0s to show if the patient has the medicine or not.
I was trying to use spread or dcast, but there's no value column. I tried to amend this by adding a row of 1s
test$true <- rep(1, nrow(test))
And then using tidyr
library(tidyr)
test_wide <- spread(test, med, true, fill = 0)
My original data produced this error but I'm not sure why the new data isn't reproducing it...
Error: `var` must evaluate to a single number or a column name, not a list
Please let me know what I can do to make this a better reproducible example sorry I'm really new to this.
It looks like you are trying to do onehot encoding here. For this please refer to the "onehot" package. Details are here.
Code for reference:
library(onehot)
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.frame(test)
str(test)
test$id <- as.numeric(test$id)
str(test)
encoder <- onehot(test)
finaldata <- predict(encoder,test)
finaldata
Make sure that all the columns that you want to be encoded are of the type factor. Also, I have taken the liberty of changing data.table to data.frame.

Merge in R only shows Header

I have 3 large excel databases converted to csv. I wish to combine these into one by using R.
I have tagged the 3 files as dat1,dat2,dat3 respectively. I tried to merge dat1 and dat2 with the name myfulldata, and then merge myfulldata with dat3, saved as myfulldata2.
When I did this though only the headers remained in the combination, essentially none of the contents of the databases were now visible. Screenshot linked below. The numbers of "obvs" in the myfulldata's are noted at 0 despite the respective ovs for each individual component being very large. Can anyone advise how to resolve?
Code:
dat1 <- read.csv("PS 2014.csv", header=T)
dat2 <- read.csv("PS 2015.csv", header=T)
dat3 <- read.csv("PS 2016.csv", header=T)
myfulldata = merge(dat1, dat2)
myfulldata2 = merge(myfulldata, dat3)
save(myfulldata2, file = "Palisis.RData")
Doing a merge in r is analogous to doing a join between two tables in a database. I suspect what you want to do is to aggregate your three CSV files row-wise (i.e. union them). In this case, you can try using rbind instead:
myfulldata <- rbind(dat1, dat2)
myfulldata <- rbind(myfulldata, dat3)
save(myfulldata, file = "Palisis.RData")
Note that this assumes that the number and ideally types of the columns in each data frame from CSV is the same (q.v. doing a UNION in SQL).

Adding Factor Scores to the Data Set in R using cbind

I am having difficulties adding factor scores to the original data set. It is not a difficult procedure at all, as is described here. However, in my case, I receive the following error to the below code:
fa <- factanal(data, factors=2, rotation="promax", scores="regression")
data <- cbind(data, fa$scores)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 889, 851
It would be no surprise to receive this error, if the row numbers really differed, but when I type "fa$scores" and hit enter, R displays all of the 889 rows. The dim function still returns 851 though:
dim(fa$scores)
[1] 851 2
Can you please clarify for me why I am receiving this error, and if possible, what I can do to add the factor scores to the data successfully?
Thanks!
fa$scores returns a matrix with rownames that you can use to join/merge the data together.
First, make sure data has rownames. If not, give it dummy names like:
rownames(data) <- 1:nrow(data)
Then run fa <- factanal(...), and convert fa$scores to a data frame of factor scores. E.g.,
fs <- data.frame(fa$scores)
Then, add a rowname column to both your original data and fs:
data$rowname <- rownames(data)
fs$rowname <- rownames(fs)
Then left join to data (using dplyr package):
library(dplyr)
left_join(data, fs, by = "rowname)

Accessing multiple data sources with [[ ]] indexing in R

here is my code:
file.number <- c(1:29)
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv), paste0(file.number, ".data"))
n <- c(1:3,10:15,21:26)
sw <- na.omit(data[[n]]$RT[data[[n]]$rep.sw=="sw"])
rep <-na.omit(data[[n]]$RT[data[[n]]$rep.sw=="rep"])
The problem is that 3rd line - if n = 1, it works, but if I include multiple numbers I get an error "recursive indexing fail." Is there a way I can access multiple indexes at once?
Thanks R Community! Any advice would be much appreciated!
Too long for a comment.
It looks like data is a list of data frames. The list elements are named, e.g. 1.data, 2.data, etc. and each data frame has, among other things, columns named RT and rep.sw. So, like this:
## representative example???
df <- data.frame(RT=1:100,rep.sw=sample(c("sw","rep"),100,replace=TRUE))
data <- setNames(lapply(1:29,function(i)df),paste0(1:29,".data"))
You seem to want to remove NA's from the RT column of each data frame for rows where res.sw=="sw" (or "rep").
If that is correct, then something like this should work:
sw <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="sw"])))
rep <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="rep"])))
This code will pass the data frames identified in n to the function one at a time, and for each of those return the rows of column RT for which rep.sw="sw", with NA's omitted. The result will be a list of vectors.
I notice that most of the columns are imported as factors, which is probably a bad idea. You might want to import using:
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv, stringsAsFactors=FALSE),
paste0(file.number, ".data"))

Resources