I have 3 large excel databases converted to csv. I wish to combine these into one by using R.
I have tagged the 3 files as dat1,dat2,dat3 respectively. I tried to merge dat1 and dat2 with the name myfulldata, and then merge myfulldata with dat3, saved as myfulldata2.
When I did this though only the headers remained in the combination, essentially none of the contents of the databases were now visible. Screenshot linked below. The numbers of "obvs" in the myfulldata's are noted at 0 despite the respective ovs for each individual component being very large. Can anyone advise how to resolve?
Code:
dat1 <- read.csv("PS 2014.csv", header=T)
dat2 <- read.csv("PS 2015.csv", header=T)
dat3 <- read.csv("PS 2016.csv", header=T)
myfulldata = merge(dat1, dat2)
myfulldata2 = merge(myfulldata, dat3)
save(myfulldata2, file = "Palisis.RData")
Doing a merge in r is analogous to doing a join between two tables in a database. I suspect what you want to do is to aggregate your three CSV files row-wise (i.e. union them). In this case, you can try using rbind instead:
myfulldata <- rbind(dat1, dat2)
myfulldata <- rbind(myfulldata, dat3)
save(myfulldata, file = "Palisis.RData")
Note that this assumes that the number and ideally types of the columns in each data frame from CSV is the same (q.v. doing a UNION in SQL).
Related
I am new at programming and bad at using loops in R. I'm facing a situation in which if I do not use a loop, I think I'm going to spend lots of time achieving my goal.
I have a big csv file in my working directory, that contains data related to 64 animal species (this csv is represented by the object "df2", created below). In the same directory, I have 64 smaller csv files, each one related to an animal species that is also present in the bigger csv. These 64 smaller files have the same number of columns (6), but different numbers of rows. I'll create some toy data to illustrate it and divide my question into four parts to make it as clear as I can.
library(tidyverse)
#Creating a df just to split it
df <- data.frame(animal=c(c("dog", "DOG"),
rep("cat", 4),
"frog",
rep("bird", 7),
rep("snake", 5),
rep("lizard", 3),
c("cow","cOW","COW","coww"),
rep("worm",6),
"lion",
rep("shark",9)),
var1=rnorm(42),
var2=rnorm(42),
var3=rnorm(42),
var4=rnorm(42),
var5=rnorm(42))
#The following steps are just to make a reproducible example. I'm filtering the toy data just to save it as csv files and import them.
da1 <- df %>%
filter(animal=="dog" | animal=="DOG")
da2 <- df %>%
filter(animal=="cat")
da3 <- df %>%
filter(animal=="frog")
da4 <- df %>%
filter(animal=="bird")
da5 <- df %>%
filter(animal=="snake")
da6 <- df %>%
filter(animal=="lizard")
da7 <- df %>%
filter(animal=="cow" | animal=="cOW"|
animal=="COW" | animal=="coww")
da8 <- df %>%
filter(animal=="worm")
da9 <- df %>%
filter(animal=="lion")
da10 <- df %>%
filter(animal=="shark")
readr::write_csv(da1, "da1.csv")
readr::write_csv(da2, "da2.csv")
readr::write_csv(da3, "da3.csv")
readr::write_csv(da4, "da4.csv")
readr::write_csv(da5, "da5.csv")
readr::write_csv(da6, "da6.csv")
readr::write_csv(da7, "da7.csv")
readr::write_csv(da8, "da8.csv")
readr::write_csv(da9, "da9.csv")
readr::write_csv(da10, "da10.csv")
#Those 10 csv files correspond to the 64 ones that I have in my directory
Part 1:
As you can see, I had to filter one species at a time. So, my first question is: how can I pass those filters and the "readr::write_csv" function inside of a loop so that I can do it all at once? (Instead of doing it individually). Note that some species such as "dog" and "cow" have several spellings. That's a problem I have to deal with since I downloaded my actual data from databases online and the files have such issues.
To load the small csv files I do the following:
library(rio)
data <- import_list(dir("path_to_directory", pattern = ".csv"), rbind = FALSE)
Part 2
Once I've imported them as above, they are stored in the object "data". This changes their order so that they are listed as da1, da10, da2, da3, da4, and so on, instead of sequentially as da1, da2, da3, da4, da5... What I want to do now is to reorder them from 1 to 10. After that, I would like to select the same three columns (animal, var1, var2) from each of the datasets. I was able to do that for each of the datasets individually:
ba1 <- data$da1 %>%
dplyr::select(animal, var1, var2)
ba2 <- data$da2 %>%
dplyr::select(animal, var1, var2)
.
.
.
Again, I would like to do it all at once using a loop or something like that.
Part 3
Once I've selected the columns and saved them in objects, I want to bind the resulting objects with subsets of the big csv file I cited above. Here are some toy data for it:
df2 <- data.frame(animal=c(rep("dog2", 2),
rep("cat2", 4),
"frog2",
rep("bird2", 7),
rep("snake2", 5),
rep("lizard2", 3),
rep("cow2",4),
rep("worm2",6),
"lion2",
rep("shark2",9)),
var1=rnorm(42),
var2=rnorm(42))
#This time all animals have the same spelling since I tabulated those data manually.
The subsets that I refer to are made by filtering this data frame by animal species. I was able to do that using dplyr::filter:
ca1 <- df2 %>%
filter(animal=="dog2")
ca2 <- df2 %>%
filter(animal=="cat2")
.
.
.
And so on until I've done it with all the animals. As my actual data contains several (64) animal species, filtering the df2 that way takes a lot of time, so I would like to do so using a faster way. I think a for loop can be useful, but I suck at this kind of programming and did not manage to write the code for it. Could anyone provide the code for it, please?
Part 4
Finally, once the species in the df2 are filtered, I want to use a loop to bind (rbind) the objects that refer to the same species, such as ba1 and ca1 in this example, and then save the objects as new csv files:
readr::write_csv(rbind(ba1, ca1), "ga1.csv")
readr::write_csv(rbind(ba2, ca2), "ga2.csv")
.
.
.
By doing that I should have 64 new csv files, containing a combination of the data of the 64 old ones and part of my big csv file. Could anyone help me? I would really appreciate it if you could answer my question stepwise.
I appreciate your time and your attention in reading all of this. Thanks so much in advance!
This is a bit confusing. You refer to "da#", "a#", "ba#", and "c1" but only "da#" and "ba#" are actually defined in your code. Here is a start on what you seem to be trying to do. First creating the files you are using as an example:
animals <- split(df2, df2$animal)
fnames <- paste0("da", formatC(1:10, digits=2, width=2, flag="0"), ".csv")
invisible(lapply(1:10, function(x) write_csv(animals[[x]], fnames[x])))
dir(pattern=".csv")
# [1] "da01.csv" "da02.csv" "da03.csv" "da04.csv" "da05.csv" "da06.csv" "da07.csv" "da08.csv" "da09.csv" "da10.csv"
First we split df2 into the different kinds of animals and then use lapply to create 10 .csv files but label them so they will appear in the correct numeric order.
Since splitting a data frame is easy, why not combine all of the files into a single data frame (alldata <- do.call(rbind, animals)), extract the columns you want and then use split to separate them by animal type. You can then keep the list and extract the parts you want - usually the simpler approach if you plan to do similar analyses on all of them - or extract them as separate objects.
I am currently working on combining my data from multiple excel files into one df. Problem is, the number of columns differ across the files (due to different experiment versions), so I need to bind only certain columns/variables from each file (they have the same names).
I tried doing this "manually" at first, using:
library(openxlsx)
PWI <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI$Subject, PWI$Block, PWI$Category, PWI$Trial,PWI$prompt1.RT)
#read in and pull out variables of interest for one subject
mergedFullData = merge(mergedDataA, mergedDataB)
#add two together, then add the third to the merged file, add 4th to that merged file, etc
Obviously, it seems like there's a simpler way to combine the files. I've been working on using:
library(openxlsx)
path <- "/Users/myname/Desktop/PrelimPWI"
merge_file_name <- "/Users/myname/Desktop/PrelimPWI/merge_file_name.xlsx"
filenames_list <- list.files(path= path, full.names=TRUE)
All <- lapply(filenames_list,function(merge_file_name$Subject){
print(paste("Merging",merge_file_name,sep = " "))
read.xlsx(merge_file_name, colNames=TRUE, startRow = 2)
})
PWI <- do.call(rbind.data.frame, All)
write.xlsx(PWI,merge_file_name)
However, I keep getting the error that the number of columns doesn't match, but I'm not sure where to pull out the specific variables I need (the ones listed in the earlier code). Any other tweaks I've tried has resulted in only the first file being written into the xlsx, or a completely blank df. Any help would be greatly appreciated!
library(tidyverse)
df1 <- tibble(
a = c(1,2,3),
x = c(4,5,6)
)
df2 <- tibble(
x = c(7,8,9),
y = c("d","e","f")
)
bind_rows(df1, df2)
The bind functions from dplyr should be able to help you. They can bind dataframes together by row or column, and are flexible about different column names.
You can then select the actual columns you want to keep.
The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}
I read this data set and I want to join the data for the training set and the test set (I should mention that this is part of a coursera course exercise).
I have read both data sets and gave all columns names,the training data have 7352 rows and 562 columns and the test set have 2947 rows and 562 columns.
The names of the columns of both data sets are the same.
When I try to join the data with bind_rows I get a data set with 10299 rows but with 478 columns, not 562.
When I use rbind I get the correct result, but I need to cast it again using tbl_df so I prefer doing it using bind_rows.
The following is the script I wrote, running it from a folder containing the unzipped data from the above ling (e.g the folder "UCI HAR Dataset") reproduces the problem.
## Setting the script folder to be current directory
CurrentScriptDirectory = script.dir <- dirname(sys.frame(1)$ofile)
setwd(CurrentScriptDirectory)
library(dplyr)
#Readin the data
train_x <- tbl_df(read.table("./UCI HAR Dataset/train/X_train.txt"))
train_y <- tbl_df(read.table("./UCI HAR Dataset/train/y_train.txt"))
test_x <- tbl_df(read.table("./UCI HAR Dataset/test/X_test.txt"))
test_y <- tbl_df(read.table("./UCI HAR Dataset/test/y_test.txt"))
#Giving the y's proper names
colnames(train_y) <- c("Activity Name")
colnames(test_y) <- c("Activity Name")
#Reading features names
featuerNames<-read.table("./UCI HAR Dataset/features.txt")
featuerNames<-featuerNames[,2]
#Giving the training and test data proper names
colnames(train_x) <- featuerNames
colnames(test_x) <- featuerNames
labeledTrainingSet <- bind_cols(train_x,train_y)
labeledTestSet <- bind_cols(test_x,test_y)
labledDataSet <- bind_rows(labeledTrainingSet,labeledTestSet)
Can someone help me understand what I'm doing wrong ?
I've worked with that dataset and ran into the same issue. As others mentioned, there are duplicate features.
Rename duplicate columns and make them legal. You can use:
make.names(X, unique = TRUE, allow_ = TRUE)
where X is a character vector. The function will add to existing column names so you don't lose original nomenclature. see http://www.inside-r.org/r-doc/base/make.names for more details
After all of your column names are unique dplyr::bind_rows() will work!
Just checked it out. You have duplicated names in your featureNames set. These are dropped by bind_rows.
test1<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test1)<- c("A","B","B")
test2<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test2)<- c("A","B","B")
test3 <-bind_rows(test1, test2)
here is my code:
file.number <- c(1:29)
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv), paste0(file.number, ".data"))
n <- c(1:3,10:15,21:26)
sw <- na.omit(data[[n]]$RT[data[[n]]$rep.sw=="sw"])
rep <-na.omit(data[[n]]$RT[data[[n]]$rep.sw=="rep"])
The problem is that 3rd line - if n = 1, it works, but if I include multiple numbers I get an error "recursive indexing fail." Is there a way I can access multiple indexes at once?
Thanks R Community! Any advice would be much appreciated!
Too long for a comment.
It looks like data is a list of data frames. The list elements are named, e.g. 1.data, 2.data, etc. and each data frame has, among other things, columns named RT and rep.sw. So, like this:
## representative example???
df <- data.frame(RT=1:100,rep.sw=sample(c("sw","rep"),100,replace=TRUE))
data <- setNames(lapply(1:29,function(i)df),paste0(1:29,".data"))
You seem to want to remove NA's from the RT column of each data frame for rows where res.sw=="sw" (or "rep").
If that is correct, then something like this should work:
sw <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="sw"])))
rep <- lapply(data[n],function(df) with(df,na.omit(RT[rep.sw=="rep"])))
This code will pass the data frames identified in n to the function one at a time, and for each of those return the rows of column RT for which rep.sw="sw", with NA's omitted. The result will be a list of vectors.
I notice that most of the columns are imported as factors, which is probably a bad idea. You might want to import using:
data <- setNames(lapply(paste0(file.number, ".csv"), read.csv, stringsAsFactors=FALSE),
paste0(file.number, ".data"))