After merging multiple data frames into one, I would like to know how to change the column headers in the master data frame to represent the original files that they came from. I merged a large number of data frames into one using the code below:
library(plyr)
dflist = list.files(path=dir, pattern="csv$", full.names=TRUE, recursive=FALSE)
import.list = llply(dflist, read.csv)
Master = Reduce(function(x, y) merge(x, y, by="Hours"), import.list)
I would like the columns that belonged to each original data frame to be named by the unique ID that the original data frame/ csv file is named by (i.e. aa, ab, ac). The unique IDs in the filenames comes immediately before a low line ("_") so I can isolate them using the code below. However, I am having trouble now applying this to column headers. Any help would be much appreciated.
filename = dflist[1]
unqID = strsplit(filename,"_")[[1]][1]
You could define a function in your llply call to and have read.csv assign names.
or just rename them after reading them in and before merging #joran suggested
#First get the names
filenames = dflist
#I am unsure about the line below, as I
unqID = lapply(filenames,function(x) strplit(x,"_")[1])
names(import.list) <- paste("unqID", names(import.list),sep=".") #renaming the list items
And then merge using your code
Related
This is a 3rd edit to the question (leaving below thread just in case):
The following code makes some sample data frames, selects those with "_areaX" in the title and makes a list of them. The goal is then to combine the data frames in the list into 1 data frame. It almost works...
Area1 <- 100
Area2 <- 200
Area3 <- 300
Zone <- 3
a1_areaX <- data.frame(Area1)
a2_areaX <- data.frame(Area2)
a3_areaX <- data.frame(Area3)
a_zoneX <- data.frame(Zone)
library(dplyr)
pattern = "_areaX"
df_list <- mget(ls(envir = globalenv(), pattern = pattern))
big_data = bind_rows(df_list, .id = "FileName")
The problem is the newly created data frame looks like this:
And I need it to look like this:
File Name
Area measurement
a1_areaX
100
a2_areaX
200
a3_areaX
300
Below are my earlier attempts at asking this question. Edited from first version:
I have csv files imported into R Global Env that look like this (I'd share the actual file(s) but there doesn't seem to be a way to do this here):
They all have a name, the above one is called "s6_section_area". There are many of them (with different names) and I've put them all together into a list using this code:
pattern = "section_area"
section_area_list <- list(mget(grep(pattern,ls(globalenv()), value = TRUE), globalenv()))
Now I want a new data frame that looks like this, put together from the data frames in the above made list.
File Name
Area measurement
a1_section_area
a number
a2_section_area
another number
many more
more numbers
So, the first column should list the name of the original file and the second column the measurement provided in that file.
Hope this is clearer - Not sure how else to provide reproducible example without sharing the actual files (which doesn't seem to be an option).
addition to edit: Using this code
section_area_data <- bind_rows(section_area_list, .id = "FileName")
I get (it goes on and on to the right)
I'm after a table that looks like the sample above, left column is file name with a list of file names going down. Right column is the measurement for that file name (taken from original file).
Note that in your list of dataframes (df_list) all the columns have different names (Area1, Area2, Area3) whereas in your output dataframe they all have been combined into one single column. So for that you need to change the different column names to the same one and bind the dataframes together.
library(dplyr)
library(purrr)
result <- map_df(df_list, ~.x %>%
rename_with(~"Area", contains('Area')), .id = 'FileName')
result
# FileName Area
#1 a1_areaX 100
#2 a2_areaX 200
#3 a3_areaX 300
Thanks everyone for your suggestions. In the end, I was able to combine the suggestions and some more thinking and came up with this, which works perfectly.
library("dplyr")
pattern = "section_area"
section_area_list <- mget(ls(envir = globalenv(), pattern = pattern))
section_area_data <- bind_rows(section_area_list, .id = "FileName") %>%
select(-V1)
So, a bunch of csv files were imported into R Global Env. A list of all files with a name ending in "section_area" was made. Those files were than bound into one big data frame, with the file names as one column and the value (area measurement in this case) in the other column (there was a pointless column in the original csv files called "V1" which I deleted).
This is what one of the many csv files looks like
sample csv file
And this is the layout of the final data frame (it goes on for about 150 rows)
final data frame
I'm trying to read in two data files, one is the actual data, the other is a file of column names in rows. I then need to assign the column names to the actual data. Below is what I have but its not assigning them properly.
#read in the data
glass_data = read.csv('/all_datasets/glass/glass.txt', header=FALSE)
glass_headers = read.csv('/all_datasets/glass/header.txt')
#add the names
names(glass_data) = c(glass_headers)
Would this work:
colnames(glass_data) <- glass_headers[, 1]
I have a set of excel files each containing one sheet of data, all of similar structure (mostly -- see below), that I want to ultimately combine into one large data frame (with each sub-set indexed by original file source).
I am able to create a list of multiple dataframes, and then merge these into one dataframe, pretty easily with the following code:
files <- grep(".xlsx", dir(), value=TRUE) # vector of file names
IDnos <- substr(files,20,24) #vector with key 5-digit ID info of each file
library("XLConnect")
library("data.table")
datalist <- lapply(files, readWorksheetFromFile, sheet = "Data")
names(datalist) <- IDnos
bigdatatable <- rbindlist(datalist, idcol = "IDNo")
One data column "Value" is usually class numeric, except I found that in several there was an "ND" put in to one row, making it class character, so in the final data frame the column is character.
Although I can fix this with some simple cleaning, I was left wondering if there is way to identify at the "list of dataframes" stage which files (or dataframe components of the list I created) with class character for column "Value". For example I can't run sapply(datalist,class) or other variations. I am hoping to avoid a for-loop.
Is there any way to use lapply or sapply to drill down into dataframes within a list?
Here's how I would use lapply to find the class of column a in a list of 2 data frames, named x and y.
datalist <- list(x = data.frame(a = letters),
y = data.frame(a = 1:26))
lapply(datalist, function(x) class(x$a))
$x
[1] "factor"
$y
[1] "integer"
I have a list of filenames that were found by searching the working directory. I want to either make one data frame with multiple elements that can be selected from or multiple data frames. To select either parts of one data frame or pick from multiple data frames, I want to name them using a part of the associated filename.
Currently, I set filenames using list.files and set up the data frame using lapply with read.csv
filenames = list.files(recursive=TRUE,pattern="*dat.csv",full.names=FALSE)
data = lapply(filenames,function(i){
read.csv(i,stringsAsFactors=FALSE)
})
Can someone explain to me the best way to go about this data import and name assignment?
A good way to store this would be as a single, combined data frame with a column describing the original file, let's say type:
data_frames = lapply(filenames,function(i){
ret <- read.csv(i,stringsAsFactors=FALSE)
ret$type <- gsub("dat.csv$", "", i)
ret
})
data = do.call(rbind, data_frames)
Or shorter, with plyr:
library(plyr)
data = ldply(filenames, read.csv, stringsAsFactors = FALSE, .id = "type")
data$type <- gsub("dat.csv$", "", data$type)
That way you could extract whatever subset you wanted with:
# to get all lines from, say, the AAAdat.csv file
subset(data, type == "AAA")
You could store each dataset as an individual variable with a name like AAA, but you shouldn't, because it's a bad idea to use your variable names to store information.
(Note that this assumes your datasets share most, or at least some, columns. If they have entirely different structures, this is not an appropriate approach).
I have several txt files in which each txt file contains 3 columns(A,B,C).
Column A will be common to all txt files. Now I want to combine txt files with coulmn A appearing only once while the other columns (B and C) of respective files. I used cbind but it creates a data frame with repeats of column A, which I dont want. The column A must be repeated only once. Here is the R code I tried:
data <- read.delim(file.choose(),header=T)
data2 <- read.delim(file.choose(),header=T)
data3 <- cbind(data1,data2)
write.table(data3,file="sample.txt",sep="\t",col.names=NA)
Unless your files are all sorted precisely the same, you'll need to use merge:
dat <- merge(data,data2,by="A")
dat <- merge(dat,data3,by="A")
This should automatically prevent you from having multiple A's, since merge knows they're all a key/index column. You'll likely want to rename the duplicate B's and C's before merging.