R - combining lines from multiple CSV into a data frame - r

I have a folder with hundreds of CSV files each containing data for a particular postal code.
Each CSV files contains two columns and thousands of rows. Descriptors are in Column A, values are in Column B.
I need to extract two pieces of information from each file and create a new table or dataframe using the values in [Column A, Row 2] (which is the postal code) and [Column B, Row 1585] (which is the median income).
The end result should be a table/dataframe with two columns: one for postal code, the other for median income.
Any help or advice would be appreciated.

Disclaimer: this question is pretty vague. Next time, be sure to add a reproducible example that we can run on our machines. It will help you, the people answering your questions, and future users.
You might try something like:
files = list.files("~/Directory")
my_df = data.frame(matrix(ncol = 2, nrow = length(files)
for(i in 1:length(files)){
row1 = read.csv("~/Directory/files[i]",nrows = 1)
row2 = read.csv("~/Directory/files[i]", skip = 1585, nrows = 1)
my_df = rbind(my_df, rbind(row1, row2))
}
my_df = my_df[,c("A","B")]
# Note on interpreting indexing syntax:
Read this as "my_df is now (=) my_df such that ([) the columns (,)
are only A and B (c("A", "B")) "

You can use list.files function to get directories for all your files and then use read.csv and rbind in for loop to create one data.frame.
Something like this:
direct<-list.files("directory_to_your_files")
df<-NULL
for(i in length(direct)){
df<-rbind(df,read.csv(direct[i]))
}

So here is the code which does what I want it to do. If there are more elegant solutions, please feel free to point them out.
# set the working directory to where the data files are stored
setwd("/foo")
# count the files
files = list.files("/foo")
#create an empty dataframe and name the columns
dataMatrix=data.frame(matrix(c(rep(NA,times=2*length(files))),nrow=length(files)))
colnames(dataMatrix)=c("Postal Code", "Median Income")
# create a for loop to get the information in R2/C1 and R1585/C2 of each data file
# Data is R2/C1 is a string, but is interpreted as a number unless specifically declared a string
for(i in 1:length(files)) {
getData = read.csv(files[i],header=F)
dataMatrix[i,1]=toString(getData[2,1])
dataMatrix[i,2]=(getData[1585,2])
}
Thank you to all those who helped me figure this out, especially Nancy.

Related

R - Export large dataframe into CSV

Beginner here: I have a list (see screenshot) called Coins_list from which I want to export the second dataframe stored in it called data into a csv. When I use the code
write.csv(Coins_list$data, file = "Coins_list_full_data.csv")
I get a huge CSV with a bunch of numbers from the column named price which apparently containts more dataframes, if I read the output correctly or at least display the data in the price column? How can I export this dataframe into CSV correctly? See screenshot for more details.
EDIT: I was able to get the first four rows into CSV by using df2 <- Coins_list$data write.csv(df2[1:4,], file="BTC_row.csv"), however it now looks like R puts the price of all four rows within a list c( ) and repeats it in each row? Any idea how to change that?
(I would post this as a comment but I have too few reputation)
Hey, you could try for starters to flatten the json file by going further than response list$content but looking at what's into the content with another $.
Else you could try getting data$price and see what pops up from there.
something like this:
names = list(data$symbol)
df = data.frame(price = NA, symbol = NA)
for (i in length(data)) {
x = data.frame(price = data$price[i], symbol = names[i])
df = inner_join(df, data)
}
to get a dataframe with price and symbol. I don't know how the data is nested so I'm just guessing.
It would be helpful to know from where you got the data for reproducibility.

Importing data in R from Excel with information cotained in header

As title says, I am trying to import data from Excel to R, where part of the information is contained in the header.
I a very simplified way, the Excel I have looks like this:
GROUP;1234
MONTH;"Jan"
PERSON;SEX;AGE;INCOME
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
Total;;;147000
After reading in to R, it should be a "clean" dataset that looks like this.
GROUP;MONTH;PERSON;SEX;AGE;INCOME
1234;Jan;John;m;26;20000
1234;Jan;Michael;m;24;40000
1234;Jan;Phillip;m;25;15000
1234;Jan;Laura;f;27;72000
I have several files that look like this. The number of persons however varies in each file. The last line contains a summary that should be skipped. There might be empty lines between the list and summary line.
Any help is higly apreciated.Thank you very much.
Excel files can be read using readxl::read_excel()
One of the parameters is skip, using which you can skip certain number of rows defined by you.
For your data, you need to skip the first two lines that contain GROUP and MONTH.
You will get the data in following format.
PERSON;SEX;AGE;INCOME;
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
After this, you can manually add the columns GROUP and MONTH
Thank you very much for your help. The hint from #Aurèle brought the missing puzzle piece. The solution I have now come up with is as follows:
group <- read_excel("TEST1.xlsx", col_names = c("C1","GROUP") ,n_max = 1)
group <- group[,2]
month <- read_excel("TEST1.xlsx", col_names = c("C1","MONTH") ,skip = 1, n_max = 1)
month <- month[,2]
data <- read_excel("TEST1.xlsx", col_names = c("NAME","SEX","AGE","INCOME") , skip = 4)
data <- data[data$AGE != NA,]
data <- cbind(data,group,month)
data

Loop through CSV files--issue completing task for each individual file

I'm attempting to loop through multiple CSV files and complete the same task for each file to save myself time. First, I ran 'list.files' to list all files in the folder (e.g., GPS_Collar33800_13.csv,GPS_Collar33801_13.CSV,etc). I then developed a loop but I'm struggling on how to structure the other parts of the code to work through each individual file. My end goal is to have 24 files that all look the same structurally and then I need to merge them all together into a master file. Another issue is that I need to list a unique ID for each file (Add column for collar ID, e.g., 33800,33801,33802,etc.) but I don't know how to easily do this without manually adding in a new unique ID by hand (if I knew that it was bringing in file GPS_Collar33800_13.csv first then I can make the AnimalID column value=33800 and do the same thing for GPS_Collar33801_13.csv and add in AnimalID column value=33801). The unique IDs are based on the file name. Any suggestions would be much appreciated!
## List CSV files in folder
`files<-list.files()`
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-33800
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
You can give this a try:
library(stringr)
## List CSV files in folder
files<-list.files()
big.df <- vector('list',length(files))
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-str_match(files[i], 'Collar(\\d+)_')[,2]
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
big.df[[i]] <- tmp2
}
final.df <- do.call('rbind', big.df)
It will require the stringr package and assumes your filenames all look like 'GPS_Collar33801_13.csv', etc. It then reads in each file, stores it in a large list, moves to the next file... and when it's done, it mashes them all together in a data.frame called final.df.
Edit: Just fixed the str_match argument.
So let me make sure before I begin that I understand the ask:
For each file in the folder,
Import the file as a data frame
Drop some columns
Rename the remaining columns
Set a column in the data frame to a value obtained from the file name
Remove cases containing the string "N/A" anywhere
Then, combine each of the resulting data frames into one data frame by UNION-ing them (that is, adding the rows together because the columns should be the same).
It's critically important that you provide your data with any such question. If you can't provide your specific data, create some fake data that still demonstrates the problem at hand. Then, provide an example of what it should look like once the operations are complete. This reduces guesswork by the people answering your question.
So with all that said, let's get cracking.
Let's abstract away the sub-parts of task #1 by pretending that we have a function called process_a_file that will do steps 1-5 of each individual file and return a data frame. I can explain how that function works later.
For the "for each file" part, you need lapply. lapply runs a given function on each element of a list you provide, and returns a list of what the function returns:
results_list <- lapply(files, process_a_file)
This will return a list, where each element of the list is a data frame returned by process_a_file. Then you need a function to combine them - I recommend bind_rows from the package dplyr:
results_df <- dplyr::bind_rows(results_list)
And that's all you need to do!
So, now, what do we put in process_a_file? This is pretty easy - your code is mostly complete for doing this, but there are some different ways to do it that I prefer :)
process_a_file <- function(filename) {
#???????
}
Step 1 is to import the file as a data frame. For this I recommend read_delim from the readr package - it's much faster than the default R methods, has nice defaults, and lets us tackle Step 5 at the same time by specifying that "N/A" means NA:
df <- readr::read_delim(filename, delim = " ", col_names = FALSE, na = "N/A")
For step 2, your way works, but I also recommend the select function from dplyr:
dplyr::select(df, 2:5,9,10,12,1)
You can also index columns with unquoted names, and drop columns with -5 or -column_name too - and you can do step 3 at the same time!
df <- dplyr::select(
df,
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
)
Your way of renaming the columns is fine, too. By the way, if you start a column name with a number, you have to use this `backtick` syntax everywhere, so it's quite inconvenient and you should probably avoid it if you can.
Then finally, I recommend getting the ID from the file name using regular expressions. I'll assume you can write that regular expression since that's really out of scope - so you can use basename(tools::file_path_sans_ext(filename) to return the filename without the path or extension, and use stringr::str_extract to pop out the ID, which you then add to a column using dplyr::mutate
dplyr::mutate(df, animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
So now, putting this all together - using dplyr's piping syntax %>% to make it look nice:
process_a_file <- function(filename) {
readr::read_delim(filename,
delim = " ",
col_names = FALSE,
na = "N/A") %>%
dplyr::select(
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
) %>%
dplyr::mutate(animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
}
results_list <- lapply(files, process_a_file)
results_df <- dplyr::bind_rows(results_list)

Select multiple rows in multiple DFs with loop in R

I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)

R for loop index issue

I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)

Resources