Split datafile into multiple - r

I am currently working with a Toolbox, so panel data must be contained in CSV files for each country. I have a 60 country panel for 1980-2014 quarterly data in a single CSV file. Rather than copying it manually I would like to use a looping routine.
This is what I tried to do:
mydata<-read.csv("regression.csv")
value<-split(mydata, mydata$country, drop=FALSE)
As far I understand I need to use lapply to export the data into 60 CSV files.
Can anyone help me with this please?

We loop through the names of the list elements using lapply, get the first 7 characters of the name with substr and use that to create the file name in the write.csv
lapply(names(value), function(x) write.csv(value[[x]],
file=paste0(substr(x, 1,7), '.csv'), quote=FALSE, row.names=FALSE))

Related

Picking last row data only from 2000 csv in the same directory and make single dataframe through R

Using R, I want to pick last row data only from over 2000 csv in the same directory
and make single dataframe.
Directory = "C:\data
File name, for example '123456_p' (6 number digit)
Each csv has different number of rows, but has the same number of columns (10 columns)
I know the tail and list function, but over 2000 dataframes, inputting manually is time wasting.
Is there any way to do this with loop through R?
As always, I really appreciate your help and support
There are four things you need to do here:
Get all the filenames we want to read in
Read each in and get the last row
Loop through them
Bind them all together
There are many options for each of these steps, but let's use purrr for the looping and binding, and base-R for the rest.
Get all the filenames we want to read in
You can do this with the list.files() function.
filelist = list.files(pattern = '.csv')
will generate a vector of filenames for all CSV files in the working directory. Edit as appropriate to specify the pattern further or target a different directory.
Read each in and get the last row
The read.csv() function can read in each file (if you want it to go faster, use data.table::fread() instead), and as you mentioned tail() can get the last row. If you build a function out of this it will be easier to loop over, or change the process if it turns out you need another step of cleaning.
read_one_file = function(x) {
tail(read.csv(x), 1)
}
Loop through them
Bind them all together
You can do both of these steps at once with map_df() in the purrr package.
library(purrr)
final_data = map_df(filelist, read_one_file)

how do i get my variables to be recognized?

I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.
When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>

R: Using the "names" function on a dataset created within a loop

I am using a for loop to read in multiple csv files and naming the datasets import1, import2, etc. For example:
assign(paste("import",i,sep=""), read.csv(files[i], header=FALSE))
However, I now want to rename the variables in each dataset. I have tried the following:
names(as.name(paste("import",i,sep=""))) <- c("xxxx", "yyyy")
But get the error "target of assignment expands to non-language object". (I need to change the name of variables in each dataset within the loop as the variable names need to be different in each dataset).
Any suggestions on how to do this would be much appreciated.
Thanks.
While I do agree it would be much better to keep your data.frames in a list rather than creating a bunch of variables in your global environment, you can also set names when you read the files in
assign(paste("import",i,sep=""),
read.csv(files[i], header=FALSE, col.names=c("xxxx", "yyyy")))
Using assign() isn't very "R-like".
A better approach would be to read the files into a list of data.frames, instead of one data.frame object per file. Assuming files is the vector of file names (as you imply above):
import <- lapply(files, read.csv, header=FALSE)
Then if you want to operate on each data.frame in the list using a loop, you easily can:
for (i in seq_along(import)) names(import[[i]]) <- c('xxx', 'yyy')

Join two dataframes before exporting as .csv files

I am working on a large questionnaire - and I produce summary frequency tables for different questions (e.g. df1 and df2).
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
c<-c(1,1,5,2,1)
Percent<-c(10,10,50,20,10)
df2<-data.frame(a,c,Percent)
rm(a,b,c,Percent)
I normally export the dataframes as csv files using the following command:
write.csv(df1 ,file="df2.csv")
However, as my questionnaire has many questions and therefore dataframes, I was wondering if there is a way in R to combine different dataframes (say with a line separating them), and export these to a csv (and then ultimately open them in Excel)? When I open Excel, I therefore will have just one file with all my question dataframes in, one below the other. This one csv file would be so much easier than having individual files which I have to open in turn to view the results.
Many thanks in advance.
If your end goal is an Excel spreadsheet, I'd look into some of the tools available in R for directly writing an xls file. Personally, I use the XLConnect package, but there is also xlsx and also several write.xls functions floating around in various packages.
I happen to like XLConnect because it allows for some handy vectorization in situations just like this:
require(XLConnect)
#Put your data frames in a single list
# I added two more copies for illustration
dfs <- list(df1,df2,df1,df2)
#Create the xls file and a sheet
# Note that XLConnect doesn't seem to do tilde expansion!
wb <- loadWorkbook("/Users/jorane/Desktop/so.xls",create = TRUE)
createSheet(wb,"Survey")
#Starting row for each data frame
# Note the +1 to get a gap between each
n <- length(dfs)
rows <- cumsum(c(1,sapply(dfs[1:(n-1)],nrow) + 1))
#Write the file
writeWorksheet(wb,dfs,"Survey",startRow = rows,startCol = 1,header = FALSE)
#If you don't call saveWorkbook, nothing will happen
saveWorkbook(wb)
I specified header = FALSE since otherwise it will write the column header for each data frame. But adding a single row at the top in the xls file at the end isn't much additional work.
As James commented, you could use
merge(df1, df2, by="a")
but that would combine the data horizontally. If you want to combine them vertically you could use rbind:
rbind(df1, df2, df3,...)
(Note: the column names need to match for rbind to work).

Select columns for heatmap in R

I need your help again :)
I wrote an R script, that generates a heatmap out of a given tab-seperated txt or xls file. At the moment, I delete all columns I don't want to have in the heatmap by hand in the xls file.
Now I want to automatize it, but I don't know how :(
The interesting columns all start the same in all xls files, followed by an individual name:
xls-file 1: L1_tpm_xxxx L2_tpm_xxxx L3_tpm_xxxx
xls-file 2: L1_tpm_xxxx L2_tpm_xxxx L3_tpm_xxxx L4_tpm_xxxx L5_tpm_xxxx
Any ideas how to select those columns?
Thanking you in anticipation, Philipp
You could use (if you have read your data in a data.frame df):
df <- df[,grep("^L[[:digit:]]+_tpm.*",colnames(df))]
or you can explicitly write the columns that you want:
df <- df[,c("L1_tpm_xxxx","L2_tpm_xxxx","L3_tpm_xxxx")]
etc...
The following link is quite useful;-)
If you think the column positions are going to be fixed across excel sheets, the simplest solution here is to just use column indices. For example, if you use read.table to import a tab-delimited text file as a data.frame, and then decide you'd prefer to only keep the first two columns, you might do something like this:
data <- read.table("path_to_file.txt", header=T, sep="\t")
data <- data[,1:2]

Resources