Creating functions in R with iterative code - r

I work with surveys and would like to export a large number of tables (drawn from data frames) into an .xlsx or .csv file. I use the xlsx package to do this. This package requires me to stipulate which column in the excel file is the first column of the table. Because I want to paste multiple tables into the .csv file I need to be able to stipulate that the first column for table n is the length of table (n-1) + x number of spaces. To do this I planned on creating values like the following.
dt# is made by changing a table into a data frame.
table1 <- table(df$y, df$x)
dt1 <- as.data.frame.matrix(table1)
Here I make the values for the number of the starting column
startcol1 = 1
startcol2 = NCOL(dt1) + 3
startcol3 = NCOL(dt2) + startcol2 + 3
startcol4 = NCOL(dt3) + 3 + startcol2 + startcol3
And so on. I will probably need to produce somewhere between 50-100 tables. Is there a way in R to make this an iterative process so I can create the 50 values of starting columns without having to write 50+ lines of code with each one building on the previous?
I found stuff on stack overflow and other blogs about writing for - loops or using apply type functions in R but this all seemed to deal with manipulating a vector as opposed to adding values to the workspace. Thanks

You can use a structure similar to this:
Your list of files to read:
file_list = list.files("~/test/",pattern="*csv",full.names=TRUE)
for each file, read and process the data frame and capture how many columns there are in the frame you are reading/processing:
columnsInEachFile = sapply(file_list,
function(x)
{
df = read.csv(x,...) # with your approriate arguments
# do any necessary processing you require per file
return(ncol(df))
}
)
The cumulative sum of the number of columns plus 1 will indicate the start columns of a data frame that contains your processed data stuck next to each other:
columnsToStartDataFrames = cumsum(columnsInEachFile)+1
columnsToStartDataFrames = columnsToStartDataFrames[-length(columnsToStartDataFrames)] # last value is not the start of a data frame but the end

Assuming tab.lst is a list containing tables, then you can do:
cumsum(c(1, sapply(tail(tab.lst, -1), ncol)))
Basically, what I'm doing here is I'm looping through all the tables but the last one (since that one's start col is determined by the second to last), and getting each table's width with ncol. Then I'm doing the cumulative sum over that vector to get all the start positions.
And here is how I created the tables (tables based on all possible combinations of columns in df):
df <- replicate(5, sample(1:10), simplify=F) # data frame with 5 columns
names(df) <- tail(letters, 5) # name the cols
name.combs <- combn(names(df), 2) # get all 2 col combinations
tab.lst <- lapply( # make tables for each 2 col combination
split(name.combs, col(name.combs)), # loop through every column in name.combs
function(x) table(df[[x[[1]]]], df[[x[[2]]]]) # ... and make a table
)

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

R- for loop to select two columns in a data frame, with only the second column changing

I'm having issues trying to write a for loop in R. I have a dataframe of 16 columns and 94 rows and i want to loop through, selecting column 1, plus column 2 in one data frame, then col 1 + col 3 etc, so i end up with 16 dataframes containing 2 columns, all written to individual .csv files
TwoB<- read.csv("data.csv", header=F)
list<- lapply(1:nX, function(x) NULL)
nX <- ncol(TwoB)
for(i in 1:ncol(TwoB)){
list[[i]]<-subset(TwoB,
select=c(1, i+1))
}
Which produces an error:
Error in `[.data.frame`(x, r, vars, drop = drop):
undefined columns selected
I'm not really sure how to code this and clearly haven't quite grasped loops yet so any help would be appreciated!
The error is easily explained as you loop over 16 columns and in the end trying to select 16+1 which column index does not exists.
You probably could loop over nX-1 instead, but I think what you try to achieve can be done more elegant.
TwoB<- read.csv("data.csv", header=F)
library("data.table")
setDT(TwoB)
nX <- ncol(TwoB)
# example to directly write your files
lapply(2:nX, function(col_index) {
fwrite(TwoB[, c(1, ..col_index)], file = paste0("col1_col", col_index, ".csv"))
})
# example to store the new data.tables in a list
list_of_two_column_tables <- lapply(2:nX, function(col_index) {
TwoB[, c(1, ..col_index)]
})

How to find similarities between 2 datasets and generate a new dataframe consisting of these rows which coincide?

I have the results of radiosonde observations for more than 1000 stations in one file and list of stations (81) that actually interest me. I need to make a new data frame where the first file's rows would be included.
So, I have two datasets imported from .txt files to R. The first is a data frame 6694668x6 and the second one is 81x1, where second dataset's rows conicide with some of first dataset's 1st column values (values are looking like this: ACM00078861).
d = data.frame(matrix(ncol = 6, nrow = 0))
for(i in 1:81){
for (j in 1:6694668) {
if(stations[i,1] == ghgt_00z.mly[j,1]){
rbind(d,ghgt_00z.mly[j,] )
j + 1
} else {j+1}
}
}
I wanted to generate a new dataframe which would look like the "ghgt_00z.mly", but containing only the rows for the stations which are listed in "stations".
Ofc, the code was running for couple of days and I have receaved only the warning message.
Please, help me!
There's a lot of options how to do this. I persolaly use classic merge()
res <- merge(x=stations, y=ghgt_00z.mly, by='common_coulmn_name', all.x = TRUE)
Where common_coulmn_name is the same column name present in both df's. As a result you have combined two df's with all columns present in both datasets, you can remove them if you want.
Second useful option is:
library(dplyr)
inp <- ghgt_00z.mly$column_of_interest
res <- filter(stations, grepl(paste(inp, collapse="|"), column_in_stations))
Where inp and column_in_stations should contain some same values.
Due to I don't have datasets I can't check these solutions, so I don't guarantee if they work fine.

Apply series of changes to multiple similar datasets in R

I have 20 csv files of data that are formatted exactly the same, about 40 columns of different numbers, but with different values in each column. I want to apply a series of changes to each data frame in order to extract specific information from every one of them.
Specifically I want to extract four columns from each data frame, find the maximum value of each column in each data frame and then add all of these maximum values together, so I get one final number for each data frame. Something like this:
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
I have the code written above to do all these steps for every data frame individually, but is it possible to apply this code to all of them at once?
If you put all 20 filenames into a vector called files
Maxes <- numeric(length(files))
i <- 1
for (file in files) {
data <- read.csv(file)
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
Maxes[i] <- Add
i <- i+1
}
Though that str(data) will just cause a lot of stuff to print to the terminal 20 times. I'm not sure the value of that, but it was in your question so I included.
Put all your files into a common folder such as /path/temp/
csvs <- list.files("/path/temp") # vector of csv
Use custom function for colMax
colMax <- function(data) sapply(data, max, na.rm = TRUE)
Using foreach, dplyr, and readr
library(foreach)
library(dplyr)
foreach(i=1:length(csvs), .combine="c") %do% { read_csv(csvs[i]) %>%
select(1:4) %>%
colMax(.) %>%
sum(.)
} # returns a vector

repeat the assigning of data frame in R [duplicate]

This question already has answers here:
Reading multiple files into multiple data frames
(2 answers)
Closed 6 years ago.
I am new to R and stackoverflow so this will probably have a very simple solution.
I have a set of data from 20 different subject. In the future I will have to perform a lot of different actions on this data and will have to repeat this action for all individual sets. Analyzing them separately and recombining them.
My question is how can I automate this process:
P4 <- read.delim("P4Rtest.txt")
P7 <- read.delim("P7Rtest.txt")
P13 <- read.delim("P13Rtest.txt")
etc etc etc.
I have tried looping with a for loop but see to get stuck with creating a new data.frame with a unique name every time.
Thank you for your help
The R way to do this would be to keep all the data sets together in a named list. For that you can use the following, where n is the number of files.
nm <- paste0("P", 1:n) ## create the names P1, P2, ..., Pn
dfList <- setNames(lapply(paste0(nm, "Rtest.txt"), read.delim), nm)
Now dfList will contain all the data sets. You can access them individually with dfList$P1 for P1, dfList$P2 for P2, and so on.
There are a bunch of different ways of doing stuff like this. You could combine all the data into one data frame using rbind. The first answer here has a good way of doing that: Replace rbind in for-loop with lapply? (2nd circle of hell)
If you combine everything into one data frame, you'll need to add a column that identifies the participant. So instead of
P4 <- read.delim("P4Rtest.txt")
...
You would have something like
my.list <- vector("list", number.of.subjects)
for(participant.number in 1:number.of.subjects){
# load individual participant data
participant.filename = paste("P", participant, "Rtest.txt", sep="")
participant.df <- read.delim(participant.filename)
# add a column:
participant.df$participant.number = participant.number
my.list[[i]] <- participant.df
}
solution <- rbind(solution, do.call(rbind, my.list))
If you want to keep them separate data frames for some reason, you can keep them in a list (leave off the last rbind line) and use lapply(my.list, function(participant.df) { stuff you want to do }) whenever you want to do stuff to the data frames.
You can use assign. Assuming all your files have a similar format as you have shown, this will work for you:
# Define how many files there are (with the numbers).
numFiles <- 10
# Run through that sequence.
for (i in 1:numFiles) {
fileName <- paste0("P", i, "Rtest.txt") # Creating the name to pull from.
file <- read.delim(fileName) # Reading in the file.
dName <- paste0("P", i) # Creating the name to assign the file to in R.
assign(dName, file) # Creating the file in R.
}
There are other methods that are faster and more compact, but I find this to be more readable, especially for someone who is new to R.
Additionally, if your numbers aren't a complete sequence like I've used here, you can just define a vector of what numbers are used like:
numFiles <- c(1, 4, 10, 25)

Resources