Pull subset of large data by multiple specific characters - r

I have a large database which divides files into chunks for ease of analysis/storage and I am trying to extract multiple specific values stored in the character format from a single column to take a chunk of the overall data for further analysis.
Within these files I am interested in pulling ALL rows in which column "Cat" is equals any number of characters (different for each pull and each file).
Files are set up (for example) as:
2001_x.sas
2002_x.sas
....
2018_x.sas
Currently, I am doing the following:
#Create a list of files- fill out pattern to chose specific files with similar names
'x<-list.files(pattern = "_x.sas")'
#Read and subset files when Cat is C21 C98 or D27 etc
'z<-lapply(x, function(x) {
a<-read.sas(x)
c<-subset(a, (Cat=="C21" | Cat=="C98 | Cat=="D27))
})'
#Bind df's into master df
'y<-bind_rows(z)'
y is then a really nice pull from multiple files at once. The advantage of this, as the total dataset is several terabytes, is that this works within the individual files and doesn't overwhelm the memory on my desktop.
The problem is that I can't always use Cat equals variable with just three values. Sometimes, I need to input hundreds of values, which is very tedious. I have tried replacing this with lists or vectors.
Ideally, I'd like the code to look more like this, if you know what I mean, but this doesnt work:
'b<-List or vector with character values of interest
z<-lapply(x, function(x) { a<-read.sas(x) subset(a, Cat==any(b)) })
y<-bind_rows(z)'
Such that any value in list b would be included in the subset if it equals Cat. However, I've only been able to get this to work with Cat equals variable and the or symbol.
Thanks!

Related

Remove multiple rows from a list of names in R (a list of 187 names to remove)?

I have a data frame in R containing over 29,000 rows. I need to remove multiple rows using only a list of names (187 names).
My dataset is about airlines, and I need to remove specific airlines from my data set that contains over 200 types of airlines. My first column contains all airline names, and I need to remove the entire row for those specific airlines.
I singled out all airline names that I want removed by this code: transmute(a_name_remove, airline_name). This gave me a table of all names of airlines that I want removed, now I have to remove that list of names from my original dataset named airlines.
I know there is a way to do this manually, which is: mydata[-c("a", "b"), ], for example. But writing out each name would be hectic.
Can you please help me by giving me a way to use the list that I have to forwardly remove those rows from my dataset?
I cannot write out each name on its own.
I also tried this: airlines[!(row.names(airlines) %in% c(remove)), ], in which I made my list "removed" into a data frame and as a vector, then used that code to remove it from my original dataset "airlines", still did not work.
Thank you!
You can create a function that negates %in%, e.g.
'%not_in%' <- Negate('%in%')
so per your code, it should look like this
airlines[row.names(airlines) %not_in% remove, ]
additionally, I do not recommend using remove as a variable name, since it is a base function in R, if possible rename the variable, e.g. discard_airlines ,
airlines[row.names(airlines) %not_in% discard_airlines, ]

How to fill dataframe rows for progressive files in a for loop in R

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.
Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.

R; Rbind Excel files from a List of vectors of files in R

I have web-scraped ~1000 Excel Files into a specific folder on my computer
I then read these files in which returned a value of chr [1:1049]
I then grouped these files by similar names which was every 6 belonged in one group
This returned a List of 175, with values of the group of 6 file names.
I am confused on how I would run a loop that would merge/rbind the 6 file names for each group from that list. I would also need to remove the first row but I know how to do that part with read.xlsx
My code so far is
setwd("C:\\Users\\ewarren\\OneDrive\\Documents\\Reservoir Storage")
files <- list.files()
file_groups <- split(files, ceiling(seq_along(files)/6))
with
for (i in file_groups) {
print(i)
}
returning each group of file names
The files for example are:
files
They are each compromised of two columns, date and amount
I need to add a third to each that is the reservoir name
That way when all the rows from all the files are combined theres a date, an amount, and a reservoir. If I do them all at once w/o the reservoir, I wouldnt know which rows belong to which.
You can use startRow = 2 to not get the first row in read.xlsx
for merging the groups of file. If you have an identifier e.g. x in each file that matches with their others in the group, but not with the ones which are in other groups.
you have make a list group1 <- list.files(pattern = "x)
then use do.call(cbind, group1)

How do I select columns by name while ignoring certain characters?

I'm trying to pull data from a file, but only pull certain columns based on the column name.
I have this bit of code:
filepath <- ([my filepath])
files <- list.files(filepath, full.names=T)
newData <- fread(file,select=c(selectCols))
selectCols contains a list of column names (as strings). But in the data I'm pulling, there may be underscores placed differently in each file for the same data.
Here's an example:
PERIOD_ID
PERIOD_ID_
_PERIOD_ID_
And so on. I know I can use gsub to change the column names once the data is already pulled:
colnames(newData) <- gsub("_","",newData)
Then I can select by column name, but given that it's a lot of data I'm not sure this is the most efficient idea.
Is there a way to do ignore underscores or other characters within the fread function?

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

Resources