Reading subset of large data - r

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?
Thank you

You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.
But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

Related

copying data from one data frame to other using variable in R

I am trying to transfer data from one data frame to other. I want to copy all 8 columns from a huge data frame to a smaller one and name the columns n1, n2, etc..
first I am trying to find the column number from which I need to copy by using this
x=as.numeric(which(colnames(old_df)=='N1_data'))
Then I am pasting it in new data frame this way
new_df[paste('N',1:8,'new',sep='')]=old_df[x:x+7]
However, when I run this, all the new 8 columns have exactly same data. However, instead if I directly use the value of x, then I get what I want like
new_df[paste('N',1:8,'new',sep='')]=old_df[10:17]
So my questions are
Why I am not able to use the variable x. I added as.numeric just to make sure it is a number not a list. However, that does not seem to help.
Is there any better or more efficient way to achieve this?
If I'm understanding your question correctly, you may be overthinking the problem.
library(dplyr);
new_df <- select(old_df, N1_data, N2_data, N3_data, N4_data,
N5_data, N6_data, N7_data, N8_data);
colnames(new_df) <- sub("N(\\d)_data", "n\\\\1", colnames(new_df));

How to read and transpose big data set into R

I am trying to read and transpose a data set within more than 18,000 rows and 90 columns into R. (Because the data set is actually including 18,000 variables, and 90 samples.) I tried read.transpose but does not work. Any suggestion? Many thanks.
Actually that is a pretty average/small data set. Just read it in like you would any data frame. Then the function you are looking for is t(), use ?t for more information

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Appending data in R

I am producing a script where I have done many manipulations to a bunch of data and, I do these same manipulations to another dataset. Both data sets have the same rows, columns, and headers. I would like to be able to join the two data sets together where I place dataset A above dataset B. I wouldn't need to headers for dataset B and would instead just clump all of the data together as if they were never really separated in the first place. Is there a simply way to do this?
Yes. Use rbind() command.
combineddataset = rbind(dataset1, dataset2)
Hope that helps.
And for completeness, you could also use the rbind.fill function found in the plyr package.

Resources