R equivalent of SAS OBS= option - r

SAS has OBS= option to limit the number of observation to read. Once put as system options, it can be applied on all the dataset that to be read by the program. It become useful to test the program before running on large full dataset.
Wondering if there is similar option/function in R? Or we have to specify number of observations for each input data frame in R?

Expanding the comments into an answer, at the top of your script you can define
OBS = 100 # however many rows you want to start
When reading in data with read.csv, read.table, etc.,
... = read.table(..., nrows = OBS)
As described in ?read.table, if you set nrows (hence OBS) to a negative number (such as the default, -1), it will be ignored.

If you have less then 100 rows you can use:
head(my_dataframe,100)
If you have a dataframe with at least 100 variables, will error on you otherwise:
my_dataframe[1:100,]

be aware that 'obs' is a short form for lastobs,
the companion option is firstobs
e.g. read rows 1--5: (firstobs=1)
set sashelp.class(obs=5);
e.g. read rows 5--10:
set sashelp.class(firstobs=5,obs=10);

Related

How to load N lines of data and assigning it to a variable using for loop in R

I have a huge dataset called df(approx 16gb of data).
Wanted to open it 100 rows at one time, and everytime it reads the 100 rows, we assign it to a variable.
This means that the first variable should have rows between 1-100, second variable should have 101-200.
The code to load the first 100 rows of data into 10 variables should be this:
reportlen <- seq(10,100,10)
for (i in length(reportlen)){
file <- fread(paste0("C:/Users/Documents/data.csv,", "nrows =",reportlen[i]))
assign(paste0("f", i),file)
}
However, I ran into an error and it gave me null value.
If you really wanted to go with your current approach, then you would probably have to use the skip feature of fread, to offset the current read by whatever amount you have already read.
But, given that you are planning to bring the entire file into memory anyway, I recommend just reading the entire file in one go:
df <- read.csv(file="C:/Users/Documents/data.csv")
parts <- str(split(df, (as.numeric(rownames(df))-1) %/% 100))
The variable parts should be a list, containing multiple data frames, each of which is 100 rows in length (except possibly for the last data frame, which might have some other count).

Skipping last column in r with read.csv

I was on that post read.csv and skip last column in R but did not find my answer, and try to check directly in Answer ... but that's not the right way (thanks mjuarez for taking the time to get me back on track.
The original question was:
I have read several other posts about how to import csv files with
read.csv but skipping specific columns. However, all the examples I
have found had very few columns, and so it was easy to do something
like:
columnHeaders <- c("column1", "column2", "column_to_skip")
columnClasses <- c("numeric", "numeric", "NULL")
data <- read.csv(fileCSV, header = FALSE, sep = ",", col.names =
columnHeaders, colClasses = columnClasses)
All answer were good, but does not work for what I entended to do. So I asked my self and other:
And in one function, does data <- read_csv(fileCSV)[,(ncol(data)-1)]
could work?
I've tried in one line of R to get on data, all 5 of first 6 columns, so not the last one. To do so, I would like to use "-" in the number of column, do you think it's possible? How can I do that?
Thanks!
In base r it has to be 2 steps operation. Example:
> data <- read.csv("test12.csv")
> data
# 3 columns are returned
a b c
1 1/02/2015 1 3
2 2/03/2015 2 4
# last column is excluded
> data[,-ncol(data)]
a b
1 1/02/2015 1
2 2/03/2015 2
one cannot write data <- read.csv("test12.csv")[,-ncol(data)] in base r.
But if you know max number of columns in your csv (say 3 in my case) then one can write:
df <- read.csv("test12.csv")[,-3]
df
a b
1 1/02/2015 1
2 2/03/2015 2
The right hand side of an assignment is processed first so this line from the question:
data <- read.csv(fileCSV)[,(ncol(data)-1)]
is trying to use data before it is defined. Also note what the above is saying is to take only the 2nd last field. To get all but the last field:
data <- read.csv(fileCSV)
data <- data[-ncol(data)]
If you know the name of the last field, say it is lastField, then this works and unlike the code above does not read the whole file and then remove the last field but rather only reads in fields other than the last. Also it is only one line of code.
read.csv(fileCSV, colClasses = c(lastField = "NULL"))
If you don't know the name of the last field but you do know how many fields there are, say n, then either of these would work:
read.csv(fileCSV)[-n]
read.csv(fileCSV, colClasses = replace(rep(NA, n), n, "NULL"))
Another way to do it without first reading in the last field is to first read in the header and first line to calculate the number of fields (assuming that all records have the same number) and then re-read the file using that.
n <- ncol(read.csv(fileCSV, nrows = 1))
making use of one of the prior two statements involving n.
It's not possible in one line as the data variable is not yet initialized when you call it. So the command ncol(data) will trigger an error.
You would need to use two lines of code to first load your data into the data variable and then remove the last column by either using data[,-ncol(data)] or data[,1:(ncol(data)-1)].
Not a single function, but at least a single line, using dplyr (disclaimer: I never use dplyr or magrittr, so a more optimized solution must exist using these libraries)
library(dplyr)
dat = read.table(fileCSV) %>% select(., which(names(.) != names(.)[ncol(.)]))

Is there a way to run a wilcoxon test for variables with different lengths?

I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)

Easiest way to apply series of calculations to similar data frames in R

The following is an example of how I want to treat my data sets. It might be a bit different to understand how my data frame is structured, but I hope it makes sense:
First density must be calculated for columns A, B, and C using raw data from columns ADry, AEthanol, BDry ...... (Since these were earlier defined as vectors too, i used the vectors instead data frame columns as it was shorter - ADry_1_0 instead of Sample_1_0$ADry_1_0)
Sample_1_0$ADensi_1_0=(ADry_1_0/(ADry_1_0-AEthanol_1_0))*(peth-pair)+pair
Sample_1_0$BDensi_1_0=(BDry_1_0/(BDry_1_0-BEthanol_1_0))*(peth-pair)+pair
Sample_1_0$CDensi_1_0=(CDry_1_0/(CDry_1_0-CEthanol_1_0))*(peth-pair)+pair
This yields 10 densities for both A, B, and C. What's interesting is the mean density
Mean_1_0=apply(Sample_1_0[7:9],2,mean)
Next standard deviations are found. We are mainly interested in standard deviations for our raw data columns (ADry and AEthanol), as error propagation calculations are afterwards carried out to find out how the deviations sum up when calculating the densities
StdAfv_1_0=apply(Sample_1_0,2,sd)
Error propagation (same for B and C)
ASd_1_0=(sqrt((sd(Sample_1_0$ADry_1_0)/mean(Sample_1_0$ADry_1_0))^2+(sqrt((sd(Sample_1_0$ADry_1_0)^2+sd(Sample_1_0$AEthanol_1_0)^2))/(mean(Sample_1_0$ADry_1_0)-mean(Sample_1_0$AEthanol_1_0)))^2))*mean(Sample_1_0$ADensi_1_0)
In the end we semi manually gathered the end informations (mean density and deviation hereof) in a plot-able dataframe. Some of the codes might be a tad long and maybe we could have achieved equal results using shorter codes, but bear with us, we are rookies.
So now to the real actual problem
This was for A_1_0, B_1_0, and C_1_0. We would like to apply the same series of commands to 15 other data frames. The dimensions are the same, and they will be named A_1_1, A_1_2, A_2_0 and so on.
Is it possible to use some kind of loop function or make a loadable script containing x and y placeholders, where we can easily insert A_1_1 for instance??
Thanks in advance, i tried to keep the amount of confusion at a minimum, although it's tough!
Data list
If instead of individual vectors you combine the raw data into data frames (or even better data.tables) and then subsequently store all the data frames for all runs into a list as #Gregor suggested, you can use this function below and the lapply function.
my_func <- function(dataset, peth, pair){
require(data.table)
names <- names(dataset)
setDT(dataset)[, `:=` (ADens = (get(names[1])/(get(names[1])-get(names[4])))*(peth-pair)+pair,
BDens = (get(names[2])/(get(names[2])-get(names[5])))*(peth-pair)+pair,
CDens = (get(names[3])/(get(names[3])-get(names[6])))*(peth-pair)+pair)
][, .(ADens_mean = mean(ADens),
ADens_sd = sd(ADens),
AErr = (sqrt((sd(get(names[1]))/mean(get(names[1])))^2) +
(sqrt((sd(get(names[1]))^2 + sd(get(names[4]))^2))/
(mean(get(names[1])) - mean(get(names[4]))))^2)* mean(ADens),
BDens_mean = mean(BDens),
BDens_sd = sd(BDens),
BErr = (sqrt((sd(get(names[2]))/mean(get(names[2])))^2) +
(sqrt((sd(get(names[2]))^2 + sd(get(names[5]))^2))/
(mean(get(names[2])) - mean(get(names[5]))))^2)* mean(BDens),
CDens_mean = mean(CDens),
CDens_sd = sd(CDens),
CErr = (sqrt((sd(get(names[3]))/mean(get(names[3])))^2) +
(sqrt((sd(get(names[3]))^2 + sd(get(names[6]))^2))/
(mean(get(names[3])) - mean(get(names[6]))))^2)* mean(CDens))
]
}
rbindlist(lapply(list_datasets, my_func, peth = 2, pair = 1))
Now, this assumes that you put your raw vectors into data frames with the columns in the order in which they appeared in your example (and that they are the only columns in the data set). If this is not the case, you may just have to edit the indices in the names[x] calls. If you wanted to have a little more flexibility, you could also define a list of list with the column names for each data set in your individual raw data sets, add that as an argument to my_func and then replace all the instances of names[x] with get(list_column_names[x])
This function should output a data.table with the results for each set of data sets (1-16) in individual rows with 6 columns (ADens_mean, ADens_sd, ...)
NOTE since there was no actual data to work with, I can't say for sure that this function does exactly what you want, but I think it will be close. This will also require you to download the data.table package.

How to aggregate on IQR in SPSS?

I have to aggregate (of course with a categorical break variable) a quite big data table containing some continuous variables by resulting the mean, median, standard deviation and interquartile range (IQR) of the required variables.
The first three is an easy one with the SPSS Aggregate command, but I have no idea how to compute IQR by aggregating the data table.
I know I could compute IQR by using Descriptives (by quartiles), but as I need the calculations in aggregation - this is not an option. Unfortunately using R fails also thanks to some odd circumstances (not able to load a huge comma separated file in R neither with base:: read.table, neither with sqldf, neither with bigmemory and neither with ff packages).
Any idea is welcomed! And of course: thank you in advance.
P.S.: I thought about estimating IQR by multiplying the standard deviation by 1.5, but that method would not work as the distributions are skewed, so assuming normality does not stands.
P.S.: do you think using R within SPSS would not result in memory problems like while opening the dataset in pure R?
This syntax should do the trick. There is no need to migrate back and forth between SPSS and R solely for this task.
*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.
vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.
*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.
This takes along time still with such a large dataset, but thats to be expected. Just search the SPSS help files for "OMS" to find the example syntax with how OMS works.
Given the further constraint that you want to calculate the IQR for many groups, there is a few different ways I could see to proceed. One would be just use the split file command and run the above frequency command again.
split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.
You could also get specific percentiles within ctables (and can do whatever grouping/nesting you want for that). Potentially a more useful solution at this point though is to make a program that actually saves separate files (or reduces the full dataset the specific group while still loaded), does the calculation on each separate file and dumps it into a dataset. Working with the dataset that has the 4 million records is a pain, and it does not appear to be necessary if you are just splitting the file up anyway. This could be accomplished via macro commands.
OMS can capture any pivot table as a dataset, so any statistical results displayed that way can be used as a dataset. Another approach, however, in this case would be to use the RANK command. RANK allows for grouping variables, so you could get rank within group, and it can compute the quartiles and percentiles within group. For example,
RANK VARIABLES=salary (A) BY jobcat minority
/RANK /NTILES(4) /PERCENT. Then aggregating with FIRST and the group variables as breaks would give you a dataset of the quartiles by group from which to compute the iqr.
Many ways to skin a cat.
-Jon Peck

Resources