How to subset a set of variables and use Aggregate function in R - r

I'm a beginner in R and i'm working on a automation,i have a list of variables in a separate file based on which the values needs to be aggregated in the master dataset.The Master datastructure is attached Master Dataset
and the referal dataset contains the vars to be aggregated Referal dataset
Of the 6 variables i need to aggregate the Variables D,E,F by Sum(C)(as per the referal dataset).
The below code does my requirement manually,
X<-aggregate(C,by=list(D,E,F),FUN=sum)
But i need a code which does the same funtionality automatically.I tried making loops but the problem i face is that both datasets dont have same data.frame size. Can someone help me on this ?

So, it seems like you want to do a few things:
1) read in the master/referent datasets
2) subset the master according to the values in the referent
3) compute column sums on the master?
also, is there a specific reason you want to use aggregate()? there are probably lots of ways to do this. In any case, here is what i would do:
# assuming master is a dataframe or matrix, referent is a vector
# just simulating them here because not clear how you are reading them in
master = matrix(rnorm(36),6)
colnames(master) = c('A','B','C','D','E','F')
referent = c('D','E','F')
colSums(master[,referent])
so is that doing what you want to do? I like colSums because it's a handy built-in. I am not an R superstar though so it is possible that other ways are better for some reason.

Related

function to remove all observations that contain a "prohibited" value - R

I have an large dataset looking like:
There are overall 43 different values for PID. I have identified PIDs that need to be removed and summarized them in a vector:
I want to remove all observations (rows) from my data set that contain one of the PIDs from the vecotor NullNK. I have tried writing a function for it, but i get an error ( i have never written functiones before):
for (i in length(NullNK)){
SR_DynUeber_einfam <- SR_DynUeber_einfam [-which(SR_DynUeber_einfam$PID == NullNK(i)),]
}
How can i efficently remove the observations from my original data set that are containing PIDs from NullNK vector?
What is wrong with my function?
Thanks!
For basic operations like this, for loops are often not needed. This does what you are looking for:
SR_DynUeber_einfam[!SR_DynUeber_einfam$PID %in% NullNK,]
One mistake in your function is NullNK(i). You should subset from a vector with NullNK[i] in R.
Hope this helps!

Subset variables by name in R

I know that there are many threads called this but either the advice within hasn't worked or I haven't understood it.
I have read what was an SPSS file into R.
I cleaned some variables and added new ones.
By this point the file size is 1,000 MB.
I wanted to write it into a CSV to look at it more easily but it just stops responding - file too big I guess.
So instead I want to create a subset of only the variables I need. I tried a couple of things
(besb <- bes[, c(1, 7, 8)])
data1 <- bes[,1:8]
I also tried referring to variables by name:
nf <- c(bes$approveGov, bes$politmoney)
All these attempts return errors with number of dimensions.
Therefore could somebody please explain to me how to create a reduced subset of variables preferably using variable names?
An easy way to subset variables from a data.frame is with the dplyr package. You can select variables with their bare names. For example:
library(dplyr)
nf <- select(bes, approveGov, politmoney)
It's fast for large data frames too.

copying data from one data frame to other using variable in R

I am trying to transfer data from one data frame to other. I want to copy all 8 columns from a huge data frame to a smaller one and name the columns n1, n2, etc..
first I am trying to find the column number from which I need to copy by using this
x=as.numeric(which(colnames(old_df)=='N1_data'))
Then I am pasting it in new data frame this way
new_df[paste('N',1:8,'new',sep='')]=old_df[x:x+7]
However, when I run this, all the new 8 columns have exactly same data. However, instead if I directly use the value of x, then I get what I want like
new_df[paste('N',1:8,'new',sep='')]=old_df[10:17]
So my questions are
Why I am not able to use the variable x. I added as.numeric just to make sure it is a number not a list. However, that does not seem to help.
Is there any better or more efficient way to achieve this?
If I'm understanding your question correctly, you may be overthinking the problem.
library(dplyr);
new_df <- select(old_df, N1_data, N2_data, N3_data, N4_data,
N5_data, N6_data, N7_data, N8_data);
colnames(new_df) <- sub("N(\\d)_data", "n\\\\1", colnames(new_df));

Cluster PAM in R - How to ignore a Column/variable but still keep it

I would like to use the Cluster PAM algorithm in R to cluster a dataset of around 6000 rows.
I want the PAM algorithm to ignore a column called "ID" (Not use it in the clustering) but i do not want to delete that column. I want to use that column later on to combine my clustered data with the original dataset.
basically what i want is to add a cluster column to the original dataset.
I am want to use the PAM as a data compression/variables reduction method. I have 220 variables and i would like to cluster some of the variables and reduce the dimensionality of my dataset so i can apply a classification algorithm (Most likely a tree) to classify a problem that i am trying to solve.
If anyone knows a way around this or a better approach, please let me know.
Thank you
import data
data <- read.table(“sampleiris.txt”)
execution
result <- pam(data[2:4], 3, FALSE, “euclidean”)
Here subset [2:4] is done considering id is the first column.And the below code should fetch you the cluster values from PAM. you can the add this as a column to your Data
result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]
Their is a small problem in the above code.
You should not use the silhouette information because it re-orders the rows as a preparation for the plot.
If you want to extract the cluster assignment while preserving the original dataset order and adding just a column of cluster assignment you should use $cluster. I tried it and it works like a charm.
This is the code:
data<- swiss[4:6]
result <- pam(data, 3)
summary (result)
export<-result$cluster
swiss[,"Clus"]<- export
View(export)
View(swiss)
Cheers

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Resources