compute with the values in table() function - r

I am new to R and stuck in computing the proportions of two values.
I got to this point with using the table() function
table(data$subscriptions, data$pickup)
The subscriptions data is divided into casual and registered users per station. Basically, I want to compute the proportion of casual users per station.
Should I be using tapply() to solve this?
Thankful for any help!

There is a function prop.table() that is called on the table to turn counts into proportions. So in your case try something like this:
tab <- table(data$subscriptions, data$pickup)
prop.table(tab, 2)
Where 2 is a margin on which the proportions will be calculated. 2 means columns in your case.
Also see help(prop.table)

Related

Incrementing variables from R queries

Hi so I am new in R and kind of don't know what I'm looking for. I want to measure probability of each frequency of a dust concentration so I need to divide each frequency to whole total of dust concentration frequency. By then I can continue by looking for CDF and PMF of the dust concentration.
So I have a dust probability data that has two column(Dust Concentration and its Frequencies) and it looks like this:
In my first thought, I have to increment i on this line of R queries
dustProb[i, "Frekuensi"]
that should've take specific frequency in row i so I can sum all frequency queried from it after getting that with for loops like this.
# the dataset is called dustData here
# dustFrequencies = dustData[i, "Frekuensi"]
for(i in dustFrequencies){
print(dustFrequencies)
}
The print() part supposed to be where I sum all the variables earned through that incremented queries.
My question is:
Can I increment the 'i' inside that R queries
Was my way is too complicated or there's other way to measure probability in R?
Sorry for lots of confusion, inneficiency, and holes, I hope I was clear enough here.
Using loops in R isn't very tidy-freindly. You can do:
library(dplyr)
dustData <- dustData %>%
mutate(probabilities = Frekuensi/sum(Frekuensi))
The new column is the frenquency divided by the sum of all frequencies, for each dust concentration.

Clustering in R

I used hclust to cluster my data and cutree to specify the numbers of cluster to be 3. Is there any way that I can examine each of the cluster? By examine I mean to list out the cases/observations that are in e.g. the first cluster. I tried all the basic function that I know such as summary(), list()...but seem not relevant. Any function can do this?
If not, the cutree function returns a list of groups/clusters that each of my observation belongs to, something like this:
1,3,1,2,3,3,1
which indicates my first observation belongs to group 1, second belong to group three...
I am thinking about how to extract the position from that list where e.g. group = 1, so it will return 1,3 and 7 since observations 1,3,7 are belong to group 1
Or I need to use a loop to count all the observations that belong to e.g. group 1 from that list?
Is my question clear?
Does this help to get started?
nclust <- 10
cutreeout <- cutree(hclustOutput, nclust)
Add them as a new column to your dataframe
mydata$cluster <- cutreeout
How many observations are in each cluster?
table(mydata$cluster)
Then you can do more stuff to interpret your clusters, and/or study subsets of your data.
This is a hint, not the answer. Here's the example of Hierarchical Clustering in R. You can try to use the functions table(), ggplot() in order to see observations per clusters.

Method to compare previous day to current day values

I am looking for a better way to compare a value from a day (day X) to the previous day (day X-1). Here I am using the airquality dataset. Suppose I am interested in comparing the wind from one day to the wind from the previous day. Right now I am using merge() to bring together two dataframes - one current day dataframe and one from the previous day. I am also just subtracting 1 from the Day column to get the PrevDay column:
airquality$PrevDay=airquality$Day-1
airquality.comp <- merge(
airquality[,c("Wind","Day")],
airquality[,c("Temp","PrevDay")],
by.x=c("Day"),by.y=c("PrevDay"))
My issue here is that I'd need to create another dataframe if I wanted to look back 2 days or if I wanted to switch Wind and Temp and look at them the other way. This just seems clunky. Can anyone recommend a better way of doing this?
IMO data.table may be harder to get used to compared to dplyr, but it will save your tail later when you need robust analysis:
setDT(airquality)[, shift(Wind, n=2L, type="lag") < Wind]
In base R, you can add an NA value and eliminate the last for comparison:
with(airquality, c(NA,head(Wind,-1)) < Wind)
Whar kind of comparison do you need?
For example, to check if the followonf values is greater you could use:
library(dplyr)
with(airquality, lag(Wind) < Wind)
Or with two lags:
with(airquality, lag(Wind, 2) < Wind)
It depends on what questions you are trying to answer, but I would look into Autocorrelation (the correlation of a time series with its own lagged values). You may want to look into the acf() function to compare the time series to itself since this will help you highlight which lags are significantly correlated.
Or if you want to compare 2 different metrics (such as Wind and Temp), then you may want to try the ccf() function since it allows you to input 2 different vectors and it will compute the cross correlation with lags. For example:
ccf(airquality$Wind,airquality$Temp)
If you are interested in autocorrelation or cross-correlation, in particular, then you might also consider something like mutual information, which will work for non-Gaussian data as well. Both the infotheo and entropy (more here) packages for R have built-in functions to do so.

Using acf function in r for time series data

I am new to time-series analysis and have a data set with a daily time step at 5 factor levels. My goal is to use the acf function in R to determine whether there is significant autocorrelation across the response variable of interest so that I can justify whether or not a time-series model is necessary.
I have sorted the dataset by Day, and am using the following code:
acf(DE_vec, lag.max=7)
The dataset has not been converted to a time-series object…it is a vector sorted by Day.
My first question is whether the dataframe should be converted to a time-series object, or if it is also correct to sort the vector by Day?
Second, if I have a variable repeated over the 5 levels for each Day, then should I construct 5 different acf plots for each level, or would it be ok to pool over stations as was done with the code above?
Thanks in advance,
Yes, acf() will work on a data.frame class, and yes, you should compute the ACF for each of the 5 levels separately. If you pass the entire df to acf(), it will return the ACF for each of the levels.
If you are curious about the relationship across levels, then you need to use ccf() or some mutual information metric like those in the entropy or infotheo pkgs.

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources