no rows to aggregate in R - r

I have a column in my data that is split by poor, Average and Good. I'm looking to find the Max and the Min number in a different column when one of those categories is selected.
This is what I have for the code:
aggregate(as.numeric(data_gt$Category), list(exptCount=data_gt$Employment_Rate),max)
Is there an easier way to do this, even if it takes more commands to individually get the max min for each category?
Thanks for your help, I'm new to R so still learning the basics.

You can get this error message if you specified incorrect variable name (data_gt$Category instead of data_gt$category).

Related

Edit maximum number of characters in R dfSummary (R)

I am wondering if it is possible to edit the "Stats/Values" column in dfSummary command of the R package "summarytools". I need to adjust the number of characters displayed in the values (I do not mean the number of factor levels but literally the number of characters) as there is a cut off point defined which not suit my survey data. I have posted a screenshot for an example.
Thanks a lot for your help!
dfSummary_screenshot_example
There is a parameter exactly for this... By default, its value is 25, but you can make it however large you want :)
dfSummary(dat, max.string.width = 500)

How to print the number of data entries inside a variable in R? [duplicate]

This question already has answers here:
How to know a dimension of matrix or vector in R?
(6 answers)
Closed 3 years ago.
I know this is probably a very simple question but I can't seem to find the answer anywhere online. I am trying to print just the number of data points inside of a variable that I created but I can't figure out how.
I tried using summary() or num() or n() but I am really just making stuff up here and cannot seem to figure it out at all.
For my specific example I have a data set on peoples heights, age, weight, gender, stuff like that. I used
one_sd_weight <- cdc$weight[abs(cdc$weight - mean(cdc$weight)) <= sd(cdc$weight)]
to determine how many of the weights fall within one standard deviation of the mean. After I do this, I can see that on the right side it created a new variable called one_sd_weight that contains 14152 out of the original 20000 entries. How do I print the number 14152 as a variable? For the work I am doing I need to create a new variable that just contains one number, 14152 or whatever number is produced when I run the code above. For example, I need to create
n_one_sd <- 14152
without typing in 14152, instead typing some function that grabs the number of entries in one_sd_weight.
I have tried things like summary() and n() but only receive error messages in return. Any help is greatly appreciated!!
n_one_sd <- length(one_sd_weight)
You're looking for length (in case of a vector) or nrow in case of a matrix/data.frame.
Or you can use NROW() for both, that should work too.

Find max count for a combination of several fields

Say I have a table with three fields message, environment and function.
I want to count up the records by message, environment and function, and then select the highest scoring row for any combination.
Getting the counts is easy
Table
| summarize count() by message, environment, function
...but how do I get just one row with the top count? My solution so far is to create a new table that tallies the counts, then tally max() by environment, function and then do a join, but this seems like an expensive and complicated workaround.
If I understand your original question correctly, you may want to look into summarize arg_max() as well: https://learn.microsoft.com/en-us/azure/kusto/query/arg-max-aggfunction
Ah, just modify the solution here to use max instead of sum
Add column of totals pr. field value

How to use apply with a function that required 2 parameters

I looked at the existing posts but could not get a clear answer... I have a data frame and I would like to modify each data by a calculation that takes into account the min and max of each lines.
I would like to use apply associated to a function:
sc=function(x,seg) {(x-seg[2])*100/(seg[1]-seg[2])}
or
sc=function(x,a,b) {(x-b)*100/(a-b)}
where x is a line of the data frame and seg=c(a,b) calculated as follow
d=dim(data) ## data is my dataframe
for (i in (1:d[1])) ## the calculation has to be done for each line, according
## the min and max of the specific line
{
seg=c(max(data[i,]),min(data[i,]))
data[i,]=apply(data[i,],1,sc)
return(data)
}
This does not work, obviously, because I do not know how to tell apply that it needs to take into account more than one parameter...
There is probably a R function that does this specific calculation, but since I am a R beginner, I would really appreciate to understand how to create such coding.
Thanks for the help!
Stéphane
Update:
Here is what I found for a solution, but it does not sound completely logical to me...
for (i in (1:d[1])) {
t=apply(data,2,sc,seg=range(data[i,]))
data[i,]=t[i,] }
The third parameter you pass to apply should be a function. Also, there's no reason to loop when you use apply.
apply(d,1,function(x) c(min(x), max(x)))
will return a 2-row matrix with the min and max values for each row. Although there is a build in function to get min/max called `range
apply(d,1,range)

Column means over a finite range of rows

I am working with climate data in New Mexico and I am an R novice. I am trying to replace NA with means but there are 37 different sites in my df. I want the means of the column for which the DF$STATION.NAME (in column 1) is unique. I cant be using data from one location to find the mean of another... obviously. so really I should have a mean for each month, for each station.
My data is organized by station.name vertically in column 1 and readings for months jan-dec in columns following, including a total column at the end (right). readings or observations are for each station for each month, over several years (station name listed in new row for each new year.)
I need to replace the NAs with the sums of the CLDD for the given month within the given station.name, how do I do this?
Try asking that question on https://stats.stackexchange.com/ (as suggested by the statistics tag), there are probably more R users there than on the general programming site. I also added the r tag to your question.
There is nothing wrong with splitting your data into station-month subsets, filling the missing values there, then reassembling them into one big matrix!
See also:
Replace mean or mode for missing values in R
Note that the common practice of filling missing values with means, medians or modes is popular, but may dilute your results since this will obviously reduce variance. Unless you have a strong physical argument why and how the missing values can be interpolated, it would be more elegant if you could find a way that can deal with missing values directly.

Resources