Group a continuous variable in R

Group a continuous variable in R - r

My aim is to compare in a pivot table if there is a link between the presence of one particular shop and the density of population where we can find these shops. For that, I've a CSV file, with 600 exemples of areas where there is OR not the shop. It's a file with 600 lines and two columns : 1/ a number who represent the density of populaiton for one area, and 2/ the quantity of this particular shop in this area (0, 1 or 2).
In order to do a pivot table, I need to group the densities in 10 groups of 60 lines for each (in the first group the 60 bigger densities until the last group with the 60 smaller densities). Then, I'll easily be able to see how many shops are built, whether the density is low or high. Am I understandable (I hope) ? :)
Nothing really difficult I suppose. But there are some much way (and package) which could be ok for that... that I'm a little bit lost.
My main issue : which is the simplest way to group my variable in ten groups of 60 lines each ? I've tried cut()/cut2() and hist() without success, I heard about bin_var() and reshape() but I don't understand how they can be helpful for this case.
For example (as Justin asked).
With cut():
data <- read.csv("data.csv", sep = ";")
groups <- cut(as.numeric(data$densit_pop2), breaks=10)
summary(groups)
(0.492,51.4] (51.4,102] (102,153] (153,204] (204,255] (255,306]
53 53 52 52 52 54
(306,357] (357,408] (408,459] (459,510]
52 59 53 54
Ok, good, indeed 'groups' contains 10 groups with almost the same number of lines. But certains values indicated in the intervals don't make any sens for me. Here is the first lines of density column (increasly sorted) :
> head(data$densit_pop2)
[1] 14,9 16,7 17,3 18,3 20,2 20,5
509 Levels: 100 1013,2 102,4 102,6 10328 103,6 10375 10396,8 104,2 ... 99,9
I mean, look at the first group. Why 0.492 when 14.9 is my smallest value ? And, if I count manually how many lines between the first one and the value 51.4, I find 76. Why is it indicated 53 lines ? I precise that the dataframe are correctly ranked from lowest to highest.
I certainly miss something... but what ?

I think you'll be happy with cut2 once you have a numeric variable to work with. When using commas as your decimal separator, use read.csv2 or use the argument dec = "," when reading in a dataset.
y = runif(600, 14.9, 10396.8)
require(Hmisc)
summary(cut2(y, m = 60))
You can do the same thing with cut, but you would need to set your breaks at the appropriate quantiles to get equal groups which takes a bit more work.
summary(cut(y, breaks = quantile(y, probs = seq(0, 1, 1/10)), include.lowest = TRUE))

Responding to your data: you need to correct errors in data entry:
data$densit_pop3 <- as.numeric(
sub('\\,', '.',
as.character(data$densit_pop2)))
Then. Something along these lines (assuming this is not really a question about loading data from text files):
with(dfrm, by(dens, factor(shops), summary) )
As an example of hte output one might get:
with(BNP, by( proBNP.A, Sex, summary))
Sex: Female
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5.0 55.7 103.6 167.9 193.6 5488.0 3094899
---------------------------------------------------------------------
Sex: Male
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 30 63 133 129 5651 4013760
If you are trying to plot this to look at the density of densities (which in this case seems like a reasonable request) then try this:
require(lattice)
densityplot( ~dens|shops, data=dfrm)
(And please do stop calling these "pivot tables". That is an aggregation strategy from Excel and one should really learn to describe the desired output in standard statistical or mathematical jargon.)

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram

Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

using median function to create a new variable in R

I would like to use the median function in order to create a new variable.If an observation (person that has been polled) in the sample has an education level greater than the median of the sample then he will be considered having a "high_educ". If his level of education is lower than the median then he will be considered "low_educ".
There are 1247 observations (n) in my sample so I should end up with something like 623 low_educ and 624 high_educ.
In below code I didn't include the median function as I do not know how to include it. Instead, I am manually including a threshold (here: 13.5) But of course doing like this my sample population is not equally divided by 2 as it should be if I was using the median function.
For your info, "educ" values are integers and my query seems to work only when I add as.numeric before (educ) in the ifelse function.
educ12<-gss%>%
filter(year==2012, !is.na(educ), !is.na(abany))%>%
mutate(education = ifelse(as.numeric(educ)>=13.5, "high_educ", "low_educ"))
Could you please help me to understand how to use the median function in my code?
Thanks a lot
Michael

Fine tuning table output of function 'univariateTable'

I've found the function univariateTable to be extremely helpful to handling larger data for a nice, clean table output. But there are a couple of things that I still need to do manually after the table is exported in csv, and I would rather do it in R to automate the process and avoid human errors.
Here is the example code with the table output that I then export as csv
value<-cbind(c(rnorm(103.251,503.24,90),rnorm(103.251,823.24,120)))
genotype<-cbind(c(rep("A",100),rep("B",100)))
gender<-rep(c("M","F","F","F"),50)
df<-cbind(value,genotype,gender)
df<-as.data.frame(df)
colnames(df)<-c("value","genotype","gender")
df$value<-as.numeric(as.character(df$value))
library(Publish)
summary(univariateTable(gender ~ Q(value) + genotype, data=df))
The two problems I have are these:
Is there a way to round the numbers in the table in a way similar to this: round(99.73)
Is there a way to substitute , with - in the interquartile range output in a way similar to this: gsub(", ","-","[503.7, 793.3]") , and instead of median [iqr] have it put out median [IQR]
Again, I do these manually after exporting the tables, but for larger tables it is much more convenient to automate the process.

univariateTable has a digits argument that you can use for rounding. To modify the formatting, you can inspect the list returned by univariateTable to figure out where to find the values that need to be changed.
Your example data threw an error, so I've modified it to make it run and also cleaned up the code a bit.
# devtools::install_github("tagteam/Publish")
library(Publish)
value <- c(rnorm(90, 103.251,503.24),rnorm(110, 103.251,823.24))
genotype <- rep(c("A","B"), each=100)
gender <- rep(c("M","F","F","F"),50)
df <- data.frame(value,genotype,gender)
The digits argument to univariateTable can be used for rounding (see ?univariateTable for the help information on the function).
tab = univariateTable(gender ~ Q(value) + genotype, data=df, digits=0)
To change the commas to hyphens, we need to see where those values are stored in the list returned by univariateTable. Run str(tab), which shows you the structure of the list. Note that the heading values in the table look like they're stored in tab$summary.groups$value and tab$summary.totals$value, so we'll edit those:
tab$summary.groups$value = gsub(", ", " - ", tab$summary.groups$value)
tab$summary.totals$value = gsub(", ", " - ", tab$summary.totals$value)
tab
Variable Level gender = F (n=150) gender = M (n=50) Total (n=200) p-value
1 value median [iqr] -6 [-481 - 424] 203 [-167 - 544] 80 [-433 - 458] 0.118
2 genotype A 75 (50) 25 (50) 100 (50)
3 B 75 (50) 25 (50) 100 (50) 1.000

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?

I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Data dictionary packing in R

I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:
1) Create a text file which:
a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc
b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc
2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.
The basic idea is to create a data dictionary of a data frame with one function.
My question is: is there a package which already does this? And if not, do people think this would be a useful function?
Thanks

There are a variety of describe functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:
" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."
And an example of the output:
Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")]
2 Variables 5325006 Observations
----------------------------------------------------------------------------------
CHOLEST
n missing unique Mean .05 .10 .25 .50 .75 .90
4410307 914699 689 199.4 141 152 172 196 223 250
.95
268
lowest : 0 10 19 20 31, highest: 1102 1204 1213 1219 1234
----------------------------------------------------------------------------------
HDL
n missing unique Mean .05 .10 .25 .50 .75 .90
4410298 914708 258 54.2 32 36 43 52 63 75
.95
83
lowest : -11.0 0.0 0.2 1.0 2.0, highest: 241.0 243.0 248.0 272.0 275.0
----------------------------------------------------------------------------------
Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex