plotting histogram by a data frame - r

I am a new user of R and I am running the last 7 days this language using the mixdist package for the modal analysis of finite mixture distributions. I am working on nanoparticles thus the R is for the analysis of particle size distributions recorded by a particle analyser I am using to my experiments.
My problem is illustrated below:
Firstly I am collecting my data from excel (raw data)
Diameter dN/dlog(dp) frequencies
4.87 1825.078136 0.001541926
5.62 2363.940947 0.001997187
6.49 2022.259831 0.001708516
7.5 1136.653264 0.000960307
8.66 363.4570006 0.000307068
10 255.6702845 0.000216004
11.55 241.6525906 0.000204161
13.34 410.3425535 0.00034668
15.4 886.929307 0.000749327
17.78 936.4632499 0.000791176
20.54 579.7940281 0.000489842
23.71 11.915522 0.00001
27.38 0 0
31.62 0 0
36.52 5172.088 0.004369665
42.17 19455.13684 0.01643677
48.7 42857.20502 0.036208126
56.23 68085.64903 0.057522504
64.94 87135.1959 0.07361661
74.99 96708.55662 0.081704712
86.6 97982.18946 0.082780747
100 95617.46266 0.080782896
115.48 93732.08861 0.079190028
133.35 93718.2981 0.079178377
153.99 92982.3002 0.078556565
177.83 88545.18227 0.074807844
205.35 78231.4116 0.066094203
237.14 63261.43349 0.053446741
273.84 46759.77702 0.039505233
316.23 32196.42834 0.027201315
365.17 21586.84472 0.018237755
421.7 14703.9162 0.012422678
486.97 10539.84662 0.008904643
562.34 7986.233881 0.00674721
649.38 6133.971913 0.005182317
749.89 4500.351801 0.003802145
865.96 2960.469207 0.002501167
1000 1649.858041 0.001393891
Inf 0 0
using the function
pikraw<-read.table(file="clipboard", sep="\t", header=T)
After importing the data in R I am choosing the 1st and the 3rd column of the above table :
diameter<- pikraw$Diameter
frequencies<-pikraw[3]
Then I am grouping my data using the functions
pikgrp <- data.frame(length =diameter, freq =frequencies)
class(pikgrp) <- c("mixdata", "data.frame")
Doing all these I am going to plot the histogram of this data
plot(pikgrp,log="x")
and there something strange happens: The horizontal axis and the values on this look fine although the y axis of the graph appear the low values of the frequencies as they are and the high values with a cut decimal lowering the plot.
Have you got any explanation on what is happening? Probably the answer could be very simple although after exhausting my self and losing a whole weekend I believe that I have all the rights on my side.

It looks to me like you are reading your data wrong. Try this:
pikraw <- read.table(file="clipboard", sep="", header=T)
That is, change the sep argument to sep="". Everything worked fine from there.
Also, note that using the clipboard as file argument only works if you have your data on the clipboard. I recommend creating a .txt (or .csv) file with your data. That way you don't have to have your data on the clipboard everytime you want to read it.

Related

Convert "survfit()" output to a matrix or dataframe

Example data below.
My basic problem is that running "survfit" by itself gives a nice column with median lifespan for each category, which is the thing I want to extract from my survfit data. Ideally I'd like to export this "survfit" output as a dataframe/table and ultimately save to .csv. But I get errors however I try.
Thanks for help/advice!
Example data:
df<-data.frame(Gtype = as.factor(c("A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C","C")),
Time=as.numeric(c("5","6","7","7","7","7","2","3","3","4","5","7","2","2","2","3","3","4")),
Status=as.numeric(c("1","1","1","1","0","0","1","1","1","1","1","1","1","1","1","1","1","1")))
library(survival)
exsurv<-survfit(Surv(df$Time,df$Status)~strata(df$Gtype))
exsurv
and the "survfit" output I want to get as a dataframe:
> exsurv<-survfit(Surv(df$Time,df$Status)~strata(df$Gtype))
> exsurv
Call: survfit(formula = Surv(df$Time, df$Status) ~ strata(df$Gtype))
n events median 0.95LCL 0.95UCL
strata(df$Gtype)=A 6 4 7.0 6 NA
strata(df$Gtype)=B 6 6 3.5 3 NA
strata(df$Gtype)=C 6 6 2.5 2 NA
edit:
An earlier version of this question included the print() function superfluously. "print(survfit)" and "survfit()" give the same result.
Yes broom::tidy function works.
'mymk1' - is the object output of using survfit on my raw survival data set
I tried this and it worked well
results <- broom::tidy(mykm1)
write.csv(results, "./Desktop/Rout/mykm1.csv")
## the output csv file created in my folder Rout inside my Desktop folder.
The csv file can then be imported easily into any word or spreadsheet.
As usual, I was making it way more complicated by not understanding the basic survival function. The basic summary(exsurv) does give you median lifespan, mean lifespan, confidence intervals, etc... from the survfit function.
exsurv<-survfit(Surv(Time, Status)~ strata(Gtype))
You can put the summary(exsurv) data into a table using the code below
survoutput<-summary(exsurv)$table
And then just save to .csv as an ouput
write.csv(survoutput, file="exsurvoutput.csv")

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

Error in correlation matrix of a data frame

R version 3.5.1.
I want to create a correlation matrix with a Data frame that goes like this:
BGI WINDSPEED SLOPE
4277.2 4.23 7.54
4139.8 5.25 8.63
4652.9 3.59 6.54
3942.6 4.42 10.05
I put that as an example but I have more than 20 columns and over 40,000 entries.
I tried using the following codes
corrplot(cor(Site.2),method = "color",outline="white")
Matrix_Site=cor(Site.2,method=c("spearman"))
but every time, the same warning appears:
Error in cor(Site.2) : 'x' must be numeric
I would like to correlate every variable of the data frame with each other and create a graph and a table with it, similar to this.

Power BI: Make a line chart continuous when source contains null values (handle missing values)

This question spinned off a question I posted earlier;
Custom x-axis values in Power BI
Suppose the following data set:
Focus on the second and third row. How can I make the line in corresponding graph hereunder be continuous and not stop in the middle?
In Excel I used to solve this problem by applying NA() in a formula generating data for the graph. Is there a similar solution using DAX perhaps?
The short version:
What you should do:
Unleash python and after following the steps there, insert this script (dont't worry, I've added some details here as well):
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Then you will be able to set it up like this:
Why you should do it:
For built-in options, these are your choices as far as I know:
1. Categorical YearMonth and categorical x-axis does not work
2. Categorical YearMonth and continuous x-axis does not work
3. Numerical YearMonth and categorical x-axis does not work
4. Numerical YearMonth and continuous x-axis does not work
The details, starting with why built-in approaches fail:
1. Categorical YearMonth and categorical x-axis
I've used the following dataset that resembles the screenshot of your table:
YearWeek 1 2
201603 2.37 2.83
201606 2.55
201607 2.98
201611 2.33 2.47
201615 2.14 2.97
201619 2.15 2.02
201623 2.33 2.8
201627 3.04 2.57
201631 2.95 2.98
201635 3.08 2.16
201639 2.50 2.42
201643 3.07 3.02
201647 2.19 2.37
201651 2.38 2.65
201703 2.50 3.02
201711 2.10 2
201715 2.76 3.04
And NO, I didn't bother manually copying ALL your data. Just your YearWeek series. The rest are random numbers between 2 and 3.
Then I set the data up as numerical for 1 and 2, and YearWeek aa type text in the Power Query Editor:
So this is the original setup with a table and chart like yours:
The data is sorted descending by YearWeek_txt:
And the x-axis is set up asCategorical:
Conclusion: Fail
2. Categorical YearMonth and numercial x-axis
With the same setup as above, you can try to change the x-axis type to Continuous:
But as you'll see, it just flips right back to 'Categorical', presumably because the type of YearWeek is text.
Conclusion: Fail
3. Numerical YearMonth and categorical x-axis
I've duplicated the original setup so that I've got two tables Categorical and Numerical where the type of YearWeek are text and integer, respectively:
So numerical YearMonth and categorical x-axis still gives you this:
Conclusion: Fail
4. Numerical YearMonth and continuous x-axis does not work
But now, with the same setup as above, you are able to change the x-axis type to Continuous:
And you'll end up with this:
Conclusion: LOL
And now, Python:
In the Power Query Editor, activate the Categorical table, select Transform > Run Python Script and insert the following snippet in the Run Python Script Editor:
# 'dataset' holds the input data for this script
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Click OK and click Table next to dataset2 here:
And you'll get this (make sure that the column data types are correct):
As you can see, no more missing values. dataset2 = dataset2.fillna(method='ffill') has replaced all missing values with the preceding value in both columns.
Click Close&Apply to get back to the Desktop, and enjoy your table and chart with no more missing values:
Conclusion: Python is cool
End note:
There are a lot of details that can go wrong here with decimal points, data types and so on. Let me know how things work out for you and I'll have a look at it again if it doesn't work on your end.

Mean Y for individual X values

I have a data set in .dta format with height and weight of baseball players. I want to calculate the mean height for each individual weight value.
From what I've been able to find, I could use dplyr and "group_by", but my R script does not recognize the command, despite having installed and called the package.
Thanks!
Here is an example coded in base R using baseball player height and weight data obtained from the UCLA SOCR MLB HeightsWeights data set.
After cleaning the data (weight is missing for one player), I posted it to GitHub to make it accessible without having to clean it again.
theCSVFile <- "https://raw.githubusercontent.com/lgreski/datasciencedepot/gh-pages/data/baseballPlayers.csv"
download.file(theCSVFile,"./data/baseballPlayers.csv",method="curl")
theData <- read.csv("./data/baseballPlayers.csv",header=TRUE,stringsAsFactors=FALSE)
aggData <- aggregate(HeightInInches ~ WeightInPounds,mean,
data=theData)
head(aggData)
...and the output is:
> head(aggData)
WeightInPounds HeightInInches
1 150 70.75000
2 155 69.33333
3 156 75.00000
4 160 71.46667
5 163 70.00000
6 164 73.00000
>
regards,
Len

Resources