How to get the percentile from the dataset? [closed]

How to get the percentile from the dataset? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I processed the dataset.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
Considering the new cases announced, the top 3 countries that explained the most cases explained what% of the total cases. can we find it?

Here is one way to do this with Base R. Since the statistics are cumulative for each country by day, we subset to the most recent day's data with the [ form of the extract operator, sort by descending confirmed cases, calculate and sum the percentages for the first 3 rows.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
# subset to max(day)
today <- df[df$day == max(df$day),]
today <- today[order(today$confirmed,decreasing=TRUE),]
today$pct <- today$confirmed / sum(today$confirmed)
paste("top 3 countries percentage as of",today$day[1],"is:",
sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
...and the output:
> paste("top 3 countries percentage as of",today$day[1],"is:",
+ sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
[1] "top 3 countries percentage as of 2020/05/30 is: 44.09%"
We can print selected data for the top 3 countries as follows.
today[1:3,colList]
countryName day confirmed pct
26000 United States 2020/05/30 1816117 0.29531640
3640 Brazil 2020/05/30 498440 0.08105067
21710 Russia 2020/05/30 396575 0.06448654
>

Related

Calculate mean, max, min number of observations per group in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How to calculate mean, max, and min number of reviews per group in a dataframe in R? I have a dataset that looks like below:
Firm Review
A Nice work
B Ok
C yes
A ok
B yes
A like
In this case, firm A has 3 reviews, B has 2 and C has 1. Then Max=3, min=1, and average number of review per firm =6/3=2.
The actual dataset has 2.1 million reviews across 50000 firms. What would be an effective way to group by a firm and then calculate these statistics?

I generally use dplyr for this kind of thing if I understand you correctly.
firm_reviews %>%
group_by(firm) %>%
count() %>%
ungroup() %>%
summarise(mean(n), min(n), max(n))

calculating timeframe accuracy in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am trying to look at prediction accuracy related to timeframes related to hospital discharge.
For example, I think Mr. Smith will be discharged within 3-7 days, which would mean he could any day from 11/9-11/13 would be correct. If he discharges in 2 days, I would say I was 1 day off and if he discharges within 10 days, I was 3 days off...
Is there any good method to do this using dplyr, base R, and lubridate? TIA. Sample data is at the link:
Sample data

A possible solution would be to express your need in a case_when.
library(dplyr)
df %>%
dplyr::mutate(DIF = case_when(discharge_calender_date < discharge_prediction_lower_bound ~ discharge_calender_date - discharge_prediction_lower_bound,
discharge_calender_date <= discharge_prediction_upper_bound ~ 0,
TRUE ~ discharge_calender_date - discharge_prediction_upper_bound))
This way you get a negative value if the patient left before the lower bound, zero if he left within the prediction and a positive result if he left after the prediction.

How do I find the Top N values for every column in a table in R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm very new to R and struggling with using it for basic data analysis.
If I load a table, how can I find the Top 10 values for every column, along with each value's frequency & count of appearance? In addition, I'd like to also find out the frequency of blanks.
Using "Forbes2000" from the "HSAUR" package...
data("Forbes2000", package = "HSAUR")
head(Forbes2000)
The data contains 8 columns, some of which ("rank", "name", "sales", etc.) is unique per row. However, some columns ("country", "category") are not unique.
So, for each column, I'd like to find out the top 10 unique values, their % frequency, and counts. In addition, if the column contains at least one blank/NULL, an additional row showing the same info. If each row is unique, limit the results to 10 rows.
So, something like... (numbers below made up)
country percentage rank
United States 85.35% 1
United Kingdom 6.31% 2
Canada 3.12% 3
category percentage rank
Banking 55.28% 1
Conglomerates 20.75% 2
Insurance 12.23% 3
NULL 3.32% 4
Oil & gas operations 2.11% 5
...(etc)...
sales percentage rank
1234.56 0.05% 1
987.65 0.05% 1
986.32 0.05% 1
822.12 0.05% 1
...(etc)...
I've looked around StackOverflow for a while and found a few ranking questions, they they were 2D in nature ( How to return 5 topmost values from vector in R? ), or for a single column (how to find the top N values by group or within category (groupwise) in an R data.frame ). I'm looking for a solution that is 3D in nature, as appending
names(Forbes2000)
doesn't seem to work to loop through all the columns.

Something like this?
library("HSAUR")
f<-function(x){
Freq<-(head(sort(table(x),decreasing=TRUE)*100/length(x),10))
rank<-1:10
rank<-rank-cumsum(duplicated(Freq))
data.frame(perc=paste(Freq,"%",sep=""),rank)
}
lapply(Forbes2000,f)

R programming- find lowest value [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've just started learning R. I wanted to know how can I find the lowest value in a column for a unique value in other column. For example, in this case I wanted to know the lowest avg price per year.
I have a data frame with about 7 columns, 2 of them being average price and year. The year is obviously recurrent ranges from 2000 to 2009. The data also has various NA's in different columns.
I have very less idea about running a loop or whatsoever in this regard.
Thank you :)
my data set looks something like this:
avgprice year
332 2002
NA 2009
5353 2004
1234 NA and so on.
To break down my problem to find first five lowest values from year 2000-2004.
s<-subset(tx.house.sales,na.rm=TRUE,select=c(avgprice,year)
s2<-subset(s,year==2000)
s3<-arrange(s2)
tail(s2,5)
I know the code fails miserably. I wanted to first subset my dataframe on the basis of year and avgprice. Then sort it for each year through 2000-2004. Arrange it and using tail() print the lowest five. However I also wanted to ignore the NAs

You could try
aggregate(averageprice~year, df1, FUN=min)
Update
If you need to get 5 lowest "averageprice" per "year"
library(dplyr)
df1 %>%
group_by(year) %>%
arrange(averageprice) %>%
slice(1:5)
Or you could use rank in place of arrange
df1 %>%
group_by(year) %>%
filter(rank(averageprice, ties.method='min') %in% 1:5)
This could be also done with aggregate, but the 2nd column will be a list
aggregate(averageprice~year, df1, FUN=function(x)
head(sort(x),5), na.action=na.pass)
data
set.seed(24)
df1 <- data.frame(year=sample(2002:2008, 50, replace=TRUE),
averageprice=sample(c(NA, 80:160), 50, replace=TRUE))

How to separate data based on different variable values [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a dataset of around 1.5 L observations and 2 variables: name and amount. name can have same value again and again, for example a name ABC can appear 50 times in the dataset.
I want a new data frame with two variables: name and total amount, where each name has a unique value and total amount is the sum of all amounts in previous dataset. For example if ABC appears three times with amount == 1, 2 and 3 respectively in the previous dataset then in the new dataset, ABC will only appear one time with total amount == 6.

You can use data.table for big datasets:
library(data.table)
res<- setDT(df)[, list(Total_Amount=sum(amount)), by=name]
Or use dplyr
library(dplyr)
df %>%
group_by(name) %>%
summarise(Total_Amount=sum(amount))
Or as suggested by #hrbrmstr,
count(df, name, wt=amount)
data
set.seed(24)
df <- data.frame(name=sample(LETTERS[1:5], 25, replace=TRUE),
amount=sample(150,25, replace=TRUE))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to get the percentile from the dataset? [closed] - r

Related

Calculate mean, max, min number of observations per group in R [closed]

calculating timeframe accuracy in R [closed]

How do I find the Top N values for every column in a table in R? [closed]

R programming- find lowest value [closed]

How to separate data based on different variable values [closed]

Categories

Resources