Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've just started learning R. I wanted to know how can I find the lowest value in a column for a unique value in other column. For example, in this case I wanted to know the lowest avg price per year.
I have a data frame with about 7 columns, 2 of them being average price and year. The year is obviously recurrent ranges from 2000 to 2009. The data also has various NA's in different columns.
I have very less idea about running a loop or whatsoever in this regard.
Thank you :)
my data set looks something like this:
avgprice year
332 2002
NA 2009
5353 2004
1234 NA and so on.
To break down my problem to find first five lowest values from year 2000-2004.
s<-subset(tx.house.sales,na.rm=TRUE,select=c(avgprice,year)
s2<-subset(s,year==2000)
s3<-arrange(s2)
tail(s2,5)
I know the code fails miserably. I wanted to first subset my dataframe on the basis of year and avgprice. Then sort it for each year through 2000-2004. Arrange it and using tail() print the lowest five. However I also wanted to ignore the NAs
You could try
aggregate(averageprice~year, df1, FUN=min)
Update
If you need to get 5 lowest "averageprice" per "year"
library(dplyr)
df1 %>%
group_by(year) %>%
arrange(averageprice) %>%
slice(1:5)
Or you could use rank in place of arrange
df1 %>%
group_by(year) %>%
filter(rank(averageprice, ties.method='min') %in% 1:5)
This could be also done with aggregate, but the 2nd column will be a list
aggregate(averageprice~year, df1, FUN=function(x)
head(sort(x),5), na.action=na.pass)
data
set.seed(24)
df1 <- data.frame(year=sample(2002:2008, 50, replace=TRUE),
averageprice=sample(c(NA, 80:160), 50, replace=TRUE))
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am currently trying to decrease the values in a column randomly according to a given sum.
For example, if the main data is like this;
ID Value
1 4
2 10
3 16
after running the code the sum of Value should be 10 and this need to be done randomly(the decrease for each member should be chosen randomly)
ID Value
1 1
2 8
3 1
Tried several command and library but could not manage it. Still a novice and
Any help would be appreciated!
Thanks
Edit: Sorry I was not clear enough. I would like to assign a new value for each observation smaller than original (randomly). And at the end new sum of value will be equal to 10
Using the sample data
dd <- read.table(text="ID Value
1 4
2 10
3 16", header=TRUE)
and the dplyr + tidyr library, you can do
library(dplyr)
library(tidyr)
dd %>%
mutate(ID=factor(ID)) %>%
uncount(Value) %>%
sample_n(10) %>%
count(ID, name = "Value", .drop=FALSE)
Here we repeat the row once for each Value, then we randomly sample 10 rows, then we count them back up. We turn ID to a factor to make sure IDs with 0 observations are preserved.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I processed the dataset.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
Considering the new cases announced, the top 3 countries that explained the most cases explained what% of the total cases. can we find it?
Here is one way to do this with Base R. Since the statistics are cumulative for each country by day, we subset to the most recent day's data with the [ form of the extract operator, sort by descending confirmed cases, calculate and sum the percentages for the first 3 rows.
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
# subset to max(day)
today <- df[df$day == max(df$day),]
today <- today[order(today$confirmed,decreasing=TRUE),]
today$pct <- today$confirmed / sum(today$confirmed)
paste("top 3 countries percentage as of",today$day[1],"is:",
sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
...and the output:
> paste("top 3 countries percentage as of",today$day[1],"is:",
+ sprintf("%3.2f%%",sum(today$pct[1:3]*100)))
[1] "top 3 countries percentage as of 2020/05/30 is: 44.09%"
We can print selected data for the top 3 countries as follows.
today[1:3,colList]
countryName day confirmed pct
26000 United States 2020/05/30 1816117 0.29531640
3640 Brazil 2020/05/30 498440 0.08105067
21710 Russia 2020/05/30 396575 0.06448654
>
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I need to find out Customer Names, who made highest and second highest purchase.
Sample Data
Name Sales
pavan 400
kumar 200
mahesh 750
rajesh 550
vasu 900
There should be two queries one for highest and one for second highest.
I want only names not row.
Updated Answer
R Base Solution
Make sure the Name is Character type
For Max
df[which.max(df$Sales),]$Name
#[1] "vasu"
For Min
df[which.min(df$Sales),]$Name
#[1] "kumar"
It is to note that which returns the index. So in the above case, which.max returns the index of maximum Sale Value and vice versa. Hence I am sending the index in subsetting enclosure of R.
Second Highest
library(dplyr)
df <- df %>% arrange(desc(Sales))
df$Name[2]
#mahesh
You can keep changing the index to get the 3rd and 4th.
your should first search for your problems in posts.
second, try this:
library(dplyr)
test %>%
summarise_at(c('Name', "Sales"), max, na.rm=TRUE) %>%
select(Name)
output is:
# A tibble: 1 x 1
Name
<chr>
1 vasu
you can replace the function max with min to receive your desired output. or quote select(Name) to receive bot values.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a dataset of around 1.5 L observations and 2 variables: name and amount. name can have same value again and again, for example a name ABC can appear 50 times in the dataset.
I want a new data frame with two variables: name and total amount, where each name has a unique value and total amount is the sum of all amounts in previous dataset. For example if ABC appears three times with amount == 1, 2 and 3 respectively in the previous dataset then in the new dataset, ABC will only appear one time with total amount == 6.
You can use data.table for big datasets:
library(data.table)
res<- setDT(df)[, list(Total_Amount=sum(amount)), by=name]
Or use dplyr
library(dplyr)
df %>%
group_by(name) %>%
summarise(Total_Amount=sum(amount))
Or as suggested by #hrbrmstr,
count(df, name, wt=amount)
data
set.seed(24)
df <- data.frame(name=sample(LETTERS[1:5], 25, replace=TRUE),
amount=sample(150,25, replace=TRUE))
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am fairly new to R/Rstudio and I am still learning how to do certain operations.
I have the following data set. For columns I have Operating Region, type of element(CA,OBU), sub-element and Net Revenue.
Currently the data is quite big(50 000 rows) and I want to get a summary of Operating region by element,sub-element and NR.
Example
Operating Region Element Sub-Element NR
Asia CA CA123 50 000
America OBU EFK456 35 000
Could someone please guide me on how to accomplish this?
Any relevant readings/examples would be much appreciated.
Using the data below to return the data frame object "data," you can use the dplyr package to organize results in many different ways. Here is one example:
data <- data.frame("OperatingRegion" = c("Asia", "America"), "Region" = c("CA", "OBU"), "Element" = c("CA123", "EFK456"), "SubElement" = c(50000, 35000))
require(dplyr)
results <- data %.%
group_by(OperatingRegion) %.%
summarise(SubE = sum(SubElement, na.rm = TRUE))
Source: local data frame [2 x 2]
OperatingRegion SubE
1 America 35000
2 Asia 50000
After loading the package, you provide dplyr the data frame and then, using the special operators %.% or %>%, group_by whatever single or multiple variables you want. Then, call summarise to create sums, medians, averages or whatever computation you want.