Get 30 occurencies in dataframe [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Good Morning Everyone,
I have a small issue regarding a dataframe :
I have 165 differents countries, sometimes with more than 30 occurencies. What I would like to do is take only 30 occurencies for each country, and then apply the mean function on the related variables.
Do you have any idea how I can achieve this?
Here is the dataframe :
Thanks for your answer,
RĂ©mi

Assuming you want to take out 30 rows for each group, we can do the following. Unfortunately, dplyr's sample_n cannot handle when the input data frame has less rows than you want to sample (unless you want to sample with replacement).
Where df is your data.frame:
Solution 1:
library(dplyr)
df %>% group_by(Nationality) %>%
sample_n(30, replace=TRUE) %>%
distinct() %>% # to remove repeated rows where nationalities have less than 30 rows
summarise_at(vars(Age, Overall, Passing), funs(mean))
Solution 2:
df %>% split(.$Nationality) %>%
lapply(function(x) {
if (nrow(x) < 30)
return(x)
x %>% sample_n(30, replace=FALSE)
}) %>%
do.call(what=bind_rows) %>%
group_by(Nationality) %>%
summarise_at(vars(Age, Overall, Passing), funs(mean))
Naturally without guarantee as you did not supply a working example.

Related

How to create new column that compares 2 other columns [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have dataset in R for samples (ID) for 2 years for one variable (Majorclade). I want to see how major clade have changed over the 2 years for each sample. I would like to create a column that compares it, like it is the same calls it 0, if different calls it 1. I imagine some kinda of mutate would do it, but I am not figuring it out. Ideas?
Table example:
We can use
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(new = +(n_distinct(Majorclade) > 1)) %>%
ungroup

calculating hourly average based on condition of the other column in r [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame which has 3 columns
first:(date_time) every 10 minutes one observation,
second: temp,
third: quality check(those that are not acceptable=NA)
I want to calculate hourly average and I want to say that for every hour that has more than 2 na in Quality check column (in the six observations that are in one hour ) the corresponding average should be NA. how can I do that I wrote this code but I don't know how can I consider the condition of Quality column :
df %>%
mutate(date = date(date_time), hour = hour(date_time)) %>%
group_by(date, hour) %>%
summarise(m = mean(temp))
We can use an if/else condition
library(dplyr)
library(lubridate)
df %>%
mutate(date = as.Date(date_time), hour = hour(date_time)) %>%
group_by(date, hour) %>%
summarise(m = if(sum(is.na(quality)) > 2) NA_real_
else mean(temp, na.rm = TRUE))

How to find a conditional mean using R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to find the mean of wages for males and females. How do i find the compute mean for wages that belong in the female column and for wages that belong in the male column.
Please always ask your question in a reproducible fashion.
See here: https://stackoverflow.com/help/minimal-reproducible-example
Regarding your question, I'm using the mtcars dataset as an example, and assuming you have all female values in one column in another (like we have mpg and cyl here), then you could use the tidyverse:
library(tidyverse)
data(mtcars)
mtcars %>%
summarise(across(.cols = c(mpg,cyl),.fns = mean))
or for your question, assuming that your dataset is called df and your columns are called female and male:
df %>%
summarise(across(.cols = c(female,male),.fns = mean))
If, however, your data was organised differently and you had gender in one separate column and e.g. the value you want to take the mean for in a column called value, then you should do:
mtcars %>%
group_by(vs) %>%
summarise(resulting_mean_mpg = mean(mpg))
Where we have calculated the mean mpg by vs.
In your case this might be
df %>%
group_by(gender) %>%
summarise(resulting_mean_value = mean(value))
Hope this helps!
We can use colMeans in base R after selecting the columns of interest
colMeans(df1[c('male', 'female')], na.rm = TRUE)

R programming- find lowest value [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've just started learning R. I wanted to know how can I find the lowest value in a column for a unique value in other column. For example, in this case I wanted to know the lowest avg price per year.
I have a data frame with about 7 columns, 2 of them being average price and year. The year is obviously recurrent ranges from 2000 to 2009. The data also has various NA's in different columns.
I have very less idea about running a loop or whatsoever in this regard.
Thank you :)
my data set looks something like this:
avgprice year
332 2002
NA 2009
5353 2004
1234 NA and so on.
To break down my problem to find first five lowest values from year 2000-2004.
s<-subset(tx.house.sales,na.rm=TRUE,select=c(avgprice,year)
s2<-subset(s,year==2000)
s3<-arrange(s2)
tail(s2,5)
I know the code fails miserably. I wanted to first subset my dataframe on the basis of year and avgprice. Then sort it for each year through 2000-2004. Arrange it and using tail() print the lowest five. However I also wanted to ignore the NAs
You could try
aggregate(averageprice~year, df1, FUN=min)
Update
If you need to get 5 lowest "averageprice" per "year"
library(dplyr)
df1 %>%
group_by(year) %>%
arrange(averageprice) %>%
slice(1:5)
Or you could use rank in place of arrange
df1 %>%
group_by(year) %>%
filter(rank(averageprice, ties.method='min') %in% 1:5)
This could be also done with aggregate, but the 2nd column will be a list
aggregate(averageprice~year, df1, FUN=function(x)
head(sort(x),5), na.action=na.pass)
data
set.seed(24)
df1 <- data.frame(year=sample(2002:2008, 50, replace=TRUE),
averageprice=sample(c(NA, 80:160), 50, replace=TRUE))

How to separate data based on different variable values [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a dataset of around 1.5 L observations and 2 variables: name and amount. name can have same value again and again, for example a name ABC can appear 50 times in the dataset.
I want a new data frame with two variables: name and total amount, where each name has a unique value and total amount is the sum of all amounts in previous dataset. For example if ABC appears three times with amount == 1, 2 and 3 respectively in the previous dataset then in the new dataset, ABC will only appear one time with total amount == 6.
You can use data.table for big datasets:
library(data.table)
res<- setDT(df)[, list(Total_Amount=sum(amount)), by=name]
Or use dplyr
library(dplyr)
df %>%
group_by(name) %>%
summarise(Total_Amount=sum(amount))
Or as suggested by #hrbrmstr,
count(df, name, wt=amount)
data
set.seed(24)
df <- data.frame(name=sample(LETTERS[1:5], 25, replace=TRUE),
amount=sample(150,25, replace=TRUE))

Resources