how to find top highest number in R - r

I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.

Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390

Related

Combine two rows in R that are separated

I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center
230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.24
8211137
42
2514
Erie
Yorktown Centre
2501 West 12th Street
190305
41.17
7862624
Here is an example of how I want the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center, 230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.28
8211137
42
2514
Erie
Yorktown Centre, 2501 West 12th Street
190305
41.17
7862624
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
I recommend checking how the file is being parsed.
From the example data you provided, it looks like the address column is being
split on ", " and going to the next line.
Based on this assumption alone, below is a potential solution using the
tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

How to identify observations with multiple matching patterns and create another variable in R?

I am trying to create a broad industry category from detailed categories in my data. I am wondering where am I going wrong in creating this with grepl in R?
My example data is as follows:
df <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"))
I am trying to create a variable called industry with 2 levels (e.g., agri, manufacturing) with the help of grepl or str_replace commands in R.
I have tried this:
newdf$industry <- ""
newdf[df$ind %>% grepl(c("^07|^08|^09", levels(df$ind), value = TRUE)), "industry"] <- "Agri"
But this gives me the following error:
argument 'pattern' has length > 1 and only the first element will be used
I want to get the following dataframe as my result:
newdf <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"),
industry = c("Agri", "Agri", "Agri", "Agri", "Manufacturing", "Manufacturing", "Agri", "Manufacturing"))
So my question is this, how do I specify if variable 'ind' starts with 07,08 or 09, my industry variable will take the value 'agri', if 'ind' starts with 10, 14 or 15, industry will be 'manufacturing'? Needless to say, there is a huge list of industry codes that I am trying to crunch in 10 categories, so looking for a solution which will help me do it with pattern recognition.
Any help is appreciated! Thanks!
Try this:
newdf = df %>%
mutate(industry = ifelse(str_detect(string = ind,
pattern = '^07|^08|^09'),
'Agri',
'Manufacturing'))
This works, using ifelse() to add desired column to df data.frame
df$industry <- ifelse(grepl(paste0("^", c('07','08','09'), collapse = "|"), df$ind), "Agri", "Manufacturing")
> df
county ind industry
1 1001 0700 Agri
2 1002 0701 Agri
3 2003 0780 Agri
4 4004 0980 Agri
5 8005 1000 Manufacturing
6 1002 1429 Manufacturing
7 2003 0840 Agri
8 4004 1500 Manufacturing

Identify change in categorical data across datapoints in R

I need help regarding the data manipulation in R .
My dataset looks something like this.
Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Now I want to identify the number of changes that “Smith” has gone through (two) based on combination of Name and country only.
Basically, I want to compare each datapoint with other datapoints and identify the count of changes that have been there in the entire dataset for the combination of Name and Country only.
You can do this by comparing the lag values of the combination with it's current value by group using dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(combination = paste(country, age),
lag_combination = lag(combination, default = 0, order_by = Name),
Changes = cumsum(combination != lag_combination)) %>%
slice(n()) %>%
select(Name, Changes)
Result:
# A tibble: 3 x 2
# Groups: Name [3]
Name Changes
<fctr> <int>
1 Avin 2
2 Robin 1
3 Smith 3
Notes:
dplyr:lag does not respect group_by(Name), so you need to add order_by = Name to lag by Name.
I'm setting a default using default = 0 so that the first entry of each group would not be NA.
Data:
df = read.table(text="Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Smith, Canada, 27
Robin, France, 28
Avin, France, 26", header = TRUE, sep = ',')

Resources