How to fill in blanks based off another column's values in R - r

Here is the code and resulting dataframe
category <- c("East","BLANK","NorthEast","BLANK","BLANK")
subcat <- c("East","North","SW","NE","SE")
data1 <- as.data.frame(category)
data1$subcat <- subcat
**Category** **Subcat**
East East
BLANK North
NorthEast SW
BLANK NE
BLANK SE
The East category contains subcat East and North. The NorthEast category contains subcat SW,NE,SE.
As you can see there are blanks for each category. How would I make so the 2nd value in Category is East and 4th and 5th row is North East? I have many more rows in the actual data so a way to do this would be helpful.
The result should be
**Category** **Subcat**
East East
*East* North
NorthEast SW
*NorthEast* NE
*NorthEast* SE

Here is a dplyr only solution:
First we group by every string that is not BLANK,
then replace all group members with first value:
data1 %>%
group_by(x = cumsum(category != "BLANK")) %>%
mutate(category = first(category)) %>%
ungroup() %>%
select(-x)
category subcat
<chr> <chr>
1 East East
2 East North
3 NorthEast SW
4 NorthEast NE
5 NorthEast SE

We could convert the 'BLANK' to NA and use fill
library(tidyr)
library(dplyr)
data1 %>%
na_if( "BLANK") %>%
fill(category)
-output
category subcat
1 East East
2 East North
3 NorthEast SW
4 NorthEast NE
5 NorthEast SE

Related

Finding rows that have the minimum of a specific factor group

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?
Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601
You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

Recode by comparing a value to numbers in a vector

I want to code the values in a column into fewer values in another column.
For example,
if the value in zipcode column is one of the following c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
code it as "west" in district column.
How can I do it in R?
You can use the ifelse() function.
Set up the data in a dataframe:
df <- data.frame(zipcode = c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028))
Then use ifelse() to code a new value based on the values of zipcode.
df$district <- ifelse(df$zipcode %in% c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
"west",
NA)
> df
zipcode region
1 90272 west
2 90049 west
3 90077 west
4 90210 west
5 90046 west
6 90069 west
7 90024 west
8 90025 west
9 90048 west
10 90036 west
11 90038 west
12 90028 west

Adding overall mean when using group_by

I am using the dplyr package to generate some tables and I'm making use of the adorn_totals("row") function.
This works fine when I want to sum values within the groups, however in some cases I want an overall mean instead of a sum. Is there an adorn_means function?
Sample code:
Regions2 <- Data %>%
filter(!is.na(REGION))%>%
group_by(REGION) %>%
summarise(Numberofpeople=length(Names))%>%
adorn_totals("row")
here my "total" row is simply the sum of all people within the regions. This gives me
REGION NumberofPeople
East Midlands 578,943
East of England 682,917
London 1,247,540
North East 245,830
North West 742,886
South East 963,040
South West 623,684
West Midlands 653,335
Yorkshire 553,853
TOTAL 6,292,028
My next piece of code generates an average salary for each region, but I want to add an overall average for the total
Regions3 <- Data %>%
filter(!is.na(REGION))%>%
filter(!is.na(AVGSalary))%>%
group_by(REGION) %>%
summarise(AverageSalary=mean(AVGSalary))
if I use adnorn_totals("row") as before I simply get the sum of the averages, not the overall average for the dataset.
How do I get the overall average?
UPADATE with some noddy data:
Data
people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150
The group averages are therefore:
London 1000
South West 983.33
South East 900
Scotland 1020
East 600
I want to add the overall total to the bottom
Total 941.11
1) Because the overall average is the weighted average of the averages (not the plain average of the averages), i.e. it is 941 and not 901, we maintain an n column so that in the end we can correctly compute the overall average. Although the data shown does not have any NAs we use drop_na in order to also use it with such data. This will remove any row containing an NA.
library(dplyr)
library(tidyr)
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary), n = n()) %>%
ungroup %>%
bind_rows(summarize(., region = "Overall Avg",
avg = sum(avg * n) / sum(n),
n = sum(n))) %>%
select(-n)
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2) Another approach would be to construct the Overall Avg line by going back to the original data:
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2a) If you object to referring to Region twice then try this.
Region_ <- Region %>%
drop_na
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
2b) or as a single pipeline where now Region_ is local to the pipeline and will automatically be removed after the pipeline completes:
Region %>%
drop_na %>%
{ Region_ <- .
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
}
Note
We used this as the input:
Lines <- "people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150"
library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$",
as.is = TRUE, skip = 1, strip.white = TRUE,
col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))
One option is to add a row with bind_rows
library(dplyr)
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
bind_rows(data_frame(region = 'Total',
Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))
Or another option is add_row from tibble
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))
If this is based on the overall mean before taking the mean, then we need to calculate it before
Data %>%
mutate(Total = mean(salary)) %>%
group_by(region) %>%
summarise(Avgsummary = mean(salary), Total = first(Total)) %>%
add_row(region = 'Total', Avgsummary = .$Total[1]) %>%
select(-Total)

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources