reshaping with an embedded column and year name - r

Hi I have a data where the year value is embedded in the column name as follows and I would like to reshape it to long format.
state<- c('MN', 'PA', 'NY')
city<- c('Minessota', 'Pittsburgh','Newyork')
POPEST2010<- c(2899, 344,4555)
POPEST2011<- c(4444, 348,8999)
POPEST2012<- c(555, 55,77665)
df<- data.frame(state,city, POPEST2010, POPEST2011, POPEST2012)
Any suggestions on how I can reshape to long format so I can see the data as follow:
state city year POPEST
MN Minessota 2010 2899
MN Minessota 2011 4444
MN Minessota 2012 8999
similarly for other states Any ideas? Thanks so much!

A solution using rename and gather
df %>%
rename_all(.funs = funs(gsub('POPEST', '', .))) %>%
gather(year, POPEST, -state, -city)

similar:
df %>%
tidyr::gather(year,POPEST,matches("POPEST")) %>% mutate(year = sub("[^0-9]+","",year))
# state city year POPEST
#1 MN Minessota 2010 2899
#2 PA Pittsburgh 2010 344
#3 NY Newyork 2010 4555
#4 MN Minessota 2011 4444
#5 PA Pittsburgh 2011 348
#6 NY Newyork 2011 8999
#7 MN Minessota 2012 555
#8 PA Pittsburgh 2012 55
#9 NY Newyork 2012 77665

Related

Replace NA with minimum Group Value R

I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help
I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007
We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))

R: How do I avoid getting an error when merging two data frames (group by/summarise)?

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

Dataframe does not correctly reshape

I have the following dataframe:
Variables Varcode Country Ccode 2000 2001
1 Power P France FR 1213 1234
2 Happiness H France FR 1872 2345
3 Power P UK UK 1726 6433
4 Happiness H UK UK 2234 9082
I would like to reshape this dataframe as follows:
Year Country Ccode P(label=Power) H(label=Happiness)
1 2000 France FR 1213 1872
2 2001 France FR 1234 2345
3 2000 UK UK 1726 2234
4 2001 UK UK 6433 9082
The original code was as follows:
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
I tried to expand the code because, the Ccode and Indicator Code ended up as a row in the list and I decided I wanted to use the codes as variable names and the variable names as labels (please note that because of that I swapped -Variables and Variables with -Varcode and Varcode respectively):
library(tidyverse)
library(Hmisc)
List <- df$Variables
df<-df %>%
gather(Year, val, -Varcode, -Country) %>%
spread(Varcode, val)
for(i in List){
label(df[,i]) <- List[i]
}
Please note: I am using a list because of memory limitations.
I ran into two problems:
The transformation does not go smoothly because two additional columns from df(among which Variables) are added where values should be.
The label function gives an error.
Can anyone help me figuring out what goes wrong?
I think you went wrong with your selection of columns to gather
Data:
df <- read.table(text = "Variables Varcode Country 2000 2001
1 Power P France 1213 1234
2 Happiness H France 1872 2345
3 Power P UK 1726 6433
4 Happiness H UK 2234 9082", header = TRUE, stringsAsFactors = FALSE) %>%
rename(`2000` = X2000, `2001` = X2001)
df %>%
select(-Varcode) %>%
gather(Year, val,`2000`:`2001`) %>%
unite(Country_Ccode, Country, Ccode, sep = "_") %>%
spread(Variables, val) %>%
separate(Country_Ccode, c("Country", "Ccode"), sep = "_")
Output
Country Ccode Year Happiness Power
1 France FR 2000 1872 1213
2 France FR 2001 2345 1234
3 UK UK 2000 2234 1726
4 UK UK 2001 9082 6433

html_table doubles value of columns

I'm trying to scrape wiki table with this code:
library(tidyverse)
library(rvest)
my_url <- "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions"
mytable <- read_html(my_url) %>% html_nodes("table") %>% .[[4]]
mytable <- mytable %>% html_table()
The problem is that in the table returned in both columns with names (champion & runner-up) values are doubled. well not exactly doubled, it looks like two forms of presenting name/surname in different order and with comma once. It does not look like that on the original wiki page only "name surname" is visible there. Why does it happen and how to get rid of it? I need those columns to contain 'name surname' only.
head(mytable)
Year[f] Country Champion Country Runner-up Score in the final[4][14]
1 1969 AUS Laver, RodRod Laver[b] ESP Gimeno, AndrésAndrés Gimeno 6–3, 6–4, 7–5
2 1970 USA Ashe, ArthurArthur Ashe AUS Crealy, DickDick Crealy 6–4, 9–7, 6–2
3 1971 AUS Rosewall, KenKen Rosewall USA Ashe, ArthurArthur Ashe 6–1, 7–5, 6–3
4 1972 AUS Rosewall, KenKen Rosewall AUS Anderson, MalcolmMalcolm Anderson 7–6(7–2), 6–3, 7–5
5 1973 AUS Newcombe, JohnJohn Newcombe NZL Parun, OnnyOnny Parun 6–3, 6–7, 7–5, 6–1
6 1974 USA Connors, JimmyJimmy Connors AUS Dent, PhilPhil Dent 7–6(9–7), 6–4, 4–6, 6–3
htmltab could be used to scrap these Wiki tables.
library(htmltab)
#data cleaning steps
bFun <- function(node) {
x <- XML::xmlValue(node)
gsub("\\s[<†‡].*$", "", iconv(x, from = 'UTF-8', to = "Windows-1252", sub="byte"))
}
df1 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions",
which = 4,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df1)
which gives
# Year[f] Country Champion Country Runner-up Score in the final[4][14]
#2 1969 AUS Rod Laver[b] ESP Andrés Gimeno 6–3, 6–4, 7–5
#3 1970 USA Arthur Ashe AUS Dick Crealy 6–4, 9–7, 6–2
#4 1971 AUS Ken Rosewall USA Arthur Ashe 6–1, 7–5, 6–3
#5 1972 AUS Ken Rosewall AUS Malcolm Anderson 7–6(7–2), 6–3, 7–5
#6 1973 AUS John Newcombe NZL Onny Parun 6–3, 6–7, 7–5, 6–1
#7 1974 USA Jimmy Connors AUS Phil Dent 7–6(9–7), 6–4, 4–6, 6–3
and
df2 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Wimbledon_gentlemen%27s_singles_champions",
which = 3,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df2)
gives
# Year[d] Country Champion Country Runner-up Score in the final[4]
#2 1877 BRI[e] Spencer Gore BRI William Marshall 6–1, 6–2, 6–4
#3 1878 BRI Frank Hadow BRI Spencer Gore 7–5, 6–1, 9–7
#4 1879 BRI John Hartley BRI Vere St. Leger Goold 6–2, 6–4, 6–2
#5 1880 BRI John Hartley BRI Herbert Lawford 6–3, 6–2, 2–6, 6–3
#6 1881 BRI William Renshaw BRI John Hartley 6–0, 6–1, 6–1
#7 1882 BRI William Renshaw BRI Ernest Renshaw 6–1, 2–6, 4–6, 6–2, 6–2

Merging data by 2 variables in R

I am attempting to merge two data sets. In the past I have used merge() with by equal to the variable I want to merge by. However, now I would like to do so with two variables. My first data set looks something like this:
Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South Carolina
2013 Tennessee Texas
Then I have another data set with a rank of each team (this is very simplified) for each year. Like this:
Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51
I would like to merge them so I have a data set that looks like this:
Year Winning_Tm Winning_TM_rank Losing_Tm Losing_Tm_rank
2011 Texas 32 Washington 34
2012 Alabama 12 South Carolina 45
2013 Tennessee 51 Texas 6
My hope is that there is a simple way to do this but it may be more complicated. Thanks!
I reproduced your data (try to include a dput of it next time):
A <- data.frame(
Year = c(2011, 2012, 2013),
Winning_Tm = c("Texas","Alabama","Tennessee"),
Losing_Tm = c("Washington","South Carolina", "Texas"),
stringsAsFactors = FALSE
)
B <- data.frame(
Year = c("2011","2011","2012","2012","2013","2013"),
Team = c("Texas","Washington","South Carolina","Alabama","Texas","Tennessee"),
Rank = c(32,34,45,12,6,51),
stringsAsFactors = FALSE
)
You can melt the first dataframe using the reshape2 package:
library(reshape2)
A <- melt(A, id.vars = "Year")
names(A)[3] <- "Team"
Now it looks like this:
> A
Year variable Team
1 2011 Winning_Tm Texas
2 2012 Winning_Tm Alabama
3 2013 Winning_Tm Tennessee
4 2011 Losing_Tm Washington
5 2012 Losing_Tm South Carolina
6 2013 Losing_Tm Texas
You can then merge the datasets together by the two columns of interest:
AB <- merge(A, B, by=c("Year","Team"))
Which looks like this:
> AB
Year Team variable Rank
1 2011 Texas Winning_Tm 32
2 2011 Washington Losing_Tm 34
3 2012 Alabama Winning_Tm 12
4 2012 South Carolina Losing_Tm 45
5 2013 Tennessee Winning_Tm 51
6 2013 Texas Losing_Tm 6
Then using the reshape command from base R you can change AB to a wide format:
reshape(AB, idvar = "Year", timevar = "variable", direction = "wide")
The result:
Year Team.Winning_Tm Rank.Winning_Tm Team.Losing_Tm Rank.Losing_Tm
1 2011 Texas 32 Washington 34
3 2012 Alabama 12 South Carolina 45
5 2013 Tennessee 51 Texas 6
Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.
I'll call your data winlose and teamrank, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
If you are familiar with SQL a rather complicated, but fast way to do this all in one step would be:
res <- sqldf("SELECT l.*,
max(case when l.Winning_Tm = r.Team then r.Rank else 0 end) as Winning_Tm_rank,
max(case when l.Losing_Tm = r.Team then r.Rank else 0 end) as Losing_Tm_rank
FROM df1 as l
inner join df2 as r
on (l.Winning_Tm = r.Team
OR l.Losing_Tm = r.Team)
AND l.Year = r.Year
group by l.Year, l.Winning_Tm, l.Losing_Tm")
res
Year Winning_Tm Losing_Tm Winning_Tm_rank Losing_Tm_rank
1 2011 Texas Washington 32 34
2 2012 Alabama South_Carolina 12 45
3 2013 Tennessee Texas 51 6
Data:
df1 <- read.table(header=T,text="Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South_Carolina
2013 Tennessee Texas")
df2<- read.table(header=T,text="Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South_Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51")

Resources