Reshaping by ID number into wide format [duplicate] - r

This question already has answers here:
How can I spread repeated measures of multiple variables into wide format?
(4 answers)
Closed 4 years ago.
Posting a second question because my first was marked as a duplicate. I apologize in advance if there already is a question that addresses this specific issue.
I started out with a dataframe as follows:
dat<-data.frame(
ID=c(100,101,101,101,102,103),
DEGREE=c("BA","BA","MS","PHD","BA","BA"),
YEAR=c(1980,1990, 1992, 1996, 2000, 2004))
> dat
ID DEGREE YEAR
100 BA 1980
101 BA 1990
101 MS 1992
101 PHD 1996
102 BA 2000
103 BA 2004
ID 101 earned a BA in 1990, an MS in 1992, and a PHD in 1996.
I want to reshape this dataframe into a wide format that ultimately looks like this:
ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_DEGREE_1 YEAR_DEGREE_2 YEAR_DEGREE_3
100 BA 1980
101 BA MS PHD 1990 1992 1996
102 BA 2000
103 BA 2004
With help from an answer to my original question, I attempted to create my new data frame using the following code:
dat$DEGREE<-as.character(dat$DEGREE)
dat %>% group_by(ID) %>%
mutate(DegreeNum = paste("Degree", row_number(), sep = "_"))%>%
mutate(DegreeYear = paste("YearDegree", row_number(), sep = "_"))%>%
spread(DegreeNum, DEGREE, fill = "")%>%
spread(DegreeYear,YEAR,fill="")%>%
as.data.frame()
ID Degree_1 Degree_2 Degree_3 YearDegree_1 YearDegree_2 YearDegree_3
100 BA 1980
101 PHD 1996
101 MS 1992
101 BA 1990
102 BA 2000
103 BA 2004
This is as far as I was able to get, but cannot figure out how to reshape it into a dataframe so that everything from ID 101 is in one row. Any help would be appreciated.

Not so hard with tidyverse...
df<-data.frame(ID=c(100,101,101,101,102,103),
DEGREE=c("BA","BA","MS","PHD","BA","BA"),
YEAR=c(1980,1990, 1992, 1996, 2000, 2004),
stringsAsFactors=FALSE)
df1 <- df %>% select(-3) %>% group_by(ID) %>% mutate(i=row_number()) %>%
as.data.frame() %>%
reshape(direction="wide",idvar="ID",v.names="DEGREE",timevar="i",sep="_")
df1[is.na(df1)] <- ""
df2 <- df %>% select(-2) %>% group_by(ID) %>% mutate(i=row_number()) %>%
as.data.frame() %>%
reshape(direction="wide",idvar="ID",v.names="YEAR",timevar="i",sep="_")
df2[is.na(df2)] <- ""
inner_join(df1,df2,"ID")
# ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_1 YEAR_2 YEAR_3
#1 100 BA 1980
#2 101 BA MS PHD 1990 1992 1996
#3 102 BA 2000
#4 103 BA 2004

Related

Calculate ratio of values within one column

I have created a simple data frame with simulated GDP data for Costa Rica and the US, using the following code
gdp_test <- read.table(text = "Country, Year, GDP
costa_rica 1979 200
costa_rica 1980 210
costa_rica 1981 250
usa 1979 350
usa 1980 375
usa 1981 421", header=T)
gdp_test <- as.data.frame(gdp_test)
The output is as follows
Country. Year. GDP
1 costa_rica 1979 200
2 costa_rica 1980 210
3 costa_rica 1981 250
4 usa 1979 350
5 usa 1980 375
6 usa 1981 421
What I would like to do is to create a new variable consisting of the ratio of each country's GDP, for each year, to the usa gdp for that same year (obviously the ratio wouldl be 1 for the usa every year).
Any ideas of how to do it? It is an easy task in Excel, but I have found no way of doing it withing R
I have not been able to write any code that would do the task
That might do the trick, using tidyverse.
if(no_NA) {
Remove last pipe line
}
:)
gdp_test %>%filter(Country.=="usa") %>% group_by(Year.) %>% select(-Country.) %>%
left_join(gdp_test,by="Year.") %>%
rename(GDPus=GDP.x,GDP=GDP.y) %>%
mutate(ratio=GDP/GDPus) %>% ungroup() %>%
mutate(ratio=ifelse(ratio==1,NA,ratio))
Here is a very clumsy way of getting the job done. I am sure there are much better ways of doing it. Help would be enormously appreciated.
gdp_test <- read.table(text = "Country, Year, GDP
costa_rica 1979 200
costa_rica 1980 210
costa_rica 1981 250
usa 1979 350
usa 1980 375
usa 1981 421", header=T)
gdp_test <- as.data.frame(gdp_test) %>%
mutate(ID=row_number(),)
gdp_usa <- gdp_test$GDP[4:6]
usa <- as.data.frame(c(gdp_usa,gdp_usa)) %>%
mutate(ID=row_number(),)
gdp <-full_join(gdp_test,usa, by = "ID")
gdp <- gdp %>% mutate(ratio = GDP/gdp_usa)

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Tidy rows in one data frame based on a condition

I have a question in R programming.
I have a data frame in R with the following data:
Country Year Population Bikes Revenue
Austria 1970 85 NA NA
Austria 1973 86 NA NA
AUSTRIA 1970 NA 56 4567
AUSTRIA 1973 NA 54 4390
I want to summarise this data in order to have the following new data:
Country Year Population Bikes Revenue
Austria 1970 85 56 4567
Austria 1973 86 54 4390
Thus, I need to exclude the repeated years per country and join the Bikes and Revenue columns to the specific year and country.
I would highly appreciate if you could help me with this issue.
Thank you.
One dplyr possibility could be:
df %>%
group_by(Country = toupper(Country), Year) %>%
summarise_all(list(~ sum(.[!is.na(.)])))
Country Year Population Bikes Revenue
<chr> <int> <int> <int> <int>
1 AUSTRIA 1970 85 56 4567
2 AUSTRIA 1973 86 54 4390
Or a combination of dplyr and tidyr:
df %>%
group_by(Country = toupper(Country), Year) %>%
fill(everything(), .direction = "up") %>%
fill(everything(), .direction = "down") %>%
distinct()
Or if you for some reasons need to use the country names starting by an uppercase letter:
df %>%
mutate(Country = tolower(Country),
Country = paste0(toupper(substr(Country, 1, 1)), substr(Country, 2, nchar(Country)))) %>%
group_by(Country, Year) %>%
summarise_all(list(~ sum(.[!is.na(.)])))
Country Year Population Bikes Revenue
<chr> <int> <int> <int> <int>
1 Austria 1970 85 56 4567
2 Austria 1973 86 54 4390

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Resources