Transforming multiple columns structure using Dplyr in R

Transforming multiple columns structure using Dplyr in R - r

I have a dataset, df,
State Year 0 1 2 3 4 5
Georgia 2001 10,000 200 300 400 500 800
Georgia 2002 20,000 500 500 1,000 2,000 2,500
Georgia 2003 2,000 5,000 1,000 400 300 8,000
Washington 2001 1,000 10,000 6,000 8,000 9,900 10,000
Washington 2006 5,000 300 200 900 1,000 8,000
I would like my desired output to look like this:
State Year Age Population
Georgia 2001 0 10,000
Georgia 2002 0 20,000
Georgia 2003 0 2,000
Georgia 2001 1 200
Georgia 2002 1 500
Georgia 2003 1 5000
Georgia 2001 2 300
Georgia 2002 2 500
Georgia 2003 2 1000
Georgia 2001 3 400
Georgia 2002 3 1000
Georgia 2003 3 400
Georgia 2001 4 500
Georgia 2002 4 2000
Georgia 2003 4 300
Georgia 2001 5 800
Georgia 2002 5 2500
Georgia 2003 5 8000
Washington 2001 0 1000
Washington 2006 0 5000
Washington 2001 1 10000
Washington 2006 1 300
Washington 2001 2 6000
Washington 2006 2 200
Washington 2001 3 8000
Washington 2006 3 900
Washington 2001 4 9900
Washington 2006 4 1000
Washington 2001 5 10000
Washington 2006 5 8200
Here is my dput
structure(list(state = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("georgia",
"washington"), class = "factor"), year = c(2001L, 2002L, 2003L,
2001L, 2006L), X0 = structure(c(1L, 3L, 4L, 2L, 5L), .Label = c("10,000",
"1000", "20,000", "2000", "5000"), class = "factor"), X1 = structure(c(2L,
4L, 5L, 1L, 3L), .Label = c("10,000", "200", "300", "500", "5000"
), class = "factor"), X2 = c(300L, 500L, 1000L, 6000L, 200L),
X3 = c(400L, 1000L, 400L, 8000L, 900L), X4 = c(500L, 2000L,
300L, 99000L, 1000L), X5 = structure(c(3L, 2L, 4L, 1L, 4L
), .Label = c("10,000", "2500", "800", "8000"), class = "factor")), class = "data.frame", row.names
=
c(NA,
-5L))
This is what I have tried:
I know that I must groupby the state and the year as well as perform some type of pivot by possibly utilizing the gather() function
library(tidyr)
library(dplyr)
df1 <- gather(df, 0, 1, 2, 3, 4, 5 factor_key=TRUE)
df %>% groupby(State, Year) %>%
mutate('Age', 'Population')

We can first convert the column type to numeric by extracting the numeric part and then do the reshape
library(dplyr)
library(tidyr)
df %>%
mutate_at(vars(matches('\\d+$')), ~readr::parse_number(as.character(.))) %>%
pivot_longer(cols = -c(state, year), names_to = "Age", values_to = "Population")

Related

data.table join by subset NAs

This is a query that comes from an earlier thread I chanced upon, two tables DT1 and DT2
DT1
Country State City Start End
1 IN Telangana Hyderabad 100 200
2 IN Maharashtra Pune 300 400
3 IN Haryana Gurgaon 500 600
4 IN Maharashtra Pune 700 800
5 IN Gujarat Ahmedabad 900 1000
DT2 with 7 rows
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
When they are joined using this code,
DT2[DT1, on=.(No>Start,No<End), ]
produces this output, with 6 rows
ID No No.1 Country State City
1: 1 100 200 IN Telangana Hyderabad
2: 2 300 400 IN Maharashtra Pune
3: 3 300 400 IN Maharashtra Pune
4: 5 500 600 IN Haryana Gurgaon
5: NA 700 800 IN Maharashtra Pune
6: NA 900 1000 IN Gujarat Ahmedabad
i can understand the NAs corresponding to IDs 6 and 7 (rownumbers 5 and 6), but why is the NA corresponding to ID 4 missing.
ID4 which has 453 no, maps to no ranges in DT1 and should have thrown an NA?
EDIT1: Providing Code to create the datasets
DT1<-
structure(list(Country = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "IN", class = "factor"),
State = structure(c(4L, 3L, 2L, 3L, 1L), .Label = c("Gujarat",
"Haryana", "Maharashtra", "Telangana"), class = "factor"),
City = structure(c(3L, 4L, 2L, 4L, 1L), .Label = c("Ahmedabad",
"Gurgaon", "Hyderabad", "Pune"), class = "factor"), Start = c(100L,
300L, 500L, 700L, 900L), End = c(200L, 400L, 600L, 800L,
1000L)), .Names = c("Country", "State", "City", "Start",
"End"), class = c("data.table", "data.frame"))
DT2<-
structure(list(ID = 1:7, No = c(157L, 346L, 389L, 453L, 562L,
9874L, 98745L)), .Names = c("ID", "No"), class = c("data.table",
"data.frame"))

Sum variables using R by categories condition

I have a data frame that shows the number of publications by year. But I am interested just in Conference and Journals Publications. I would like to sum all other categories in Others type.
Examples of data frame:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Editorship 3
1996 Conference 20
1996 Editorship 2
1996 Books and Thesis 3
And the result would be:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Other 3
1996 Conference 20
1996 Other 5

With dplyr we can replace anything other than "Journal" or "Conference" to "Other" and then sum them by year and type.
library(dplyr)
df %>%
mutate(type = sub("^((Journal|Conference))", "Other", type)) %>%
group_by(year, type) %>%
summarise(n = sum(n))
# year type n
# <int> <chr> <int>
#1 1994 Conference 2
#2 1994 Journal 3
#3 1995 Conference 10
#4 1995 Other 3
#5 1996 Conference 20
#6 1996 Other 5

We can use data.table
library(data.table)
library(stringr)
setDT(df1)[, .(n = sum(n)), .(year, type = str_replace(type,
'(Journal|Conference)', 'Other'))]
# year type n
#1: 1994 Other 5
#2: 1995 Other 10
#3: 1995 Editorship 3
#4: 1996 Other 20
#5: 1996 Editorship 2
#6: 1996 Books and Thesis 3

levels(df$type)[levels(df$type) %in% c("Editorship", "Books_and_Thesis")] <- "Other"
aggregate(n ~ type + year, data=df, sum)
# type year n
# 1 Conference 1994 2
# 2 Journal 1994 3
# 3 Other 1995 3
# 4 Conference 1995 10
# 5 Other 1996 5
# 6 Conference 1996 20
Input data:
df <- structure(list(year = c(1994L, 1994L, 1995L, 1995L, 1996L, 1996L,
1996L), type = structure(c(2L, 3L, 2L, 1L, 2L, 1L, 1L), .Label = c("Other",
"Conference", "Journal"), class = "factor"), n = c(2L, 3L, 10L,
3L, 20L, 2L, 3L)), .Names = c("year", "type", "n"), row.names = c(NA, -7L), class = "data.frame")

Adding a new variable by comparing existing variables in a data frame in r

I have a dataset with the 2016 primary election results. The dataset contains 8 columns: State, state_abbr, county, fips(which is the combined state and county id number), party, candidate, votes, and fraction votes. I want to create a new column called "result" that indicates a "won" or "loss" in each county for each candidate. I filtered the data using dplyr to the 2 democratic candidates, then used this code add the column:
Democrat$result <- ifelse(Democrat$fraction_votes > .5, "Win","Loss")
This is obviously not an accurate method, because the winner didn't always get 50% of the vote. How can I get R to compare the vote_fraction or vote totals for each county, and return a "win" or "loss?" Would the apply() family, for loop, or writing a function be the best way to create the new column?
state state_abbreviation county fips party candidate
1 Alabama AL Autauga 1001 Democrat Bernie Sanders
2 Alabama AL Autauga 1001 Democrat Hillary Clinton
3 Alabama AL Baldwin 1003 Democrat Bernie Sanders
4 Alabama AL Baldwin 1003 Democrat Hillary Clinton
5 Alabama AL Barbour 1005 Democrat Bernie Sanders
6 Alabama AL Barbour 1005 Democrat Hillary Clinton
votes fraction_votes
1 544 0.182
2 2387 0.800
3 2694 0.329
4 5290 0.647
5 222 0.078
6 2567 0.906

I would first use summarise function from dplyr package to find the maximum number of votes any candidate received in a given county, then add the column with county maximum to the original dataset, then calculate the result.
# create a sample dataset akin to the question setup
df <- data.frame(abrev = rep("AL", 6), county = c("Autuga", "Autuga", "Baldwin", "Baldwin",
"Barbour", "Barbour"),
party = rep("Democrat", 6),
candidate = rep(c("Bernie", "Hillary"), 3),
fraction_votes = c(0.18, 0.8, 0.32, 0.64, 0.07, 0.9))
# load a dplyr library
library(dplyr)
# calculate what was the maximum ammount of votes candidate received in a given county
# take a df dataset
winners <- df %>%
# group it by a county
group_by(county) %>%
# for each county, calculate maximum of votes
summarise(score = max(fraction_votes))
# join the original dataset and the dataset with county maximumus
# join them by county column
df <- left_join(df, winners, by = c("county"))
# calculate the result column
df$result <- ifelse(df$fraction_votes == df$score, "Win", "Loss")
If there are different counties with same name, you would have to adjust the grouping and joining part, but the logic should be the same

In base R, you can calculate a binary vector with ave:
Democrat$winner <- ave(Democrat$fraction_votes, Democrat$fips, FUN=function(i) i == max(i))
which returns
Democrat
state state_abbreviation county fips party candidate votes fraction_votes winner
1 Alabama AL Autauga 1001 Democrat Bernie 544 0.182 0
2 Alabama AL Autauga 1001 Democrat Hillary 2387 0.800 1
3 Alabama AL Baldwin 1003 Democrat Bernie 2694 0.329 0
4 Alabama AL Baldwin 1003 Democrat Hillary 5290 0.647 1
5 Alabama AL Barbour 1005 Democrat Bernie 222 0.078 0
6 Alabama AL Barbour 1005 Democrat Hillary 2567 0.906 1
which could be converted to logical by wrapping the ave in as.logical if desired.
This is also quite straightforward in data.table. Assuming that fips is the unique state-county ID:
library(data.table)
# convert to data.table
setDT(Democrat)
# get logical vector that proclaims winner if vote fraction is maximum
Democrat[, winner := fraction_votes == max(fraction_votes), by=fips]
which returns
Democrat
state state_abbreviation county fips party candidate votes fraction_votes winner
1: Alabama AL Autauga 1001 Democrat Bernie 544 0.182 FALSE
2: Alabama AL Autauga 1001 Democrat Hillary 2387 0.800 TRUE
3: Alabama AL Baldwin 1003 Democrat Bernie 2694 0.329 FALSE
4: Alabama AL Baldwin 1003 Democrat Hillary 5290 0.647 TRUE
5: Alabama AL Barbour 1005 Democrat Bernie 222 0.078 FALSE
6: Alabama AL Barbour 1005 Democrat Hillary 2567 0.906 TRUE
data
Democrat <-
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Alabama", class = "factor"),
state_abbreviation = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"),
county = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("Autauga",
"Baldwin", "Barbour"), class = "factor"), fips = c(1001L,
1001L, 1003L, 1003L, 1005L, 1005L), party = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Democrat", class = "factor"),
candidate = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Bernie",
"Hillary"), class = "factor"), votes = c(544L, 2387L, 2694L,
5290L, 222L, 2567L), fraction_votes = c(0.182, 0.8, 0.329,
0.647, 0.078, 0.906)), .Names = c("state", "state_abbreviation",
"county", "fips", "party", "candidate", "votes", "fraction_votes"
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

conditional cumulative sum using dplyr

My dataframe looks like this and I want two separate cumulative columns, one for fund A and the other for fund B
Name Event SalesAmount Fund Cum-A(desired) Cum-B(desired)
John Webinar NA NA NA NA
John Sale 1000 A 1000 NA
John Sale 2000 B 1000 2000
John Sale 3000 A 4000 2000
John Email NA NA 4000 2000
Tom Webinar NA NA NA NA
Tom Sale 1000 A 1000 NA
Tom Sale 2000 B 1000 2000
Tom Sale 3000 A 4000 2000
Tom Email NA NA 4000 2000
I have tried:
df<-
df %>%
group_by(Name)%>%
mutate(Cum-A = as.numeric(ifelse(Fund=="A",cumsum(SalesAmount),0)))%>%
mutate(Cum-B = as.numeric(ifelse(Fund=="B",cumsum(SalesAmount),0)))
but it is totally not what I want as it shows me the runningtotal of both funds,albeit only on the row when the funds match.
Kindly help.

How about:
library(dplyr)
d %>%
group_by(Name) %>%
mutate(cA=cumsum(ifelse(!is.na(Fund) & Fund=="A",SalesAmount,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(Fund) & Fund=="B",SalesAmount,0)))
The output:
Source: local data frame [10 x 8]
Groups: Name
Name Event SalesAmount Fund Cum.A.desired. Cum.B.desired. cA cB
1 John Webinar NA NA NA NA 0 0
2 John Sale 1000 A 1000 NA 1000 0
3 John Sale 2000 B 1000 2000 1000 2000
4 John Sale 3000 A 4000 2000 4000 2000
5 John Email NA NA 4000 2000 4000 2000
6 Tom Webinar NA NA NA NA 0 0
7 Tom Sale 1000 A 1000 NA 1000 0
8 Tom Sale 2000 B 1000 2000 1000 2000
9 Tom Sale 3000 A 4000 2000 4000 2000
10 Tom Email NA NA 4000 2000 4000 2000
Zeroes in the resulting columns can be replaced by NA afterwards if needed:
result$cA[result$cA==0] <- NA
result$cB[result$cB==0] <- NA
Your input data set:
d <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("John", "Tom"), class = "factor"), Event = structure(c(3L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 2L, 1L), .Label = c("Email", "Sale", "Webinar"), class = "factor"), SalesAmount = c(NA, 1000L, 2000L, 3000L, NA, NA, 1000L, 2000L, 3000L, NA), Fund = structure(c(NA, 1L, 2L, 1L, NA, NA, 1L, 2L, 1L, NA), .Label = c("A", "B"), class = "factor"), Cum.A.desired. = c(NA, 1000L, 1000L, 4000L, 4000L, NA, 1000L, 1000L, 4000L, 4000L), Cum.B.desired. = c(NA, NA, 2000L, 2000L, 2000L, NA, NA, 2000L, 2000L, 2000L)), .Names = c("Name", "Event", "SalesAmount", "Fund", "Cum.A.desired.", "Cum.B.desired." ), class = "data.frame", row.names = c(NA, -10L))

Here's an approach generalizing to more funds, using zoo and data.table:
# prep
require(data.table)
require(zoo)
setDT(d)
d[,Fund:=as.character(Fund)] # because factors are the worst
uf <- unique(d[Event=="Sale"]$Fund) # collect set of funds
First, assign cumulative sales on the relevant subset of observations:
for (f in uf) d[(Event=="Sale"&Fund==f),paste0('c',f):=cumsum(SalesAmount),by=Name]
Then, carry the last observation forward:
d[,paste0('c',uf):=lapply(.SD,na.locf,na.rm=FALSE),.SDcols=paste0('c',uf),by=Name]

You can shorten #Marat's answer slightly by rolling it all into a single mutate:
df %>%
group_by(Name) %>%
mutate(
cA = cumsum(ifelse(!is.na(Fund) & Fund == "A", SalesAmount, 0)),
cB = cumsum(ifelse(!is.na(Fund) & Fund == "B", SalesAmount, 0)),
cA = ifelse(cA == 0, NA, cA),
cB = ifelse(cB == 0, NA, cB)
)

Merging cases into one in R

I have a very newbie question. I'm using the Aid Worker Security Database, which records episodes of violence against aid workers, with incident reports from 1997 through the present. The events are marked independently in the dataset. I would like to merge all events that happened in a single country in a given year, sum the values of the other variables and create a simple time series with the same number of years for all countries (1997-2013). Any idea how to do it?
df
# year country totalnationals internationalskilled
# 1 1997 Rwanda 0 3
# 2 1997 Cambodia 1 0
# 3 1997 Somalia 0 1
# 4 1997 Rwanda 1 0
# 5 1997 DR Congo 10 0
# 6 1997 Somalia 1 0
# 7 1997 Rwanda 1 0
# 8 1998 Angola 5 0
Where "df" is defined as:
df <- structure(list(year = c(1997L, 1997L, 1997L, 1997L, 1997L, 1997L,
1997L, 1998L), country = c("Rwanda", "Cambodia", "Somalia", "Rwanda",
"DR Congo", "Somalia", "Rwanda", "Angola"), totalnationals = c(0L,
1L, 0L, 1L, 10L, 1L, 1L, 5L), internationalskilled = c(3L, 0L,
1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("year", "country", "totalnationals",
"internationalskilled"), class = "data.frame", row.names = c(NA, -8L))
I would like to have something like that:
# year country totalnationals internationalskilled
# 1 1997 Rwanda 2 3
# 2 1997 Cambodia 1 0
# 3 1997 Somalia 1 1
# 4 1997 DR Congo 10 0
# 5 1997 Angola 0 0
# 6 1998 Rwanda 0 0
# 7 1998 Cambodia 0 0
# 8 1998 Somalia 0 0
# 9 1998 DR Congo 0 0
# 10 1998 Angola 5 0
Sorry for the very, very newbie question... but so far I couldn't figure out how to do it. Thanks! :-)

Updated after OP's comments -
df <- subset(df, year <= 2013 & year >= 1997)
df$totalnationals <- as.integer(df$totalnationals)
df$internationalskilled <- as.integer(df$internationalskilled)
df2 <- aggregate(data = df,cbind(totalnationals,internationalskilled)~year+country, sum)
To add 0s for years without a record -
df3 <- expand.grid(unique(df$year),unique(df$country))
df3 <- merge(df3,df2, all.x = TRUE, by = 1:2)
df3[is.na(df3)] <- 0

Same thing with data tables (can be faster on large datasets).
library(data.table)
dt <- data.table(df,key="year,country")
smry <- dt[,list(totalnationals =sum(totalnationals),
internationalskilled=sum(internationalskilled)),
by="year,country"]
countries <- unique(dt$country)
template <- data.table(year=rep(1997:2013,each=length(countries)),
country=countries,
key="year,country")
time.series <- smry[template]
time.series[is.na(time.series)]=0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Transforming multiple columns structure using Dplyr in R - r

We can first convert the column type to numeric by extracting the numeric part and then do the reshape library(dplyr) library(tidyr) df %>% mutate_at(vars(matches('\\d+$')), ~readr::parse_number(as.character(.))) %>% pivot_longer(cols = -c(state, year), names_to = "Age", values_to = "Population")

Related

data.table join by subset NAs

Sum variables using R by categories condition

Adding a new variable by comparing existing variables in a data frame in r

conditional cumulative sum using dplyr

Merging cases into one in R

Categories

Resources