In r, how do I add rows together to get totals for a specific set of variables [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
My goal is to have a list of how much FDI China sent to each country per year. At the moment I have a list of individual projects that looks like this
Year
Country
Amount
2001
Angola
6000000
2001
Angola
8000000
2001
Angola
5.0E7
I want to sum it so it looks like this.
Year
Country
Amount
2001
Angola
6.4E7
How do I merge the rows and add the totals to get nice country-year data? I can't find an R command that does this precise thing.

library(tidyverse)
I copied the data table and read your dataframe into R using:
df <- clipr::read_clip_tbl(clipr::read_clip())
I like using dplyr to solve this question:
df2 <- as.data.frame(df %>% group_by(Country,Year) %>% summarize(Amount=sum(Amount)))
# A tibble: 1 x 3
# Groups: Country [1]
Country Year Amount
<chr> <int> <dbl>
1 Angola 2001 64000000

Related

How to exclude observation that does not appear at least once every year - R

I have a database where companies are identified by an ID (cnpjcei) from 2009 to 2018, where we can have 1 or more observations of a given company in a given year or no observations of a given company in a given year.
Here is a sample of the database:
> df
cnpjcei year
<chr> <dbl>
1 4774 2009
2 4774 2010
3 28959 2009
4 29688 2009
5 43591 2010
6 43591 2010
7 65803 2011
8 105104 2011
9 113980 2012
10 220043 2013
I would like to keep in that df only the companies that appear at least once a year.
What would be the easiest way to do this?
Using the data.table library:
library(data.table)
df<-data.table(df)
df<-df[,unique_years:=length(unique(year)), by=list(cnpjcei),][unique_years==10]
We can use dplyr, group_by id and filter only the cases in which all the elements in 2009:2018 can be found %in% the year column.
Please mind that, for this code to work with the sample database as in the question, the range would have to be replaced with 2009:2013
library(dplyr)
df %>% group_by(cnpjcei) %>% filter(all(2009:2018 %in% year))
You can keep the ids (cnpjcei) which has all the unique years available in the data.
library(dplyr)
result <- df %>%
group_by(cnpjcei) %>%
filter(n_distinct(year) == n_distinct(.$year)) %>%
ungroup

Scatter plot with variables that have multiple different years

I'm currently trying to make a scatter plot of child mortality rate and child labor. My problem is, I don't actually have a lot of data, and some countries may only get values for some years, and some other countries may only have data for some other years, so I can't plot all the data together, nor the data in any year is big enough to limit to that only year. I was wondering if there is a function that takes the last value available in the dataset for any given specified variable. So, for instance, if my last data for child labor from Germany is from 2015 and my last data from Italy is from 2014, and so forth with the rest of the countries, is there a way I can plot the last values for each country?
Code goes like this:
head(data2)
# A tibble: 6 x 5
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1962 34.5 NA
2 Afghanistan AFG 1963 33.9 NA
3 Afghanistan AFG 1964 33.3 NA
4 Afghanistan AFG 1965 32.8 NA
5 Afghanistan AFG 1966 32.2 NA
6 Afghanistan AFG 1967 31.7 NA
Never mind about those NA's. Labor data just doesn't go back there. But I do have it in the dataset, for more recent years. Child mortality data, on the other hand, is actually pretty complete.
Thanks.
I cannot find which variable to plot, but following code can select only last of each country.
data2 %>%
group_by(Entity) %>%
filter(Year == max(Year)) %>%
ungroup
result is like
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <lgl>
1 Afghanistan AFG 1967 31.7 NA
No you can plot some variable.
You might want to define what you mean by 'last' value per group - as in most recent, last occurrence in the data or something else?
dplyr::last picks out the last occurrence in the data, so you could use it along with arrange to order your data. In this example we sort the data by Year (ascending order by default), so the last observation will be the most recent. Assuming you don't want to include NA values, we also use filter to remove them from the data.
data2 %>%
# first remove NAs from the data
filter(
!is.na(labor)
) %>%
# then sort the data by Year
arrange(Year) %>%
# then extract the last observation per country
group_by(Entity) %>%
summarise(
last_record = last(labor)
)

R - read Excel file and switch variables into observations

I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA:
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
Reading Excel files via exact cell references with readxl::read_excel()
Reading a single cell into a constant
Setting column names for ease of use with tidyr::separate()
Restructuring to narrow format Tidy Data
Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script. Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
Specify a worksheet with the sheet= parameter
Add _Q1 to distinguish each quarter's read, and save the quarter as a key variable
Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>

Correlation of multiple values across years and strata

I have a dataframe that looks like this:
Year Strata Value1 Value2
1999 1 44268 0.8725
1999 2 46009 1.4550
1999 3 27715 3.1100
2000 1 24015 1.5800
2000 2 55601 1.5400
2000 3 15765 3.3200
I'm looking to find if value1 is correlated with value2, across years and strata. The real dataframe has many more years than this.
The repeated measure needs to be year, and it needs to be blocked by strata.
How is this done using R? Do you need to use aov()?

Panel Data from Long to wide reshape or cast

Hi i have panel data and would like to reshape or cast my Indicator name column from long to wide format. currently all the columns are in long format, Year(1960-2011), Country Name (all the countries in the world), Indicator name (varying by different indicators) and Value(individual values corresponding to year, indicator name and country name). How can i do this can someone help please. I would like the various indicators to be in the wide format with the corresponding value below it and on the other columns year and country name. Please help
Indicator.Name Year Country
GDP 1960 USA
GDP 1960 UK
Country Name Year GDP PPP HHH
USA 1960 7 9 10
Uk 1960 9 10 NA
World 1960 7 5 3
Africa 1960 3 7 NA
try using dcast from reshape2 like below:
library(reshape2)
indicator <- c('PPP','PPP','GDP','GDP')
country.name <- c('USA','UK','USA','UK')
year <- c(1960,1961,1960,1961)
value <- c(5,7,8,9)
d <- data.frame(indicator, country.name, year, value)
d1 <- dcast(d, country.name + year ~ indicator)

Resources