I have problems with rearranging my data rame so that it is suitable for panel analysis.
The raw data looks like this (there are all countries and 50 years, that's just head):
head(suicide_data_panel)
country variable 1970 1971
Afghanistan suicide NA NA
Afghanistan unempl NA NA
Afghanistan hci NA NA
Afghanistan gini NA NA
Afghanistan inflation NA NA
Afghanistan cpi NA NA
I would like it to be:
country year suicide unempl
Afghanistan 1970 NA NA
Afghanistan 1971 NA NA
Afghanistan 1972 NA NA
Afghanistan 1973 NA NA
Afghanistan 1974 NA NA
Afghanistan 1975 NA NA
So that I can run panel regression. I've tried to use dcast but I don't know how to make it account for different years:
suicide <- dcast(suicide_data_panel, country~variable, sum)
This command will result in taking the last year only into account:
head(suicide)
country account alcohol
1 Afghanistan -18.874843 NA
2 Albania -6.689212 NA
3 Algeria NA NA
4 American Samoa NA NA
5 Andorra NA NA
6 Angola 7.000035 NA
It sorts variables alphabetically. Please help.
You coul try to use the tidyverse package:
library(tidyverse)
suicide_data_panel %>%
gather(year, dummy, -country, -variable) %>%
spread(variable, dummy)
You can do this by first: using MELT function with your ID variables "country" and "variable"; and, second: using dcast function to transform "variable" into individual columns.
Following a reshape approach.
tm <- by(dat, dat$variable, reshape, varying=3:4, idvar="country",
direction="long", sep="", timevar="year", drop=2)
res <- cbind(el(tm)[1:2], data.frame(mapply(`[[`, tm, 3)))
res
# country year hci suicide unempl
# DE.1970 DE 1970 1.51152200 1.3709584 0.6328626
# AT.1970 AT 1970 -0.09465904 -0.5646982 0.4042683
# CH.1970 CH 1970 2.01842371 0.3631284 -0.1061245
# DE.1971 DE 1971 0.63595040 -0.0627141 -1.3888607
# AT.1971 AT 1971 -0.28425292 1.3048697 -0.2787888
# CH.1971 CH 1971 -2.65645542 2.2866454 -0.1333213
Data
set.seed(42)
dat <- cbind(expand.grid(country=c("DE", "AT", "CH"),
variable=c("suicide", "unempl", "hci"),
stringsAsFactors=F), x1970=rnorm(9), x1971=rnorm(9))
Related
I have some cumulative data on covid-19 cases for countries and i am trying to calculate the difference in a new column called Diff. I can't remove the NA values because it wouldn't show the dates when there were no tests carried out. So i have made it so that if there is an NA value, to set the Diff value to 0 to indicate there was no difference, hence no tests conducted that day.
I am also trying to make a statement which says that if Diff is also NA, indicating that there was no tests conducted the day before, then to set the difference to the confirmed cases value for that day.
As you can see from my results at the bottom, i am almost there but i am creating a new column called ifelse. I tried to fix this but i think there is a simple error i am making somewhere. If anyone could point it out to me i would really appreciate it, thank you.
Edit: I realised i made a logical error with my thinking about setting the daily cases to confirmed cases when the lag calculation = NA because this is giving a misleading answer.
I used the below code on the large dataset to fill down and repeat the previous values when NAs appear. I filtered by group so as not to simply propagate forward values across countries.
I then calculated the lag and then used Ronak Shah's code to get the daily values.
data <- data %>%
group_by(CountryName) %>%
fill(ConfirmedCases, .direction = "down")
data <- data %>%
mutate(lag1 = ConfirmedCases - lag(ConfirmedCases))
data <- data %>% mutate(DailyCases = replace_na(coalesce(lag1, ConfirmedCases), 0))
library(tidyverse)
data <- data.frame(
stringsAsFactors = FALSE,
CountryName = c("Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan"),
ConfirmedCases = c(NA,7L,NA,NA,NA,10L,16L,21L,
22L,22L,22L,24L,24L,34L,40L,42L,
75L,75L,91L,106L,114L,141L,166L,
192L,235L,235L,270L,299L,337L,367L,
423L),
Diff = c(NA,NA,NA,NA,NA,NA,6L,5L,1L,
0L,0L,2L,0L,10L,6L,2L,33L,0L,16L,
15L,8L,27L,25L,26L,43L,0L,35L,
29L,38L,30L,56L)
)
data2 <- data %>%
mutate(Diff = ifelse(is.na(ConfirmedCases) == TRUE, 0, ConfirmedCases - lag(ConfirmedCases)),
ifelse(is.na((ConfirmedCases - lag(ConfirmedCases))) == TRUE, ConfirmedCases, ConfirmedCases - lag(ConfirmedCases)))
head(data2, 10)
#> CountryName ConfirmedCases Diff ifelse(...)
#> 1 Afghanistan NA 0 NA
#> 2 Afghanistan 7 NA 7
#> 3 Afghanistan NA 0 NA
#> 4 Afghanistan NA 0 NA
#> 5 Afghanistan NA 0 NA
#> 6 Afghanistan 10 NA 10
#> 7 Afghanistan 16 6 6
#> 8 Afghanistan 21 5 5
#> 9 Afghanistan 22 1 1
#> 10 Afghanistan 22 0 0
Created on 2020-08-15 by the reprex package (v0.3.0)
Maybe this can help by creating a duplicate of your target column:
library(tidyverse)
data %>% mutate(D=ConfirmedCases,D=ifelse(is.na(D),0,D),
Diff2 = c(0,diff(D)),Diff2=ifelse(Diff2<0,0,Diff2)) %>% select(-D)
Output:
CountryName ConfirmedCases Diff Diff2
1 Afghanistan NA NA 0
2 Afghanistan 7 NA 7
3 Afghanistan NA NA 0
4 Afghanistan NA NA 0
5 Afghanistan NA NA 0
6 Afghanistan 10 NA 10
7 Afghanistan 16 6 6
8 Afghanistan 21 5 5
9 Afghanistan 22 1 1
10 Afghanistan 22 0 0
11 Afghanistan 22 0 0
12 Afghanistan 24 2 2
13 Afghanistan 24 0 0
14 Afghanistan 34 10 10
15 Afghanistan 40 6 6
16 Afghanistan 42 2 2
17 Afghanistan 75 33 33
18 Afghanistan 75 0 0
19 Afghanistan 91 16 16
20 Afghanistan 106 15 15
21 Afghanistan 114 8 8
22 Afghanistan 141 27 27
23 Afghanistan 166 25 25
24 Afghanistan 192 26 26
25 Afghanistan 235 43 43
26 Afghanistan 235 0 0
27 Afghanistan 270 35 35
28 Afghanistan 299 29 29
29 Afghanistan 337 38 38
30 Afghanistan 367 30 30
31 Afghanistan 423 56 56
I think you can use coalesce to get first non-NA value from Diff and ConfirmedCases and if both of them are NA replace it with 0.
library(dplyr)
data %>%
mutate(Diff2 = tidyr::replace_na(coalesce(Diff, ConfirmedCases), 0))
# CountryName ConfirmedCases Diff Diff2
#1 Afghanistan NA NA 0
#2 Afghanistan 7 NA 7
#3 Afghanistan NA NA 0
#4 Afghanistan NA NA 0
#5 Afghanistan NA NA 0
#6 Afghanistan 10 NA 10
#7 Afghanistan 16 6 6
#8 Afghanistan 21 5 5
#9 Afghanistan 22 1 1
#10 Afghanistan 22 0 0
#11 Afghanistan 22 0 0
#12 Afghanistan 24 2 2
#...
#...
I have a data set containing information about academic degrees per year, like this:
Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA
I want to obtain a data frame that contains the year and the highest academic degree obtained just before 2015, like this:
YearX Highest_Degree
2004 Master
2010 PHD
2006 Master
NA NA
2004 Master
2014 Master
Ugh, what a terrible data format. We add an ID column, clean it up, and then we can get what you want in a few lines.
library(tidyr)
library(dplyr)
library(stringr)
# create ID column
mutate(dd, id = 1:n()) %>%
# convert degree and year columns to long format
gather(dd, key = "degkey", value = "degree", starts_with("Deg")) %>%
gather(key = "yearkey", value = "year", starts_with("Year")) %>%
# pull the numbers into an index
mutate(yr_index = str_extract(yearkey, "[0-9]+"),
deg_index = str_extract(degkey, "[0-9]+")) %>%
# get rid of junk and filter to the years you want
filter(yr_index == deg_index, year < 2015) %>%
# order by descending index
arrange(desc(yr_index)) %>%
# keep relevant columns
select(id, degree, year) %>%
# for each ID, keep the top row
group_by(id) %>%
slice(1) %>%
# join back to the original to complete any lost IDs
right_join(select(dd, id))
# Joining, by = "id"
# # A tibble: 6 x 3
# # Groups: id [?]
# id degree year
# <int> <chr> <int>
# 1 1 Master 2004
# 2 2 PHD 2010
# 3 3 College 2006
# 4 4 <NA> NA
# 5 5 Master 2004
# 6 6 Master 2014
# Warning message:
# attributes are not identical across measure variables; they will be dropped
Using this data:
dd = read.table(text = "Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA",
header = T)
My data frame currently looks like
country_txt Year nkill_yr Countrycode Population deathsPer100k
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 Afghanistan 1988 128 4 11541 1.109089e-04
5 Afghanistan 1989 10 4 11778 8.490406e-06
6 Afghanistan 1990 12 4 12249 9.796718e-06
It contains a list of al countries, and the terrorist Deaths per 100,000 population.
Ideally I would Like a data frame in wide format that has the structure of:
country_txt 1970 1971 1972 1973 1974 1975
Afghanistan 3.98 1.1 0 4.3 0.8 0.09
Albania 0 0.4 0.5 0 0 0
Algeria 0 0 0 0.1 0.2 0
Angola 0 0.3 0 0 0 0
Except my function currently repeats like this:
YearCountryRatio<- spread(data = YearCountryRatio, Year, deathsPer100k )
country_txt 1970 1971 1972 1973
Afghanistan 3.98 NA NA NA
Afghanistan NA 1.1 NA NA
Afghanistan NA NA 0 NA
Afghanistan NA NA NA 4.3
And similarly for other countries,
Is there any way to either:
Collapse all of the NA values to show only one country or
Put it directly into wide format?
I've assumed you want each country_txt value reduced to a single row and are happy to drop the unused variables. (Note: I added a dummy country_txt value of "XYZ" to the sample data to show how multiple countries spread)
library(dplyr)
library(tidyr)
df <- read.table(text = "country_txt Year nkill_yr Countrycode Population deathsPer100k
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 XYZ 1988 128 4 11541 1.109089e-04
5 XYZ 1989 10 4 11778 8.490406e-06
6 XYZ 1990 12 4 12249 9.796718e-06", header = TRUE)
df <- mutate(df, deathsPer100k = round(deathsPer100k*100000, 2))
select(df, country_txt, Year, deathsPer100k) %>% spread(Year, deathsPer100k, fill = 0)
#> country_txt 1973 1979 1987 1988 1989 1990
#> 1 Afghanistan 0 3.98 0 0.00 0.00 0.00
#> 2 XYZ 0 0.00 0 11.09 0.85 0.98
So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA
I have a dataframe called df and I have 10 variables inside this df.
df contains a list of countries which are connected to their gdp, unemployment level, and whether they have been colonised as a (TRUE) etc.
For each variable gdp, unemp level and colonised I know there's a number of NAs.
Is there a command where I can list the names of the countries where they have NAs. e.g.: if the UK has NA for gdp, but has unemp and colonised and France has gdp, unemp but NA for colonised.
Is there a command which will bring a list of the UK and France because they have NAs?
My data:
destination origin sum gdp.diff unemployment.diff
1 Albania Azerbaijan 2 27 8.467610
2 Albania Congo 1 -21 NA
3 Albania Dem. Rep. of the Congo 1 -80 13.437610
4 Albania Eritrea 21 -66 NA
5 Albania Iran (Islamic Rep. of) 279 5 2.997610
6 Albania Mali 1 -68 6.137609
So I need Albania to appear in the list because is has an NA for unemp.diff
Using complete.cases:
#dummy data
df <- data.frame(country = letters[1:3],
gdp = c(1,NA,2),
unemployment = c(1,2,3),
colonised = c(T,F,NA))
df
# country gdp unemployment colonised
# 1 a 1 1 TRUE
# 2 b NA 2 FALSE
# 3 c 2 3 NA
df[ !complete.cases(df), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE
# 3 c 2 3 NA
# check for NAs on one column
df[ is.na(df$gdp), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE