Interpolating missing data in a dataframe with R - r

I have a dataframe which is similar to the one below:
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 NA NA
3 France FR 2002 NA NA
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 NA NA
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
I have previously used the following code to get the differences:
df <- df %>%
arrange(country, year) %>% #sort data
group_by(country) %>%
mutate_if(is.numeric, funs(d = . - lag(.)))
I would like expand on this code by calculating the difference between the data points of Happiness and Power, divide it by the difference in years between the data points and calculate the values to replace the NA's with, resulting in the following output.
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2002 1400 1800
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 0 0
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
What would be an efficient way of carrying out this task?
EDIT: Please note that also France 2004 is NA. The extend function does seem to properly deal with such a situation.
EDIT 2: Adding the group_by(country) seems to mess things up for unknown reasons:It seems that the code is trying to convert a character to a numeric, although I do not really understand why. When I convert the column to character, the error becomes an evaluation error. Any suggestions?
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(TRcomplete$F116.s)
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(as.character(TRcomplete$F116.s))
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s))
Error: unexpected ')' in "TRcomplete$F116.s <- as.character(TRcomplete$F116.s))"
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s)
> str(TRcomplete$F116.s)
chr [1:6984] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.

You can use na.fill with fill="extend" from the zoo library
rapply(df, zoo::na.fill,"integer",fill="extend",how="replace")
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
EDIT:
library(tidyverse)
library(zoo)
df%>%
group_by(Country)%>%
mutate_at(4:5,~na.fill(.x,"extend"))
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
If all the elements in the group are NA then:
df%>%
group_by(Country)%>%
mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))

Related

How to order the rows information of a data set with two criteria

I have a data set containing information about academic degrees per year, like this:
Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA
I want to obtain a data frame that contains the year and the highest academic degree obtained just before 2015, like this:
YearX Highest_Degree
2004 Master
2010 PHD
2006 Master
NA NA
2004 Master
2014 Master
Ugh, what a terrible data format. We add an ID column, clean it up, and then we can get what you want in a few lines.
library(tidyr)
library(dplyr)
library(stringr)
# create ID column
mutate(dd, id = 1:n()) %>%
# convert degree and year columns to long format
gather(dd, key = "degkey", value = "degree", starts_with("Deg")) %>%
gather(key = "yearkey", value = "year", starts_with("Year")) %>%
# pull the numbers into an index
mutate(yr_index = str_extract(yearkey, "[0-9]+"),
deg_index = str_extract(degkey, "[0-9]+")) %>%
# get rid of junk and filter to the years you want
filter(yr_index == deg_index, year < 2015) %>%
# order by descending index
arrange(desc(yr_index)) %>%
# keep relevant columns
select(id, degree, year) %>%
# for each ID, keep the top row
group_by(id) %>%
slice(1) %>%
# join back to the original to complete any lost IDs
right_join(select(dd, id))
# Joining, by = "id"
# # A tibble: 6 x 3
# # Groups: id [?]
# id degree year
# <int> <chr> <int>
# 1 1 Master 2004
# 2 2 PHD 2010
# 3 3 College 2006
# 4 4 <NA> NA
# 5 5 Master 2004
# 6 6 Master 2014
# Warning message:
# attributes are not identical across measure variables; they will be dropped
Using this data:
dd = read.table(text = "Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA",
header = T)

Combining rows of data into one with an uncommon aspect in R

I have this data frame that goes something similar to the following.
Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA
and so on.
The data frame like this repeats with different titles, and has a number appear in 'Number' in 2004 and 'Allocation' in 2001.
How would I go about changing the data so it turns into something a single row of the data frame
Code Title Number Allocation
1000 Jack 113 6
1001 Dave 101 19
This also works:
library(dplyr)
df %>%
select(-Year) %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct()
or:
df %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct(Code, Title, Number, Allocation)
Result:
# A tibble: 2 x 4
# Groups: Code, Title [2]
Code Title Number Allocation
<int> <fctr> <int> <int>
1 1000 Jack 113 6
2 1001 Dave 101 19
Data:
df = read.table(text=" Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA", header = TRUE)

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

R issues with merge/rbind/concatenate two data frames

I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!
Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000
You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000
Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000

Resources