R issues with merge/rbind/concatenate two data frames - r

I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!

Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000

You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000

Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000

Related

Interpolating missing data in a dataframe with R

I have a dataframe which is similar to the one below:
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 NA NA
3 France FR 2002 NA NA
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 NA NA
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
I have previously used the following code to get the differences:
df <- df %>%
arrange(country, year) %>% #sort data
group_by(country) %>%
mutate_if(is.numeric, funs(d = . - lag(.)))
I would like expand on this code by calculating the difference between the data points of Happiness and Power, divide it by the difference in years between the data points and calculate the values to replace the NA's with, resulting in the following output.
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2002 1400 1800
4 France FR 2003 1600 2200
5 France FR 2004 NA NA
6 UK UK 2000 1000 1000
7 UK UK 2001 0 0
8 UK UK 2002 1000 1000
9 UK UK 2003 1000 1000
10 UK UK 2004 1000 1000
What would be an efficient way of carrying out this task?
EDIT: Please note that also France 2004 is NA. The extend function does seem to properly deal with such a situation.
EDIT 2: Adding the group_by(country) seems to mess things up for unknown reasons:It seems that the code is trying to convert a character to a numeric, although I do not really understand why. When I convert the column to character, the error becomes an evaluation error. Any suggestions?
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(TRcomplete$F116.s)
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(as.character(TRcomplete$F116.s))
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s))
Error: unexpected ')' in "TRcomplete$F116.s <- as.character(TRcomplete$F116.s))"
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s)
> str(TRcomplete$F116.s)
chr [1:6984] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
> TRcomplete<-TRcomplete%>%
+ group_by(country) %>%
+ mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.
You can use na.fill with fill="extend" from the zoo library
rapply(df, zoo::na.fill,"integer",fill="extend",how="replace")
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
EDIT:
library(tidyverse)
library(zoo)
df%>%
group_by(Country)%>%
mutate_at(4:5,~na.fill(.x,"extend"))
Country Ccode Year Happiness Power
1 France FR 2000 1000 1000
2 France FR 2001 1200 1400
3 France FR 2003 1400 1800
4 France FR 2004 1600 2200
5 UK UK 2000 1000 1000
6 UK UK 2001 1000 1000
7 UK UK 2003 1000 1000
8 UK UK 2004 1000 1000
If all the elements in the group are NA then:
df%>%
group_by(Country)%>%
mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))

How to order the rows information of a data set with two criteria

I have a data set containing information about academic degrees per year, like this:
Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA
I want to obtain a data frame that contains the year and the highest academic degree obtained just before 2015, like this:
YearX Highest_Degree
2004 Master
2010 PHD
2006 Master
NA NA
2004 Master
2014 Master
Ugh, what a terrible data format. We add an ID column, clean it up, and then we can get what you want in a few lines.
library(tidyr)
library(dplyr)
library(stringr)
# create ID column
mutate(dd, id = 1:n()) %>%
# convert degree and year columns to long format
gather(dd, key = "degkey", value = "degree", starts_with("Deg")) %>%
gather(key = "yearkey", value = "year", starts_with("Year")) %>%
# pull the numbers into an index
mutate(yr_index = str_extract(yearkey, "[0-9]+"),
deg_index = str_extract(degkey, "[0-9]+")) %>%
# get rid of junk and filter to the years you want
filter(yr_index == deg_index, year < 2015) %>%
# order by descending index
arrange(desc(yr_index)) %>%
# keep relevant columns
select(id, degree, year) %>%
# for each ID, keep the top row
group_by(id) %>%
slice(1) %>%
# join back to the original to complete any lost IDs
right_join(select(dd, id))
# Joining, by = "id"
# # A tibble: 6 x 3
# # Groups: id [?]
# id degree year
# <int> <chr> <int>
# 1 1 Master 2004
# 2 2 PHD 2010
# 3 3 College 2006
# 4 4 <NA> NA
# 5 5 Master 2004
# 6 6 Master 2014
# Warning message:
# attributes are not identical across measure variables; they will be dropped
Using this data:
dd = read.table(text = "Year1 Deg_Year1 Year2 Deg_Year2 Year3 Deg_Year3 Year4 Deg_Year4 Year5 Deg_Year5
2001 College 2004 Master NA NA NA NA NA NA
2004 College 2004 Master 2010 PHD NA NA NA NA
2006 Master 2006 College NA NA NA NA NA NA
2016 Master NA NA NA NA NA NA NA NA
2002 Master 2003 Master 2004 College 2004 Master NA NA
2014 Master 2017 PHD NA NA NA NA NA NA",
header = T)

Combining rows of data into one with an uncommon aspect in R

I have this data frame that goes something similar to the following.
Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA
and so on.
The data frame like this repeats with different titles, and has a number appear in 'Number' in 2004 and 'Allocation' in 2001.
How would I go about changing the data so it turns into something a single row of the data frame
Code Title Number Allocation
1000 Jack 113 6
1001 Dave 101 19
This also works:
library(dplyr)
df %>%
select(-Year) %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct()
or:
df %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct(Code, Title, Number, Allocation)
Result:
# A tibble: 2 x 4
# Groups: Code, Title [2]
Code Title Number Allocation
<int> <fctr> <int> <int>
1 1000 Jack 113 6
2 1001 Dave 101 19
Data:
df = read.table(text=" Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA", header = TRUE)

Duplicate rows while using Merge function in R - but I dont want the sum

So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources