R Transform Data Frame and Remove NAs - r

I've converted a data set in R from LONG to WIDE format and now have one measurement per row. What would be the best way to consolidate the rows based on the "Date" column and remove the NAs?
Here is a sample of what I have:
Date M1 M2 M3 M4
1 2013 NA NA NA 2
2 2013 6 NA NA NA
3 2013 NA 19 NA NA
4 2013 NA NA 10 NA
5 2014 NA NA NA 1
6 2014 NA NA 231 NA
7 2014 NA 215 NA NA
8 2014 16 NA NA NA
This is what I'd like to create:
Date M1 M2 M3 M4
1 2013 6 19 10 2
2 2014 16 215 231 1
Any suggestions or help would be appreciated!

Without knowing more about your dataset, you can try something like this:
library(data.table)
as.data.table(mydf)[, lapply(.SD, sum, na.rm = TRUE), by = Date]
# Date M1 M2 M3 M4
# 1: 2013 6 19 10 2
# 2: 2014 16 215 231 1
It doesn't have to be using "data.table" (but that's going to be one of your fastest options) but can be any one of your favorite aggregation functions.

If you have one measurement per row:
result<-aggregate(cbind(M1=data$M1, M2=data$M2, M3=data$M3, M4=data$M4),
by=list(Date= data$Date), FUN=sum, na.rm=TRUE)
Edit
This is better as mentioned by Ananda in the comments:
aggregate(. ~ Date, mydf, sum, na.rm = TRUE, na.action = "na.pass")

Using dplyr
library(dplyr)
df1%>%
group_by(Date) %>%
summarise_each(funs(sum(., na.rm=TRUE)))
# Date M1 M2 M3 M4
#1 2013 6 19 10 2
#2 2014 16 215 231 1
If there is only one non-NA observation per each column per 'Date', you could replace the summarise_each step with summarise_each(funs(na.omit(.)))

Related

Counting columns with NAs after group_by

I want to count the number of columns that have an NA value after using group_by.
Similar questions have been asking, but counting total NAs not columns with NA (group by counting non NA)
Data:
Spes <- "Year Spec.1 Spec.2 Spec.3 Spec.4
1 2016 5 NA NA 5
2 2016 1 NA NA 6
3 2016 6 NA NA 4
4 2018 NA 5 5 9
5 2018 NA 4 7 3
6 2018 NA 5 2 1
7 2019 6 NA NA NA
8 2019 4 NA NA NA
9 2019 3 NA NA NA"
Data <- read.table(text=spes, header = TRUE)
Data$Year <- as.factor(Data$Year)
The desired output:
2016 2
2018 1
2019 3
I have tried a few things, this is my current best attempt. I would be keen for a dplyr solution.
> Data %>%
group_by(Year) %>%
summarise_each(colSums(is.na(Data, [2:5])))
Error: Can't create call to non-callable object
I have tried variations without much luck. Many thanks
One option could be to group_by Year, check if there is any NA values in each column and calculate their sum for each Year.
library(dplyr)
Data %>%
group_by(Year) %>%
summarise_all(~any(is.na(.))) %>%
mutate(output = rowSums(.[-1])) %>%
select(Year, output)
# A tibble: 3 x 2
# Year output
# <fct> <dbl>
#1 2016 2
#2 2018 1
#3 2019 3
Base R translation using aggregate
rowSums(aggregate(.~Year, Data, function(x)
any(is.na(x)), na.action = "na.pass")[-1], na.rm = TRUE)
#[1] 2 1 3

Duplicate rows while using Merge function in R - but I dont want the sum

So here's my problem, I have about 40 datasets, all csv files that contain only two columns, (a) Date and (b) Price (for each dataset the price column is named as its country).. I used the merge function as follows to consolidate all data into a single dataset with one date column and several price columns. This was the function I used:
merged <- Reduce(function(x, y) merge(x, y, by="Date", all=TRUE), list(a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an))
What has happened is I have for instance in date column, 3 values for same date but the corresponding country values are split. e.g.:
# Date India China South Korea
# 01-Jan-2000 5445 NA 4445 NA
# 01-Jan-2000 NA 1234 NA NA
# 01-Jan-2000 NA NA NA 5678
I actually want
# 01-Jan-2000 5445 1234 4445 5678
I dont know how to get this, as the other questions related to this topic ask for summation of values which I clearly do not need. This is a simple example. Unfortunately I have daily data from Jan 2000 to November 2016 for about 43 countries, all messed up. Any help to solve this would be appreciated.
I would append all dataframes using rbind and reshape the result with spread(). As merging depends on the dataframe you start with.
Reproducable example:
library(dplyr)
a <- data.frame(date = Sys.Date()-1:10, cntry = "China", price=round(rnorm(10,20,5),2))
b <- data.frame(date = Sys.Date()-6:15, cntry = "Netherlands", price=round(rnorm(10,50,10),2))
c <- data.frame(date = Sys.Date()-11:20, cntry = "USA", price=round(rnorm(10,70,25),2))
all <- do.call(rbind, list(a,b,c))
all %>% group_by(date) %>% spread(cntry, price)
results in:
date China Netherlands USA
* <date> <dbl> <dbl> <dbl>
1 2016-11-29 NA NA 78.75
2 2016-11-30 NA NA 66.22
3 2016-12-01 NA NA 86.04
4 2016-12-02 NA NA 17.07
5 2016-12-03 NA NA 75.72
6 2016-12-04 NA 46.90 39.57
7 2016-12-05 NA 51.80 65.11
8 2016-12-06 NA 57.50 96.36
9 2016-12-07 NA 46.42 46.93
10 2016-12-08 NA 45.71 57.63
11 2016-12-09 15.41 60.09 NA
12 2016-12-10 16.66 60.07 NA
13 2016-12-11 23.72 66.21 NA
14 2016-12-12 19.82 45.46 NA
15 2016-12-13 14.22 45.07 NA
16 2016-12-14 27.26 NA NA
17 2016-12-15 20.08 NA NA
18 2016-12-16 15.79 NA NA
19 2016-12-17 17.66 NA NA
20 2016-12-18 26.77 NA NA

How to create a panel based on a long list of table in r

I have a data set like the following:
wk name score
3 - Davide - 3.070000
6 - Davide - 3.460000
7 - Davide - 3.480000
48 -Cringe- 2.773333
79 -Fabynsane- 2.330000
69 -PiDjO- 2.070000
61 -sjb- 2.310000
I want to use this information to construct a panel like the following:
name1 name2 name3 ...
wk1
wk2
wk3
...
I have tried dcast in reshape:
panel.num = dcast(data, name + wk ~ score)
but it gives me a panel like the following and this is apparently not the one I want:
Authorname wk.list 1 2 3 4 5 6 7 8 9 10 11 12 13
2 - Davide - 3 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 - Davide - 6 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
I am wondering what went wrong and how I could fix this issue. Thanks~
Try doing wk ~ name, ie
dat <- data.frame(wk=sample(1:100, 10),
name=sample(c("Davide", "Cringe", "Fabynsane"), 10, rep=T),
score=runif(10, 2, 3))
library(reshape2)
dcast(dat, wk ~ name)
# wk Cringe Davide Fabynsane
# 1 8 NA 2.225543 NA
# 2 12 NA NA 2.958040
# 3 46 NA 2.659209 NA
# 4 47 NA 2.086529 NA
# 5 59 NA NA 2.287232
Other options include
library(tidyr)
spread(dat, name, score)
Or reshape from base R
reshape(dat, idvar='wk', timevar='name', direction='wide')

R issues with merge/rbind/concatenate two data frames

I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!
Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000
You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000
Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000

R: Function “diff” over various groups

While searching for a solution to my problem I found this thread: Function "diff" over various groups in R. I've got a very similar question so I'll just work with the example there.
This is what my desired output should look like:
name class year diff
1 a c1 2009 NA
2 a c1 2010 67
3 b c1 2009 NA
4 b c1 2010 20
I have two variables which form subgroups - class and name. So I want to compare only the values which have the same name and class. I also want to have the differences from 2009 to 2010. If there is no 2008, diff 2009 should return NA (since it can't calculate a difference).
I'm sure it works very similarly to the other thread but I just can't make it work. I used this code too (and simply solved the ascending year by sorting the data differently), but somehow R still manages to calculate a difference and does not return NA.
ddply(df, .(class, name), summarize, year=head(year, -1), value=diff(value))
Using the data set form the other post, I would do something like
library(data.table)
df <- df[df$year != 2008, ]
setkey(setDT(df), class, name, year)
df[, diff := lapply(.SD, function(x) c(NA, diff(x))),
.SDcols = "value", by = list(class, name)]
Which returns
df
# name class year value diff
# 1: a c1 2009 33 NA
# 2: a c1 2010 100 67
# 3: b c1 2009 80 NA
# 4: b c1 2010 90 10
# 5: a c2 2009 80 NA
# 6: a c2 2010 90 10
# 7: b c2 2009 90 NA
# 8: b c2 2010 100 10
# 9: a c3 2009 90 NA
#10: a c3 2010 100 10
#11: b c3 2009 80 NA
#12: b c3 2010 99 19
Using dplyr
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff=c(NA,diff(value)))
# Source: local data frame [12 x 5]
# Groups: class, name
# name class year value diff
# 1 a c1 2009 33 NA
# 2 a c1 2010 100 67
# 3 a c2 2009 80 NA
# 4 a c2 2010 90 10
# 5 a c3 2009 90 NA
# 6 a c3 2010 100 10
# 7 b c1 2009 80 NA
# 8 b c1 2010 90 10
# 9 b c2 2009 90 NA
# 10 b c2 2010 100 10
# 11 b c3 2009 80 NA
# 12 b c3 2010 99 19
Update:
With relative difference
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff1=c(NA,diff(value)), rel_diff=round(diff1/value[row_number()-1],2))

Resources