Filling in a ton of NA data in R by indices? - r

I have Price data indexed according to three things:
State, Date, and UPC (that is the Product Code).
I have a bunch of prices that are NA.
I am trying to fill the NAs in in the following way: For a given missing Price with index (S,D,UPC), fill in with the average Price of all the data points with the same S and UPC. I.e., take the average over Date.
There must be an incredibly easy way to do this because this is very simple. I have been using for loops, but I now realize that that is incredibly inefficient and I would like to use a function, such as one in plyr or dplyr, that will do it all in as few steps as possible.
upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200604,200601,200602,200601,200602,200603,200601,200602,200603,200604,200605,200606)
price=c(26,28,NA,NA,23,24,85,84,NA,81,78,24,19,98,NA)
state=c(1,1,1,1,2,2,1,1,2,2,2,1,1,1,1)
# This is what I have:
data <- data.frame(upc,date,state,price)
# This is what I want:
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
data2 <- data.frame(upc,date,state,price)
Any advice? Thanks.

Use ave with multiple grouping variables, and then replace NA values with the means:
with(data,
ave(price, list(upc,state), FUN=function(x) replace(x,is.na(x),mean(x,na.rm=TRUE) ) )
)
# [1] 26.0 28.0 27.0 27.0 23.0 24.0 85.0 84.0 79.5 81.0 78.0 24.0 19.0 98.0 47.0

You can construct a matrix of means by upc and state:
meanmtx <- tapply(dat$price, dat[c('upc','state')], mean, na.rm=TRUE)
That matrix has character indices that can be matched to values in upc and state. So then use 2 column character indexing to put these in the empty "slots":
dat$price[is.na(dat$price)] <-
meanmtx[ cbind( as.character(dat[ is.na(dat$price), 'upc']),
as.character(dat[ is.na(dat$price),'state']) ) ]
> dat
upc date state price
1 1153801013 200601 1 26.0
2 1153801013 200602 1 28.0
3 1153801013 200603 1 27.0
4 1153801013 200604 1 27.0
5 1153801013 200601 2 23.0
6 1153801013 200602 2 24.0
7 2105900750 200601 1 85.0
8 2105900750 200602 1 84.0
9 2105900750 200603 2 79.5
10 2105900750 200601 2 81.0
11 2105900750 200602 2 78.0
12 2173300001 200603 1 24.0
13 2173300001 200604 1 19.0
14 2173300001 200605 1 98.0
15 2173300001 200606 1 47.0

Here is another compact option using na.aggregate (from zoo) and data.table. The na.aggregate by default replace the NA values with the mean of the column of interest. It also has a FUN argument in case we want to replace the NA by median, min or max, or whatever we wish. The group by operations can be done by dplyr/data.table/base R methods. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'upc', 'state', we assign (:=) the 'price' as the na.aggregate of 'price'.
library(data.table)
library(zoo)
setDT(data)[, price:= na.aggregate(price) , .(upc, state)]

Related

How to calculate means from data frame in R for a variable with specific amount of NAs or not NAs at all?

So I have lots of data in the form (4 values for each day)
date var1 var2
1 2003-10-28 1.2 970
2 2003-10-28 NA 510
3 2003-10-28 NA 640
4 2003-10-28 NA 730
5 2003-10-30 2.0 570
6 2003-10-30 NA 480
7 2003-10-30 1.2 580
8 2003-10-30 1.2 297
9 2002-05-07 3.0 830
10 2002-05-07 4.8 507
11 2002-05-07 4.8 253
12 2002-05-07 NA 798
and I need to calculate sums for var1 for every day, IF there is for example less than 2 NA values (or none) for that specific date and otherwise that date should be ignored. At the same time I should calculate means of var2 for the same dates, IF the sum for var1 was also calculated. Then I should save those means, sums and dates to another data frame so that those ignored dates aren't there.
I have tried all kinds of loop structures, but I get confused by the fact that mean and sum have to be calculated for the dates where there are not NAs at all. Also saving the dates, means and sums gets me into difficulties because I have no idea how to do the indexing properly.
so expected output from this sample data should look like
date sum(var1) mean(var2)
1 2003-10-30 4.8 480.75
2 2002-05-07 17.4 561.75
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'date' if the number of NA values in 'var1' is less than 3, then get the sum of 'var1' and mean of 'var2'.
library(data.table)
setDT(df1)[,if(sum(is.na(var1)) < 3) .(Sum = sum(var1, na.rm=TRUE),
Mean = mean(var2, na.rm=TRUE)) , by = date]
# date Sum Mean
#1: 2003-10-30 4.4 481.75
#2: 2002-05-07 12.6 597.00
Using dplyr. Assuming your original dataset is df
library(dplyr)
df %>% group_by(date) %>% filter(sum(is.na(var1)) <= 2)%>% summarise(Sum = sum(var1, na.rm = T), Mean = mean(var2, na.rm = T))
Data
df <- read.table(text = " date var1 var2
1 2003-10-28 1.2 970
2 2003-10-28 NA 510
3 2003-10-28 NA 640
4 2003-10-28 NA 730
5 2003-10-30 2.0 570
6 2003-10-30 NA 480
7 2003-10-30 1.2 580
8 2003-10-30 1.2 297
9 2002-05-07 3.0 830
10 2002-05-07 4.8 507
11 2002-05-07 4.8 253
12 2002-05-07 NA 798",header =TRUE)
Output
Source: local data frame [2 x 3]
date Sum Mean
(date) (dbl) (dbl)
1 2002-05-07 12.6 597.00
2 2003-10-30 4.4 481.75

Match two data frames by two columns and extract values from third column

I apologize if is a basic or duplicate question, but I am a beginner R user.
I am attempting to match every row in Dataframe A by Sex and Age to the two corresponding columns in Dataframe B. I know there will be a match for sure, so I want to extract values from the matching rows of two different columns in Dataframe B and store them in Dataframe C.
Dataframe A Dataframe B
ID Sex Age Weight Row Sex Age X1 X2
1 1 24 36 1 1 24 18.2 12.3
2 1 34 56 2 2 87 15.4 16.5
3 2 87 12 3 1 64 16.3 11.2
4 2 21 08 4 2 21 15.6 14.7
5 1 64 33 5 1 34 17.7 18.9
...
Dataframe C
ID Age Sex Weight Y1 Y2
1 1 24 36 18.2 12.3
2 1 34 56 17.7 18.9
3 2 87 12 15.4 16.5
4 2 21 08 15.6 14.7
5 1 64 33 16.3 11.2
There are 9000 IDs in my dataframe. I've looked at similar questions like this one
Fill column values by matching values in each row in two dataframe
But I don't think this I am applying this code correctly. Will a for loop be useful here?
for(i in 1:nrow(ID){
dfC[i,Y1] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X1"]
dfC[i,Y2] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X2"]
}
I know the merge function was also suggested, but these two variables are not actually named the same way in my data set.
Thanks!
Try this bro... reduce function in R for such operations
set.seed(1)
list.of.data.frames = list(data.frame(id=1:10, sex=1:10 , age =1:10 , weight=1:20), data.frame(row=5:14, sex=11:20 , age :1:20 , x1:1:10, x2:1:10), data.frame(id=8:14, sex=11:20 , age :1:20 ,weight:20:30, y1:1:10, y2:1:10))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)

Replace NaNs in dataframe with values from another dataframe based on two criteria

Hi this is my first post to stackoverflow. I have been trying to solve this problem, but have not been able to figure out the answer alone nor find other posts that answer this question.
I need to replace missing values from my dataset with values from another dataframe; however, where it gets tricky is that the values I need to match have another factor associated with them, but matching dates.
Here is a simplified version of the first dataframe:
> df1
date site Value
1991-07-08 A 22.5
1991-07-09 A NaN
1992-07-13 B 23.1
1992-07-14 A NaN
1993-07-07 B 27.3
Here is a simplified version of the second dataframe:
> df2
date site value
1991-07-08 A 22.5
1991-07-09 A NaN
1992-07-14 A NaN
1991-07-08 B 10.6
1992-07-09 B 23
1992-07-14 B NaN
1992-07-09 C 11.3
1992-07-14 C 12.4
What I want to do is when there is a missing value for A to replace it with the value from B (with the same date), and if there is not value for B, using the value of C (with the same date). Thus, the resulting dataframe would look like this:
> dfFIN
date site Value
1991-07-08 A 22.5
1991-07-09 A 23
1992-07-13 B 23.1
1992-07-14 A 12.4
1993-07-07 B 27.3
This is what I have come up with so far:
dfFIN<-replace(df1[which(df1$site=="A"),],
df1$value[which(df$value=="NaN")],
df2$value[which(df2$site=="B" &
df2$date==df1$date[which(df1$value=="NaN" & df1$site=="A")])])
However, I get the following error message:
Error in [<-.data.frame(*tmp*, list, value = numeric(0)) :
missing values are not allowed in subscripted assignments of data frames
And I have not incorporated site C yet. I am not quite sure what to do and would appreciate any help.
Welcome to SO! First of all, your problem seems a bit underdefined, so I went ahead and made several alterations. I'm starting with two data frames:
df1 <- read.table(text = "
date site Value
1991-07-08 A 22.5
1991-07-09 A NaN
1992-07-13 B 23.1
1992-07-14 A NaN
1993-07-07 B 27.3
", head = T)
df2 <- read.table(text = "
date site Value
1991-07-08 A 22.5
1991-07-09 A NaN
1992-07-14 A NaN
1991-07-08 B 10.6
1991-07-09 B 23
1992-07-14 B NaN
1992-07-09 C 11.3
1992-07-14 C 12.4
", head = T)
Replacing NaN with a more traditional NA:
df1$Value[is.nan(df1$Value)] <- NA
df2$Value[is.nan(df2$Value)] <- NA
Merging (left joining) data frames that are cast from long to wide format (reshape2), so that date serves as a key:
library(reshape2)
dd1 <- dcast(df1, date ~ site)
dd2 <- dcast(df2, date ~ site)
dm <- merge(dd1, dd2, by = "date", all.x = TRUE, suffixes = c("", ".y"))
dm looks like so:
date A B A.y B.y C
1 1991-07-08 22.5 NA 22.5 10.6 NA
2 1991-07-09 NA NA NA 23.0 NA
3 1992-07-13 NA 23.1 NA NA NA
4 1992-07-14 NA NA NA NA 12.4
5 1993-07-07 NA 27.3 NA NA NA
Now it is super easy to replace NA with anything you want without the need to bother with dates. I'm using the following rule: if A is missing, use B.y, if B.y is also missing, use C.
dm$A <- ifelse(is.na(dm$A),
ifelse(is.na(dm$B.y),
dm$C, dm$B.y),
dm$A)
Now restore the original format:
dfFin <- na.omit(melt(dm[, c("date", "A", "B")], id = "date", variable.name = "site"))
date site value
1 1991-07-08 A 22.5
2 1991-07-09 A 23.0
4 1992-07-14 A 12.4
8 1992-07-13 B 23.1
10 1993-07-07 B 27.3

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

R data.table reshape chunks of columns at once

Lets say I have a data.table with these columns
nodeID
hour1aaa
hour1bbb
hour1ccc
hour2aaa
hour2bbb
hour2ccc
...
hour24aaa
hour24bbb
hour24ccc
for a total of 72 columns. Let's call it rawtable
I want to reshape it so I have
nodeID
hour
aaa
bbb
ccc
for a total of just these 5 columns
where the hour column will contain whichever hour from the original 72 that it should be.
Let's call it newshape
The way I'm doing it now is to use rbindlist with 24 items where each item is the proper subset of the bigger data.table. Like this (except I'm leaving out most of the hours in my example)
newshape<-rbindlist(list(
rawtable[,list(nodeID, Hour=1, aaa=hour1aaa, bbb=hour1bbb, ccc=hour1ccc)],
rawtable[,list(nodeID, Hour=2, aaa=hour2aaa, bbb=hour2bbb, ccc=hour2ccc)],
rawtable[,list(nodeID, Hour=24, aaa=hour24aaa, bbb=hour24bbb, ccc=hour24ccc)]))
Here is some sample data to play with
rawtable<-data.table(nodeID=c(1,2),hour1aaa=c(12.4,32),hour1bbb=c(61.1,65.33),hour1ccc=c(-4.2,54),hour2aaa=c(12.2,1.2),hour2bbb=c(12.2,5.7),hour2ccc=c(5.6,101.9),hour24aaa=c(45.2,8.5),hour24bbb=c(23,7.9),hour24ccc=c(98,32.3))
Using my rbindlist approach gives the desired result but, as with most things I do with R, there is probably a better way. By better I mean more memory efficient, faster, and/or uses less lines of code. Does anyone have a better way to achieve this?
This is a classic reshape problem if you get your names in the standard convention it expects, though I'm not sure this really harnesses the efficiency of the data.table structure:
reshape(
setNames(rawtable, gsub("(\\D+)(\\d+)(\\D+)", "\\3.\\2", names(rawtable))),
idvar="nodeID", direction="long", varying=-1
)
Result:
nodeID hour aaa bbb ccc
1: 1 1 12.4 61.10 -4.2
2: 2 1 32.0 65.33 54.0
3: 1 2 12.2 12.20 5.6
4: 2 2 1.2 5.70 101.9
5: 1 24 45.2 23.00 98.0
6: 2 24 8.5 7.90 32.3
#Arun's answer over here: https://stackoverflow.com/a/15510828/496803 may also be useful if you can adapt it to your current data.
One option is to use merged.stack from my package "splitstackshape". This function, stacks groups of columns and then merges the output together. Because of how the function creates the "time" variable, you can specify whatever you wanted to strip out from the column names. In this case, we want to strip out "hour", "aaa", "bbb", and "ccc" and have just the numbers remaining.
library(splitstackshape)
## Make sure you're using at least 1.2.0
packageVersion("splitstackshape")
# [1] ‘1.2.0’
merged.stack(rawtable, id.vars="nodeID",
var.stubs=c("aaa", "bbb", "ccc"),
sep="hour|aaa|bbb|ccc")
# nodeID .time_1 aaa bbb ccc
# 1: 1 1 12.4 61.10 -4.2
# 2: 1 2 12.2 12.20 5.6
# 3: 1 24 45.2 23.00 98.0
# 4: 2 1 32.0 65.33 54.0
# 5: 2 2 1.2 5.70 101.9
# 6: 2 24 8.5 7.90 32.3

Resources