How to combine datasets without looping where one has multiple values? - r

Given the basic tools I know now (which, order, if, %in%, order, etc..), I am running frequently into one problem I call "the uniqueness problem".
The problem basically looks like this...
I have a matrix A I want filled out from another raw matrix, B.
A:
[upc] [day1] [day2] ... day52
[1] 123 NA NA NA
[2] 456 NA NA NA
[3] 789 NA NA NA
B is mega huge row wise, so looping is out of the question.
[upc] [quantity] [day]
[1] 123 11 1
[2] 123 2 1
[3] 789 5 1
[4] 456 10 1
[5] 789 6 1
I want to fill up day1 for each UPC in matrix A with the quantities in matrix B. The problem is that there are multiple instances of each UPC in B, and I can't loop over them to get the total quantity to put next to each upc.
So what I WANT is this.. (which would be filled out TOTALLY, i.e. days 2-52 ..by looping over the other days, which is small and thus manageable)
A:
[upc] [day1] [day2] ... day52
[1] 123 13 NA NA
[2] 456 10 NA NA
[3] 789 11 NA NA
Do you know any functions that can accomplish this without looping?

If you convert your original matrices to data.frames, you can employ aggregate,merge and reshape to get there:
Make some data including multiple days for the added id of 999:
A <- data.frame(upc=c(123,456,789,999))
B <- data.frame(
upc=c(123,123,789,456,789,999,999,999),
quantity=c(11,2,5,10,6,10,3,3),
day=c(1,1,1,1,1,1,2,2)
)
Aggregate the quantities by id and day, then merge and reshape:
mrgd <- merge(A,aggregate(quantity ~ upc + day ,data=B, sum),by="upc")
final <- reshape(mrgd,idvar="upc",timevar="day",direction="wide",sep="")
names(final) <- gsub("quantity","day",names(final))
Which gives:
final
# upc day1 day2
#1 123 13 NA
#2 456 10 NA
#3 789 11 NA
#4 999 10 6

You can create a matrix A using the tapply function:
> B <- data.frame(
+ upc=c(123,123,789,456,789,999,999,999),
+ quantity=c(11,2,5,10,6,10,3,3),
+ day=c(1,1,1,1,1,1,2,2)
+ )
> tapply( B$quantity, B[,c('upc','day')], FUN=sum )
day
upc 1 2
123 13 NA
456 10 NA
789 11 NA
999 10 6
>
If the B matrix is really huge then you might consider saving it as an ff object (ff package) then using ffrowapply to do it in chunks.

Related

R - enter basic formula

I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.
In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.
Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL

chaining together sequential observations with only current and immediately prior ID values in R

Say I have some data on traits of individuals measured over time, that looks like this:
present <- c(1:4)
pre.1 <- c(5:8)
pre.2 <- c(9:12)
present2 <- c(13:16)
id <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait <- rnorm(16,10,3)
d <- data.frame(id,prev.id,trait)
print d:
id prev.id trait
1 1 5 10.693266
2 2 6 12.059654
3 3 7 3.594182
4 4 8 14.411477
5 5 9 10.840814
6 6 10 13.712924
7 7 11 11.258689
8 8 12 10.920899
9 9 NA 14.663039
10 10 NA 5.117289
11 11 NA 8.866973
12 12 NA 15.508879
13 13 NA 14.307738
14 14 NA 15.616640
15 15 NA 10.275843
16 16 NA 12.443139
Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:
id <- c(1:4,13:16)
prev.id <- c(5:8, rep(NA,4))
trait <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
id prev.id trait prev.trait.1 prev.trait.2
1 1 5 10.693266 10.84081 14.663039
2 2 6 12.059654 13.71292 5.117289
3 3 7 3.594182 11.25869 8.866973
4 4 8 14.411477 10.92090 15.508879
5 13 NA 14.307738 NA NA
6 14 NA 15.616640 NA NA
7 15 NA 10.275843 NA NA
8 16 NA 12.443139 NA NA
I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.
#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)
#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)
#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]
How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?
What you have is a forest of linear lists. We'll start at the terminal ends
roots<-d$id[is.na(d$prev.id)]
And determine the paths backwards
path <- function(node) {
a <- integer(nrow(d))
i <- 0
while(!is.na(node)) {
i <- i+1
a[i] <- node
node <- d$id[match(node,d$prev.id)]
}
return(rev(a[1:i]))
}
Then we can get a 'stacked' representation of your desired output with
x<-do.call(rbind,lapply(roots,
function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))
And then use reshape2::dcast to get it in the desired shape
library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
id 1 2 3
1 1 10.693266 10.84081 14.663039
2 2 12.059654 13.71292 5.117289
3 3 3.594182 11.25869 8.866973
4 4 14.411477 10.92090 15.508879
5 13 14.307738 NA NA
6 14 15.616640 NA NA
7 15 10.275843 NA NA
8 16 12.443139 NA NA
I leave it to you to adapt column names.

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Adding a column to a ragged array matrix by grouping variables

In R:
I am not sure what the proper title for this question is, so maybe someone can help me out. It would be greatly appreciated. I'm sorry if this is called something easily searchable.
So I have a ragged array matrix (multiple UPCS)
[upc] [quantity] [day] [daysum]
[1] 123 11 1 NA
[2] 123 2 1 NA
[3] 789 5 1 NA
[4] 456 10 2 NA
[5] 789 6 2 NA
I want the matrix to be summed by UPC for each day, for example:
[upc] [quantity] [day] [daysum]
[1] 123 11 1 13
[2] 123 2 1 13
[3] 789 5 1 5
[4] 456 10 2 10
[5] 789 6 2 6
Thank you for your time and help.
You have not described what is supposed to happen with the "clean matrix" but the code to create a "column" from your larger matrix suitable for binding to it on a row-aligned basis is quite simple:
B <- cbind(B, daysum=ave(B[, 'quantity'], # analysis variable
B[, 'upc'], B[ , 'day'], # strata variables
FUN=sum) ) # function applied in strata
This of course assumes that B really has the column names as indicated. Should also work if it is actually a dataframe, although the output does not suggest that you actually have R objects yet. The ave function will replicate the sums for all the rows with the same stratification variables.

recoding using R

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz
At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003
Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Resources