Adding a column to a ragged array matrix by grouping variables - r

In R:
I am not sure what the proper title for this question is, so maybe someone can help me out. It would be greatly appreciated. I'm sorry if this is called something easily searchable.
So I have a ragged array matrix (multiple UPCS)
[upc] [quantity] [day] [daysum]
[1] 123 11 1 NA
[2] 123 2 1 NA
[3] 789 5 1 NA
[4] 456 10 2 NA
[5] 789 6 2 NA
I want the matrix to be summed by UPC for each day, for example:
[upc] [quantity] [day] [daysum]
[1] 123 11 1 13
[2] 123 2 1 13
[3] 789 5 1 5
[4] 456 10 2 10
[5] 789 6 2 6
Thank you for your time and help.

You have not described what is supposed to happen with the "clean matrix" but the code to create a "column" from your larger matrix suitable for binding to it on a row-aligned basis is quite simple:
B <- cbind(B, daysum=ave(B[, 'quantity'], # analysis variable
B[, 'upc'], B[ , 'day'], # strata variables
FUN=sum) ) # function applied in strata
This of course assumes that B really has the column names as indicated. Should also work if it is actually a dataframe, although the output does not suggest that you actually have R objects yet. The ave function will replicate the sums for all the rows with the same stratification variables.

Related

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

Factors created when subsetting data frame

When using subset on a data frame, my resulting data frame has some odd behavior.
df is the subset of a larger data frame
>df
buy_sell_count trt sector
1 1 0.023957 Apartment
2 1 0.026739 Strip Center
3 1 0.0705979999999999 Mall
4 1 0.0595650000000001 Office
5 1 0.0290539999999999 Industrial
I've tried the various drop-level practices shown in this question, but none have worked.
When i do mean(df$trt) I get a argument is not numeric or logical: returning NA
When i do as.numeric(df$trt) I get
[1] 8 9 12 11 10 1 4 6 3 5 7 2
I think it has to do with the levels:
df$trt produces
[1] 0.023957 0.026739 0.0705979999999999 0.0595650000000001 0.0290539999999999
[6] -0.01607 -0.188538 0.00279700000000016 -0.022502 0.00178300000000009
[11] 0.00770099999999996 -0.0191330000000001
12 Levels: -0.01607 -0.0191330000000001 -0.022502 -0.188538 0.00178300000000009 ... 0.0705979999999999

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

How to combine datasets without looping where one has multiple values?

Given the basic tools I know now (which, order, if, %in%, order, etc..), I am running frequently into one problem I call "the uniqueness problem".
The problem basically looks like this...
I have a matrix A I want filled out from another raw matrix, B.
A:
[upc] [day1] [day2] ... day52
[1] 123 NA NA NA
[2] 456 NA NA NA
[3] 789 NA NA NA
B is mega huge row wise, so looping is out of the question.
[upc] [quantity] [day]
[1] 123 11 1
[2] 123 2 1
[3] 789 5 1
[4] 456 10 1
[5] 789 6 1
I want to fill up day1 for each UPC in matrix A with the quantities in matrix B. The problem is that there are multiple instances of each UPC in B, and I can't loop over them to get the total quantity to put next to each upc.
So what I WANT is this.. (which would be filled out TOTALLY, i.e. days 2-52 ..by looping over the other days, which is small and thus manageable)
A:
[upc] [day1] [day2] ... day52
[1] 123 13 NA NA
[2] 456 10 NA NA
[3] 789 11 NA NA
Do you know any functions that can accomplish this without looping?
If you convert your original matrices to data.frames, you can employ aggregate,merge and reshape to get there:
Make some data including multiple days for the added id of 999:
A <- data.frame(upc=c(123,456,789,999))
B <- data.frame(
upc=c(123,123,789,456,789,999,999,999),
quantity=c(11,2,5,10,6,10,3,3),
day=c(1,1,1,1,1,1,2,2)
)
Aggregate the quantities by id and day, then merge and reshape:
mrgd <- merge(A,aggregate(quantity ~ upc + day ,data=B, sum),by="upc")
final <- reshape(mrgd,idvar="upc",timevar="day",direction="wide",sep="")
names(final) <- gsub("quantity","day",names(final))
Which gives:
final
# upc day1 day2
#1 123 13 NA
#2 456 10 NA
#3 789 11 NA
#4 999 10 6
You can create a matrix A using the tapply function:
> B <- data.frame(
+ upc=c(123,123,789,456,789,999,999,999),
+ quantity=c(11,2,5,10,6,10,3,3),
+ day=c(1,1,1,1,1,1,2,2)
+ )
> tapply( B$quantity, B[,c('upc','day')], FUN=sum )
day
upc 1 2
123 13 NA
456 10 NA
789 11 NA
999 10 6
>
If the B matrix is really huge then you might consider saving it as an ff object (ff package) then using ffrowapply to do it in chunks.

Resources