Create additional rows and transpose in R - r

Im handling some stock order data and am having a problem with what I suspect needs a transpose. The data frame lists the qty for each supply location across the row for each customer for each item but I need it to have a separate row for each supply location
What I have looks like this - Each of the numbered columns is a supply location
1. Customer Cust.location Product 116 117 41 25 81 Total.Order
2. ABC Tap 123 5 3 0 2 1 11
3. ABC Tap 456 0 1 4 0 2 7
4. DEF Kar 123 1 0 0 3 4 8
What I need is
1. Customer Cust.Location Product Source Total
2. ABC Tap 123 116 5
3. ABC Tap 123 117 3
4. ABC Tap 123 25 2
5. ABC Tap 123 81 1
6. ABC Tap 456 117 1
7. ABC Tap 456 41 4
8. ABC Tap 456 81 2
9. DEF Kar 123 116 1
10.DEF Kar 123 25 3
11.DEF Kar 123 81 4
Sorry abou the poor layout - first time post here.
Not worried too much about handling 0 qty lines so if you have a solution that retains them it doesn't matter

This is classic reshaping from wide to long format. The melt function from the reshape2 package is how I prefer to do it, though you can use the reshape function in base R. If your data.frame is dat:
library(reshape2)
dat.m <- melt(dat[,-9], id.vars= c("Customer", "Cust.location", "Product"),
variable.name="Source", value.name="Total")
I've removed the Total.Order column (dat[,-9]) because it seems you don't need that.
# Customer Cust.location Product Source Total
# 1 ABC Tap 123 116 5
# 2 ABC Tap 456 116 0
# 3 DEF Kar 123 116 1
# 4 ABC Tap 123 117 3
# 5 ABC Tap 456 117 1
# 6 DEF Kar 123 117 0
# 7 ABC Tap 123 41 0
# 8 ABC Tap 456 41 4
# 9 DEF Kar 123 41 0
# 10 ABC Tap 123 25 2
# 11 ABC Tap 456 25 0
# 12 DEF Kar 123 25 3
# 13 ABC Tap 123 81 1
# 14 ABC Tap 456 81 2
# 15 DEF Kar 123 81 4

Base reshape method that #alexwhan alludes to is very similar:
dat <- read.table(text="Customer Cust.location Product 116 117 41 25 81 Total.Order
ABC Tap 123 5 3 0 2 1 11
ABC Tap 456 0 1 4 0 2 7
DEF Kar 123 1 0 0 3 4 8",header=TRUE)
reshape(
dat[,-9],
idvar=c("Customer","Cust.location", "Product"),
varying=4:8,
v.names="Total",
timevar="Source",
times=names(dat[4:8]),
direction="long"
)

Related

R Lag value of previous calculated function

I am trying to use a lag value of a previous row, which needs to be calculated from the previous row (unless its first entry).
I was trying something similar to:
test<-data.frame(account_id=c(123,123,123,123,444,444,444,444),entry=c(1,2,3,4,1,2,3,4),beginning_balance=c(100,0,0,0,200,0,0,0),
deposit=c(10,20,5,8,10,12,20,4),running_balance=c(0,0,0,0,0,0,0,0))
test2<-test %>%
group_by(account_id) %>%
mutate(running_balance = if_else(entry==1, beginning_balance+deposit,
lag(running_balance)+deposit))
print(test2)
the running balance should be 110,130,135,143,210,222,242,246
For each account_id you can add first beginning_balance with cumulative sum of deposit.
library(dplyr)
test %>%
group_by(account_id) %>%
mutate(running_balance = first(beginning_balance) + cumsum(deposit))
# account_id entry beginning_balance deposit running_balance
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 100 10 110
#2 123 2 0 20 130
#3 123 3 0 5 135
#4 123 4 0 8 143
#5 444 1 200 10 210
#6 444 2 0 12 222
#7 444 3 0 20 242
#8 444 4 0 4 246
Same thing using data.table :
library(data.table)
setDT(test)[, running_balance := first(beginning_balance) + cumsum(deposit), account_id]
Using for-loops for each unique account_id and adding cumulative sum for each id.
for ( i in unique (test$account_id)) {
test$running_balance [test$account_id == i] <- cumsum(test$beginning_balance[test$account_id == i]+test$deposit[test$account_id == i])
}
print (test)
account_id entry beginning_balance deposit running_balance
1 123 1 100 10 110
2 123 2 0 20 130
3 123 3 0 5 135
4 123 4 0 8 143
5 444 1 200 10 210
6 444 2 0 12 222
7 444 3 0 20 242
8 444 4 0 4 246

Create new table based on two existing tables with specific criteria

this is probably stupid, but i have the following problem:
I have two tables:
1)Table with therapies on a specific patient with beginning and ending date:
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14
2)Table with many diagnoses, the specific patient and date and description:
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz
I want to create a new table, with the diagnoses of the patieents in the time of their therapy, i.e. with the matching criteria: patientID, date between startoftherapy and endoftherapy. Something like this:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 08.08.10 xxx
2 2 22 233 05.10.11 yyy
3 2 33 233 08.09.11 xxx
I´m way to unexperienced to do this, can anyone help me with this or point me in the right direction?
We can do it with `plyr:
# We recreate your data.frames
df1 <- read.table(text="
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14", h=T)
df2 <- read.table(text="
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz", h=T)
We load dplyr ; install.packages("dplyr") if you don't have it.
library(dplyr)
Then we left_join by patientID. A graphical definition (and more) can be found here. Then we just rearrange column order.
# we first left_join
left_join(df1, df2, "patientID") %>%
select(therapyID,diagnosisID,patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
We obtain:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 8.8.10 xxx
2 1 22 233 5.10.11 yyy
3 1 33 233 8.9.11 xxx
4 2 11 233 8.8.10 xxx
The output may be different from the one you provided because of row order. It can be changed with arrange. Is this what you want?
EDIT
I want to sort out cases where date of diagnosis did not happened during the therapy
Then you first need to properly convert time column to date format. This function does the job for your format:
ch2date <- function(x) as.Date(x, format="%d.%m.%y")
We can include it to the pipe and then use these columns for filtering:
left_join(df1, df2, "patientID") %>%
mutate(startoftherapy = ch2date(startoftherapy),
endoftherapy = ch2date(endoftherapy),
dateofdiagnosis = ch2date(dateofdiagnosis)) %>%
filter(startoftherapy < dateofdiagnosis, dateofdiagnosis < endoftherapy) %>%
select(therapyID, diagnosisID, patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
Does it solve your problem?

Indexing subgroups by sorted positions in R dataframe

I have a dataframe which contains information about several categories, and some associated variables. It is of the form:
ID category sales score
227 A 109 21
131 A 410 24
131 A 509 1
123 B 2 61
545 B 19 5
234 C 439 328
654 C 765 41
What I would like to do is be able to introduce two new columns, salesRank and scoreRank, where I find the item index per category, had they been ordered by sales and score, respectively. I can solve the general case like this:
dF <- dF[order(-dF$sales),]
dF$salesRank<-seq.int(nrow(dF))
but this doesn't account for the categories and so far I've only solved this by breaking up the dataframe. What I want would result in the following:
ID category sales score salesRank scoreRank
227 A 109 21 3 2
131 A 410 24 2 1
131 A 509 1 1 3
123 B 2 61 2 1
545 B 19 5 1 2
234 C 439 328 2 1
654 C 765 41 1 2
Many thanks!
Try:
library(dplyr)
df %>%
group_by(category) %>%
mutate(salesRank = row_number(desc(sales)),
scoreRank = row_number(desc(score)))
Which gives:
#Source: local data frame [7 x 6]
#Groups: category
#
# ID category sales score salesRank scoreRank
#1 227 A 109 21 3 2
#2 131 A 410 24 2 1
#3 131 A 509 1 1 3
#4 123 B 2 61 2 1
#5 545 B 19 5 1 2
#6 234 C 439 328 2 1
#7 654 C 765 41 1 2
From the help:
row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
desc(): transform a vector into a format that will be sorted in descending
order.
As #thelatemail pointed out, for this particular dataset you might want to use min_rank() instead of row_number() which will account for ties in sales/score more appropriately:
> row_number(c(1,2,2,4))
#[1] 1 2 3 4
> min_rank(c(1,2,2,4))
#[1] 1 2 2 4
Use ave in base R with rank (the - is to reverse the rankings from low-to-high to high-to-low):
dF$salesRank <- with(dF, ave(-sales, category, FUN=rank) )
#[1] 3 2 1 2 1 2 1
dF$scoreRank <- with(dF, ave(-score, category, FUN=rank) )
#[1] 2 1 3 1 2 1 2
I have just a base R solution with tapply.
salesRank <- tapply(dat$sales, dat$category, order, decreasing = T)
scoreRank <- tapply(dat$score, dat$category, order, decreasing = T)
cbind(dat, salesRank = unlist(salesRank), scoreRank= unlist(scoreRank))
ID category sales score salesRank scoreRank
A1 227 A 109 21 3 2
A2 131 A 410 24 2 1
A3 131 A 509 1 1 3
B1 123 B 2 61 2 1
B2 545 B 19 5 1 2
C1 234 C 439 328 2 1
C2 654 C 765 41 1 2

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Allow a maximum number of entries when certain conditions apply

I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
if dat is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Say your data is in df. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
A solution in two lines using the plyr package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?

Resources