this is probably stupid, but i have the following problem:
I have two tables:
1)Table with therapies on a specific patient with beginning and ending date:
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14
2)Table with many diagnoses, the specific patient and date and description:
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz
I want to create a new table, with the diagnoses of the patieents in the time of their therapy, i.e. with the matching criteria: patientID, date between startoftherapy and endoftherapy. Something like this:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 08.08.10 xxx
2 2 22 233 05.10.11 yyy
3 2 33 233 08.09.11 xxx
I´m way to unexperienced to do this, can anyone help me with this or point me in the right direction?
We can do it with `plyr:
# We recreate your data.frames
df1 <- read.table(text="
therapyID patientID startoftherapy endoftherapy
1 1 233 5.5.10 6.6.11
2 2 233 7.7.11 8.8.11
3 3 344 1.1.09 3.2.10
4 4 344 3.3.10 10.10.11
5 5 544 2.1.09 3.2.10
6 6 544 4.3.12 4.3.14
7 7 113 1.1.12 1.1.15
8 8 123 2.1.13 1.1.15
9 9 543 2.1.09 3.2.10
10 10 533 7.7.11 8.8.14", h=T)
df2 <- read.table(text="
diagnosisID dateofdiagnosis patientID diagnosis
1 11 8.8.10 233 xxx
2 22 5.10.11 233 yyy
3 33 8.9.11 233 xxx
4 44 2.2.09 344 zzz
5 55 3.3.09 344 yyy
6 666 2.2.12 123 zzz
7 777 3.3.12 123 yyy
8 555 3.2.10 543 xxx
9 203 8.8.12 533 zzz", h=T)
We load dplyr ; install.packages("dplyr") if you don't have it.
library(dplyr)
Then we left_join by patientID. A graphical definition (and more) can be found here. Then we just rearrange column order.
# we first left_join
left_join(df1, df2, "patientID") %>%
select(therapyID,diagnosisID,patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
We obtain:
therapyID diagnosisID patientID dateofdiagnosis diagnosis
1 1 11 233 8.8.10 xxx
2 1 22 233 5.10.11 yyy
3 1 33 233 8.9.11 xxx
4 2 11 233 8.8.10 xxx
The output may be different from the one you provided because of row order. It can be changed with arrange. Is this what you want?
EDIT
I want to sort out cases where date of diagnosis did not happened during the therapy
Then you first need to properly convert time column to date format. This function does the job for your format:
ch2date <- function(x) as.Date(x, format="%d.%m.%y")
We can include it to the pipe and then use these columns for filtering:
left_join(df1, df2, "patientID") %>%
mutate(startoftherapy = ch2date(startoftherapy),
endoftherapy = ch2date(endoftherapy),
dateofdiagnosis = ch2date(dateofdiagnosis)) %>%
filter(startoftherapy < dateofdiagnosis, dateofdiagnosis < endoftherapy) %>%
select(therapyID, diagnosisID, patientID, dateofdiagnosis, diagnosis) %>%
arrange(therapyID)
Does it solve your problem?
Related
I have an ordinal data frame which has answers in the survey format. I want to convert each factor into a possible column so as to get them by frequencies of a specific event.
I have tried lapply, dplyr to get frequencies but failed
as.data.frame(apply(mtfinal, 2, table))
and
mtfinalf<-mtfinal %>%
group_by(q28) %>%
summarise(freq=n())
Expected Results in the form of data.frame
Frequency table with respect to q28's factors
Expected Results in the form of data.frame
q28 sex1 sex2 race1 race2 race3 race4 race5 race6 race7 age1 age2
2 0
3 0
4 23
5 21
Actual Results
$age
1 2 3 4 5 6 7
6 2 184 520 507 393 170
$sex
1 2
1239 543
$grade
1 2 3 4
561 519 425 277
$race7
1 2 3 4 5 6
179 21 27 140 17 1307
7
91
$q8
1 2 3 4 5
127 259 356 501 539
$q9
1 2 3 4 5
993 224 279 86 200
$q28
2 3 4 5
1034 533 94 121
This will give you a count of number of unique combinations. What you are asking is impossible since there would be overlaps between levels of sex, race and age.
mtfinalf<-mtfinal %>%
group_by(q28,age,race,sex) %>%
tally()
suppose I have the next data frame and what I want to do is to identify and remove certain observations.
The idea is to delete those observations with 4 or more similar numbers.
df<-data.frame(col1=c(12,34,233,3333,3333333,333333,555555,543,456,87,4,111111,1111111111,22,222,2222,22222,9111111,912,8688888888))
col1
1 12
2 34
3 233
4 3333
5 3333333
6 333333
7 555555
8 543
9 456
10 87
11 4
12 111111
13 1111111111
14 22
15 222
16 2222
17 22222
18 9111111
19 912
20 8688888888
So the final output should be:
col1
1 12
2 34
3 233
4 543
5 456
6 87
7 4
8 22
9 222
10 912
Another way of removing the desired values would be to directly filter 1111, 2222 etc., using grep() after converting the numbers to characters.
df$col1[-as.numeric(grep(paste(1111*(1:9), collapse="|"), as.character(df$col1), value=F))]
# [1] 12 34 233 543 456 87 4 22 222 912
Not the most efficient method, but it seems to return the desired result. Convert the vector into a string, split each individual character, use rle to look for repeating sequences, take the maximum and return TRUE if that max is less than 4.
df[sapply(strsplit(as.character(df$col1), ""),
function(x) max(rle(x)$lengths) < 4), , drop=FALSE]
col1
1 12
2 34
3 233
8 543
9 456
10 87
11 4
14 22
15 222
19 912
This method will include values like 155155 but exclude values like 555511 or 155551.
I have two similar tables recording someones spending over 3 months.
From months 4-6 a new variable has been added.
df1 = data.frame(Month=c(1,2,3),Rent=c(132,123,234),Food=c(34,13,45))
df2 = data.frame(Month=c(4,5,6),Rent=c(111,212,231),Food=c(33,11,41),Fun=c(4,6,5))
> df1
Month Rent Food
1 1 132 34
2 2 123 13
3 3 234 45
> df2
Month Rent Food Fun
1 4 111 33 4
2 5 212 11 6
3 6 231 41 5
How can I combine/merge the two tables to look like this:
Month Rent Food Fun
1 1 132 34 NA
2 2 123 13 NA
3 3 234 45 NA
4 4 111 33 4
5 5 212 11 6
6 6 231 41 5
You can use join family functions for such tasks in the dplyr package as follows:
library(dplyr)
full_join(df1, df2)
Joining by: c("Month", "Rent", "Food")
Month Rent Food Fun
1 1 132 34 NA
2 2 123 13 NA
3 3 234 45 NA
4 4 111 33 4
5 5 212 11 6
6 6 231 41 5
I have a dataframe which contains information about several categories, and some associated variables. It is of the form:
ID category sales score
227 A 109 21
131 A 410 24
131 A 509 1
123 B 2 61
545 B 19 5
234 C 439 328
654 C 765 41
What I would like to do is be able to introduce two new columns, salesRank and scoreRank, where I find the item index per category, had they been ordered by sales and score, respectively. I can solve the general case like this:
dF <- dF[order(-dF$sales),]
dF$salesRank<-seq.int(nrow(dF))
but this doesn't account for the categories and so far I've only solved this by breaking up the dataframe. What I want would result in the following:
ID category sales score salesRank scoreRank
227 A 109 21 3 2
131 A 410 24 2 1
131 A 509 1 1 3
123 B 2 61 2 1
545 B 19 5 1 2
234 C 439 328 2 1
654 C 765 41 1 2
Many thanks!
Try:
library(dplyr)
df %>%
group_by(category) %>%
mutate(salesRank = row_number(desc(sales)),
scoreRank = row_number(desc(score)))
Which gives:
#Source: local data frame [7 x 6]
#Groups: category
#
# ID category sales score salesRank scoreRank
#1 227 A 109 21 3 2
#2 131 A 410 24 2 1
#3 131 A 509 1 1 3
#4 123 B 2 61 2 1
#5 545 B 19 5 1 2
#6 234 C 439 328 2 1
#7 654 C 765 41 1 2
From the help:
row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
desc(): transform a vector into a format that will be sorted in descending
order.
As #thelatemail pointed out, for this particular dataset you might want to use min_rank() instead of row_number() which will account for ties in sales/score more appropriately:
> row_number(c(1,2,2,4))
#[1] 1 2 3 4
> min_rank(c(1,2,2,4))
#[1] 1 2 2 4
Use ave in base R with rank (the - is to reverse the rankings from low-to-high to high-to-low):
dF$salesRank <- with(dF, ave(-sales, category, FUN=rank) )
#[1] 3 2 1 2 1 2 1
dF$scoreRank <- with(dF, ave(-score, category, FUN=rank) )
#[1] 2 1 3 1 2 1 2
I have just a base R solution with tapply.
salesRank <- tapply(dat$sales, dat$category, order, decreasing = T)
scoreRank <- tapply(dat$score, dat$category, order, decreasing = T)
cbind(dat, salesRank = unlist(salesRank), scoreRank= unlist(scoreRank))
ID category sales score salesRank scoreRank
A1 227 A 109 21 3 2
A2 131 A 410 24 2 1
A3 131 A 509 1 1 3
B1 123 B 2 61 2 1
B2 545 B 19 5 1 2
C1 234 C 439 328 2 1
C2 654 C 765 41 1 2
Im handling some stock order data and am having a problem with what I suspect needs a transpose. The data frame lists the qty for each supply location across the row for each customer for each item but I need it to have a separate row for each supply location
What I have looks like this - Each of the numbered columns is a supply location
1. Customer Cust.location Product 116 117 41 25 81 Total.Order
2. ABC Tap 123 5 3 0 2 1 11
3. ABC Tap 456 0 1 4 0 2 7
4. DEF Kar 123 1 0 0 3 4 8
What I need is
1. Customer Cust.Location Product Source Total
2. ABC Tap 123 116 5
3. ABC Tap 123 117 3
4. ABC Tap 123 25 2
5. ABC Tap 123 81 1
6. ABC Tap 456 117 1
7. ABC Tap 456 41 4
8. ABC Tap 456 81 2
9. DEF Kar 123 116 1
10.DEF Kar 123 25 3
11.DEF Kar 123 81 4
Sorry abou the poor layout - first time post here.
Not worried too much about handling 0 qty lines so if you have a solution that retains them it doesn't matter
This is classic reshaping from wide to long format. The melt function from the reshape2 package is how I prefer to do it, though you can use the reshape function in base R. If your data.frame is dat:
library(reshape2)
dat.m <- melt(dat[,-9], id.vars= c("Customer", "Cust.location", "Product"),
variable.name="Source", value.name="Total")
I've removed the Total.Order column (dat[,-9]) because it seems you don't need that.
# Customer Cust.location Product Source Total
# 1 ABC Tap 123 116 5
# 2 ABC Tap 456 116 0
# 3 DEF Kar 123 116 1
# 4 ABC Tap 123 117 3
# 5 ABC Tap 456 117 1
# 6 DEF Kar 123 117 0
# 7 ABC Tap 123 41 0
# 8 ABC Tap 456 41 4
# 9 DEF Kar 123 41 0
# 10 ABC Tap 123 25 2
# 11 ABC Tap 456 25 0
# 12 DEF Kar 123 25 3
# 13 ABC Tap 123 81 1
# 14 ABC Tap 456 81 2
# 15 DEF Kar 123 81 4
Base reshape method that #alexwhan alludes to is very similar:
dat <- read.table(text="Customer Cust.location Product 116 117 41 25 81 Total.Order
ABC Tap 123 5 3 0 2 1 11
ABC Tap 456 0 1 4 0 2 7
DEF Kar 123 1 0 0 3 4 8",header=TRUE)
reshape(
dat[,-9],
idvar=c("Customer","Cust.location", "Product"),
varying=4:8,
v.names="Total",
timevar="Source",
times=names(dat[4:8]),
direction="long"
)