how to read matrix text file into r using datatable - r

I am having trouble reading a text file that has data formatted in a matrix format as follows:
Location Product Day1 Day2 Day3 Day4 ... Day1 Day2 Day3
Jan Jan Jan ... Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
It is essentially a time series that has quantities of products by location by day. I want to eventually convert this into a "sql" table format.
My trouble is when I've tried the following to just skip row 2 and import the rest of the data with the fill = TRUE, I don't get the desired result. The actual counts get shifted to the right and don't align to the first "header" row. I want to combine row 1 and two together to make a date field starting from Day1 in row 1 and then leave empty fields as null or NA. Then eventually pivot this data to be in the following format:
Location Product Period Count
123 23 Jan 1
234 25 Jan 1 5
234 25 Feb 3 9
How can I accomplish this?

This demonstrates the auto-position guessing function, fwf_empty of pkg:readr. I could not get the read_fwf-function to accept a text connection argument to the file argument, so needed to save the text as a slightly edited version that looks like:
Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
Jan Jan Jan Jan Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
The R code:
require(readr)
fwf_empty(file="~/Untitled 4 copy.txt")
$begin
[1] 0 9 17 22 27 32 40
$end
[1] 8 16 21 26 31 36 55
$col_names
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7"
> read_fwf(file="~/Untitled 4 copy.txt", fwf_empty(file="~/Untitled 4 copy.txt"))
Warning: 8 parsing failures.
row col expected actual
2 X9 4 chars 3
3 X8 4 chars 2
3 -- 9 columns 8 columns
4 X9 4 chars 3
5 X3 4 chars 2
... ... ......... .........
.See problems(...) for more details.
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
2 <NA> <NA> Jan Jan Jan Jan Feb Feb Feb
3 123 23 <NA> <NA> 8 <NA> 9 3 <NA>
4 234 25 2 4 <NA> <NA> <NA> <NA> 9
5 254 87 3 <NA> <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA> <NA>
Then rename the columns and remove the first two lines:
> colnm <- paste0( inp[1,], inp[2,])
> colnm
[1] "LocationNA" "ProductNA" "Day1Jan" "Day2Jan" "Day3Jan"
[6] "Day4Jan" "Day1Feb" "Day2Feb" "Day3Feb"
> colnames(inp) <- colnm
> inp[-(1:2), ]
LocationNA ProductNA Day1Jan Day2Jan Day3Jan Day4Jan Day1Feb Day2Feb
3 123 23 <NA> <NA> 8 <NA> 9 3
4 234 25 2 4 <NA> <NA> <NA> <NA>
5 254 87 3 <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA>
Day3Feb
3 <NA>
4 9
5 <NA>
6 <NA>

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Extracting a numeric information align with ID from unstructured dataset in R

I am trying to extract score information for each ID and for each itemID. Here how my sample dataset looks like.
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
This sample dataset is limited only for two schools, two teachers and two students.
In this step, I need to extract student responses for each item.
Wherever the first column has Item , I need to grab from there. I especially need to index the rows and columns columns rather than giving the exact row columns number since this will be for multiple datafiles and each files has different information. No need to grab the ..:Error part.
################################################################################
# ## 2-extract the score information here
# ## 1-grab item information from where "Item 1:.." starts
Here, rather than using row number, how to automate this part.
score<-df[c(7:11,13:17,20:26),c(seq(2,dim(df)[2],2))] # need to automate row and columns index here
score<-as.data.frame(t(score))
rownames(score)<-seq(1,nrow(score),1)
colnames(score)<-paste0('i',seq(1,ncol(score),1)) # assign col names for items
score<-apply(score,2,as.numeric) # only keep numeric columns
score<-as.data.frame(score)
score$total<-rowSums(score,na.rm=T); score # create a total score
> score
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
Additionally, I need to add ID which I could not achieve here.
My desired output would be:
> score
ID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 123-2222-2 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 112-1111-3 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5

Cleaning a data.frame in a semi-reshape/semi-aggregate fashion

First time posting something here, forgive any missteps in my question.
In my example below I've got a data.frame where the unique identifier is the tripID with the name of the vessel, the species code, and a catch metric.
> testFrame1 <- data.frame('tripID' = c(1,1,2,2,3,4,5),
'name' = c('SS Anne','SS Anne', 'HMS Endurance', 'HMS Endurance','Salty Hippo', 'Seagallop', 'Borealis'),
'SPP' = c(101,201,101,201,102,102,103),
'kept' = c(12, 22, 14, 24, 16, 18, 10))
> testFrame1
tripID name SPP kept
1 1 SS Anne 101 12
2 1 SS Anne 201 22
3 2 HMS Endurance 101 14
4 2 HMS Endurance 201 24
5 3 Salty Hippo 102 16
6 4 Seagallop 102 18
7 5 Borealis 103 10
I need a way to basically condense the data.frame so that all there is only one row per tripID as shown below.
> testFrame1
tripID name SPP kept SPP.1 kept.1
1 1 SS Anne 101 12 201 22
2 2 HMS Endurance 101 14 201 24
3 3 Salty Hippo 102 16 NA NA
4 4 Seagallop 102 18 NA NA
5 5 Borealis 103 10 NA NA
I've looked into tidyr and reshape but neither of those are can deliver quite what I'm asking for. Is there anything out there that does this quasi-reshaping?
Here are two alternatives using base::reshape and data.table::dcast:
1) base R
reshape(transform(testFrame1,
timevar = ave(tripID, tripID, FUN = seq_along)),
idvar = cbind("tripID", "name"),
timevar = "timevar",
direction = "wide")
# tripID name SPP.1 kept.1 SPP.2 kept.2
#1 1 SS Anne 101 12 201 22
#3 2 HMS Endurance 101 14 201 24
#5 3 Salty Hippo 102 16 NA NA
#6 4 Seagallop 102 18 NA NA
#7 5 Borealis 103 10 NA NA
2) data.table
library(data.table)
setDT(testFrame1)
dcast(testFrame1, tripID + name ~ rowid(tripID), value.var = c("SPP", "kept"))
# tripID name SPP_1 SPP_2 kept_1 kept_2
#1: 1 SS Anne 101 201 12 22
#2: 2 HMS Endurance 101 201 14 24
#3: 3 Salty Hippo 102 NA 16 NA
#4: 4 Seagallop 102 NA 18 NA
#5: 5 Borealis 103 NA 10 NA
Great reproducible post considering it's your first. Here's a way to do it with dplyr and tidyr -
testFrame1 %>%
group_by(tripID, name) %>%
summarise(
SPP = toString(SPP),
kept = toString(kept)
) %>%
ungroup() %>%
separate("SPP", into = c("SPP", "SPP.1"), sep = ", ", extra = "drop", fill = "right") %>%
separate("kept", into = c("kept", "kept.1"), sep = ", ", extra = "drop", fill = "right")
# A tibble: 5 x 6
tripID name SPP SPP.1 kept kept.1
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1.00 SS Anne 101 201 12 22
2 2.00 HMS Endurance 101 201 14 24
3 3.00 Salty Hippo 102 <NA> 16 <NA>
4 4.00 Seagallop 102 <NA> 18 <NA>
5 5.00 Borealis 103 <NA> 10 <NA>

Transfer pivottable to another table in R

In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA

copy values of a column into another column based on a condition using a loop

I need to create a complicated "for" loop, but after reading some examples I'm still clueless of how to write it in a proper R way and therefore I'm not sure whether it will work or not. I'm still an R beginner :(
I have a dataset in the long format, with different occasions, however, some occasions are not truly new ones since the date of start is the same, but have a different offence that I need to copy in a new column called "offence2", after this I need to drop the false new occasion, in order to keep only rows that represent new occasions. My real data have up to 8 different offences for a single date, but I made a simpler example.
This are an example of how my data looks like
id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5)
dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
"15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
"22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006")
dstart1<-as.Date(dstart, "%d/%m/%Y")
offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a")
cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25)
mydata<-data.frame(id, dstart1, offence, cod_offence)
Data
id dstart1 offence cod_offence
1 1 2006-11-25 a 25
2 1 2006-12-13 b 26
3 1 2006-12-13 c 27
4 2 2006-02-07 b 26
5 2 2006-02-07 d 28
6 3 2006-01-15 a 25
7 3 2006-03-22 a 25
8 3 2006-09-18 e 29
9 4 2006-03-04 b 26
10 4 2006-03-04 a 25
11 4 2006-08-22 c 27
12 4 2006-08-22 a 25
13 5 2006-04-11 a 25
14 5 2006-04-11 b 26
15 5 2006-10-19 a 25
I need something like this:
id dstart1 offence cod_offence offence2
1 1 2006-11-25 a 25 NA
2 1 2006-12-13 b 26 c
3 1 2006-12-13 c 27 NA
4 2 2006-02-07 b 26 d
5 2 2006-02-07 d 28 NA
6 3 2006-01-15 a 25 NA
7 3 2006-03-22 a 25 NA
8 3 2006-09-18 e 29 NA
9 4 2006-03-04 b 26 a
10 4 2006-03-04 a 25 NA
11 4 2006-08-22 c 27 a
12 4 2006-08-22 a 25 NA
13 5 2006-04-11 a 25 b
14 5 2006-04-11 b 26 NA
15 5 2006-10-19 a 25 NA
I think that I need to do something like this:
given i=individual
j=observation within individual
for each individual I need to check whether mydata$dstart1(j) = mydata$dstart1(j+1)
if this is true, then copy mydata$offence2(j)=mydata$offence(j+1), otherwise keep the same value
This has to stop if id(j) != id(j+1) and re-start with the new id.
My problem is that I don't know how to put this in a loop.
Thank you!!
Update
Yes, it'w works fine with the example, but not yet with my real data, since they are a little bit more complex
What happen if instead of two repeated dates I have three or more? each one of them with different offences. Following #CathG solution, I need to create more variables according to the number of offences (in my case 8), I guess I would need a new vector that identify the position of the observation within id and a new "instruction" that tell R that depending of the position of the mydata$dstart1, the value need to be copied in a different column. But then again, I don't know how to do it.
id dstart1 offence cod_offence offence2 offence3 offence4
1 1 2006-11-25 a 25 NA NA NA
2 1 2006-12-13 b 26 c NA NA
3 1 2006-12-13 c 27 NA NA NA
4 2 2006-02-07 b 26 d NA NA
5 2 2006-02-07 d 28 NA NA NA
6 2 2006-04-12 b 26 d c a
7 2 2006-04-12 d 28 NA NA NA
8 2 2006-04-12 c 27 NA NA NA
9 2 2006-04-12 a 25 NA NA NA
Thanks again!!!
With splitand a loop :
# data with repeated dates /offences
id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5)
dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
"15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
"22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006","19/10/2006","19/10/2006","19/10/2006")
dstart1<-as.Date(dstart, "%d/%m/%Y")
offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a","c","b","a")
cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25,27,25,25)
mydata<-data.frame(id, dstart1, offence, cod_offence)
# see the max offences there are for same id and date
maxoff<-max(table(mydata$id,mydata$dstart1))
mydata[,paste("offence",2:maxoff,sep="")]<-NA
# split your data according to id
splitmydata<-split(mydata,mydata$id)
# for each "per id dataset", apply a function that looks for repeated offences / dates and fill the "offences" variables in the row with first occurence of specific date.
splitmydata2<-lapply(splitmydata,
function(tab){
for(datestart in unique(tab[,"dstart1"])){
ind_date<-sort(which(tab[,"dstart1"]==datestart))
if(length(ind_date[-1])){
tab[ind_date[1],grep("^offence",colnames(tab),value=T)[2:(length(ind_date))]]<-as.character(tab[ind_date[-1],"offence"])
}
}
return(tab)
}
)
mydata2<-unsplit(splitmydata2,mydata$id) # finally, unsplit your data
> mydata2
id dstart1 offence cod_offence offence2 offence3 offence4
1 1 2006-11-25 a 25 <NA> <NA> <NA>
2 1 2006-12-13 b 26 c <NA> <NA>
3 1 2006-12-13 c 27 <NA> <NA> <NA>
4 2 2006-02-07 b 26 d <NA> <NA>
5 2 2006-02-07 d 28 <NA> <NA> <NA>
6 3 2006-01-15 a 25 <NA> <NA> <NA>
7 3 2006-03-22 a 25 <NA> <NA> <NA>
8 3 2006-09-18 e 29 <NA> <NA> <NA>
9 4 2006-03-04 b 26 a <NA> <NA>
10 4 2006-03-04 a 25 <NA> <NA> <NA>
11 4 2006-08-22 c 27 a <NA> <NA>
12 4 2006-08-22 a 25 <NA> <NA> <NA>
13 5 2006-04-11 a 25 b <NA> <NA>
14 5 2006-04-11 b 26 <NA> <NA> <NA>
15 5 2006-10-19 a 25 c b a
16 5 2006-10-19 c 27 <NA> <NA> <NA>
17 5 2006-10-19 b 25 <NA> <NA> <NA>
18 5 2006-10-19 a 25 <NA> <NA> <NA>
You can use base R
indx <- with(mydata, ave(as.numeric(dstart1), id,
FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))
transform(mydata, offence2=ifelse(!!indx,
c(as.character(offence[-1]), NA), NA))
Or using dplyr
library(dplyr)
mydata %>%
group_by(id) %>%
mutate(offence2= dstart1==lead(dstart1),
offence2= ifelse(!is.na(offence2)&offence2,
as.character(lead(offence)), NA_character_))
# id dstart1 offence cod_offence offence2
#1 1 2006-11-25 a 25 NA
#2 1 2006-12-13 b 26 c
#3 1 2006-12-13 c 27 NA
#4 2 2006-02-07 b 26 d
#5 2 2006-02-07 d 28 NA
#6 3 2006-01-15 a 25 NA
#7 3 2006-03-22 a 25 NA
#8 3 2006-09-18 e 29 NA
#9 4 2006-03-04 b 26 a
#10 4 2006-03-04 a 25 NA
#11 4 2006-08-22 c 27 a
#12 4 2006-08-22 a 25 NA
#13 5 2006-04-11 a 25 b
#14 5 2006-04-11 b 26 NA
#15 5 2006-10-19 a 25 NA
or using data.table
library(data.table)
setDT(mydata)[, indx:=c(dstart1[-1]==dstart1[-.N], FALSE), by=id][,
offence2:=ifelse(indx, as.character(offence)[which(indx)+1],
NA_character_), by=id][,indx:=NULL]
mydata
# id dstart1 offence cod_offence offence2
#1: 1 2006-11-25 a 25 NA
#2: 1 2006-12-13 b 26 c
#3: 1 2006-12-13 c 27 NA
#4: 2 2006-02-07 b 26 d
#5: 2 2006-02-07 d 28 NA
#6: 3 2006-01-15 a 25 NA
#7: 3 2006-03-22 a 25 NA
#8: 3 2006-09-18 e 29 NA
#9: 4 2006-03-04 b 26 a
#10: 4 2006-03-04 a 25 NA
#11: 4 2006-08-22 c 27 a
#12: 4 2006-08-22 a 25 NA
#13: 5 2006-04-11 a 25 b
#14: 5 2006-04-11 b 26 NA
#15: 5 2006-10-19 a 25 NA
Update
Using the new dataset mydata2 and if you use the first method, we get d1
indx <- with(mydata2, ave(as.numeric(dstart1), id,
FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))
d1 <- transform(mydata2, offence2=ifelse(!!indx,
c(as.character(offence[-1]), NA), NA))
From d1, we can create an indx column and then use dcast to convert from long form to wide for the column offence2. If there are columns with all NAs, we can remove that by using colSums(is.na(. Rename the columns, and then use mutate_each from dplyr to sort the columns, and finally cbind it with mydata2
d1$indx <- with(d1, ave(seq_along(id), id, dstart1, FUN=seq_along))
library(reshape2)
d2 <- dcast(d1, id + dstart1+indx~indx, value.var='offence2')
d2New <- d2[,colSums(is.na(d2))!=nrow(d2)]
nm1 <- grep("^\\d",colnames(d2New))
colnames(d2New)[nm1] <- paste0('offence', 2:(length(nm1)+1))
d3 <- d2New[,-3] %>%
group_by(id, dstart1) %>%
mutate_each(funs(.[order(.)])) %>%
ungroup()
cbind(mydata,d3[,-c(1:2)])
# id dstart1 offence cod_offence offence2 offence3 offence4
#1 1 2006-11-25 a 25 <NA> <NA> <NA>
#2 1 2006-12-13 b 26 c <NA> <NA>
#3 1 2006-12-13 c 27 <NA> <NA> <NA>
#4 2 2006-02-07 b 26 d <NA> <NA>
#5 2 2006-02-07 d 28 <NA> <NA> <NA>
#6 2 2006-04-12 b 26 d c a
#7 2 2006-04-12 d 28 <NA> <NA> <NA>
#8 2 2006-04-12 c 27 <NA> <NA> <NA>
#9 2 2006-04-12 a 25 <NA> <NA> <NA>
data
mydata <- structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
5, 5), dstart1 = structure(c(13477, 13495, 13495, 13186, 13186,
13163, 13229, 13409, 13211, 13211, 13382, 13382, 13249, 13249,
13440), class = "Date"), offence = structure(c(1L, 2L, 3L, 2L,
4L, 1L, 1L, 5L, 2L, 1L, 3L, 1L, 1L, 2L, 1L), .Label = c("a",
"b", "c", "d", "e"), class = "factor"), cod_offence = c(25, 26,
27, 26, 28, 25, 25, 29, 26, 25, 27, 25, 25, 26, 25)), .Names = c("id",
"dstart1", "offence", "cod_offence"), row.names = c(NA, -15L),
class = "data.frame")
mydata2 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
dstart1 = structure(c(13477, 13495, 13495, 13186, 13186, 13250, 13250,
13250, 13250), class = "Date"), offence = c("a", "b", "c", "b", "d", "b",
"d", "c", "a"), cod_offence = c(25L, 26L, 27L, 26L, 28L, 26L, 28L, 27L, 25L
)), .Names = c("id", "dstart1", "offence", "cod_offence"), row.names =
c("1","2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame")

Resources