Related
I have "Y maze" sequence data containing the characters, A,B,C. I am trying to quantitative the number of times those three values are found together. The data looks like this:
Animal=c(1,2,3,4,5)
VisitedZones=c(1,2,3,4,5)
data=data.frame(Animal, VisitedZones)
data[1,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C")
data[2,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B")
data[3,2]=("A,C,B,A,C,A,B,A,C,A")
data[4,2]=("A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B")
data[5,2]=("A,C,B,A,C,A,A,A,B,")
The tricky part is that I also have to consider the reading frame so that I can find all instances of ABC combinations. There are three reading frames, For example:
Here is the working example I have so far.
Split <- strsplit(data$VisitedZones, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),sequence(Ncol))] <- unlist(Split,
use.names = FALSE)
# Bind the values back together, here as a "data.table" (faster)
v2=data.table(Animal = data$Animal, M)
# I get error here
df=mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
It would be great if I could get something like this:
Animal VisitedZones ABC ACB BCA BAC CAB CBA
1 A,B,C,A,B.C... 2 0 1 0 1 0
2 A,B,C,C... 1 0 0 0 0 0
3 A,C,B,A... 0 1 0 0 0 1
df<-mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
Using dplyr, I generate for every letter in your vector the three-letter combination that starts from it, then create a table of frequencies of all found combinations (minus the last two, which are incomplete).
Result:
AAB ABC BCA CAA CAB
1 6 5 1 4
Your revised question is basically completely different, so I'll answer it here.
First, I would say your data structure doesn't make much sense to me, so I'll start out by reshaping it into something I can work with:
v2<-as.data.frame(t(v2))
Flip it over so the letters are in columns, not rows;
v2<-tidyr::gather(v2,"v","letter",na.rm=T)
Melt the table so it's long data (so that I'll be able to use lead etc).
v2<-group_by(v2,v)
df=mutate(v2,trio=paste0(letter,lead(letter),lead(letter,2)))
This brings us back basically to where we were at the end of the last question, only the data is grouped by the "animal" variable (here called "v" and represented by V1 thru V5).
df<-df[!grepl("NA",df$trio),]
Even though we removed the unnecessary NA's, we still end up having those pesky ABNA and ANANA etc at the end of each group, so this line will remove anything with an NA in it.
tt<-table(df$v,df$trio)
And finally, we create the table but also break it by "v". The result is this:
AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
V1 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
V2 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
V3 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
V4 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
V5 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0
You can now cbind it to your original data to get something like what you described, but it requires just an additional step, because of the way table saves its results:
data<-cbind(data,spread(as.data.frame(tt),Var2,Freq))[,-3]
Which ends up looking like this:
Animal VisitedZones AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
1 1 A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
2 2 A,C,B,A,C,A,B,A,C,A,C,A,C,B 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
3 3 A,C,B,A,C,A,B,A,C,A 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
4 4 A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
5 5 A,C,B,A,C,A,A,A,B, 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0
I need to create several matrices based on two criterion: ideo and time.
Here is a part of my code which does not work. I see the problem in using list object to store the numbers from new subsets but I don't know the way to list i and j simultaneously in the list. Should I make list(list())? What other ways to code this problem?
ideo.list<-list(f1,f2,f3,f4,f5,f6,f7,f8,f9)
time.list<-list(t1,t2,t3,t4)
dattime.list<-list()
for (j in 1:length(time.list)){
for (i in 1:length(ideo.list)){
dat.sub<-subset(dat,iyear %in% time.list[[j]] & Ideo %in% ideo.list[[i]])
dattime.list[[i*j]]<-apply(dat.sub[,5:13],2,sum)
}}
nn<-matrix(unlist(dattime.list), byrow=TRUE, ncol=9,nrow=length(dattime.list) )
The head of input data is below:
iyear Ideo Armed.Assault Assassination
1 1982 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
2 1994 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
3 1995 Left Wing Terrorist Groups (Anarchist) 0 0
4 2010 Racist Terrorist Groups 1 0
5 2013 Left Wing Terrorist Groups (Anarchist) 0 0
6 2014 Cell Strategy and Terrorist Groups 0 0
Bombing.Explosion Facility.Infrastructure.Attack Hijacking Hostage.Taking..Barricade.Incident.
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 0 0 0
5 0 1 0 0
6 0 1 0 0
Hostage.Taking..Kidnapping. Unarmed.Assault Unknown
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Thank you for help!
I have a data table of the form
ID REGION INCOME_BAND RESIDENCY_YEARS
1 SW Under 5,000 10-15
2 Wales Over 70,000 1-5
3 Center 15,000-19,999 6-9
4 SE 15,000-19,999 15-19
5 North 15,000-19,999 10-15
6 North 15,000-19,999 6-9
created by
exp = data.table(
ID = c(1,2,3,4,5,6),
REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))
I would like to transform this to
I've managed to perform the majority of the work with dcast:
exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))
However I need some help creating sensible column headings.
Currently I have
["ID"
"REGION.1_Center_?15,000-?19,999_6-9"
"REGION.1_North_?15,000-?19,999_10-15"
"REGION.1_North_?15,000-?19,999_6-9"
"REGION.1_SE_?15,000-?19,999_15-19" "REGION.1_SW_Under
?5,000_10-15" "REGION.1_Wales_Over ?70,000_1-5"
"INCOME_BAND.1_Center_?15,000-?19,999_6-9"
"INCOME_BAND.1_North_?15,000-?19,999_10-15"
"INCOME_BAND.1_North_?15,000-?19,999_6-9"
"INCOME_BAND.1_SE_?15,000-?19,999_15-19"
"INCOME_BAND.1_SW_Under ?5,000_10-15"
"INCOME_BAND.1_Wales_Over ?70,000_1-5"
"RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15"
"RESIDENCY_YEARS.1_North_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19"
"RESIDENCY_YEARS.1_SW_Under ?5,000_10-15"
"RESIDENCY_YEARS.1_Wales_Over ?70,000_1-5"
And I would like the column headings to be
ID SW Wales Center SE North Under 5,000 Over 70,000 15,000-19,999 1-5 6-9 10-15 15-19
Could anybody advise?
This apparently simple question is not easy to answer. So, we will go forward step-by step.
First, the OP has tried to reshape multiple value columns simultaneously which creates an unwanted cross product of all available combinations.
In order to treat all values in the same way, we need to melt() all value columns first before reshaping:
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]
ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales
1: 1 0 1 0 0 0 0 0 0 0 1 1 0
2: 2 1 0 0 0 0 0 0 1 0 0 0 1
3: 3 0 0 0 1 1 1 0 0 0 0 0 0
4: 4 0 0 1 0 1 0 0 0 1 0 0 0
5: 5 0 1 0 0 1 0 1 0 0 0 0 0
6: 6 0 0 0 1 1 0 1 0 0 0 0 0
Now, the result has 13 columns instead of 19 and the columns are named by the respective value as requested.
Unfortunately, the columns appear in the wrong order because they alphabetically ordered. There are two approaches to change the order:
Change order of columns after reshaping
The setcolorder() function reorders the columns of a data.table in place, e.g. without copying:
# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
# reorder columns
, setcolorder(.SD, c("ID", col_order))]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1: 1 0 0 0 1 0 1 0 0 0 0 1 0
2: 2 0 1 0 0 0 0 0 1 1 0 0 0
3: 3 0 0 1 0 0 0 1 0 0 1 0 0
4: 4 0 0 0 0 1 0 1 0 0 0 0 1
5: 5 1 0 0 0 0 0 1 0 0 0 1 0
6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Now, all REGION columns appear first, followed by INCOME_BAND and RESIDENCY_YEARS columns in the specified order.
Set factor levels before reshaping
If value is turned into a factor with appropriately ordered factor levels dcast() will use the factor levels for ordering the columns:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1: 1 0 0 0 1 0 1 0 0 0 0 1 0
2: 2 0 1 0 0 0 0 0 1 1 0 0 0
3: 3 0 0 1 0 0 0 1 0 0 1 0 0
4: 4 0 0 0 0 1 0 1 0 0 0 0 1
5: 5 1 0 0 0 0 0 1 0 0 0 1 0
6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Set factor levels before reshaping - lazy version
If it is sufficient to have the columns grouped by REGION, INCOME_BAND, and RESIDENCY_YEARS then we can use a short cut to avoid specifying each value in col_order. The fct_inorder() function from the forcats package reorders factor levels by their first appearance in a vector:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19
1: 1 1 0 0 0 0 1 0 0 1 0 0 0
2: 2 0 1 0 0 0 0 1 0 0 1 0 0
3: 3 0 0 1 0 0 0 0 1 0 0 1 0
4: 4 0 0 0 1 0 0 0 1 0 0 0 1
5: 5 0 0 0 0 1 0 0 1 1 0 0 0
6: 6 0 0 0 0 1 0 0 1 0 0 1 0
This works because the output of melt() is ordered by variable:
melt(exp, id.vars = "ID")
ID variable value
1: 1 REGION SW
2: 2 REGION Wales
3: 3 REGION Center
4: 4 REGION SE
5: 5 REGION North
6: 6 REGION North
7: 1 INCOME_BAND Under ?5,000
8: 2 INCOME_BAND Over ?70,000
9: 3 INCOME_BAND ?15,000-?19,999
10: 4 INCOME_BAND ?15,000-?19,999
11: 5 INCOME_BAND ?15,000-?19,999
12: 6 INCOME_BAND ?15,000-?19,999
13: 1 RESIDENCY_YEARS 10-15
14: 2 RESIDENCY_YEARS 1-5
15: 3 RESIDENCY_YEARS 6-9
16: 4 RESIDENCY_YEARS 15-19
17: 5 RESIDENCY_YEARS 10-15
18: 6 RESIDENCY_YEARS 6-9
R Newbie has a simple data table (DT) that has the number of households (NumHH) in several United States (Residences):
NumHH Residence
6 AK
4 AL
7 AR
6 AZ
1 CA
2 CO
2 CT
1 AK
4 AL
6 AR
3 AZ
1 CA
6 CO
3 CT
5 AL
By using with(),
with(DT, table(NumHH, Residence))
I can get a table that's close to what I want:
Residence
NumHH AK AL AR AZ CA CO CT
1 1 0 0 0 2 0 0
2 0 0 0 0 0 1 1
3 0 0 0 1 0 0 1
4 0 2 0 0 0 0 0
5 0 1 0 0 0 0 0
6 1 0 1 1 0 1 0
7 0 0 1 0 0 0 0
but I need a table that provides the frequency of several ranges per residence. The frequencies are calculated this way:
##Frequency of ranges per State
One <- DT$NumHH <=1 ##Only 1 person/household
Two_Four <- ((DT$NumHH <=4) - (DT$NumHH <=1)) ##2 to 4 people in Household
OverFour <- DT$NumHH >4 ##More than 4 people in HH
Ideally, the result would look like this:
Residence
NumHH AK AL AR AZ CA CO CT
One 1 0 0 0 2 0 0
Two_Four 0 2 0 1 0 1 2
OverFour 1 1 2 1 0 1 0
I've tried:
with() - I am only able to do one range at a time with "with()", such as:
with(DT, table (One, Residence)) - and that gives me a FALSE row and a TRUE row by state.
data.frames asks me to name each state ("AK", "AL", "AR", etc.), but with() already knows.
I have also tried ddply, but got a list of each calculation's (150 unlabeled rows in 4 columns - not the desired 3 labeled rows in 50 columns for each state), so I'm obviously not doing it right.
Any assistance is greatly appreciated.
Use ?cut to establish your groups before using table:
with(dat, table( NumHH=cut(NumHH, c(0,1,4,Inf), labels=c("1","2-4",">4")), Residence))
# Residence
#NumHH AK AL AR AZ CA CO CT
# 1 1 0 0 0 2 0 0
# 2-4 0 2 0 1 0 1 2
# >4 1 1 2 1 0 1 0
I am working with a large dataset of a fishing fleet and I need to format it for a poisson regression and other count models. See below for a subset of the data. The count variable is 'days'. p1:p3 are indicator variables for port group and f1:f4 are indicator variables for other fishing activity.
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 0 0 0
2016 5 2 2 8885.52 0 0 1 1 0 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
These rows only represent active weeks but I need to incorporate each vessel's inactive weeks. For example, for id=1, in year (yr) 2016 I need to add rows that start at week=1, and then rows for weeks 9,12, and 13. These rows will need to maintain the same information in the dummy categories (these don't change by yr), and have zeros in the 'days' column. I don't need to add rows after the last value of 'week' for that year and vessel.
This is where things get really complicated:
In the revenue (rev) column for these newly created rows I need to add the average revenue for that week and year for all vessels that share the same port group (p1:p3).
Finally, I need to add a new column of lagged revenues. For each row, the value for lagged revenue should be the value in the 'rev' column for the previous week for that vessel in that year.
The value for week 1 for each vessel should be the average of the first 2 weeks of revenue for that vessel in that year.
This task blows my data manipulation skills to smithereens and banging my head against the wall is starting to hurt. Any suggestions would be well appreciated! Thanks.
Thanks to https://stackoverflow.com/users/3001626/david-arenburg, and https://stackoverflow.com/users/2802241/user2802241, the issue has been solved. You can see a post on the adding rows part at:
Adding rows to a data.table according to column values
test<-data.frame(DT %>%
group_by(yr, id) %>%
complete(week = 1:max(week)) %>%
replace_na(list(days = 0)) %>%
group_by(yr, id) %>%
mutate_each(funs(replace(., is.na(.), mean(., na.rm = T))), p1:f4))
poisson<-data.table(test)
setkey(poisson,yr,id,week)
avrev<-poisson[,.(avrev = mean(rev,na.rm=T)),by=.(p1,p2,[p3,week,yr)]
avrev<-transform(avrev,xyz=interaction(p1,p2,p3,week,yr,sep=''))
poisson<-transform(poisson,xyz=interaction(tier200,tier300,tier500,week,yr,sep=''))
poisson<-transform(poisson,uniqueid=interaction(drvid,season,sep=''))
poisson$rev[is.na(poisson$rev)]<- avrev$avrev[match(poisson$xyz[is.na(poisson$rev)],avrev$xyz)]
poisson[, lagrev:=c(rev[1], rev[-.N]), by=uniqueid]
I'm sure there is a much nicer and neater way to accomplish the task but this works. David Arenburg also posted an answer in the comments section that utilizes data.table to create the new rows- see the other post.
To get the average of revenue by week, year, p1, p2, and p3 just use the aggregate function:
average_rev <- aggregate(rev~week+year+p1+p2+p3, data=your_dataframe, FUN=mean)
To add a new column of lagged revenues:
your_dataframe$lagged_rev <- c(NA, your_dataframe$rev[1:(nrow(_your_dataframe)-1)])
To get average rev for the last two weeks:
your_dataframe$avg_rev <- rowMeans(your_dataframe[,c('rev','lagged_rev')])