I have a data.table in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.
Example data below:
library(data.table)
mydt <- data.table(id = c(1,2,3,4,5,6),
travel = c("no travel", "morocco", "algeria",
"morocco", "morocco", NA),
cluster = c(1,1,1,2,2,2))
> mydt
id travel cluster
1: 1 no travel 1
2: 2 morocco 1
3: 3 algeria 1
4: 4 morocco 2
5: 5 morocco 2
6: 6 <NA> 2
In the above example I want to calculate how many people travelled to each destination by cluster.
Initially I was doing this using the .N notation, as below:
mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1
1: 1 no travel 1 1
2: 2 morocco 1 1
3: 3 algeria 1 1
4: 4 morocco 2 2
5: 5 morocco 2 2
6: 6 <NA> 2 1
However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...) and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.
I then tried using sum() instead, as this is intuitive and also allows me to exclude NAs:
mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2
1: 1 no travel 1 1 1
2: 2 morocco 1 1 1
3: 3 algeria 1 1 1
4: 4 morocco 2 2 1
5: 5 morocco 2 2 1
6: 6 <NA> 2 1 0
This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...) as one of the grouping variables in the by statement.
When I use a different variable I get the desired result except that I am not able to exclude NAs this way:
mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2 ndest3
1: 1 no travel 1 1 1 1
2: 2 morocco 1 1 1 1
3: 3 algeria 1 1 1 1
4: 4 morocco 2 2 1 2
5: 5 morocco 2 2 1 2
6: 6 <NA> 2 1 0 1
This leads me to two questions:
In a data.table conditional count, how do I exclude NAs?
Why can't the same variable be used in the sum logic statemtent and as a grouping variable after by?
Any insights would be much appreciated.
You can exclude NAs in i
mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
# id travel cluster ndest1
#1: 1 no travel 1 1
#2: 2 morocco 1 1
#3: 3 algeria 1 1
#4: 4 morocco 2 2
#5: 5 morocco 2 2
#6: 6 <NA> 2 NA
Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.
To build a dummy data frame:
Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)
DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")
All<-rbind(DF1,DF2,DF3)
Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.
#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]
But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:
Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]
or even using split
ByYear1<-split(Trans1,Trans1$Year)
But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.
Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:
"Transect", "Data 2014", "Data 2015", "Data 2016"
Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.
It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.
If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".
This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.
library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")),
Transect + .id ~ Year, value.var = "Data")
# Transect .id 2014 2015 2016
# 1: 1 1 1 2 3
# 2: 1 2 2 1 4
# 3: 1 3 3 2 5
# 4: 1 4 1 2 4
# 5: 1 5 2 4 5
# 6: 1 6 3 3 6
# 7: 1 7 1 4 4
# 8: 1 8 2 5 4
# 9: 1 9 3 4 3
# 10: 1 10 NA NA 4
# 11: 2 1 2 3 4
# 12: 2 2 1 4 3
# 13: 2 3 2 3 2
# 14: 2 4 2 2 3
# 15: 2 5 1 3 2
# 16: 2 6 4 4 1
# 17: 2 7 4 3 4
# 18: 2 8 5 3 3
# 19: 2 9 4 2 2
# 20: 2 10 NA NA 3
# 21: 3 1 3 2 3
# 22: 3 2 4 1 3
# 23: 3 3 3 2 2
# 24: 3 4 3 3 5
# 25: 3 5 2 2 4
# 26: 3 6 1 3 3
# 27: 3 7 3 3 2
# 28: 3 8 3 4 4
# 29: 3 9 3 5 NA
# Transect .id 2014 2015 2016
Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.
I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic.
http://www.jstatsoft.org/v21/i12