Apply function across multiple columns - r

Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!

vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.

Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0

Related

Cross tabulation of two variables with averages in R?

I have the following social network dataset where participants (ego) were asked who provided social, work, and care support in their lives. Those who provided support (alter) were classified according to their relationship with ego (circle) resulting in the following dataset:
ego alter circle social work care
3400 3403 1 0 0 1
3400 3402 1 0 1 0
3400 3401 1 1 0 0
3500 3504 1 0 0 0
3500 3503 1 0 0 0
3500 3502 1 0 1 1
3500 3501 2 1 0 0
3600 3604 1 0 0 0
3600 3603 3 0 0 1
3600 3602 3 0 1 0
3600 3601 2 1 0 0
3700 3702 1 0 1 1
3700 3703 1 0 0 1
3700 3701 2 1 0 0
…
So, for example, in row 1, alter 3403 of social circle 1, did not provide social or work support but provided care support for ego 3400.
My question for you all is: how can I cross tabulate the variable circle with each of the support variables (social, work, and care) and then calculate the averages with ego?
Below is the resulting cross tabulation with totals and percentages, but I need the averages taking into account each ego.
Crosstab result
First, reproducible data using dput():
social <- structure(list(ego = c(3400L, 3400L, 3400L, 3500L, 3500L, 3500L,
3500L, 3600L, 3600L, 3600L, 3600L, 3700L, 3700L, 3700L), alter = c(3403L,
3402L, 3401L, 3504L, 3503L, 3502L, 3501L, 3604L, 3603L, 3602L,
3601L, 3702L, 3703L, 3701L), circle = c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 3L, 3L, 2L, 1L, 1L, 2L), social = c(0L, 0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L), work = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L), care = c(1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-14L))
Now, counts,
(tbl.count <- aggregate(cbind(social, work, care)~circle, social, sum))
# circle social work care
# 1 1 1 3 4
# 2 2 3 0 0
# 3 3 0 1 1
and means,
(tbl.mean <- aggregate(cbind(social, work, care)~circle, social, mean))
# circle social work care
# 1 1 0.1111111 0.3333333 0.4444444
# 2 2 1.0000000 0.0000000 0.0000000
# 3 3 0.0000000 0.5000000 0.5000000
and percentages,
(tbl.pct <- aggregate(cbind(social, work, care)~circle, social, function(x) mean(x)*100))
# circle social work care
# 1 1 11.11111 33.33333 44.44444
# 2 2 100.00000 0.00000 0.00000
# 3 3 0.00000 50.00000 50.00000

R: data.table compare sets of rows

I am working in R with data.tables. I have the following data.table encoding a set of points with coordinates A,B,C,D and index encoding a set the point belongs to.
library(data.table)
A B C D set
1: 0 0 0 0 1
2: 1 0 1 0 2
3: 1 1 1 0 2
4: 0 1 0 0 2
5: 1 0 1 1 2
6: 0 1 0 0 3
7: 1 1 0 0 3
8: 0 0 1 0 4
9: 1 0 1 0 4
10: 0 1 0 1 4
11: 0 0 0 0 5
12: 1 0 0 0 5
13: 1 1 1 0 5
14: 1 1 1 1 5
dt = setDT(structure(list(A = c(0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 1L, 1L), B = c(0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 1L), C = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L), D = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L), set = c(1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L)), .Names = c("A", "B", "C", "D", "set"), row.names = c(NA,
-14L), class = "data.frame"))
I have another table encoding e.g. probability of each set.
set mass
1: 1 0.27809187
2: 2 0.02614841
3: 3 0.36890459
4: 4 0.28975265
5: 5 0.03710247
wt = setDT(structure(list(set = 1:5, mass = c(0.27809187, 0.02614841, 0.36890459,
0.28975265, 0.03710247)), .Names = c("set", "mass"), row.names = c(NA,
-5L), class = "data.frame"))
I would like to have a procedure to create a projection to a subspace e.g. C,D. (Note that original points 1,4,6,7,11,12 coincide in this case, sets 1 and 3 are the same in this subspace as well as sets 2 and 5.
unique(dt[,c("C","D", "set")])
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 0 0 3
6: 1 0 4
7: 0 1 4
8: 0 0 5
9: 1 0 5
10: 1 1 5
and to identify the same sets, keep just unique ones and sum the corresponding masses. I.e. in this case:
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 1 0 4
6: 0 1 4
set mass
1: 1 0.6469965 % set 1 + set 3
2: 2 0.06325088 % set 2 + set 5
3: 4 0.36890459
Thank you for your ideas.
Similar in concept to Frank's, we can map the binary values of each set to a decimal with x * 2 ^ ((length(x) - 1):0). Subsetting, also, for "C" and "D", we get:
coords = c("C", "D")
d = data.frame(set = dt$set,
val = Reduce("+", Map("*", list(dt$C, dt$D), 2 ^ ((length(coords) - 1):0))))
d
Then, we can group identical sets following the same idea:
tab = table(d$val, d$set) > 0L ## `table(d) > 0` to ignore the duplicates
gr = colSums(tab * (2 ^ ((nrow(tab) - 1):0)))
gr
# 1 2 3 4 5
# 8 11 8 6 11
## another (pre-edit) alternative with unnecessary overhead
#gr = cutree(hclust(dist(table(d) > 0L)), h = 0)
#gr
#1 2 3 4 5
#1 2 1 3 2
and aggregate based on this group:
rowsum(wt$mass[match(names(gr), wt$set)], gr, reorder = FALSE)
# [,1]
#8 0.64699646
#11 0.06325088
#6 0.28975265
A somewhat clunky option: make a unique string for each set, and then group on that.
coords = c("C", "D")
gDT = setorder(unique(dt[,c(coords, "set"), with=FALSE]))[,
.(s = paste(do.call(paste, c(.SD, .(sep="_"))), collapse="."))
, by=set, .SDcols = coords][,
g := .GRP
, by=s][]
# set s g
# 1: 1 0_0 1
# 2: 2 0_0.1_0.1_1 2
# 3: 3 0_0 1
# 4: 5 0_0.1_0.1_1 2
# 5: 4 0_1.1_0 3
gDT[wt, on=.(set), mass := i.mass ]
gDT[, .(set = first(set), mass = sum(mass)), by=g]
# g set mass
# 1: 1 1 0.64699646
# 2: 2 2 0.06325088
# 3: 3 4 0.28975265
Comments
You can get rid of the g by chaining on [, g := NULL][] in the last line.
setorder is just sorting the data so that the unique string turns out the same in set sets that are the same.
Grouped first and sum operations are optimized, as you can see if you add verbose = TRUE to the final line, like gDT[, .(set = first(set), mass = sum(mass)), by=g, verbose=TRUE].

calculate the rate under the same in using R

I have a question to calculate the rate under the same id numbers.
Here is the sample dataset d:
id answer
1 1
1 0
1 0
1 1
1 1
1 1
1 0
2 0
2 0
2 0
3 1
3 0
The ideal output is
id rate freq
1 4/7 (=0.5714) 7
2 0 3
3 1/2 (=0.5) 2
Thanks.
Just for fun, you can use aggregate
> aggregate(answer~id, function(x) c(rate=mean(x), freq=length(x)), data=df1)
id answer.rate answer.freq
1 1 0.5714286 7.0000000
2 2 0.0000000 3.0000000
3 3 0.5000000 2.0000000
Try
library(data.table)
setDT(df1)[,list(rate= mean(answer), freq=.N) ,id]
# id rate freq
#1: 1 0.5714286 7
#2: 2 0.0000000 3
#3: 3 0.5000000 2
Or
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(rate=mean(answer), freq=n())
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), answer = c(1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L)), .Names = c("id", "answer"), class = "data.frame",
row.names = c(NA, -12L))

1 factor column and 15 columns to factor under the 1 factor column then sum each 15 individually

I don't know the vocabulary otherwise I am sure I would be able to effectively search for this. So far I have not found anything and I am running out of time.
So I have 16 columns of information, 1 of them is a factor column, we'll assume dates, and the other 15 are hour times (6 am - 8 pm, representing hour only) with either a 1 or a 0, representing active state or inactive state. What I want to do is
Group the data by the factor column, (Dates)
After everything is grouped, I want to individually sum over each 15 columns per grouping
display a 2 dimensional table with the dates running vertically and time sum running horizontally
Please, if you can help, please use the vocabulary so I can not only learn it myself, but so I can look up documentation and teach it to others too please.
An example would be
Date Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 0 0 0 1 1 ... 0
9-15 0 1 1 1 1 ... 0
9-16 0 1 1 1 0 ... 0
9-16 0 0 0 0 0 ... 1
9-16 1 1 0 0 0 ... 1
9-18 0 1 0 1 1 ... 0
.
.
.
11-7 0 1 1 1 0 ... 0
What I want is
Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 5 10 15 25 45 ... 20
9-16 5 6 25 28 15 ... 11
9-17 3 45 42 6 17 ... 32
9-18 5 10 15 25 45 ... 20
.
.
.
11-7 12 36 84 9 7 ... 21
where each of the entry is the sum over the column variable rather than a 1 or zero frequency count.
You can do that quite easily with dplyr - first group by column "Date", then summarise each of the other columns with sum:
require(dplyr)
df %>%
group_by(Date) %>%
summarise_each(funs(sum))
#Source: local data frame [4 x 7]
#
# Date Hour1 Hour2 Hour3 Hour4 Hour5 Hour15
#1 11-7 0 1 1 1 0 0
#2 9-15 0 1 1 2 2 0
#3 9-16 1 2 1 1 0 2
#4 9-18 0 1 0 1 1 0
data
df <- structure(list(Date = structure(c(2L, 2L, 3L, 3L, 3L, 4L, 1L), .Label = c("11-7",
"9-15", "9-16", "9-18"), class = "factor"), Hour1 = c(0L, 0L,
0L, 0L, 1L, 0L, 0L), Hour2 = c(0L, 1L, 1L, 0L, 1L, 1L, 1L), Hour3 = c(0L,
1L, 1L, 0L, 0L, 0L, 1L), Hour4 = c(1L, 1L, 1L, 0L, 0L, 1L, 1L
), Hour5 = c(1L, 1L, 0L, 0L, 0L, 1L, 0L), Hour15 = c(0L, 0L,
0L, 1L, 1L, 0L, 0L)), .Names = c("Date", "Hour1", "Hour2", "Hour3",
"Hour4", "Hour5", "Hour15"), class = "data.frame", row.names = c(NA,
-7L))

R: use a row as a grouping vector for row sums

If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.

Resources