dcast in r with duplicates and no aggregation - r

I have looked through many similar questions here but can't find an answer that addresses this situation. My data frame is like this:
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1
Where SET is purposefully duplicated many times, with each record including a single SP and values for multiple TRIALS (T1-3).
What I desire is a wide dataframe like the following. There is to be NO summation/averaging/mathematical operation of any kind:
142 1 0 0 0 NA 4 5 0 NA
255 NA NA NA 2 0 0 NA 3 0
336 1 0 0 0 0 6 0 0 0
66 NA 22 1 NA NA NA NA NA NA
I have tried the following, but receive the error with melt and dcast defaulting to length. This turns the SET variable into different numbers and only fills in 0s and 1s for the value.
df %>%
group_by(SET, SP) %>%
melt(id.vars = c('SET','SP')) %>%
data.table::dcast(SP + variable ~ SET, fun.aggregate = NULL, value.var = 'value')
This works when I DON'T have any duplicate SETs, but fails as soon as I include the full dataset.
A note: my real data frame is about 2.5 million rows, so speed is of concern.

This is a situation where dcast.data.table excels. It allows for multiple arguments as 'value.var', allowing for very concise syntax:
dcast(df, SET ~ SP, value.var=c("T1", "T2", "T3"))
# SET T1_bird T1_cat T1_dog T2_bird T2_cat T2_dog T3_bird T3_cat T3_dog
#1: A 5 0 1 0 NA 0 NA 4 0
#2: B NA 2 NA 3 0 NA 0 0 NA
#3: C 0 0 1 0 0 0 0 6 0
#4: D NA NA NA NA NA 22 NA NA 1

As per #lukeA above but add fun.aggregate = identity or fun.aggregate = list argument in dcast() function call

You could try
df <- read.table(header=T, text="
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1")
df %>%
gather(var, val, -(1:2)) %>%
unite("SP", SP, var) %>%
spread(SP, val)
# SET bird_T1 bird_T2 bird_T3 cat_T1 cat_T2 cat_T3 dog_T1 dog_T2 dog_T3
# 1 A 5 0 NA 0 NA 4 1 0 0
# 2 B NA 3 0 2 0 0 NA NA NA
# 3 C 0 0 0 0 0 6 1 0 0
# 4 D NA NA NA NA NA NA NA 22 1

Since my current reputation doesn't allow adding a comment to #lukeA answer above, I am making this a new answer which is more of a suggestion:
using the data.table function setcolorder one could have the columns reordered as "bird_T1, cat_T1, dog_T1, bird_T2, cat_T2 etc." by using a custom function similar to
newOrder <- function() {
, function(i) grep(
sprintf('%s', i)
, names(DT), value = TRUE)
where index is the index created in order to allow dcast.data.table for duplicated keys such as the SET variable in the initial table above:
DT[, index := 1:.N, by = SET]
finally the new order is achieved in regular manner:
setcolorder(dcast(DT), neworder = newOrder())


R - adding a count table in a dataframe

I would like to get a count table from several values in my dataframe, which are spread over several columns.
In my case, several of my subjects have one or more types of medication. Now they are mentioned in several columns. I would instead like to have each medication name in a separate column, and a 1 or 0 (a count) per subject if they have it or not.
For example, my dataframe now looks like:
Pp X X2 X3 X4
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
7 duloxetin escitalopram venlafaxin NA
I would like to it look like:
Pp trimipramin escitalopram quetiapin venlafaxin duloxetin …
1 0 0 0 0 0 …
2 1 0 0 0 0 …
3 0 1 1 1 0 …
4 0 0 0 0 1 …
5 0 1 1 0 0 …
6 0 0 0 0 0 …
7 0 1 0 1 1 …
Would it be possible to get such a count table added to my dataframe and not separately?
I assume dplyr's spread() might be useful in this case, but I couldn't get it to work :-(
Any help?
The table function in base is a simple method for creating this type of binary indicator matrix.
D %>%
gather(key, drug, -Pp) %>%
select(-key) %>%
#>Pp amitriptylin citalopram duloxetin escitalopram Lithium ...
#> 1 0 0 0 0 0 ...
#> 2 0 0 0 0 0 ...
#> 3 0 0 0 1 0 ...
#> 4 1 0 1 0 0 ...
#> 5 0 1 0 1 1 ...
#> 6 0 0 0 0 0 ...
#> 7 0 0 1 1 0 ...
You can also do it using spread instead, but this solution will remove patients without any associated drugs:
D %>%
gather(key, value, -Pp) %>%
filter(!is.na(value)) %>%
mutate(key=1) %>%
spread(value, key, fill=0, drop = FALSE)
data %>% gather(key = "Med", value = "Value", -Pp) %>% select(-Med) %>%
dcast(Pp~Value,fun.aggregate = function(x){as.integer(length(x) > 0)})
data <- read.table(text="
Pp X X2 X3 X4
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
7 duloxetin escitalopram venlafaxin NA",header=T, stringsAsFactors = F)
I do not know if it is the easiest solution, but it works. First you have to get all medications in a vector. I do that with unlist and unqiue. Then you have to loop through them and and get the number of cases. I do that with the apply function and the sum of matches. As you want it in the same data.frame you can store it with data[[name]]. This solution works without any packages.
names = as.character(na.omit(unique(unlist(data))))
for(name in names){
data[[name]] = apply(data, FUN = function(x) { sum(x == name, na.rm = T) }, 1)
data = read.table(text = "X X2 X3 X4
trimipramin NA NA NA
Quetiapin venlafaxin escitalopram NA
duloxetin amitriptylin NA NA
quetiapin citalopram escitalopram Lithium
duloxetin escitalopram venlafaxin NA", header = T, stringsAsFactors = F)

Generate a Preference Matrix in R?

I'm using r to analyze an undirected network of individuals with ethnicities as attributes. I want to create a tie accounts table, or "preference matrix," a square matrix where values of ethnicity are arrayed on both dimensions, and each cell tells you how many ties correspond to that type of relationship. (so from this you can calculate the probability of one group throwing ties to another group - but I just want to use it as an argument in igraph's preference.game function). here's what I tried:
# I create a variable for ethnicity by assigning the names of my vertices to their corresponding ethnicities
eth <- atts$Ethnicity[match(V(mahmudNet)$name,atts$Actor)]
# I create an adjacency matrix from my network data
mat <- as.matrix(get.adjacency(mahmudNet))
# I create the dimensions for my preference matrix from the Ethnicity values
eth.value <- unique(sort(eth))
# I create an empty matrix using these dimensions
eth.mat <- array(NA,dim=c(length(eth.value),length(eth.value)))
# I create a function that will populate the empty cells of the matrix
for (i in eth.value){
for (j in eth.value){
eth.mat[i,j] <- sum(mat[eth==i,eth==j])
My problem is at the end, I think. I need to figure out an expression that tells R how to populate the cells. the expression I put doesn't seem to work, but I want it so that potentially I could go
a <- sum(mat[eth=="White", eth=="Black"])
And then "a" would return the sum of all the cells in the adjacency matrix that correspond to a White-Black relationship.
Here's a sample of my data:
# data frame with Ethnicity attributes:
Actor Ethnicity
1 Sultan Mahmud of Siak 2
2 Daeng Kemboja 1
3 Raja Kecik of Trengganu 1
4 Raja Alam 2
5 Tun Dalam 2
6 Raja Haji 1
7 The Suliwatang 1
8 Punggawa Miskin 1
9 Tengku Selangor 1
10 Tengku Raja Said 1
11 Datuk Bendahara 2
12 VOC 3
13 King of Selangor 1
14 Dutch at Batavia 3
15 Punggawa Tua 2
16 Raja Tua Encik Andak 1
17 Raja Indera Bungsu 2
18 Sultan of Jambi 2
19 David Boelen 3
20 Datuk Temenggong 2
21 Punggawa Opu Nasti 1
# adjacency matrix with relations
Daeng Kemboja Punggawa Opu Nasti Raja Haji Daeng Cellak
Daeng Kemboja 0 1 1 1
Punggawa Opu Nasti 1 0 1 0
Raja Haji 1 1 0 0
Daeng Cellak 1 0 0 0
Daeng Kecik 1 0 0 0
Daeng Kecik
Daeng Kemboja 1
Punggawa Opu Nasti 0
Raja Haji 0
Daeng Cellak 0
Daeng Kecik 0
This is a simple job for table, once you have your data in the right shape.
First a sample dataset:
# fake ethnicity data by actor
actor_eth <- data.frame(actor = letters[1:10],
eth = sample(1:3, 10, replace=T))
# fake adjacency matrix
adj_mat <- matrix(rbinom(100, 1, .5), ncol=10)
dimnames(adj_mat) <- list(letters[1:10], letters[1:10])
# blank out lower triangle & diagonal,
# so random data is not asymetric & no self-ties
adj_mat[lower.tri(adj_mat)] <- NA
diag(adj_mat) <- NA
Here's our fake adjacency matrix:
a b c d e f g h i j
a NA 1 1 1 0 0 1 1 0 1
b NA NA 0 1 0 1 0 0 1 0
c NA NA NA 1 1 0 0 1 0 0
d NA NA NA NA 1 0 0 1 1 0
e NA NA NA NA NA 0 0 1 0 1
f NA NA NA NA NA NA 1 1 0 1
g NA NA NA NA NA NA NA 1 1 0
Here's our fake eth table:
actor eth
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 3
7 g 3
8 h 3
9 i 1
10 j 2
So what you want to do is 1) put this in long format, so you have a bunch of rows with a source actor and a target actor, each representing a tie. Then 2) replace the actor name with ethnicity, so you have ties with source/target ethnicity. Then 3) you can just use table to make a cross tab.
# use `melt` to put this in long form, omitting rows showing "non connections"
actor_ties <- subset(melt(adj_mat), value==1)
# now replace the actor names with their ethnicities to get create a data.frame
# of ties by ethnicty
eth_ties <-
data.frame(source_eth = with(actor_eth, eth[match(actor_ties$Var1, actor)]),
target_eth = with(actor_eth, eth[match(actor_ties$Var2, actor)]))
# now here's your cross tab
source_eth 1 2 3
1 0 2 1
2 2 0 1
3 3 5 9

Generate crosstabulations from dataframe of categorical variables in survey

I've got some survey results and am trying to do some basic cross tabulations. Each column is a type of chemical, and the number 0:5 are how useful they were.
I'm trying to figure out a nice table that presents the freq, and percent. Using table, or xtabs, I'm able to get individual results for each column, but I'd like to figure out a way to create a nice table that I'll be able to output into Latex that includes all the chemicals on one table.
Thanks for any help you can provide.
Data Frame :
df <- read.table(text = "
V1 V2 V3 V4 V5 V6 V7
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 NA
5 0 0 0 0 0 2 0
7 0 0 0 0 0 0 0
9 NA 2 NA NA NA 3 NA
11 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0
14 NA NA NA NA NA 2 3
15 NA 3 NA 3 NA NA NA
17 0 0 0 0 0 0 0
18 NA 5 NA 5 NA NA NA
19 0 0 0 0 0 0 0
20 NA 1 NA NA NA NA NA", header = T)
Desired Output (Accurate numbers for V1 and V2) :
V1 V2 etc....
Freq Percent Freq Percent
No 9 100 9 56.2
Poor 0 0 1 6.2
Somewhat effective 0 0 1 6.2
Good 0 0 1 6.2
Very Good 0 0 3 18.75
NA 0 0 1 6.2
Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames
res <- do.call(cbind,lapply(df, function(x) {
x1 <- table(factor(x, levels=0:5,
labels=c('No', 'Poor', 'Somewhat Effective',
'Good', 'Very Good', 'NA') ))
cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
colnames(res) <- paste(rep(paste0('V',1:7),each=2),
# V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
#No 9 100 9 56.25 9 100 9
#Poor 0 0 1 6.25 0 0 0
# V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
#No 81.82 9 100 8 66.67 8 80
#Poor 0.00 0 0 0 0.00 0 0
I'm not a regular "dplyr" or "tidyr" user, so I'm not sure if this is the best approach using those tools (but it seems to work):
df %>%
gather(var, val, V1:V7) %>% ## Make the data long
na.omit() %>% ## We don't need the NAs
## Factor the "value" column
mutate(val = factor(val, 0:5, c("No", "Poor", "Somewhat Effective",
"Good", "Very Good", "NA"))) %>%
group_by(val, var) %>% ## Group by val and var
summarise(Freq = n()) %>% ## Get the count
group_by(var) %>% ## Group just by var now
mutate(Pct = Freq/sum(Freq) * 100) %>% ## Calculate the percent
gather(R1, R2, Freq:Pct) %>% ## Go long again....
unite(Var, var, R1) %>% ## Combine the var and R1 cols
spread(Var, R2, fill = 0) ## Go wide....
# Source: local data frame [6 x 15]
# val V1_Freq V1_Pct V2_Freq V2_Pct V3_Freq V3_Pct V4_Freq
# 1 No 9 100 9 56.25 9 100 9
# 2 Poor 0 0 1 6.25 0 0 0
# 3 Somewhat Effective 0 0 1 6.25 0 0 0
# 4 Good 0 0 1 6.25 0 0 1
# 5 Very Good 0 0 3 18.75 0 0 0
# 6 NA 0 0 1 6.25 0 0 1
# Variables not shown: V4_Pct (dbl), V5_Freq (dbl), V5_Pct (dbl), V6_Freq
# (dbl), V6_Pct (dbl), V7_Freq (dbl), V7_Pct (dbl)
The "data.table" approach is similar in terms of the series of steps you'll have to go through.
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
DT <- melt(as.data.table(df, keep.rownames = TRUE),
id.vars = "rn", na.rm = TRUE)
DT <- DT[, value := factor(value, 0:5, levs)
][, list(Freq = .N), by = list(variable, value)
][, Pct := Freq/sum(Freq) * 100, by = list(variable)]
dcast.data.table(melt(DT, id.vars = c("variable", "value")),
value ~ variable + variable.1,
value.var = "value.1", fill = 0)
OK, one more... (a variant on #akrun's answer)
library(gdata) ## For "interleave"
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
x1 <- sapply(lapply(df, factor, 0:5, levs), table)
t(interleave(t(x1), t(prop.table(x1, 2))))
### Or, skipping the transposing....
## library(SOfun) ## For "Riffle" which is like "interleave"
## Riffle(x1, prop.table(x1, 2) * 100)

using function in lapply in data.table in r

If there is a sample data set as below.
> tmp <- data.table(x=c(1:10),y=(5:14))
> tmp
x y
1: 1 5
2: 2 6
3: 3 7
4: 4 8
5: 5 9
6: 6 10
7: 7 11
8: 8 12
9: 9 13
10: 10 14
I want choose two lowest number and I want change 0 value to the other numbers.
x y
1: 1 5
2: 2 6
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0
9: 0 0
10: 0 0
I think the coding is
tmp[, c("x","y"):=lapply(.SD, x[which(!x %in% sort(x)[1:2])] = 0}), .SDcols=c("x","y")]
but it changes all 0
How can i solve this problem.
To expand on my comment, I'd do something like this:
for (j in names(tmp)) {
col = tmp[[j]]
min_2 = sort.int(unique(col), partial=2L)[2L] # 2nd lowest value
set(tmp, i = which(col > min_2), j = j, value = 0L)
This loops over all the columns in tmp, and gets the 2nd minimum value for each column using sort.int with partial argument, which is slightly more efficient than using sort (as we don't have to sort the entire data set to find the 2nd minimum value).
Then we use set() to replace those rows where the column value is greater than the 2nd minimum value, for that column, with the value 0.
May be you can try
tmp[, lapply(.SD, function(x) replace(x,
!rank(x, ties.method='first') %in% 1:2, 0))]
# x y
#1: 1 5
#2: 2 6
#3: 0 0
#4: 0 0
#5: 0 0
#6: 0 0
#7: 0 0
#8: 0 0
#9: 0 0
#10:0 0

How to create dummy variables?

I have a variable that is a factor :
$ year : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is
dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )
But this has the unfortunate consequences of
The dummy variables are named dt1$year2003, not just "2003", "2004" etc
It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable).
Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.
This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.
year 2003 2004 2005
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1
You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0
library(caret) provides a very simple function (dummyVars) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously:
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1
