dcast in R - creating pivot table - r

Here is my example
Student <- c('A', 'B', 'B')
Assessor <- c('C', 'D', 'D')
Score <- c(1, 5, 7)
df <- data.frame(Student, Assessor, Score)
df <- dcast(df, Student ~ Assessor,fun.aggregate=(function (x) x), value = 'Score')
print(df)
The output:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
While I want to get something like
C D
A 1 NaN
B NaN 5
B NaN 7
What I am missing?
In addition, if I replace Score with
Score <- c('foo', 'bar','bar')
The output will be:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
Any thoughts?

Since dcast spread among unique values of the left side of the formula I think you can achieve your goal with a (not so elegant hack) but I bet there are other ways to do that with table maybe.
library(reshape2)
dcast(df, Student + Score ~ ...)[-2]
Using Score as value column: use value.var to override.
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7
The hack is to just spread by remaining Student and Score the same and then spread other variables (in this case Assessor) and the with [-2] remove the Score column in order to get the desired output (unless your first column is made by column names actually, which is impossible in base R; in that case you need a data.table solution)

Using the dev version of tidyr (0.3.0) get it from github.
First we complete the combinations of Student/Assessor, then we nest it all into a list, spread and then unnest the list into new rows.
library(dplyr)
library(tidyr)
df %>% complete(Student, Assessor) %>%
nest(Score) %>%
spread(Assessor, Score) %>%
unnest(C) %>%
unnest(D)
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7

Related

Using tidyverse in R how do I arrange my column values in a fixed way?

ID score
a 1
a 2
b 2
b 4
c 4
c 5
I want to change id to "a,b,c" order this to
ID score
a 1
b 2
c 4
a 2
b 4
c 5
What I tried
> data <- read_csv(data)
> data <- factor(data$id, levels = c('a', 'b', 'c'))
This works for tables so I tried it but didn't work for this. Anybody know if there is a way?
Instead of assigning the 'id' column to data <- (which would replace the data with the 'id' values) it would be used for ordering. In base R, this can be done with
data1 <- data[order(duplicated(data$ID)),]
row.name(data1) <- NULL
Or with dplyr
library(dplyr)
library(data.table)
data %>%
arrange(rowid(ID))
library(dplyr)
d %>%
group_by(ID) %>%
mutate(r = row_number()) %>%
ungroup() %>%
arrange(r, ID, score) %>%
select(-r)
OR in base R
with(d, d[order(ave(seq(NROW(d)), d$ID, FUN = seq_along), ID, score),])

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

dplyr lag of different group

I am trying to use dplyr to mutate both a column containing the samegroup lag of a variable as well as the lag of (one of) the other group(s).
Edit: Sorry, in the first edition, I messed up the order a bit by rearranging by date at the last second.
This is what my desired result would look like:
Here is a minimal code example:
library(tidyverse)
set.seed(2)
df <-
data.frame(
x = sample(seq(as.Date('2000/01/01'), as.Date('2015/01/01'), by="day"), 10),
group = sample(c("A","B"),10,replace = T),
value = sample(1:10,size=10)
) %>% arrange(x)
df <- df %>%
group_by(group) %>%
mutate(own_lag = lag(value))
df %>% data.frame(other_lag = c(NA,1,2,7,7,9,10,10,8,6))
Thank you very much!
A solution with data.table:
library(data.table)
# to create own lag:
setDT(df)[, own_lag:=c(NA, head(value, -1)), by=group]
# to create other group lag: (the function works actually outside of data.table, in base R, see N.B. below)
df[, other_lag:=sapply(1:.N,
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
})]
df
# x group value own_lag other_lag
#1: 2001-12-08 B 1 NA NA
#2: 2002-07-09 A 2 NA 1
#3: 2002-10-10 B 7 1 2
#4: 2007-01-04 A 5 2 7
#5: 2008-03-27 A 9 5 7
#6: 2008-08-06 B 10 7 9
#7: 2010-07-15 A 4 9 10
#8: 2012-06-27 A 8 4 10
#9: 2014-02-21 B 6 10 8
#10: 2014-02-24 A 3 8 6
Explanation of other_lag determination: The idea is, for each observation, to look at the group value, if there is any group value different from current one, previous to current one, then take the last value, else, put NA.
N.B.: other_lag can be created without the need of data.table:
df$other_lag <- with(df, sapply(1:nrow(df),
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
}))
Another data.table approach similar to #Cath's:
library(data.table)
DT = data.table(df)
DT[, vlag := shift(value), by=group]
DT[, volag := .SD[.(chartr("AB", "BA", group), x - 1), on=.(group, x), roll=TRUE, x.value]]
This assumes that A and B are the only groups. If there are more...
DT[, volag := DT[!.BY, on=.(group)][.(.SD$x - 1), on=.(x), roll=TRUE, x.value], by=group]
How it works:
:= creates a new column
DT[, col := ..., by=] does each assignment separately per by= group, essentially as a loop.
The grouping values for the current iteration of the loop are in the named list .BY.
The subset of data used by the current iteration of the loop is the data.table .SD.
x[!i, on=] is an anti-join, looking up rows of i in x and returning x with the matched rows dropped.
x[i, on=, roll=TRUE, x.v] ...
looks up each row of i in x using the on= condition
when no exact on= match is found, it "rolls" to the nearest previous value of the final on= column
it returns v from the x table
For more details and intuition, review the startup messages shown when you type library(data.table).
I am not entirely sure whether I got your question correctly, but if "own" and "other" refers to group A and B, then this might do the trick. I strongly assume there are more elegant ways to do this:
df.x <- df %>%
dplyr::group_by(group) %>%
mutate(value.lag=lag(value)) %>%
mutate(index=seq_along(group)) %>%
arrange(group)
df.a <- df.x %>%
filter(group=="A") %>%
rename(value.lag.a=value.lag)
df.b <- df.x %>%
filter(group=="B") %>%
rename(value.lag.b = value.lag)
df.a.b <- left_join(df.a, df.b[,c("index", "value.lag.b")], by=c("index"))
df.b.a <- left_join(df.b, df.a[,c("index", "value.lag.a")], by=c("index"))
df.x <- bind_rows(df.a.b, df.b.a)
Try this: (Pipe-Only approach)
library(zoo)
df %>%
mutate(groupLag = lag(group),
dupLag = group == groupLag) %>%
group_by(dupLag) %>%
mutate(valueLagHelp = lag(value)) %>%
ungroup() %>%
mutate(helper = ifelse(dupLag == T, NA, valueLagHelp)) %>%
mutate(helper = case_when(is.na(helper) ~ na.locf(helper, na.rm=F),
TRUE ~ helper)) %>%
mutate(valAfterLag = lag(dupLag)) %>%
mutate(otherLag = ifelse(is.na(lag(valueLagHelp)), lag(value), helper)) %>%
mutate(otherLag = ifelse((valAfterLag | is.na(valAfterLag)) & !dupLag,
lag(value), otherLag)) %>%
select(c(x, group, value, ownLag, otherLag))
Sorry for the mess.
What it does it that it first creates a group lag and creates a helper variable for the case when the group is equal to its lag (i. e. when two "A"s are subsequent. Then it groups by this helper variable and it assigns to all values which are dupLag == F the correct value. Now we need to take care of the ones with dupLag == T.
So, ungroup. We need a new lagged-value helper that assigns all dupLag == T an NA, because they are not correctly assigned yet.
What's next is that we assign all NAs in our helper the last non-NA value.
This is not all because we still need to take care of some dupLag == F data points (you get that when you look at the complete tibble). First, we basically just change the second data point with the first mutate(otherLag==... operation. The next operation finalizes everything and then we select the variables which we'd like to have in the end.

R How to apply function to rows of grouped dataframe?

Suppose I have a dataframe generated like this
dataframe <- data.frame(name = (rep(c('A', 'B', 'C', 'D'), 25)), probe = rep(number, each = 4), a = rnorm(100), b = (rnorm(100)+1), c = (rnorm(100)+5))
> head(dataframe)
name probe a b c
1 A 1 0.03394554 2.97384424 4.173368
2 B 1 1.64304498 2.67977648 5.027671
3 C 1 0.35266588 1.62455820 5.664635
4 D 1 -1.24197302 0.29907974 5.243112
5 A 2 -0.20330593 0.45405930 6.603498
6 B 2 -1.06909795 -0.02575508 4.318659
The samples are in the columns. Variables are in the rows.
I need to calculate the ratio (A+B)/(C+D) for very set of samples using the same probe, such as when probe == 1 or probe == 2.
I can groupby by probe.
But it seems functions can be applied to the columns, how to apply functions to the rows in a groupby object?
Thanks for the help!
I'd reshape.
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, -name, -probe) %>%
spread(name, value) %>%
mutate(ratio = (A+B)/(C+D) )
Or we could use recast from reshape2. It is a convenient wrapper for melt/dcast. We add the new column 'ratio' after the reshape.
library(reshape2)
transform(recast(df, measure.var=c('a', 'b', 'c'),
probe+variable~name, value.var='value'), ratio= (A+B)/(C+D))

Resources