R Reshape - Sum and Combine - r

I am sure this is really simple but I can not get it to work
I need to sum two values while the other columns remain constant using reshape/melt:
Data Looks like this:
ID Value
1 2850508 1010.58828
2 2850508 94.37286
Desired Output:
ID Variable Value
1 2850508 Cost 1104.96114
Current Output:
ID Variable Value
1 2850508 Cost 1010.58828
2 2850508 Cost 94.37286
Current Code:
Sum <- melt(Data, id="ID", measured="Cost")
Any help would be greatly appreciated!

You can also just use the aggregate function.
aggregate(formula = . ~ ID,
data = Data ,
FUN = sum)
## ID Value
## 1 2850508 1104.961
And to get your desired output, you have to cbind and rearrange:
cbind(aggregate(formula = . ~ ID,
data = Data ,
FUN = sum),
Variable = "Cost")[, c("ID", "Variable", "Value")]

Using dplyr: (I added two more IDs so there'd be more data):
d is your data
d %>%
group_by(ID) %>%
summarise(Value=sum(Value)) %>%
mutate(Variable="Cost") %>%
select(ID,Variable,Value)
ID Variable Value
1 2850508 Cost 1104.961
2 2850509 Cost 1164.961
3 2850510 Cost 1047.961

It is also very simple with data.table
library(data.table)
setDT(df)[, .(Variable = "Cost", Value = sum(Value)) , ID]
# ID Variable Value
# 1: 2850508 Cost 1104.961

Related

How do I turn non-sequential data in R into sequential data while grouping on an ID

I have the following dataframe.
df <- data.frame(Person = c("Eric","Eric","Eric","Joe","Joe","Joe"), Order = c(2,7,4,2,5,1),
Value = c("A","A","B","C","A","B"))
The order column is currently in a random order. Every person has 3 order values which are random integers between 1 and 8. Order is always a value between 1 and 8, and there are no repeats for a person. How do I transform the Order column so that it is reflecting the order of the values, grouped by the person? Thus, the order column would always between 1 and 3. The desired output would look like this.
df <- data.frame(Person = c("Eric","Eric","Eric","Joe","Joe","Joe"), Order = c(1,3,2,2,3,1),
Value = c("A","A",'B","C","A","B"))```
Perhaps, we need to rank the 'Order' grouped by 'Person'
library(dplyr)
df %>%
group_by(Person) %>%
mutate(Order = rank(Order))
Some base R options
Using rank
df,
Order = ave(Order, Person, FUN = rank)
)
Using match + sort
transform(
df,
Order = ave(Order, Person, FUN = function(x) match(x,sort(x)))
)
Using data.table :
library(data.table)
setDT(df)[, Order := frank(Order), Person]
df
# Person Order Value
#1: Eric 1 A
#2: Eric 3 A
#3: Eric 2 B
#4: Joe 2 C
#5: Joe 3 A
#6: Joe 1 B

Select rows from grouped dataframe based on duplicate values

I have a dataframe with 3 columns. The id of each individual, the number of group they belong (gr) and location codes (loc). What I am trying to do is identify which individuals visit 2 locations with the following sequence: Location 1 -> Location 2 -> Location 1.
Dummy dataset:
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,4,4,4)
gr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,4,4,4,4,4,4,4)
loc <- c(5,5,4,4,5,5,5,3,3,3,3,2,2,2,2,3,3,2,2,2)
df<- data.frame(id,gr, loc)
I have tried using a diff function, to identify differences between the locations:
dif<- diff(as.numeric(df$loc))
But I can't find any other way to move forward. In addition this approach doesn't account for the groups of each individual (and the ids repeat between groups). I was thinking maybe using a lag function but not sure how or if it helps at all. Any recommendations? Many thanks in advance, I'm still pretty new in R.
Desired output:
id<- c(1,4)
gr<- c(1,4)
out<- data.frame(cbind(id, gr))
A possible data.table option
unique(
setDT(df)[
,
q := rleid(loc), .(id, gr)
][
,
.SD[uniqueN(q) == 3 & first(loc) == last(loc)], .(id, gr)
][
,
.(id, gr)
]
)
gives
id gr
1: 1 1
2: 4 4
May be this works
library(dplyr)
library(data.table)
df %>%
group_by(id) %>%
filter(n_distinct(rleid(loc)) >2) %>%
slice_tail(n = 1) %>%
select(-loc) %>%
ungroup
# A tibble: 2 x 2
# id gr
# <dbl> <dbl>
#1 1 1
#2 4 4

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

dplyr lag of different group

I am trying to use dplyr to mutate both a column containing the samegroup lag of a variable as well as the lag of (one of) the other group(s).
Edit: Sorry, in the first edition, I messed up the order a bit by rearranging by date at the last second.
This is what my desired result would look like:
Here is a minimal code example:
library(tidyverse)
set.seed(2)
df <-
data.frame(
x = sample(seq(as.Date('2000/01/01'), as.Date('2015/01/01'), by="day"), 10),
group = sample(c("A","B"),10,replace = T),
value = sample(1:10,size=10)
) %>% arrange(x)
df <- df %>%
group_by(group) %>%
mutate(own_lag = lag(value))
df %>% data.frame(other_lag = c(NA,1,2,7,7,9,10,10,8,6))
Thank you very much!
A solution with data.table:
library(data.table)
# to create own lag:
setDT(df)[, own_lag:=c(NA, head(value, -1)), by=group]
# to create other group lag: (the function works actually outside of data.table, in base R, see N.B. below)
df[, other_lag:=sapply(1:.N,
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
})]
df
# x group value own_lag other_lag
#1: 2001-12-08 B 1 NA NA
#2: 2002-07-09 A 2 NA 1
#3: 2002-10-10 B 7 1 2
#4: 2007-01-04 A 5 2 7
#5: 2008-03-27 A 9 5 7
#6: 2008-08-06 B 10 7 9
#7: 2010-07-15 A 4 9 10
#8: 2012-06-27 A 8 4 10
#9: 2014-02-21 B 6 10 8
#10: 2014-02-24 A 3 8 6
Explanation of other_lag determination: The idea is, for each observation, to look at the group value, if there is any group value different from current one, previous to current one, then take the last value, else, put NA.
N.B.: other_lag can be created without the need of data.table:
df$other_lag <- with(df, sapply(1:nrow(df),
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
}))
Another data.table approach similar to #Cath's:
library(data.table)
DT = data.table(df)
DT[, vlag := shift(value), by=group]
DT[, volag := .SD[.(chartr("AB", "BA", group), x - 1), on=.(group, x), roll=TRUE, x.value]]
This assumes that A and B are the only groups. If there are more...
DT[, volag := DT[!.BY, on=.(group)][.(.SD$x - 1), on=.(x), roll=TRUE, x.value], by=group]
How it works:
:= creates a new column
DT[, col := ..., by=] does each assignment separately per by= group, essentially as a loop.
The grouping values for the current iteration of the loop are in the named list .BY.
The subset of data used by the current iteration of the loop is the data.table .SD.
x[!i, on=] is an anti-join, looking up rows of i in x and returning x with the matched rows dropped.
x[i, on=, roll=TRUE, x.v] ...
looks up each row of i in x using the on= condition
when no exact on= match is found, it "rolls" to the nearest previous value of the final on= column
it returns v from the x table
For more details and intuition, review the startup messages shown when you type library(data.table).
I am not entirely sure whether I got your question correctly, but if "own" and "other" refers to group A and B, then this might do the trick. I strongly assume there are more elegant ways to do this:
df.x <- df %>%
dplyr::group_by(group) %>%
mutate(value.lag=lag(value)) %>%
mutate(index=seq_along(group)) %>%
arrange(group)
df.a <- df.x %>%
filter(group=="A") %>%
rename(value.lag.a=value.lag)
df.b <- df.x %>%
filter(group=="B") %>%
rename(value.lag.b = value.lag)
df.a.b <- left_join(df.a, df.b[,c("index", "value.lag.b")], by=c("index"))
df.b.a <- left_join(df.b, df.a[,c("index", "value.lag.a")], by=c("index"))
df.x <- bind_rows(df.a.b, df.b.a)
Try this: (Pipe-Only approach)
library(zoo)
df %>%
mutate(groupLag = lag(group),
dupLag = group == groupLag) %>%
group_by(dupLag) %>%
mutate(valueLagHelp = lag(value)) %>%
ungroup() %>%
mutate(helper = ifelse(dupLag == T, NA, valueLagHelp)) %>%
mutate(helper = case_when(is.na(helper) ~ na.locf(helper, na.rm=F),
TRUE ~ helper)) %>%
mutate(valAfterLag = lag(dupLag)) %>%
mutate(otherLag = ifelse(is.na(lag(valueLagHelp)), lag(value), helper)) %>%
mutate(otherLag = ifelse((valAfterLag | is.na(valAfterLag)) & !dupLag,
lag(value), otherLag)) %>%
select(c(x, group, value, ownLag, otherLag))
Sorry for the mess.
What it does it that it first creates a group lag and creates a helper variable for the case when the group is equal to its lag (i. e. when two "A"s are subsequent. Then it groups by this helper variable and it assigns to all values which are dupLag == F the correct value. Now we need to take care of the ones with dupLag == T.
So, ungroup. We need a new lagged-value helper that assigns all dupLag == T an NA, because they are not correctly assigned yet.
What's next is that we assign all NAs in our helper the last non-NA value.
This is not all because we still need to take care of some dupLag == F data points (you get that when you look at the complete tibble). First, we basically just change the second data point with the first mutate(otherLag==... operation. The next operation finalizes everything and then we select the variables which we'd like to have in the end.

Grouping with numeric variables

I hava a dataframe like this:
name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google
I would like to convert the values of rows of the second column to columns and keep the first one and in other have a numeric value to 0 and 1 if not exist or exist the value. Here an example of the expected output:
name,Google,Yahoo
stockA,1,1
stockB,0,0
stockC,1,0
I tried this:
library(reshape2)
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
and the error it gives me is this:
Using value as value column: use value.var to override.
Error in `[.data.frame`(x, i) : undefined columns selected
Any idea for the error?
An example in which the previous code works.
Data (df):
name,nam2,value
stockA,sth1,Yahoo
stockA,sth2,NA
stockB,sth3,Google
and this works:
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name + nam2 ~ value, length)
The OP has asked to get an explanation for the error caused by
dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
(I'm quite astonished that no one so far has tried to improve the OP's reshape2 approach to return exactly the expected answer).
There are several issues with OP's code:
df appears in the dcast() formula.
The second parameter to melt() is 1:2 which means that all columns are used as id.vars. It should read 1.
But the most crucial point is that the data.frame df already is in long format and doesn't need to be reshaped.
So, df can be used directly in dcast():
library(reshape2)
dcast(df[!is.na(df$value), ], name ~ value, length, drop = FALSE)
# name Google Yahoo
#1 stockA 1 1
#2 stockB 0 0
#3 stockC 1 0
In order to avoid a third NA column appearing in the result, the NA rows have to be filtered out of df before reshaping. On the other hand, drop = FALSE is required to ensure stockB is included in the result.
Data
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df
# name value
#1 stockA Google
#2 stockA Yahoo
#3 stockB <NA>
#4 stockC Google
You can do that with spread from the tidyr package.
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df$row <- 1
df %>%
spread(value, row, fill = 0) %>%
select(-`<NA>`)
Try df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name ~ value, length)
Just remove df + from the equation.
Though this will give you an extra column for NA values, which makes me think the na.rm argument isn't working properly in your formulation.
You can do it also with base R:
df <- read.table(header=TRUE, sep=',', text=
'name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google')
xtabs(~., data=df)
# value
#name Google Yahoo
# stockA 1 1
# stockB 0 0
# stockC 1 0

Resources