Grouping with numeric variables

Grouping with numeric variables - r

I hava a dataframe like this:
name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google
I would like to convert the values of rows of the second column to columns and keep the first one and in other have a numeric value to 0 and 1 if not exist or exist the value. Here an example of the expected output:
name,Google,Yahoo
stockA,1,1
stockB,0,0
stockC,1,0
I tried this:
library(reshape2)
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
and the error it gives me is this:
Using value as value column: use value.var to override.
Error in `[.data.frame`(x, i) : undefined columns selected
Any idea for the error?
An example in which the previous code works.
Data (df):
name,nam2,value
stockA,sth1,Yahoo
stockA,sth2,NA
stockB,sth3,Google
and this works:
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name + nam2 ~ value, length)

The OP has asked to get an explanation for the error caused by
dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
(I'm quite astonished that no one so far has tried to improve the OP's reshape2 approach to return exactly the expected answer).
There are several issues with OP's code:
df appears in the dcast() formula.
The second parameter to melt() is 1:2 which means that all columns are used as id.vars. It should read 1.
But the most crucial point is that the data.frame df already is in long format and doesn't need to be reshaped.
So, df can be used directly in dcast():
library(reshape2)
dcast(df[!is.na(df$value), ], name ~ value, length, drop = FALSE)
# name Google Yahoo
#1 stockA 1 1
#2 stockB 0 0
#3 stockC 1 0
In order to avoid a third NA column appearing in the result, the NA rows have to be filtered out of df before reshaping. On the other hand, drop = FALSE is required to ensure stockB is included in the result.
Data
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df
# name value
#1 stockA Google
#2 stockA Yahoo
#3 stockB <NA>
#4 stockC Google

You can do that with spread from the tidyr package.
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df$row <- 1
df %>%
spread(value, row, fill = 0) %>%
select(-`<NA>`)

Try df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name ~ value, length)
Just remove df + from the equation.
Though this will give you an extra column for NA values, which makes me think the na.rm argument isn't working properly in your formulation.

You can do it also with base R:
df <- read.table(header=TRUE, sep=',', text=
'name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google')
xtabs(~., data=df)
# value
#name Google Yahoo
# stockA 1 1
# stockB 0 0
# stockC 1 0

Related

Using set_names vs. mutate(colnames) when changing data frame column names to lower case

A quick question that I was looking to understand better.
Data:
df1 <- data.frame(COLUMN_1 = letters[1:3], COLUMN_2 = 1:3)
> df1
COLUMN_1 COLUMN_2
1 a 1
2 b 2
3 c 3
Why does this work in setting data frame names to lower case:
df2 <- df1 %>%
set_names(., tolower(names(.)))
> df2
column_1 column_2
1 a 1
2 b 2
3 c 3
But this does not?
df2 <- df1 %>%
mutate( colnames(.) <- tolower(colnames(.)) )
Error: Column `colnames(.) <- tolower(colnames(.))` must be length 3 (the number of rows) or one, not 2

The solution, writing the arguments out explicitly, is:
df1 %>% rename_all(tolower) ==
rename_all(.tbl = df1, .funs = tolower)
mutate operates on the data itself, not the column names, so that's why we're using rename. We use rename_all because you don't want to type out 1 = tolower(1), 2 = tolower(2), ...
What you suggested, df2 <- df1 %>% rename_all(tolower(.)) doesn't work because then you would be trying to feed the whole df1 into the tolower function, which is not what you want.

Another solution would be this names(df) <- tolower(names(df))

How do I change the column names using dcast?

I'm transforming my data from long to wide. Part of the data are dates.
My problem is that I would like to have other colnames.
It is formed like eg variable_1-1 and I want 1-1_variable.
df:
SN specimen_isolate_no isolaat materiaal_lokatie alarmniveau afnamedatum
1: 2 1-1 STAPEP Bloedkweek Bloed 0 2017-04-30
2: 3 1-1 KLEBOX Bloedkweek 0 2018-12-30
3: 3 2-1 KLEBOX Bloedkweek 0 2018-12-31
I tried dcast from data.table:
setDT(df)
df.wide <- dcast(df, SN ~ specimen_isolate_no, value.var = c("materiaal_lokatie","afnamedatum", "isolaat", "alarmniveau" ))
Which give me the following result:
colnames:
[1] "SN" "materiaal_lokatie_1-1" "materiaal_lokatie_2-1"
"afnamedatum_1-1" "afnamedatum_2-1" "isolaat_1-1"
"isolaat_2-1" "alarmniveau_1-1" "alarmniveau_2-1"
This result is ok, but I rather have the colnames formed like specimen_isolate_no_variable, eg 1-1_alarmniveau.
In order to achieve this, I tried
molten <- melt(df, id.vars = c("SN", "specimen_isolate_no"))
dfmolton <- dcast(molten, SN ~ specimen_isolate_no + variable)
#and
df %>%
gather(key, value, -SN, -specimen_isolate_no) %>%
unite(new.col, c(specimen_isolate_no,key )) %>%
spread(new.col, value)
But both options mess up my dates and I don't know how to fix that.
#colnames:
[1] "SN" "1-1_isolaat" "1-1_materiaal_lokatie" "1-1_alarmniveau" "1-1_afnamedatum" "2-1_isolaat" "2-1_materiaal_lokatie" "2-1_alarmniveau" "2-1_afnamedatum"
dfmolten$`1-1_afnamedatum`
[1] "17286" "17895"
So my question: does anyone how to change the forming of colnames using dcast?

As Frank mentioned, there's an outstanding feature request for this... side note: please add reactions to FRs you'd like, we use this to some extent to steer development time:
https://github.com/Rdatatable/data.table/issues/3189
In the meantime, you can just use setnames and some regexing to do this:
old = grep('SN', names(df.wide), value = TRUE, invert = TRUE, fixed = TRUE)
new = sapply(strsplit(old, '_', fixed = TRUE), function(x) paste(rev(x), collapse = '_'))
setnames(df.wide, old, new)

dplyr lag of different group

I am trying to use dplyr to mutate both a column containing the samegroup lag of a variable as well as the lag of (one of) the other group(s).
Edit: Sorry, in the first edition, I messed up the order a bit by rearranging by date at the last second.
This is what my desired result would look like:
Here is a minimal code example:
library(tidyverse)
set.seed(2)
df <-
data.frame(
x = sample(seq(as.Date('2000/01/01'), as.Date('2015/01/01'), by="day"), 10),
group = sample(c("A","B"),10,replace = T),
value = sample(1:10,size=10)
) %>% arrange(x)
df <- df %>%
group_by(group) %>%
mutate(own_lag = lag(value))
df %>% data.frame(other_lag = c(NA,1,2,7,7,9,10,10,8,6))
Thank you very much!

A solution with data.table:
library(data.table)
# to create own lag:
setDT(df)[, own_lag:=c(NA, head(value, -1)), by=group]
# to create other group lag: (the function works actually outside of data.table, in base R, see N.B. below)
df[, other_lag:=sapply(1:.N,
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
})]
df
# x group value own_lag other_lag
#1: 2001-12-08 B 1 NA NA
#2: 2002-07-09 A 2 NA 1
#3: 2002-10-10 B 7 1 2
#4: 2007-01-04 A 5 2 7
#5: 2008-03-27 A 9 5 7
#6: 2008-08-06 B 10 7 9
#7: 2010-07-15 A 4 9 10
#8: 2012-06-27 A 8 4 10
#9: 2014-02-21 B 6 10 8
#10: 2014-02-24 A 3 8 6
Explanation of other_lag determination: The idea is, for each observation, to look at the group value, if there is any group value different from current one, previous to current one, then take the last value, else, put NA.
N.B.: other_lag can be created without the need of data.table:
df$other_lag <- with(df, sapply(1:nrow(df),
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
}))

Another data.table approach similar to #Cath's:
library(data.table)
DT = data.table(df)
DT[, vlag := shift(value), by=group]
DT[, volag := .SD[.(chartr("AB", "BA", group), x - 1), on=.(group, x), roll=TRUE, x.value]]
This assumes that A and B are the only groups. If there are more...
DT[, volag := DT[!.BY, on=.(group)][.(.SD$x - 1), on=.(x), roll=TRUE, x.value], by=group]
How it works:
:= creates a new column
DT[, col := ..., by=] does each assignment separately per by= group, essentially as a loop.
The grouping values for the current iteration of the loop are in the named list .BY.
The subset of data used by the current iteration of the loop is the data.table .SD.
x[!i, on=] is an anti-join, looking up rows of i in x and returning x with the matched rows dropped.
x[i, on=, roll=TRUE, x.v] ...
looks up each row of i in x using the on= condition
when no exact on= match is found, it "rolls" to the nearest previous value of the final on= column
it returns v from the x table
For more details and intuition, review the startup messages shown when you type library(data.table).

I am not entirely sure whether I got your question correctly, but if "own" and "other" refers to group A and B, then this might do the trick. I strongly assume there are more elegant ways to do this:
df.x <- df %>%
dplyr::group_by(group) %>%
mutate(value.lag=lag(value)) %>%
mutate(index=seq_along(group)) %>%
arrange(group)
df.a <- df.x %>%
filter(group=="A") %>%
rename(value.lag.a=value.lag)
df.b <- df.x %>%
filter(group=="B") %>%
rename(value.lag.b = value.lag)
df.a.b <- left_join(df.a, df.b[,c("index", "value.lag.b")], by=c("index"))
df.b.a <- left_join(df.b, df.a[,c("index", "value.lag.a")], by=c("index"))
df.x <- bind_rows(df.a.b, df.b.a)

Try this: (Pipe-Only approach)
library(zoo)
df %>%
mutate(groupLag = lag(group),
dupLag = group == groupLag) %>%
group_by(dupLag) %>%
mutate(valueLagHelp = lag(value)) %>%
ungroup() %>%
mutate(helper = ifelse(dupLag == T, NA, valueLagHelp)) %>%
mutate(helper = case_when(is.na(helper) ~ na.locf(helper, na.rm=F),
TRUE ~ helper)) %>%
mutate(valAfterLag = lag(dupLag)) %>%
mutate(otherLag = ifelse(is.na(lag(valueLagHelp)), lag(value), helper)) %>%
mutate(otherLag = ifelse((valAfterLag | is.na(valAfterLag)) & !dupLag,
lag(value), otherLag)) %>%
select(c(x, group, value, ownLag, otherLag))
Sorry for the mess.
What it does it that it first creates a group lag and creates a helper variable for the case when the group is equal to its lag (i. e. when two "A"s are subsequent. Then it groups by this helper variable and it assigns to all values which are dupLag == F the correct value. Now we need to take care of the ones with dupLag == T.
So, ungroup. We need a new lagged-value helper that assigns all dupLag == T an NA, because they are not correctly assigned yet.
What's next is that we assign all NAs in our helper the last non-NA value.
This is not all because we still need to take care of some dupLag == F data points (you get that when you look at the complete tibble). First, we basically just change the second data point with the first mutate(otherLag==... operation. The next operation finalizes everything and then we select the variables which we'd like to have in the end.

dcast in R - creating pivot table

Here is my example
Student <- c('A', 'B', 'B')
Assessor <- c('C', 'D', 'D')
Score <- c(1, 5, 7)
df <- data.frame(Student, Assessor, Score)
df <- dcast(df, Student ~ Assessor,fun.aggregate=(function (x) x), value = 'Score')
print(df)
The output:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
While I want to get something like
C D
A 1 NaN
B NaN 5
B NaN 7
What I am missing?
In addition, if I replace Score with
Score <- c('foo', 'bar','bar')
The output will be:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
Any thoughts?

Since dcast spread among unique values of the left side of the formula I think you can achieve your goal with a (not so elegant hack) but I bet there are other ways to do that with table maybe.
library(reshape2)
dcast(df, Student + Score ~ ...)[-2]
Using Score as value column: use value.var to override.
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7
The hack is to just spread by remaining Student and Score the same and then spread other variables (in this case Assessor) and the with [-2] remove the Score column in order to get the desired output (unless your first column is made by column names actually, which is impossible in base R; in that case you need a data.table solution)

Using the dev version of tidyr (0.3.0) get it from github.
First we complete the combinations of Student/Assessor, then we nest it all into a list, spread and then unnest the list into new rows.
library(dplyr)
library(tidyr)
df %>% complete(Student, Assessor) %>%
nest(Score) %>%
spread(Assessor, Score) %>%
unnest(C) %>%
unnest(D)
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7

R Reshape - Sum and Combine

I am sure this is really simple but I can not get it to work
I need to sum two values while the other columns remain constant using reshape/melt:
Data Looks like this:
ID Value
1 2850508 1010.58828
2 2850508 94.37286
Desired Output:
ID Variable Value
1 2850508 Cost 1104.96114
Current Output:
ID Variable Value
1 2850508 Cost 1010.58828
2 2850508 Cost 94.37286
Current Code:
Sum <- melt(Data, id="ID", measured="Cost")
Any help would be greatly appreciated!

You can also just use the aggregate function.
aggregate(formula = . ~ ID,
data = Data ,
FUN = sum)
## ID Value
## 1 2850508 1104.961
And to get your desired output, you have to cbind and rearrange:
cbind(aggregate(formula = . ~ ID,
data = Data ,
FUN = sum),
Variable = "Cost")[, c("ID", "Variable", "Value")]

Using dplyr: (I added two more IDs so there'd be more data):
d is your data
d %>%
group_by(ID) %>%
summarise(Value=sum(Value)) %>%
mutate(Variable="Cost") %>%
select(ID,Variable,Value)
ID Variable Value
1 2850508 Cost 1104.961
2 2850509 Cost 1164.961
3 2850510 Cost 1047.961

It is also very simple with data.table
library(data.table)
setDT(df)[, .(Variable = "Cost", Value = sum(Value)) , ID]
# ID Variable Value
# 1: 2850508 Cost 1104.961

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Grouping with numeric variables - r

You can do that with spread from the tidyr package. df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"), value = c("Google", "Yahoo", NA, "Google")) df$row <- 1 df %>% spread(value, row, fill = 0) %>% select(-`<NA>`)

Try df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name ~ value, length) Just remove df + from the equation. Though this will give you an extra column for NA values, which makes me think the na.rm argument isn't working properly in your formulation.

You can do it also with base R: df <- read.table(header=TRUE, sep=',', text= 'name, value stockA,Google stockA,Yahoo stockB,NA stockC,Google') xtabs(~., data=df) # value #name Google Yahoo # stockA 1 1 # stockB 0 0 # stockC 1 0

Related

Using set_names vs. mutate(colnames) when changing data frame column names to lower case

How do I change the column names using dcast?

dplyr lag of different group

dcast in R - creating pivot table

R Reshape - Sum and Combine

Categories

Resources