r reshape data using row with NA to identify new column - r

I have a dataset in R that looks like this:
DF <- data.frame(name=c("A","b","c","d","B","e","f"),
x=c(NA,1,2,3,NA,4,5))
I would like to reshape it into:
rDF <- data.frame(name=c("b","c","d","e","f"),
x=c(1,2,3,4,5),
head=c("A","A","A","B","B"))
where the first row with an NA identifies a new column, and takes that "row value" until the next row with an NA, and then changes "row value".
I have tried both spread and melt, but it does not give me what I want.
library(tidyr)
DF %>% spread(name,x)
library(reshape2)
melt(DF, id=c('name'))
Any suggestions?

Here's a possible data.table/zoo packages combination solution
library(data.table) ; library(zoo)
setDT(DF)[is.na(x), head := name]
na.omit(DF[, head := na.locf(head)], "x")
# name x head
# 1: b 1 A
# 2: c 2 A
# 3: d 3 A
# 4: e 4 B
# 5: f 5 B
Or as suggested by #Arun, just using data.table
na.omit(setDT(DF)[, head := name[is.na(x)], by=cumsum(is.na(x))])

You can try:
library(data.table)
library(magrittr)
split(DF, cumsum(is.na(DF$x))) %>%
lapply(function(u) transform(u[-1,], head=u[1,1])) %>%
rbindlist
# name x head
#1: b 1 A
#2: c 2 A
#3: d 3 A
#4: e 4 B
#5: f 5 B

Here's an approach using only base R functions:
idx <- is.na(DF$x)
x <- rle(cumsum(idx))$lengths
DF$head <- rep(DF$name[idx], x)
DF[!idx,]
# name x head
#2 b 1 A
#3 c 2 A
#4 d 3 A
#6 e 4 B
#7 f 5 B

Related

add a new column with scaled values to all dataframes in a list

I have a list of dataframes, all of which contain a user column and another column called 'VD'. I want to add a new column to all dataframes 'VD_z' in the list with the scaled values of the VD column
df1 <- data.frame(VD = 1:3, user=letters[1:3])
df2 <- data.frame(VD = 4:6, user=letters[4:6])
filelist <- list(df1,df2)
I read several similar questions, finally trying:
filelist <- mapply(cbind(filelist, VD_z= lapply(filelist, function(df) scale(df$VD))))
What I expect is that all dataframes in the list now have the new VD_z column with the scaled values, like this:
df1 <- data.frame(VD = 1:3, user=letters[1:3], VD_z=c(-1,0,1))
df2 <- data.frame(VD = 4:6, user=letters[4:6], VD_z=c(-1,0,1))
What I get is an Error message 'Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'
Thanks for your help!
We can use map from purrr to loop through the list and mutate to create the 'VD_z'
library(tidyverse)
filelist %>%
map( ~ .x %>%
mutate(VD_z = scale(VD)))
or using base R with lapply/transform
filelist1 <- lapply(filelist, transform, VD_z = scale(VD))
filelist1
#[[1]]
# VD user VD_z
#1 1 a -1
#2 2 b 0
#3 3 c 1
#[[2]]
# VD user VD_z
#1 4 d -1
#2 5 e 0
#3 6 f 1
If we using the logic from the OP's post, assign thescaleto new coumn 'VD_z' and thenreturn` 'df'
filelist1 <- lapply(filelist, function(df) {df$VD_z <- scale(df$VD); df})
A data.table approach can be,
library(data.table)
dd <- rbindlist(filelist, idcol = 'id')[, VD_z := scale(VD), by = id]
# id VD user VD_z
#1: 1 1 a -1
#2: 1 2 b 0
#3: 1 3 c 1
#4: 2 4 d -1
#5: 2 5 e 0
#6: 2 6 f 1
You can then use split() to split the data frame to a list, i.e.
split(dd, dd$id)
which gives,
$`1`
id VD user VD_z
1: 1 1 a -1
2: 1 2 b 0
3: 1 3 c 1
$`2`
id VD user VD_z
1: 2 4 d -1
2: 2 5 e 0
3: 2 6 f 1

Writing a R function with aggregation using data.table

I'm writing a R function with aggregations using data.table package. My table looks like:
Name1 Name2 Price
A F 6
A D 5
A E 2
B F 4
B D 7
C F 4
C E 2
My function looks like:
MyFun <- function(Master_Table, Desired_Column, Group_By){
Master_Table <- as.data.table(Master_Table)
Master_Table_New <- Master_Table[, (Master_Table$Desired_Column), by=.(Desired_Column$Group_By)]
return(Master_Table_New)
}
I want to calculate df[, .(Group_Median = median(Price), by=.(Name1, Name2)]
But when I apply it into my own function, it keeps giving me errors like: `
Error in `[.data.table`(Master_Table, , .(Med_Group = mean(Master_Table$Desired_Column)), :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))] `
or:
Error in `[.data.table`(Master_Table, , .(Med_Group = mean(Master_Table$Desired_Column)), :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
This would be the very first step of my whole work. If anyone knows anything about this, please let me know, any help would be appreciated!
The function should be written as:
MyFun <- function(Master_Table, Desired_Column, Group_By){
Master_Table[, sapply(.SD, mean), .SDcols = Desired_Column, by=Group_By]
}
#Have a close watch here how Group_By is prepared to provide multiple columns.
MyFun(DT, "Price", "Name1,Name2")
# Name1 Name2 V1
# 1: A F 6
# 2: A D 5
# 3: A E 2
# 4: B F 4
# 5: B D 7
# 6: C F 4
# 7: C E 2
Data
DT <- read.table(text =
"Name1 Name2 Price
A F 6
A D 5
A E 2
B F 4
B D 7
C F 4
C E 2",
header = TRUE, stringsAsFactors = FALSE)
setDT(DT)

Dynamic column rename based on a separate data frame in R

Generate df1 and df2 like this
pro <- c("Hide-Away", "Hide-Away")
sourceName <- c("New Rate2", "FST")
standardName <- c("New Rate", "SFT")
df1 <- data.frame(pro, sourceName, standardName, stringsAsFactors = F)
A <- 1; B <- 2; C <-3; D <- 4; G <- 5; H <- 6; E <-7; FST <-8; Z <-8
df2<- data.frame(A,B,C,D,G,H,E,FST)
colnames(df2)[1]<- "New Rate2"
Then run this code.
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of DF2 will be like
New Rate2 B C D G H E FST
1 2 3 4 5 6 7 8
The output of DF2 will be like
New Rate B C D G H E SFT
1 2 3 4 5 6 7 8
So clearly the code worked and swapped the names correctly. But now create df2 with the below code instead. And make sure to regenrate df1 to what it was before.
df2<- data.frame(FST,B,C,D,G,H,E,Z)
colnames(df2)[8]<- "New Rate2"
and then run
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of df2 will be
FST B C D G H E New Rate2
8 2 3 4 5 6 7 8
The output of df2 will be
New Rate B C D G H E SFT
8 2 3 4 5 6 7 8
So the order of the columns has not been preserved. I know this is because of the %in code but I am not sure of an easy fix to make the column swapping more dynamic.
I am not totally sure about the question, as it seems a little vague. I'll try my best though--the best way I know to dynamically set column names is setnames from the data.table package. So let's say that I have a set of source names and a set of standard names, and I want to swap the source for the standard (which I take to be the question).
Given the data above, I have a data.frame structured like so:
> df2
A B C D G H E FST
1 1 2 3 4 5 6 7 8
as well as two vectors, sourceName and standardName.
sourceName <- c("A", "FST")
standardName <- c("New A", "FST 2: Electric Boogaloo")
I want to dynamically swap sourceName for standardName, and I can do this with setnames like so:
df3 <- as.data.table(df2)
setnames(df3, sourceName, standardName)
> df3
New A B C D G H E FST 2: Electric Boogaloo
1: 1 2 3 4 5 6 7 8
Trying to follow your example, in your second pass I get an index value of 0,
> df2
New Rate B C D G H E SFT
1 8 2 3 4 5 6 7 8
> df1
sourceName standardName
1 New Rate2 New Rate
2 FST SFT
> index<-which(colnames(df2) %in% df1[,1])
> index
integer(0)
which would account for your expected ordering on assignment to column names.

R: Examine to see if a Datatable is subset of another Datatable

How I can check to see if a data table is subset of another data table, regardless of the row and column order? For instance, imagine someone rbinded the DT_x and DT_y with removing the duplicate and created DT_Z. Now, I want to know how I can compare DT_x and DT_Z and get the result which show/state that the DT_z is a subset of DT_Z?
as very simple example:
DT1 <- data.table(a= LETTERS[1:10], v=1:10)
DT2 <- data.table(a= LETTERS[1:6], v=1:6)
DT1
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
7: G 7
8: H 8
9: I 9
10: J 10
DT2
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
I am sure all.equal(DT1, DT2) will not answer my question.
I think you can use data.table's fintersect() and fsetequal():
is_df1_subset_of_df2 <- function(df1, df2) {
intersection <- data.table::fintersect(df1, df2)
data.table::fsetequal(df1, intersection)
}
The first line picks the elements in df1 that exists in df2.
The second line checks if that set is all of df1.

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

Resources