Related
Generate df1 and df2 like this
pro <- c("Hide-Away", "Hide-Away")
sourceName <- c("New Rate2", "FST")
standardName <- c("New Rate", "SFT")
df1 <- data.frame(pro, sourceName, standardName, stringsAsFactors = F)
A <- 1; B <- 2; C <-3; D <- 4; G <- 5; H <- 6; E <-7; FST <-8; Z <-8
df2<- data.frame(A,B,C,D,G,H,E,FST)
colnames(df2)[1]<- "New Rate2"
Then run this code.
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of DF2 will be like
New Rate2 B C D G H E FST
1 2 3 4 5 6 7 8
The output of DF2 will be like
New Rate B C D G H E SFT
1 2 3 4 5 6 7 8
So clearly the code worked and swapped the names correctly. But now create df2 with the below code instead. And make sure to regenrate df1 to what it was before.
df2<- data.frame(FST,B,C,D,G,H,E,Z)
colnames(df2)[8]<- "New Rate2"
and then run
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of df2 will be
FST B C D G H E New Rate2
8 2 3 4 5 6 7 8
The output of df2 will be
New Rate B C D G H E SFT
8 2 3 4 5 6 7 8
So the order of the columns has not been preserved. I know this is because of the %in code but I am not sure of an easy fix to make the column swapping more dynamic.
I am not totally sure about the question, as it seems a little vague. I'll try my best though--the best way I know to dynamically set column names is setnames from the data.table package. So let's say that I have a set of source names and a set of standard names, and I want to swap the source for the standard (which I take to be the question).
Given the data above, I have a data.frame structured like so:
> df2
A B C D G H E FST
1 1 2 3 4 5 6 7 8
as well as two vectors, sourceName and standardName.
sourceName <- c("A", "FST")
standardName <- c("New A", "FST 2: Electric Boogaloo")
I want to dynamically swap sourceName for standardName, and I can do this with setnames like so:
df3 <- as.data.table(df2)
setnames(df3, sourceName, standardName)
> df3
New A B C D G H E FST 2: Electric Boogaloo
1: 1 2 3 4 5 6 7 8
Trying to follow your example, in your second pass I get an index value of 0,
> df2
New Rate B C D G H E SFT
1 8 2 3 4 5 6 7 8
> df1
sourceName standardName
1 New Rate2 New Rate
2 FST SFT
> index<-which(colnames(df2) %in% df1[,1])
> index
integer(0)
which would account for your expected ordering on assignment to column names.
I am trying to sort names in a row and create a comma separated string which would create another column.
This being my sample data.frame .
df=data.frame(A=c("A","K","B","D","F"),B =c("E","C","D","A","K"))
A B
1 A E
2 K C
3 B D
4 D A
5 F K
The Output I am trying to get would be like this
A B C
1 A E A , E
2 K C C , K
3 B D B , D
4 D A A , D
5 F K F , K
So far I have tried this :
lapply(df,FUN=paste(sort(df$A,df$B),collapse=" , "))
mapply(FUN= function(x,y)paste(sort(x,y),collapse=" , "),df$A,df$B)
Here I am trying to sort column values and paste them using ',' to create a unique pair name.
Any help is appreciated.
If you only have 2 columns, you can use pmax and pmin to avoid any costly looping code. E.g.:
with(lapply(df, as.character), paste(pmin(A,B),pmax(A,B),sep=",") )
#[1] "A,E" "C,K" "B,D" "A,D" "F,K"
You can do it with mapply, but since your data are factors, you need to coerce to character to they sort properly:
df$C <- mapply(function(x, y){paste(sort(c(as.character(x), as.character(y))),
collapse = ',')}, df$A, df$B)
df
# A B C
# 1 A E A,E
# 2 K C C,K
# 3 B D B,D
# 4 D A A,D
# 5 F K F,K
To simplify a bit, you can just use apply to iterate over the rows:
apply(df, 1, function(x){paste(sort(x), collapse = ',')})
Since it treats df as a matrix, it converts everything to character, which happens to be what you want for the sample data.
Also see tidyr::unite for pasting two columns together, though it can't easily sort.
Try this
> for( i in 1:nrow(df)){
+ df$C[i]<-paste0(as.character(unlist(sort(df[i,1:2]))),collapse=" , ")
+ }
> df
> df
A B C
1 A E A , E
2 K C C , K
3 B D B , D
4 D A A , D
5 F K F , K
I have a dataset in R that looks like this:
DF <- data.frame(name=c("A","b","c","d","B","e","f"),
x=c(NA,1,2,3,NA,4,5))
I would like to reshape it into:
rDF <- data.frame(name=c("b","c","d","e","f"),
x=c(1,2,3,4,5),
head=c("A","A","A","B","B"))
where the first row with an NA identifies a new column, and takes that "row value" until the next row with an NA, and then changes "row value".
I have tried both spread and melt, but it does not give me what I want.
library(tidyr)
DF %>% spread(name,x)
library(reshape2)
melt(DF, id=c('name'))
Any suggestions?
Here's a possible data.table/zoo packages combination solution
library(data.table) ; library(zoo)
setDT(DF)[is.na(x), head := name]
na.omit(DF[, head := na.locf(head)], "x")
# name x head
# 1: b 1 A
# 2: c 2 A
# 3: d 3 A
# 4: e 4 B
# 5: f 5 B
Or as suggested by #Arun, just using data.table
na.omit(setDT(DF)[, head := name[is.na(x)], by=cumsum(is.na(x))])
You can try:
library(data.table)
library(magrittr)
split(DF, cumsum(is.na(DF$x))) %>%
lapply(function(u) transform(u[-1,], head=u[1,1])) %>%
rbindlist
# name x head
#1: b 1 A
#2: c 2 A
#3: d 3 A
#4: e 4 B
#5: f 5 B
Here's an approach using only base R functions:
idx <- is.na(DF$x)
x <- rle(cumsum(idx))$lengths
DF$head <- rep(DF$name[idx], x)
DF[!idx,]
# name x head
#2 b 1 A
#3 c 2 A
#4 d 3 A
#6 e 4 B
#7 f 5 B
For dummy dataset
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
Now the dataset looks like this:
teamid member leader
1: 1 a,b c
2: 2 d,e
3: 3 c,g,h
3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.
The following is my ALMOST desired output:
teamid value leader
1: 1 a FALSE
2: 1 b FALSE
3: 1 c TRUE
4: 1 c TRUE
5: 2 d TRUE
6: 2 e TRUE
7: 3 c FALSE
8: 3 g FALSE
9: 3 h FALSE
I want to have the two columns merged into one, and add a tag if one is a team leader.
I have an ugly solution for this,
dt1 <- dt[, strsplit(member, ","), by = teamid]
dt2 <- dt[, strsplit(leader, ","), by = teamid]
setkey(dt1,teamid)
setkey(dt2,teamid)
dt3 <- merge(dt1,dt2, all = TRUE)
dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y"))
dt5 <- dt4[value!="NA_real"]
dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL]
setkey(dt6, teamid)
setnames(dt6,value,member)
Issues:
This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?
There're duplicated rows, in row 3 and row 4.
When I tried to change column name, an error came up
setnames(dt6,value,member)
Error in setnames(dt6, value, member) : object 'value' not found
Maybe the most important thing,
When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured
merge(df1,df2, all = TRUE)
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Any suggestion? Thanks a lot!
Melt first.
result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
# teamid status member
# 1: 1 member a
# 2: 1 member b
# 3: 1 leader c
# 4: 2 leader d
# 5: 2 leader e
# 6: 3 member c
# 7: 3 member g
# 8: 3 member h
If you want to get rid of the status column and add a "tag" to the member column, you can do it this way:
result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
# teamid member
# 1: 1 a
# 2: 1 b
# 3: 1 c*
# 4: 2 d*
# 5: 2 e*
# 6: 3 c
# 7: 3 g
# 8: 3 h
A slightly simpler approach may be
crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)
dt_long <- data.table(people=c(crew, leads),
status = rep(c("crew", "leader"), c(length(crew), length(leader))))
It gives me
people status
1: a crew
2: b crew
3: c crew
4: g crew
5: h crew
6: c leader
7: d leader
8: e leader
You can try a tidyverse solution now
dt %>%
separate_rows(member) %>%
separate_rows(leader) %>%
gather(status, member, -teamid) %>%
distinct() %>%
filter(member != "") %>%
mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>%
select(-status)
teamid member
1 1 a
2 1 b
3 3 c
4 3 g
5 3 h
6 1 c*
7 2 d*
8 2 e*
I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0