I do this with some steps that are not at all elegant and safe, but I'm sure that there is a way more easily and fast.
I need a help to know what is a quick way to go from dataframe_1 to dataframe_2.
#from this
a<-c("A","A","B","B","C","C")
b<-c(1,2,12,2,4,5)
dataframe_1<-cbind.data.frame(a,b)
a b
1 A 1
2 A 2
3 B 12
4 B 2
5 C 4
6 C 5
#to this
a<-c(1,2)
b<-c(12,2)
c<-c(4,5)
dataframe_2<-cbind.data.frame(A=a,B=b,C=c)
A B C
1 1 12 4
2 2 2 5
Try unstack
> unstack(rev(dataframe_1))
A B C
1 1 12 4
2 2 2 5
One option IF the number of elements in each group is constant.
data.frame(do.call(cbind, split(dataframe_1$b, dataframe_1$a)))
A B C
1 1 12 4
2 2 2 5
This can be also be done with dcast and rowid from data.table:
dcast(as.data.table(dataframe_1), rowid(a) ~ a, value.var = 'b')[, -1]
A B C
1: 1 12 4
2: 2 2 5
Here, [, -1] removes the first column (which is rowid(a)).
Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6
I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))