I was looking at this question: Find how many times duplicated rows repeat in R data frame, which provides the following code:
library(plyr)
ddply(df,.(a,b),nrow)
However, I have a dataset with many variables, so I can't type them out like a,b in this case. I've tried using names(data) with the paste function, but it doesn't seem to work. I tried this:
var_names=paste(names(data),collapse=",")
ddply(data,.(paste(a)),nrow)
It instead gives this output:
However, if I manually type them out, I get the proper output:
What do I need to do differently here?
Instead of paste and evaluating, make use of count from dplyr, which can take multiple columns with across and select-helpers - everything()
library(dplyr)
df %>%
count(across(everything()))
A reproducible example with mtcars dataset
data(mtcars)
df <- mtcars %>%
select(vs:carb)
count(df, across(everything()))
vs am gear carb n
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Also, in ddply, we can just pass a vector of column names i.e. no need to create a single string
library(plyr)
ddply(df, names(df), nrow)
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Or if we are creating a single string from names, also paste the whole expression and then evaluate (which is not recommended as there are standard ways of dealing this)
eval(parse(text = paste('ddply(df, .(', toString(names(df)), '), nrow)')))
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
You can use aggregate by grouping all the columns and counting it's length.
aggregate(1:nrow(df)~., df, length)
I have long data looking like this for example:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0
4 1 0
4 2 1
4 3 NA
4 4 NA
I want to only keep those rows before condition is met once so I want:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
3 1 1
4 1 0
4 2 1
I tried to loop but a) it said looping is not good coding style in R and b) it won't work.
Sidenote: just if you are wondering, it does make sense that IDs have condition and then lose it again in my example, but I am only interested in when they first had it.
Thank you.
Here's an easy way with dplyr:
library(dplyr)
df %>% group_by(ID) %>%
filter(row_number() <= which.max(condition) | sum(condition) == 0)
# # A tibble: 7 x 3
# # Groups: ID [3]
# ID time condition
# <int> <int> <int>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 0
# 4 1 4 1
# 5 2 1 0
# 6 2 2 1
# 7 3 1 1
It relies on which.max which returns the index of the first maximum value in vector. The | sum(condition) == 0 takes care to keep censored cases (where condition is always 0).
Using this data:
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0')
Hi i am having difficulties trying to convert my data into longitudinal data using the Reshape package. Would be grateful if anyone could help me, thank you!
Data is as follows:
m <- matrix(sample(c(0, 0:), 100, replace = TRUE), 10)
ID<-c(1:10)
dim(ID)=c(10,1)
m<- cbind(ID,m)
d <- as.data.frame(m)
names(d)<-c('ID', 'litter1', 'litter2', 'litter3', 'litter4', 'litter5', 'litter6', 'litter7', 'litter8', 'litter9', 'litter10')
print(d)
ID litter1 litter2 litter3 litter4 litter5 litter6 litter7 litter8 litter9 litter10
1 0 0 0 3 1 0 2 0 0 3
2 0 2 1 2 0 0 0 2 0 0
3 1 0 1 2 0 3 3 3 2 0
4 2 1 2 3 0 2 3 3 1 0
5 0 1 2 0 0 0 3 3 1 0
6 2 1 2 0 3 3 0 0 0 0
7 0 1 0 3 0 0 1 2 2 0
8 0 1 3 3 2 1 3 2 3 0
9 0 2 0 2 2 3 2 0 0 3
10 2 2 2 2 1 3 0 3 0 0
I wish to convert the above data into a longitudinal data with columns 'ID', 'litter category' which tells us the category of the litter, i.e. 1-10 and 'litter number' which tells us the number of pieces for each litter category:
ID littercategory litternumber
1 4 3
1 5 1
1 7 2
1 10 3
2 2 2
2 3 1
2 4 2
2 8 2
and so on.
Would really appreciate your help thank you!
You could do that as follows:
library(reshape2)
d = melt(d, id.vars=c("ID"))
colnames(d) = c('ID','littercategory','litternumber')
# remove the text in the littercategory column, keep only the number.
d$littercategory = gsub('litter','',d$littercategory)
d = d[d$litternumber!=0]
Output:
ID littercategory litternumber
1 1 4
2 1 8
3 1 6
4 1 4
7 1 6
8 1 5
10 1 10
1 2 6
2 2 9
As you can see, only the ordering is different as the output you requested, but I'm sure you can fix that yourself. (If not, there are plenty of resources on how to do that).
Hope this helps!
To get desired output you have to melt your data and filter out values larger than 0.
library(data.table)
result <- setDT(melt(d, "ID"))[value != 0][order(ID)]
# To get exact structure modify result
result[, .(ID,
littercategory = sub("litter", "", variable),
litternumber = value)]
I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))
Starting with some sample two-way frequency table:
a <- c(1,2,3,4,4,3,4,2,2,2)
b <- c(1,2,3,4,1,2,4,3,2,2)
tab <- table(a,b)
> tab
b
a 1 2 3 4
1 1 0 0 0
2 0 3 1 0
3 0 1 1 0
4 1 0 0 2
I need to transform the table into the following format:
goal <- data.frame(a=c(1,2,3,4),b=c(1,2,3,4),count=c(1,3,1,2))
> goal
a b count
1 1 1 1
2 2 2 3
3 3 3 1
4 4 4 2
. . . .
How can I form all pairwise combinations from the two-way table and add the frequency counts in the third column?
Intuition tells me there should be a simple kind of 'reverse' function for table, but I could not find anything on SO or Google.
Naturally, after posting the question I found the right search query for Google...
> as.data.frame(tab)
a b Freq
1 1 1 1
2 2 1 0
3 3 1 0
4 4 1 1
5 1 2 0
6 2 2 3
7 3 2 1
8 4 2 0
9 1 3 0
10 2 3 1
11 3 3 1
12 4 3 0
13 1 4 0
14 2 4 0
15 3 4 0
16 4 4 2