Distributing from frequency table to original data? - r

I have a data_aa as below
ID x.cal tx.cal
1 0 0
2 0 0
3 0 0
4 0 0
5 1 1
6 10 1
7 10 1
8 11 1
9 11 1
With above data_aa, I made a frequency table as below and got an result_A variable.
x.cal tx.cal frequency result_A
0 0 3223 0.05268579
1 1 35 0.05048418
1 10 2 0.89475308
1 11 1 0.98251303
1 12 1 1.06831347
1 13 1 1.15179768
I want to add result_A value to my original data data_aa.
How I can redistribute result_A from frequency data to original data_aa ?
I want to add result_A value to each individual (ID) from frequency table.
My desired table as below
ID x.cal tx.cal result_A
1 0 0 0.05268579
2 0 0 0.05268579
3 0 0 0.05268579
4 0 0 0.05268579
5 1 1 0.05048418
6 10 1 0.89475308
7 10 1 0.89475308
8 11 1 0.98251303
9 11 1 0.98251303

Use David's approach to combine the dataframes, then clean the resultant to get the format you want.
df <- merge(data_aa, freq_aa, by = c("x.cal", "tx.cal"), all.x = T)
result <- df[c("ID", "x.cal", "tx.cal", "result_A")]

Related

Converting data to longitudinal data

Hi i am having difficulties trying to convert my data into longitudinal data using the Reshape package. Would be grateful if anyone could help me, thank you!
Data is as follows:
m <- matrix(sample(c(0, 0:), 100, replace = TRUE), 10)
ID<-c(1:10)
dim(ID)=c(10,1)
m<- cbind(ID,m)
d <- as.data.frame(m)
names(d)<-c('ID', 'litter1', 'litter2', 'litter3', 'litter4', 'litter5', 'litter6', 'litter7', 'litter8', 'litter9', 'litter10')
print(d)
ID litter1 litter2 litter3 litter4 litter5 litter6 litter7 litter8 litter9 litter10
1 0 0 0 3 1 0 2 0 0 3
2 0 2 1 2 0 0 0 2 0 0
3 1 0 1 2 0 3 3 3 2 0
4 2 1 2 3 0 2 3 3 1 0
5 0 1 2 0 0 0 3 3 1 0
6 2 1 2 0 3 3 0 0 0 0
7 0 1 0 3 0 0 1 2 2 0
8 0 1 3 3 2 1 3 2 3 0
9 0 2 0 2 2 3 2 0 0 3
10 2 2 2 2 1 3 0 3 0 0
I wish to convert the above data into a longitudinal data with columns 'ID', 'litter category' which tells us the category of the litter, i.e. 1-10 and 'litter number' which tells us the number of pieces for each litter category:
ID littercategory litternumber
1 4 3
1 5 1
1 7 2
1 10 3
2 2 2
2 3 1
2 4 2
2 8 2
and so on.
Would really appreciate your help thank you!
You could do that as follows:
library(reshape2)
d = melt(d, id.vars=c("ID"))
colnames(d) = c('ID','littercategory','litternumber')
# remove the text in the littercategory column, keep only the number.
d$littercategory = gsub('litter','',d$littercategory)
d = d[d$litternumber!=0]
Output:
ID littercategory litternumber
1 1 4
2 1 8
3 1 6
4 1 4
7 1 6
8 1 5
10 1 10
1 2 6
2 2 9
As you can see, only the ordering is different as the output you requested, but I'm sure you can fix that yourself. (If not, there are plenty of resources on how to do that).
Hope this helps!
To get desired output you have to melt your data and filter out values larger than 0.
library(data.table)
result <- setDT(melt(d, "ID"))[value != 0][order(ID)]
# To get exact structure modify result
result[, .(ID,
littercategory = sub("litter", "", variable),
litternumber = value)]

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

how to merge or join data frame and keep the row names as well?

I have few data frame , one column is values and their corresponding names.
I want when I merge them I keep the row names there too
for example
df1<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
df2<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
df3<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
I expect to have an output like
row.names1 variable row.names2 variable row.names2 variable
1 1 1 1 1 0
2 1 2 0 2 1
3 0 3 0 3 1
4 0 4 1 4 1
5 0 5 0 5 0
6 0 6 0 6 0
7 0 7 1 7 0
8 0 8 1 8 0
9 0 9 0 9 0
10 1 10 1 10 1
I guess you want to cbind the datasets keeping the rownames. An option using data.table is
library(data.table) #data.table_1.9.5
dt <- do.call(cbind,lapply(mget(paste0("df",1:3)),
as.data.table, keep.rownames=TRUE))
setnames(dt, seq(2,ncol(dt),by=2), rep('variable',3))
setnames(dt, seq(1,ncol(dt), by=2), paste0('row.names', 1:(ncol(dt)/2)))
head(dt,2)
# row.names1 variable row.names2 variable row.names3 variable
#1: 1 0 1 1 1 1
#2: 2 0 2 1 2 0
do.call(cbind,mget(paste0("df",1:3)))

cumulative counter in dataframe R

I have a dataframe with many rows, but the structure looks like this:
year factor
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
I would need to add a counter as a third column. It should count the cumulative cells that contains zero until it set again to zero once the value 1 is encountered. The result should look like this:
year factor count
1 0 0
2 0 1
3 0 2
4 0 3
5 0 4
6 0 5
7 0 6
8 0 7
9 1 0
10 0 1
11 0 2
12 0 3
13 0 4
14 0 5
15 0 6
16 0 7
17 1 0
18 0 1
19 0 2
20 0 3
I would be glad to do it in a quick way, avoiding loops, since I have to do the operations for hundreds of files.
You can copy my dataframe, pasting the dataframe in "..." here:
dt <- read.table( text="...", , header = TRUE )
Perhaps a solution like this with ave would work for you:
A <- cumsum(dt$factor)
ave(A, A, FUN = seq_along) - 1
# [1] 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
Original answer:
(Missed that the first value was supposed to be "0". Oops.)
x <- rle(dt$factor == 1)
y <- sequence(x$lengths)
y[dt$factor == 1] <- 0
y
# [1] 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3

Calculate Run Length Sequence and Maximum by Subject ID

We have time series data in which repeated observations were measured for several subjects. I would like to calculate the number of occasions in which the variable positive == 1 occurs for each subject (variable id).
A second aim is to identify the maximum length of these runs of consecutive observations in which positive == 1. For each subject there are likely to be multiple runs within the study period. Rather than calculating the maximum number of consecutive positive observations per subject, I would like to calculate the maximum run length within an individual run.
Here is a toy data set that illustrates the problem:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
count(test$episode)
x freq
1 0 25
2 1 5
The code above gets close to answering my first question in which I am attempting to count the number of positive episodes, however it is not conditioned by subject. This has the unfortunate effect of counting the last observation of Subject #1 and the first observation of Subject #2 in the same run. Can anyone help me develop code to condition this run length encoding by subject?
Secondly, how can one extract only the maximum run length for each run in which positive == 1? I would like to add an additional column in which only the observations in which the maximum run length are recorded. For Subject #1, this would look like:
id positive run run_positive episode max_run
1 1 0 1 0 0 0
2 1 1 1 1 1 0
3 1 1 2 2 0 0
4 1 1 3 3 0 0
5 1 1 4 4 0 0
6 1 1 5 5 0 5
7 1 0 1 0 0 0
8 1 0 2 0 0 0
9 1 1 1 1 1 0
10 1 1 2 2 0 2
If anyone can come up with a method to do this I would be extremely grateful.
I think this answers your first question:
aggregate(positive ~ id, data = test, FUN = sum)
id positive
1 1 7
2 2 4
3 3 4
This might answer your second question, but I would need to see the desired result for each id to check:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
test$group <- paste(test$id*10, test$positive, sep='')
my.seq <- data.frame(rle(test$group)$lengths)
test$first <- unlist(apply(my.seq, 1, function(x) seq(1,x)))
test$last <- unlist(apply(my.seq, 1, function(x) seq(x,1,-1)))
test$max <- ifelse(test$last == 1 & test$positive==1, test$run, 0)
test
id positive run run_positive episode group first last max
1 1 0 1 0 0 100 1 1 0
2 1 1 1 1 1 101 1 5 0
3 1 1 2 2 0 101 2 4 0
4 1 1 3 3 0 101 3 3 0
5 1 1 4 4 0 101 4 2 0
6 1 1 5 5 0 101 5 1 5
7 1 0 1 0 0 100 1 2 0
8 1 0 2 0 0 100 2 1 0
9 1 1 1 1 1 101 1 2 0
10 1 1 2 2 0 101 2 1 2
11 2 1 3 3 0 201 1 2 0
12 2 1 4 4 0 201 2 1 4
13 2 0 1 0 0 200 1 1 0
14 2 1 1 1 1 201 1 1 1
15 2 0 1 0 0 200 1 1 0
16 2 1 1 1 1 201 1 1 1
17 2 0 1 0 0 200 1 4 0
18 2 0 2 0 0 200 2 3 0
19 2 0 3 0 0 200 3 2 0
20 2 0 4 0 0 200 4 1 0
21 3 0 5 0 0 300 1 5 0
22 3 0 6 0 0 300 2 4 0
23 3 0 7 0 0 300 3 3 0
24 3 0 8 0 0 300 4 2 0
25 3 0 9 0 0 300 5 1 0
26 3 1 1 1 1 301 1 4 0
27 3 1 2 2 0 301 2 3 0
28 3 1 3 3 0 301 3 2 0
29 3 1 4 4 0 301 4 1 4
30 3 0 1 0 0 300 1 1 0

Resources