Building a contingency table

Building a contingency table - r

I have a data like this:
A B
1 10
1 20
1 30
2 10
2 30
2 40
3 20
3 10
3 30
4 20
4 10
5 10
5 10
and I want to build a contingency table like this:
10 20 30 40
10 1 3 2 0
20 3 0 2 0
30 2 2 0 0
40 0 0 0 0
Meaning: According to column A, for each two values of column B mark + 1 in the specific Contingency table.
Can you help me do this?

Here is a very ugly answer, using the data from the image, because I already spent too much time on your problem. In general, it's not practical to have your result depend on the order of variables.
A <- rep(c(1:4),c(3,2,3,3))
B <- c(10,10,30,10,20,30,20,10,10,20,30)
data <- data.frame(cbind(A,B))
#split by A
library(plyr)
data2 <- ddply(data,.(A),function(x){
combined_pairs <- cbind(x$B[-nrow(x)],
x$B[-1])
#return data where first is always lowest
smallest <- apply(combined_pairs,MARGIN=1,
FUN=min)
largest <- apply(combined_pairs,MARGIN=1,
FUN=max)
return(data.frame(small=smallest,large=largest))
})
library(reshape2)
result <- dcast(small~large,data=data2,
fun.aggregate=length)
> result
small 10 20 30
1 10 1 3 1
2 20 0 0 2
I think you can add the empty rows yourself if you still need them.

Related

R: finding absolute difference with dplyr and group_by

I have the following example.
I want to create a new column with the absolute difference in AGE compared to each Treat==1 in the same PairID.
Desired output should be as shown below.
I have tried using dplyr with:
Data complete:
Treat <- c(1,0,0,1,0,0,1,0)
PairID <- c(1,1,1,2,2,2,3,3)
Age <- c(30,60,31,20,20,40,50,52)
D <- data.frame(Treat,PairID,Age)
D
D %>%
group_by(PairID) %>%
abs(Age - Age[Treat == 1])

in Base-R:
D$absD <- unlist(lapply(split(D,D$PairID), function(x) abs(x$Age - x$Age[x$Treat==1])))
> D
Treat PairID Age absD
1 1 1 30 0
2 0 1 60 30
3 0 1 31 1
4 1 2 20 0
5 0 2 20 0
6 0 2 40 20
7 1 3 50 0
8 0 3 52 2

Linear interpolation by multiple groupings in R

I have the following data set:
District Type DaysBtwn Start_Day End_Day Start_Vol End_Vol
1 A 0 3 0 31 28 23
2 A 1 3 0 31 24 0
3 B 0 3 0 31 17700 10526
4 B 1 3 0 31 44000 35800
5 C 0 3 0 31 5700 0
6 C 1 3 0 31 35000 500
For each of the group combinations District & Type, I want to do a simple linear interpolation: for a x=Days (Start_Day and End_Day) and y=Volumes (Start_Vol and End_Vol), I want the estimated volume returned for xout=DaysBtwn.
I have tried so many things. I think I am having issues because of the way my data is set up. Can someone point me in the right direction for how to use the approx function in R to get the desired output? I don't mind moving my data set around to get the correct format for approx.`
Example of desired output:
District Type EstimatedVol
1 0 25
2 1 15
3 0 13000
4 1 39000
5 0 2500
6 1 25000
dt <- data.table(input) interpolation <- dt[, approx(x,y,xout=z), by=list(input$District,input$Type)]

Why not simply calculate it directly?
dt$EstimatedVol <- (End_Vol - Start_Vol) / (End_Day - Start_Day) * (DaysBtwn - Start_Day) + Start_Vol

lag and summarize time series data

I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated

something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2

Data Cleaning for Survival Analysis

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

R - Conditional replacement of column values in a data frame

I have a data frame which has 2 columns - A & B. I want to replace the values of column B in such a way that, when the VALUE>=5 replace with 1, else replace with 0.
Note - There are 2 conditions to be checked.
X=read.csv("Y:/impdat.csv")
A B
3 16
12 3
1 2
12 9
4 4
5 6
21 1
4 14
3 10
12 1
So after replacing, the data should be
A B
3 1
12 0
1 0
12 1
4 0
5 1
21 0
4 1
3 1
12 0
Sounds simple. But I am unable to implement it.
I tried
ifelse(X$B>=5,1,0)
This only prints the new values, but the original data remains the same.

X$B <- as.integer(X$B >= 5)
will do the trick.

transform(X, B=ifelse(B>=5,1,0))

Got it.
Just had to assign the object.
X$B=ifelse(X$B>=5,1,0)

Categories

HOME

logical-operators

functional-programming

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Building a contingency table - r

Related

R: finding absolute difference with dplyr and group_by

Linear interpolation by multiple groupings in R

lag and summarize time series data

Data Cleaning for Survival Analysis

R - Conditional replacement of column values in a data frame

Categories

Resources