Calculate agreement for a specific row in a table in R - r

Hello! I am new to R and I have this table that I want to find the correlation, how much agreement is here in the 3rd row between all three tosses. How can I calculate this for more than two values for a specific row? (heads is 1, tails is 2) Can I check for agreement between one column vs all the rest?
library(readxl)
> COIN_TOSS <- read_excel("C:/Users/user/Desktop/COIN TOSS.xlsx")
TOSS #1 TOSS #2 TOSS #3
1 2 2 2
2 2 2 1
3 1 1 2
4 2 1 1
5 2 1 1
6 2 1 2
7 1 1 2
8 1 1 2
9 1 1 1
10 2 1 1
Also, I want to print a plot, with the sum of values. I have the top 3 values of each column (10 columns in sum) with this: (Am most frequent values are these)
am <- excel__data$AM
oneam <- sort(table(am),decreasing=TRUE)[1:3]
>am
3 2 4
31 26 24
For the plot I used this, but the y-axis stays the same with the max value being 30, and not all values (stacked up) are visible. How can I change it to go up to 200? Can I use something else besides plot and points?
plot(oneam, pch=10, col='red')
points(onecm, pch=10,col='blue')
points(onefm, pch=10,col='green')
points(onekk, pch=10,col='yellow')
points(onekm, pch=10,col='black')
points(onels, pch=10,col='orange')

"Agreement" and "Correlation" are very different things.
If you want to simply look at "agreement" you could calculate a row-wise mean and standard deviation. Low standard deviations would indicate that all tosses have been fairly close, if you want to you could even standardize by dividing SD/MEAN to get the Coefficient of Variance a % metric.
You can even be more specific and calculate a "distance measure" from one specific toss to the other two e.g.:
library(dplyr)
COIN_TOSS %>%
mutate(Toss3_Delta = ((TOSS1+TOSS2)/2-TOSS3)/((TOSS1+TOSS2)/2))
Now if we are talking about correlation in your example this works only column wise because three cases are not enough to calculate a correlation.
This works:
library(magrittr)
COIN_TOSS %$%
cor()

Related

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

Data Summary in R: Using count() and finding an average numeric value [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I am working on a directed graph and need some advice on generating a particular edge attribute.
I need to use both the count of interactions as well as another quality of the interaction (the average length of text used within interactions between the same unique from/to pair) in my visualization.
I am struggling to figure out how to create this output in a clean, scalable way. Below is my current input, solution, and output. I have also included an ideal output along with some things I have tried.
Input
x = read.table(network = "
Actor Receiver Length
1 1 4
1 2 20
1 3 9
1 3 100
1 3 15
2 3 38
3 1 25
3 1 17"
sep = "", header = TRUE)
I am currently using dplyr to get a count of how many times each pair appears to achieve the output below.
I use the following command:
EDGE <- dplyr::count(network, Actor, Receiver )
names(EDGE) <- c("from","to","count")
To achieve my current output:
From To Count
1 1 1
1 2 1
1 3 3
2 3 1
3 1 2
Ideally, however, I like to know the average lengths for each pair as well, or end up with something like this:
From To Count AverageLength
1 1 1 4
1 2 1 20
1 3 3 41
2 3 1 38
3 1 2 21
Is there any way I can do this without creating a host of new data frames and then grafting them back onto the output? I am mostly having issues trying to summarize and count at the same time. My stupid solution has been to simply add "Length" as an argument to the count function, this does not produce anything useful. I could also that it may be useful to combine actor-receiver and then use the summary function to create something to graft onto the frame as a result of the count. In the interest of scaling, however, I would like to figure out if there is a simple and clear way of doing this.
Thank you very much for any assistance with this issue.
A naive solution would be to use cbind() in order to connect these two outputs together. Here is an example code:
Actor <- c(rep(1, 5), 2, 3, 3)
Receiver <- c(1, 2, rep(3, 4), 1, 1)
Length <- c(4, 20, 9, 100, 15, 38, 25, 17)
x <- data.frame("Actor" = Actor,
"Receiver" = Receiver,
"Length" = Length)
library(plyr)
EDGE <- cbind(ddply(x,.(Actor, Receiver), nrow), # This part replace dplyr::count
ddply(x,.(Actor, Receiver), summarize, mean(Length))[ , 3]) # This is the summarize
names(EDGE) <- c("From", "To", "Count", "AverageLength")
EDGE # Gives the expected results
From To Count AverageLength
1 1 1 1 4.00000
2 1 2 1 20.00000
3 1 3 3 41.33333
4 2 3 1 38.00000
5 3 1 2 21.00000

Find n smallest values from data?

How to get 3 minimum value on the data automatically?
Data:
data <- c(4,3,5,2,2,1,1,5,6,7,8,9)
[1] 4 3 5 2 2 1 1 5 6 7 8 9
With min() function just return 1 value and I want to get 3 minimum value from data.
min(data)
[1] 1
Can I have this from a data?
[1] 1 1 2
Simply take the first three values of a sorted vector
> sort(data)[1:3]
[1] 1 1 2
Another alternative is head function that shows you the first n values of R object, so for three highest numbers you need head of a sorted vector
> head(sort(data), 3)
[1] 1 1 2
...but you could take head of possibly any other R object.
If you were interested in value that marks the upper boundry of k percent lowest values, use quantile function
> quantile(data, 0.1)
10%
1.1
data <- c(4,3,5,2,2,1,1,5,6,7,8,9)
sort(data,decreasing=F)[1:3]

working with data in tables in R

I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!
You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1
a) Identify the number of observations per subject
you can use this code on each variable:
summary

Discretization and aliasing in R

So, I have an array of value from 1 to 100, and I need to make it discrete while applying an alias to each discrete value. For example:
A
10
15
55
15
70
Now, let's say I want to make it discrete over 2 bins (so that 0-50 is one bin and 51-100 is the other one) and alias these bins with 1 and 2. It should result in:
A
1
1
2
1
2
Please, notice this is different from the discretize function (contained in entropy or infotheo). That function only counts the number of values for each bin.
My question also is different from this one (with a similar title).
Now, I can have this result using a series of ifs, but I was wondering whether exist a simpler way to do it.
You're looking for the function cut:
x <- cut(sample(1:100, 10), c(0, 50, 100))
str(x)
# Factor w/ 2 levels "(0,50]","(50,100]": 1 2 1 2 1 1 2 1 1 1

Resources