Discretization and aliasing in R - r

So, I have an array of value from 1 to 100, and I need to make it discrete while applying an alias to each discrete value. For example:
A
10
15
55
15
70
Now, let's say I want to make it discrete over 2 bins (so that 0-50 is one bin and 51-100 is the other one) and alias these bins with 1 and 2. It should result in:
A
1
1
2
1
2
Please, notice this is different from the discretize function (contained in entropy or infotheo). That function only counts the number of values for each bin.
My question also is different from this one (with a similar title).
Now, I can have this result using a series of ifs, but I was wondering whether exist a simpler way to do it.

You're looking for the function cut:
x <- cut(sample(1:100, 10), c(0, 50, 100))
str(x)
# Factor w/ 2 levels "(0,50]","(50,100]": 1 2 1 2 1 1 2 1 1 1

Related

getting table(column) to return 0 for non-represented values

I'm working with a data set where my outcome of interest is coded across multiple columns and takes on values of 1, 2 and 3. Running table() across any one of these columns sometimes gives me results of the following (desired) form:
1 2 3
8 87 500
But also, for example, sometimes gives me results that look like this, when there are no 2's in a column
1 3
5 200
This is a problem as I try to combine all of these tables using rbind, which I do using this code.
tables = sapply(.GlobalEnv, is.table)
allquestions <- do.call(rbind, mget(names(tables)[tables]))
When this code comes across tables of the latter form, it seems to treat values in the '3' column as though they were in the '2' column, because '3' is in the second position. it then seems to take the value for the '3' position from the 1 position, as shown below
1 2 3
8 87 500
5 200 5
What I want it to look like is this:
1 2 3
8 87 500
5 0 200
Is there any way to make table() look for values that might not be represented in a column? Ideally, I would want it to print out the following for the second table example I gave.
1 2 3
5 0 200
Alternatively, is there a way to make the way I use rbind function pay attention to column names and merge them appropriately?
You can convert the values to factor with levels to specify all the values it can take.
x <- c(1, 2, 3, 1, 2)
table(x)
x
#1 2 3
#2 2 1
x <- c(1, 3, 3)
table(x)
#x
#1 3
#1 2
table(factor(x, 1:3))
#1 2 3
#1 0 2

Calculate agreement for a specific row in a table in R

Hello! I am new to R and I have this table that I want to find the correlation, how much agreement is here in the 3rd row between all three tosses. How can I calculate this for more than two values for a specific row? (heads is 1, tails is 2) Can I check for agreement between one column vs all the rest?
library(readxl)
> COIN_TOSS <- read_excel("C:/Users/user/Desktop/COIN TOSS.xlsx")
TOSS #1 TOSS #2 TOSS #3
1 2 2 2
2 2 2 1
3 1 1 2
4 2 1 1
5 2 1 1
6 2 1 2
7 1 1 2
8 1 1 2
9 1 1 1
10 2 1 1
Also, I want to print a plot, with the sum of values. I have the top 3 values of each column (10 columns in sum) with this: (Am most frequent values are these)
am <- excel__data$AM
oneam <- sort(table(am),decreasing=TRUE)[1:3]
>am
3 2 4
31 26 24
For the plot I used this, but the y-axis stays the same with the max value being 30, and not all values (stacked up) are visible. How can I change it to go up to 200? Can I use something else besides plot and points?
plot(oneam, pch=10, col='red')
points(onecm, pch=10,col='blue')
points(onefm, pch=10,col='green')
points(onekk, pch=10,col='yellow')
points(onekm, pch=10,col='black')
points(onels, pch=10,col='orange')
"Agreement" and "Correlation" are very different things.
If you want to simply look at "agreement" you could calculate a row-wise mean and standard deviation. Low standard deviations would indicate that all tosses have been fairly close, if you want to you could even standardize by dividing SD/MEAN to get the Coefficient of Variance a % metric.
You can even be more specific and calculate a "distance measure" from one specific toss to the other two e.g.:
library(dplyr)
COIN_TOSS %>%
mutate(Toss3_Delta = ((TOSS1+TOSS2)/2-TOSS3)/((TOSS1+TOSS2)/2))
Now if we are talking about correlation in your example this works only column wise because three cases are not enough to calculate a correlation.
This works:
library(magrittr)
COIN_TOSS %$%
cor()

How to get levels for each factor variable in R

I understand R assigns values to a factor vector alphabetically. In this following example:
x <- as.factor(c("A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C"))
str(x)
This prints
Factor w/ 3 levels "A","B","C": 1 2 3 1 1 1 1 1 1 2 ...
Since I have only three levels it is easier to understand the level - value association i.e., A = 1, B = 2, so on and so forth.
In a scenario where I have hundreds of factors, is there a easier way to get it printed as a table that displays all the factors along with it level values like this:
Levels Values
A 1
B 2
C 3
Why do you want to know the underlying numeric values that R assigns to each factor level? I ask because this generally wouldn't be an important thing to keep track of. Can you say more about what you're trying to accomplish? We may be able to provide additional advice if we know more about the underlying problem you're trying to solve. For now, below are examples of how to do what you ask that also show why the results might not be what you expect.
Do all the columns in your data frame have different combinations of the same underlying categories? If not, what you're asking for could give unexpected and undesirable results. Below are a couple of examples, based on a fake data frame with 3 factor columns, two of which are upper case letters and one of which is lower case letters.
# Fake data
set.seed(2)
x = c("C","A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C")
dat = data.frame(x=x,
y=sample(LETTERS[1:5], length(x), replace=TRUE),
z=sample(letters[1:3], length(x), replace=TRUE),
w=rnorm(length(x)))
Note that the numeric codes assigned to each factor level are not unique across columns. The lower case letters and the upper case letters can both have factor codes 1 through 3.
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
$z
Levels Values
1 a 1
2 b 2
3 c 3
Another potential complication is whether the order of the factor levels is the same for different columns. As an example, let's change the factor order for one of the upper case columns. This creates a new issue in that the the same letter can have a different code value in different columns and the same code can be assigned to different letters. For example, A has code 1 in column x and code 5 in column y. Furthermore, code 1 is assigned to E in column y, rather than to A.
dat$y = factor(dat$y, levels = LETTERS[5:1])
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 E 1
2 D 2
3 C 3
4 B 4
5 A 5
$z
Levels Values
1 a 1
2 b 2
3 c 3

Data Summary in R: Using count() and finding an average numeric value [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I am working on a directed graph and need some advice on generating a particular edge attribute.
I need to use both the count of interactions as well as another quality of the interaction (the average length of text used within interactions between the same unique from/to pair) in my visualization.
I am struggling to figure out how to create this output in a clean, scalable way. Below is my current input, solution, and output. I have also included an ideal output along with some things I have tried.
Input
x = read.table(network = "
Actor Receiver Length
1 1 4
1 2 20
1 3 9
1 3 100
1 3 15
2 3 38
3 1 25
3 1 17"
sep = "", header = TRUE)
I am currently using dplyr to get a count of how many times each pair appears to achieve the output below.
I use the following command:
EDGE <- dplyr::count(network, Actor, Receiver )
names(EDGE) <- c("from","to","count")
To achieve my current output:
From To Count
1 1 1
1 2 1
1 3 3
2 3 1
3 1 2
Ideally, however, I like to know the average lengths for each pair as well, or end up with something like this:
From To Count AverageLength
1 1 1 4
1 2 1 20
1 3 3 41
2 3 1 38
3 1 2 21
Is there any way I can do this without creating a host of new data frames and then grafting them back onto the output? I am mostly having issues trying to summarize and count at the same time. My stupid solution has been to simply add "Length" as an argument to the count function, this does not produce anything useful. I could also that it may be useful to combine actor-receiver and then use the summary function to create something to graft onto the frame as a result of the count. In the interest of scaling, however, I would like to figure out if there is a simple and clear way of doing this.
Thank you very much for any assistance with this issue.
A naive solution would be to use cbind() in order to connect these two outputs together. Here is an example code:
Actor <- c(rep(1, 5), 2, 3, 3)
Receiver <- c(1, 2, rep(3, 4), 1, 1)
Length <- c(4, 20, 9, 100, 15, 38, 25, 17)
x <- data.frame("Actor" = Actor,
"Receiver" = Receiver,
"Length" = Length)
library(plyr)
EDGE <- cbind(ddply(x,.(Actor, Receiver), nrow), # This part replace dplyr::count
ddply(x,.(Actor, Receiver), summarize, mean(Length))[ , 3]) # This is the summarize
names(EDGE) <- c("From", "To", "Count", "AverageLength")
EDGE # Gives the expected results
From To Count AverageLength
1 1 1 1 4.00000
2 1 2 1 20.00000
3 1 3 3 41.33333
4 2 3 1 38.00000
5 3 1 2 21.00000

Reformatting data in order to plot 2D continuous heatmap

I have data stored in a data.frame that I would like to plot as a continuous heat map. I have tried using the interp function from akima package, but as the data can be very large (2 million rows) I would like to avoid this if possible as it takes a very long time. Here is the format of my data
l1 <- c(1,2,3)
grid1 <- expand.grid(l1, l1)
lprobdens <- c(0,2,4,2,8,10,4,8,2)
df <- cbind(grid1, lprobdens)
colnames(df) <- c("age1", "age2", "probdens")
age1 age2 probdens
1 1 0
2 1 2
3 1 4
1 2 2
2 2 8
3 2 10
1 3 4
2 3 8
3 3 2
I would like to format it in a length(df$age1) x length(df$age2) matrix. I gather that once it is formatted in this manner I would be able to use basic functions such as image to plot a 2D histogram continuous heat map similar to that created using the akima package. Here is how I think the transformed data should look. Please correct me if I am wrong.
1 2 3
1 0 2 4
2 2 8 8
3 4 10 2
It seems as though ldply but I can't seem to sort out how it works.
I forgot to mention, the $age information is always continuous and regular, such that the list age1 is equal to age2 but age1 >= age2. I guess this means that it may be classed as continuous data as it stands and doesn't require the interp function.
Ok I think I get it what you want. It just a matter of reshaping data with reshape s 'cast function. The value.var argument is just to avoid the warning message that R tried to guess the value to use. The result does not change if you omit it.
library(reshape2)
as.matrix(dcast(dat, age1 ~ age2, value.var = "probdens")[-1])
1 2 3
[1,] 0 2 4
[2,] 2 8 8
[3,] 4 10 2

Resources