Data Summary in R: Using count() and finding an average numeric value [duplicate] - r

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I am working on a directed graph and need some advice on generating a particular edge attribute.
I need to use both the count of interactions as well as another quality of the interaction (the average length of text used within interactions between the same unique from/to pair) in my visualization.
I am struggling to figure out how to create this output in a clean, scalable way. Below is my current input, solution, and output. I have also included an ideal output along with some things I have tried.
Input
x = read.table(network = "
Actor Receiver Length
1 1 4
1 2 20
1 3 9
1 3 100
1 3 15
2 3 38
3 1 25
3 1 17"
sep = "", header = TRUE)
I am currently using dplyr to get a count of how many times each pair appears to achieve the output below.
I use the following command:
EDGE <- dplyr::count(network, Actor, Receiver )
names(EDGE) <- c("from","to","count")
To achieve my current output:
From To Count
1 1 1
1 2 1
1 3 3
2 3 1
3 1 2
Ideally, however, I like to know the average lengths for each pair as well, or end up with something like this:
From To Count AverageLength
1 1 1 4
1 2 1 20
1 3 3 41
2 3 1 38
3 1 2 21
Is there any way I can do this without creating a host of new data frames and then grafting them back onto the output? I am mostly having issues trying to summarize and count at the same time. My stupid solution has been to simply add "Length" as an argument to the count function, this does not produce anything useful. I could also that it may be useful to combine actor-receiver and then use the summary function to create something to graft onto the frame as a result of the count. In the interest of scaling, however, I would like to figure out if there is a simple and clear way of doing this.
Thank you very much for any assistance with this issue.

A naive solution would be to use cbind() in order to connect these two outputs together. Here is an example code:
Actor <- c(rep(1, 5), 2, 3, 3)
Receiver <- c(1, 2, rep(3, 4), 1, 1)
Length <- c(4, 20, 9, 100, 15, 38, 25, 17)
x <- data.frame("Actor" = Actor,
"Receiver" = Receiver,
"Length" = Length)
library(plyr)
EDGE <- cbind(ddply(x,.(Actor, Receiver), nrow), # This part replace dplyr::count
ddply(x,.(Actor, Receiver), summarize, mean(Length))[ , 3]) # This is the summarize
names(EDGE) <- c("From", "To", "Count", "AverageLength")
EDGE # Gives the expected results
From To Count AverageLength
1 1 1 1 4.00000
2 1 2 1 20.00000
3 1 3 3 41.33333
4 2 3 1 38.00000
5 3 1 2 21.00000

Related

getting table(column) to return 0 for non-represented values

I'm working with a data set where my outcome of interest is coded across multiple columns and takes on values of 1, 2 and 3. Running table() across any one of these columns sometimes gives me results of the following (desired) form:
1 2 3
8 87 500
But also, for example, sometimes gives me results that look like this, when there are no 2's in a column
1 3
5 200
This is a problem as I try to combine all of these tables using rbind, which I do using this code.
tables = sapply(.GlobalEnv, is.table)
allquestions <- do.call(rbind, mget(names(tables)[tables]))
When this code comes across tables of the latter form, it seems to treat values in the '3' column as though they were in the '2' column, because '3' is in the second position. it then seems to take the value for the '3' position from the 1 position, as shown below
1 2 3
8 87 500
5 200 5
What I want it to look like is this:
1 2 3
8 87 500
5 0 200
Is there any way to make table() look for values that might not be represented in a column? Ideally, I would want it to print out the following for the second table example I gave.
1 2 3
5 0 200
Alternatively, is there a way to make the way I use rbind function pay attention to column names and merge them appropriately?
You can convert the values to factor with levels to specify all the values it can take.
x <- c(1, 2, 3, 1, 2)
table(x)
x
#1 2 3
#2 2 1
x <- c(1, 3, 3)
table(x)
#x
#1 3
#1 2
table(factor(x, 1:3))
#1 2 3
#1 0 2

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

Creating a new variable in a data frame and changing its values in one step [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I have a column which is part of a data frame, df. It is full of integers. Let's say it is the number of houses sold in a day by a reality compant. Let's call it df$houses. I want to make a second column called df$quant where the number of houses is categorized, with 0 being 0-2 houses sold in a day, 1 being 3-5 houses, 2 being 6-9 houses and 3 being more than 10 houses? I could do this in two steps.
1) Create the new column df$quant from df$houses:
df$quant <- df$houses
2) Change the values of df$quant:
df$quant[which(df$quant <= 2)] <- 0
etc.
I would like to do this in one step though, making the new variable and filling it with the proper values. Mostly, so I don't have to worry about getting the order of the lines of code in the second step right. It would be more robust.
Could this be done with an if statement?
Thanks a lot.
I would do something like this: (using cut)
x <- 1:11
df <- data.frame(x)
myFunction <- function(x) as.integer(cut(x, c(-1, 2, 5, 9, max(x)))) - 1
df$new <- myFunction(df$x)
df
x new
1 1 0
2 2 0
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

Resources