I have two columns, one age and another one is the percentage. I need to draw a graph that shows the distribution of the sum of percentages for each 5 years interval.
df$group <- cut(df$age, breaks = seq(0,120,by=5), right = TRUE)
I used the above code to group the age on every 5 intervals and then used group by(age)and summarize(sum=sum(percentage) to sum all percentages on every 5 intervals. However, I'm not able to do that as a "group by" can not work on a categorical variable, Do you know any better way?
if the df is:
df <- data.frame(age=c(2,4,6,8), percentage=c(2,3,6,7))
and it transformed to below by group by(age) and summarize(sum percentage)
age(0-5, 5-10), sum percentage(5,13)
but, I need the following:
age(5,10) ,sum percentage(5,13)
You can create a new variable with groups and then use aggregate to aggregate the percentage values by group and sum them up:
df = data.frame(age=c(2,4,6,8), percentage=c(2,3,6,7))
df$age.group = cut(df$age,seq(0,120,5))
sums = aggregate(percentage ~ age.group,FUN=sum,data=df)
The result will be:
> df
age percentage age.group
1 2 2 (0,5]
2 4 3 (0,5]
3 6 6 (5,10]
4 8 7 (5,10]
> sums
age.group percentage
1 (0,5] 5
2 (5,10] 13
Related
I have this dataframe
my dataframe
where values in the 'Age' columns need to be summarize per the whole size range
i.e. now the data frame is like this:
Size Age 1 Age 2 Age 3
[1] 8 2 8 5
[2] 8.5 4 7 9
[3] 9 1 11 45
[4] 9.5 3 2 0
But i want this
Size Age 1 Age 2 Age 3
[1+2] 8 6 15 16
[3+4] 9 4 13 45
Which function is better to use in R?
I thought but I don't tried, to use rowwise () together with mutate (), but I don't know how to set the criteria.
Thank you in advance for the help :)
You can do this quite easily with the dplyr library. (You may need to install.packages("dplyr") if you haven't already.)
Using dplyr functions, we can group by a new grouping column, size, replacing the existing size column with values that have been rounded down to the nearest whole number. Then we just summarise across all the columns that starts_with "Age" and sum up the values.
require(dplyr)
my_df |>
group_by(size = floor(size)) |>
summarise(
across(starts_with("Age"), sum)
)
Age <- c(90,56,51,64,67,59,51,55,48,50,43,57,44,55,60,39,62,66,49,61,58,55,45,47,54,56,52,54,50,62,48,52,50,65,59,68,55,78,62,56)
Tenure <- c(2,2,3,4,3,3,2,2,2,3,3,2,4,3,2,4,1,3,4,2,2,4,3,4,1,2,2,3,3,1,3,4,3,2,2,2,2,3,1,1)
df <- data.frame(Age, Tenure)
I'm trying to count the unique values of Tenure, thus I've used the table() function to look at the frequencies
table(df$Tenure)
1 2 3 4
5 15 13 7
However I'm curious to know what the aggregate() function is showing?
aggregate(Age~Tenure , df, function(x) length(unique(x)))
Tenure Age
1 1 3
2 2 13
3 3 11
4 4 7
What's the difference between these two outputs?
The reason for the difference is your inclusion of unique in the aggregate. You are counting the number of distinct Ages by Tenure, not the count of Ages by Tenure. To get the analogous output with aggregate try
aggregate(Age~Tenure , df, length)
Tenure Age
1 1 5
2 2 15
3 3 13
4 4 7
I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!
You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1
a) Identify the number of observations per subject
you can use this code on each variable:
summary
I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]
I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.