Creating an R dataframe column using row values and aggregate value [duplicate] - r

This question already has answers here:
Calculate the mean by group
(9 answers)
Mean per group in a data.frame [duplicate]
(8 answers)
How to calculate mean of all columns, by group?
(6 answers)
Closed 1 year ago.
I have some fish catch data. Each row contains a species name, a catch value (cpue), and some other unrelated identifying fields (year, location, depth, etc). This code will produce a dataset with the correct structure:
# a sample dataset
set.seed(1337)
fish = rbind(
data.frame(
spp = "Flounder",
cpue = rnorm(5, 5, 2)
),
data.frame(
spp = "Bass",
cpue = rnorm(5, 15, 1)
),
data.frame(
spp = "Cod",
cpue = rnorm(5, 2, 4)
)
)
I'm trying to create a normalized cpue column cpue_norm. To do this, I apply the following function to each cpue value:
cpue_norm = (cpue - cpue_mean)/cpue_std
Where cpue_mean and cpue_std are, respectively, the mean and standard deviation of cpue. The caveat is that I need to do this by each species i.e. when I calculate the cpue_norm for a particular row, I need to calculate the cpue_mean and cpue_std using cpue from only that species.
The trouble is that all of the species are in the same dataset. So for each row, I need to calculate the mean and standard deviation of cpue for that species and then use those values to calculate cpue_norm.
I've been able to make some headway with tapply:
calc_cpue_norm = function(l) {
return((l - mean(l))/sd(l))
}
tapply(fish$cpue, fish$spp, calc_cpue_norm)
but I end up with lists when I need to be adding these values to the dataframe rows instead.
Anyone who knows R better than me have some wisdom to share?

Related

Transform a country|date|Value dataframe in a dataframe with countries in the columns [duplicate]

This question already has answers here:
Reshape multiple value columns to wide format
(5 answers)
Closed 4 months ago.
I have a data frame, in which data for different countries are listed vertically with this pattern:
country | time | value
I want to transform it in a data frame, in which each row is a specific time period, and every column is the value relative to that country. Data are monthly.
time | countryA-value | countryB-value |countryC-value
Moreover, not all periods are present, when data is missing, the row is just absent, and not filled with NA or similar. I thought to two possible solutions, but they seem too complicated and inefficient. I do not write here the code,
If the value in a cell of the column "time" is more than one month after the cell above, while the cells to the left are the same (i.e. the data pertains to the same country), then we have a gap. I have to fill the gap and to this recursively until all missing dates are included.
At this point I have for each country the same number of observations, and I can simply copy a number of cells equal to the number of observations.
Drawbacks: it does not seem very efficient.
I could create a list of time periods using the command
allDates <- seq.Date(from = as.Date('2020-02-01'), to = as.Date('2021-01-01'), by = 'month')-1)
Then I look up the table about each period of allDates for each subset of the table of each country. If the value exist, copy the value, if there is not, fill with NA.
Drawbacks: I have no idea of which function I could use to this purpose.
Below the code to create a small table with two missing rows, namely data2
data <- data.frame(matrix(NA, 24, 3))
colnames(data) <- c("date", "country", "value")
data["date"] <- rep((seq.Date(from = as.Date('2020-02-01'), to = as.Date('2021-01-01'), by = 'month')-1), 2)
data["country"] <- rep(c("US", "CA"), each = 12)
data["value"] <- round(runif(24, 0, 1), 2)
data2 <- data[c(-4,-5),]
I solved the problem following the suggestion of r2evans, I checked the function dcast, and I obtained exactly what I wanted.
I used the code
reshape2::dcast(dataFrame, yearMonth ~ country, fill = NA)
Where dataFrame is the name of the data frame, yearMonth is the name of the column, in which the date is written, and country is the name of the column, in which the country is written.
The option fill=NA allowed to fill all gaps in the data with NA.

How to sum data based on weekdays in R? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I am new to R and have the following problem:
I have data in two columns, one column represents the earnings on a certain day and the other column contains the weekdays (encoded modulo 7, so for example Saturday is 0, Monday is 1 etc.). An example would be
1. 12,000|6
2. 56,000|0
3. 25,000|1
4. 35,000|2
5. 18,000|3
6. 60,000|4
7. 90,000|5
8. 45,000|6
9. 70,000|0`
Are there any R commands that I can use to find out how much is earned on all Mondays, on all Tuedays and so on?
The general sequence here is grouping and then summarizing. In your case, you want to group by weekday and then sum all the earnings for each group. Here is an implementation of that using dplyr.
library(dplyr)
sample_data <- tibble(earnings = sample(seq(1, 10000, by= 1), replace = TRUE, 100), weekday = sample(seq(0, 6, by = 1),replace = TRUE, 100))
sample_data %>%
group_by(weekday) %>%
summarise(total_earnings = sum(earnings))

Set up loop to find unique values in dataframe, sum them and add them to a new row in dataframe [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a data df as shown below.
df$gene_name contains names of genes, each row is a unique identifier.
df$gene_type indicates the gene type, there are several types within this vector.
df$Dim1 etc contain a expression value for each gene in various biological samples.
I am trying to set up a loop (or otherwise), which add a new row to df summing up df$dim.1 , df$dim.2 etc for every unique value in df$gene_type. For the pictured example, the 10th row would look like,
df[10, 3] <- c("total", "antisense", "sum of Dim.1 values").
Then the code will find the next unique value in gene_type and sum Dim.1, Dim.2 etc for the new gene_type.

How to create quintile of variable in R [duplicate]

This question already has answers here:
split a vector by percentile
(5 answers)
Closed 6 years ago.
Is it possible to bin a variable in to quintile (1/5th) using R. And select only the variables that fall in the 5th bin.
As of now I am using the closest option which is quartile (.75) as there is not a function to do quintile.
Any suggestions please.
Not completely sure what you mean, but this divides a dataset into 5 equal groups based on value and subsequently selects the fifth group
obs = rnorm(100)
qq = quantile(obs, probs = seq(0, 1, .2))
obs[obs >= qq[5]]

how to create mean ses by school - tapply function error?

I have a dataframe that lists studentnumber <- c( 1,2,3.. nth) and schoolnumber<- c(1,1,2,3,4,4) so pupil 1 is in school 1, pupil 2 is in school 1, pupil 3 is in school 3....
I have social economic status for each pupil and I want to calculate a new column where the SESs are actual SES minus the mean SES of a particular school. The function for this is apparently:
mydata$meansocialeconomicstatus <- with(mydata, tapply(ses, schoolnumber, mean))
But I receive an error term because the new column is not repeating each value depending on if the school number has repeated. So this gives me a discrepancy in the number of rows in the new column not matching the dataframe. This is because each mean is only being given once.
My question is, what could I add to make the mean ses repeat in the new column depending on the school number?
You can use the dplyr package.
library(dplyr)
# Calculate the mean socialeconomicstatus per schoolnumber.
mydata2 <- mydata %>%
group_by(schoolnumber) %>%
summarise(meansocialeconomicstatus = mean(ses))
# Join the mean socialeconomicstatus back to the original dataset based on schoolnumber.
left_join(mydata,mydata2,by="schoolnumber")

Resources