I have the following data table in R, which I need to collapse for streamlined data processing. I can do this manually, but I am looking for the most efficient way possible. The data frame looks like this:
and so on. Each age group has 4 observations, 2 male and 2 female (1 of each type). And region consists of city1, city2, city3, etc. which are all ordered the same as the example above. After all age groups are exhausted, the next cityX begins.
I need to combine gender into the total, summing males and females (within type). I also need to combine all age groups to give a population total (sum all age groups). I need to keep type separate, and then later combine them as an additional column. I want the final rows output to be the region. I need the population totals for each year column. So the final output would be like this:
I know this could be done manually by splitting the data frame repeatedly, but what would be the most efficient way to do this?
Related
I have a data set that has variable sex, has two levels, male and female, and another categorical variable which has 6 levels, I want to find the most frequent of the second variable for male and females, I mean which levels of this second variable have the most frequency for males and females,
thank you
Assuming it's ok to infer this from a table, a simple frequency table will do it:
table1 <- table(dataset$sex, dataset$var2)
table1
Obviously substitute in your dataset's name and whatever you've called your second variable. The output will be a frequency table and you can easily read along each row to see the most frequent category for each sex.
I have a dataframe with three colmuns; name of data point, group number assigned to that data point and species (data is animal related, and data points belong to one of two species).
Any given row looks like this
Name | Group Number | Species
Data Point A | 3 | 1
I would like to split groups only if that group contains above 90% of only one species, e.g if group 3 is 10 rows long and has 9 rows belonging only to either species 1 or species 2, then it satisfies my requirements and should be split.
I have looked into using the split function as well as the filter functions from dplyr but I can't seem to figure out how to get r to split groups with this percentage-based requirement. Any help would be useful, thank you!
I have two data frames. One data frame is called Measurements and has 500 rows. The columns are PatientID, Value and M_Date. The other data frame is called Patients and has 80 rows and the columns are PatientID, P_Date.
Each patient ID in Patients is unique. For each row in Patients, I want to look at the set of measurements in Measurements with the same PatientID (there are maybe 6-7 per patient).
From this set of measurements, I want to identify the one with M_Date closest to P_Date. I want to append this value to Patients in a new column. How do I do this? I tried using ddplyr but can't figure out how to access two data frames at once within this function.
you probably want to install the install.packages("survival") and the neardate function within it to solve your problem.
It has a good example in the documentation
I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want
I have observed plant leaf number (column 'No. of fully expanded leaves") of different plant (Plant id) on different days (Measuring date). I want to group plants with the same maximum No. of fully expanded leaves (with all columns included) and put them in the same spreadsheet, which means plants with different max leaves will be put into separate files. Here is what my data look like:
And here is what the data of a single plant looks like:
How can I do this in R?
Many thanks,