I'm currently working with csv's with data about test participants on a diet program.
One CSV has the the partcipants information 'suubject id', chosen group', 'extra calories' etc.
The second set of csv's are about the different diet groups, meals, calories etc. (of which there are 10)
My task was to find the total calories of each group then find the total calories of each participants chosen group.
To get the total calories of each group I just made a variable that summed the total calories of each group
one <- sum(groupOne$calories)
Then I cleaned the data up a bit in the participants file by removing the 'g' in the row name.
I would ideally like to get some output that has the participants subjectID and their the total calories of their group. Something like below:
|SubjectID||Group||Groups Total Calories|
1 G3 100cal
2 G6 200cal
After that I'm kind of stuck, I don't quite know how to group the two together to together and spit out some data that matches the participants to the groups to output a clean display of the participants subjectId, their group and the total calories of that group.
Related
I have the following data table in R, which I need to collapse for streamlined data processing. I can do this manually, but I am looking for the most efficient way possible. The data frame looks like this:
and so on. Each age group has 4 observations, 2 male and 2 female (1 of each type). And region consists of city1, city2, city3, etc. which are all ordered the same as the example above. After all age groups are exhausted, the next cityX begins.
I need to combine gender into the total, summing males and females (within type). I also need to combine all age groups to give a population total (sum all age groups). I need to keep type separate, and then later combine them as an additional column. I want the final rows output to be the region. I need the population totals for each year column. So the final output would be like this:
I know this could be done manually by splitting the data frame repeatedly, but what would be the most efficient way to do this?
I have a dataset with multiple individuals and two variables from different points of time, for example:
Company
Employees at t=1 (before a specific event)
Revenue at t=2 (after a specific event)
Company A
100
10,000 USD
Company B
23
4,000 USD
Company C
150
90,000 USD
My question is now how you would call the structure of this data set?
One the one hand, it could be a panel as there are multiple individuals at multiple points of time. On the other hand, it could be a cross-sectional data set as there is only one entry per individual.
Thanks for your help! :)
I have data in r that has over 6000 observations and 96 variables.
The data relates to groups of individuals and their activities etc. If a group returned the Group ID number was recorded again and a new observation was made. I need to merge the rows by ID so that the # of individuals take the highest number recorded, but the activities etc are a combination of both observations.
The data contains, #of individuals, activities, impacts, time of arrival etc. The issue is that some of the observations were split across 2 lines, so there may be activities which were recorded for the same group in another line. The Group ID for both observations is the same, but one may have the #of individuals recorded and some activity records or impacts, but the second may be incomplete and only have Group ID and then Impacts (which are additional to those in the 1st record). The #of individuals in the group never changes, so I need some way to combine them so that activities are additive, but #visitors takes the value that is highest, time of arrival needs to be the earliest recorded and time of departure needs to be the later of the 2 observations.
Does anyone know how to merge observations based on Group ID but vary the merging protocol based on the variable.
enter image description here
I'm not sure if this actually is what you want, but to combine rows of a data frame based on multiple conditions you can use the dplyr package and its summarise()function. I generated some data to use in R directly, you would have to modify the code according to your needs.
# generate data
ID<-rep(1:20,2)
visitors<-sample(1:50, 40, replace=TRUE)
impact<-sample(rep(c("a", "b", "c", "d", "e"), 8))
arrival<-sample(rep(8:15, 5))
departure <- sample(rep(16:23, 5))
df<-data.frame(ID, visitors, impact, arrival, departure)
df$impact<-as.character(df$impact)
# summarise rows with identical ID
df_summary <- df %>%
group_by(ID) %>%
summarise(visitors = max(visitors), arrival = min(arrival),
departure = max(departure), impact = paste0(impact, collapse =", "))
Hope this helps!
I have observed plant leaf number (column 'No. of fully expanded leaves") of different plant (Plant id) on different days (Measuring date). I want to group plants with the same maximum No. of fully expanded leaves (with all columns included) and put them in the same spreadsheet, which means plants with different max leaves will be put into separate files. Here is what my data look like:
And here is what the data of a single plant looks like:
How can I do this in R?
Many thanks,
I have a data frame with 21 variables and 1200 observations. The first column is the ID name for each species and column 21 is the total count of all the times each species was seen across multiple sites.
example columns: ID, RM1, RM2, RM10, Total
each row is an ID name and counts per river mile and total count
All I want is a list of the top 20 (or 100 for that matter) most abundant species and their total count. How do I do this?
This is driving me crazy and I don't want to do it in excel - there must be a way in R.
Sort you data frame, lets call it df, by Total, and take top 100
head(df[order(df$Total,decreasing = TRUE), ], 100)