Group splitting with specific conditions - r

I have a dataframe with three colmuns; name of data point, group number assigned to that data point and species (data is animal related, and data points belong to one of two species).
Any given row looks like this
Name | Group Number | Species
Data Point A | 3 | 1
I would like to split groups only if that group contains above 90% of only one species, e.g if group 3 is 10 rows long and has 9 rows belonging only to either species 1 or species 2, then it satisfies my requirements and should be split.
I have looked into using the split function as well as the filter functions from dplyr but I can't seem to figure out how to get r to split groups with this percentage-based requirement. Any help would be useful, thank you!

Related

Collapse and Sum Data along multiple groupings in R

I have the following data table in R, which I need to collapse for streamlined data processing. I can do this manually, but I am looking for the most efficient way possible. The data frame looks like this:
and so on. Each age group has 4 observations, 2 male and 2 female (1 of each type). And region consists of city1, city2, city3, etc. which are all ordered the same as the example above. After all age groups are exhausted, the next cityX begins.
I need to combine gender into the total, summing males and females (within type). I also need to combine all age groups to give a population total (sum all age groups). I need to keep type separate, and then later combine them as an additional column. I want the final rows output to be the region. I need the population totals for each year column. So the final output would be like this:
I know this could be done manually by splitting the data frame repeatedly, but what would be the most efficient way to do this?

Recoding variables in vector based on values in different vector

Complete R novice here.
I have wide form data frame which includes a vector/variable for participant_number, with each participant providing two responses (score), with a within-subjects manipulation (code).
enter image description here
However, I have three separate sets of values which corresponded to the participant numbers in three different (between subjects) experimental groups (e.g. control, active_1, active_2).
enter image description here
How can I use these sets of values to create a variable in my main data frame which indicates what experimental group the participant belongs to?
Any help, much appreciated.
The package "dplyr" is quite useful for these kind of things. Let's consider a small working example
df <- data.frame(ID=c(1:7))
ListActive1 <- c(1,3)
ListActive2 <- c(2,5)
ListControl <- c(4,7,6)
df is the main data frame containing the ID of the participant (and of course it may have further columns, e.g. the score etc.) The three vectors contain for each group the IDs of the participants belonging to this particular group, e.g. the participants with ID 2 and 5 belong to the group "Active2".
Now we create a new column in the main data frame using the command mutate which comes with the dplyr package (make sure to install and load it).
df <- mutate(df,group=case_when(
ID %in% ListActive1 ~ "Active1",
ID %in% ListActive2 ~ "Active2",
ID %in% ListControl ~ "Control"))
The command case_when checks for each participant in which of the lists the ID appears and then puts the corresponding label in the new column group.
ID group
1 1 Active1
2 2 Active2
3 3 Active1
4 4 Control
5 5 Active2
6 6 Control
7 7 Control

Comparing multiple data frames based on unique values in one column and finding overlapping values in second column in multiple data frames in R

I wanted to ask for advice based on a problem I am having in trying to identify intersecting values in multiple data frames, but in my mind this is a bit complex and I cant figure out how to do it using the normal intersect function.
I have several data frames (up to 12) with multiple columns that are showing gene changes over time (for example 5 time points) and how other genes correlate with this change (i.e, other genes that also go down, or up in a manner that correlates other genes in the data). The analysis takes each gene one at a time, uses that gene as a reference and tests every single gene against it to see if the pattern of change over time of those genes correlate with the first reference gene. This is repeated for every single gene. So taking one data frame as an example, the results would appear as follows.
Column 1 contains genes that serve as the reference gene, this value can occur multiple times if other genes correlate with changes over time in this gene. for example if gene b, c and d correlate with gene a, the first two columns show as follows:
a b
a c
a d
The same for gene b and so on and so fourth 20,000 times (number of genes)! Hope this makes sense?
b a
b c
b d
The analyses above is carried in multiple different samples, so I will get up to 12 data frames which are different samples each with results detailed as above.
Objective (and apologies in advance that I do not have code as I am not entirely sure where to start!) as I am thinking this might best be served by creating a function for this: For gene 'x' in column number 1, in every single data frame, I would like to see if column 2 has overlapping values.
Taking the example above, multiple data frames may look like this:
df1
a b
a c
a d
df2
a d
a c
a e
df3
a d
a e
a f
So comparing the data frames, the function would identify that for gene a, there is one column value between all data frame... gene d.. as it is common to all data frames for gene a.
Similarly, the function would carry out this overlap analysis for every single gene... gene a,b,c..etc
The output would be the values of the overlap for every single gene in column 2 that occurs for the same gene in column a across the data frames
I am pasting head(analysis)
Feature1 Feature2 delay pBefore pAfter corBefore
1 ENSMUSG00000001525 ENSMUSG00000026211 0 0.1093914984 0.1093914984 0.7161907
2 ENSMUSG00000001525 ENSMUSG00000055653 -1 0.0916478944 0.1047749696 0.7414240
3 ENSMUSG00000001525 ENSMUSG00000003038 0 0.0006810160 0.0006810160 0.9786161
plus many many more genes in feature 1, each with genes in feature 2 associated with genes in feature 1
this data frame would be one sample and I would have a separate result for the other samples
I would really appreciate any hints as to how to create code to achieve this goal. In additon, it would be nice to be able to specify that I would also liek to see over lap of genes that only contain, i.e pBefore of >= 0.8 for example, or same for the delay column etc...
Many thanks for taking the time to read this!
If I understand correctly, you can add all 12 dataframes as
df_final = pd.concat([df1,df2.....df12])
Find the combination of genes present in all 12 dataframe
df_n = df_final.groupby(['A','B']).size().reset_index(name = 'count')
As there are 12 Dataframe
df_n[df_n['count']==12]
will give you the pair of genes in all 12 dataframes.

Count of columns with filters

I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))

R: Extract top nth values with ID/name

I have a data frame with 21 variables and 1200 observations. The first column is the ID name for each species and column 21 is the total count of all the times each species was seen across multiple sites.
example columns: ID, RM1, RM2, RM10, Total
each row is an ID name and counts per river mile and total count
All I want is a list of the top 20 (or 100 for that matter) most abundant species and their total count. How do I do this?
This is driving me crazy and I don't want to do it in excel - there must be a way in R.
Sort you data frame, lets call it df, by Total, and take top 100
head(df[order(df$Total,decreasing = TRUE), ], 100)

Resources