Summarising a data frame in R by 2 elements - r

I have a large dataset that includes information for multiple sequences, detailing their sequence ID, country of origin, clade, host and many other things. Each country has multiple different sequences and some countries contain sequences from multiple different clades. Is there a way to know the number of different clades for each different country, without having to test each country one by one (there are too many to realistically enter by hand)?

Without example data, I will rephrase your question as:
I have a large dataset that includes information for multiple starwars
characters, detailing their eye color, homeworld, name, and many other
things. Each homeworld has multiple different eye colors. Is there a
way to know the number of different eye colors for each different
homeworld, without having to test each homeworld one by one (there are
too many to realistically enter by hand)?
Here, we can count how many times different combinations of homeworld and eye color exist in the data. For instance, we have three brown eyed characters from Alderaan.
library(dplyr)
starwars %>% count(homeworld, eye_color)
# A tibble: 66 x 3
homeworld eye_color n
<chr> <chr> <int>
1 Alderaan brown 3
2 Aleen Minor unknown 1
3 Bespin blue 1
4 Bestine IV blue 1
5 Cato Neimoidia red 1
6 Cerea yellow 1
7 Champala blue 1
8 Chandrila blue 1
9 Concord Dawn brown 1
10 Corellia brown 1
# … with 56 more rows
We could add another step to count how many eye colors appear on each homeworld, by counting the number of rows for each homeworld from the step before. This tells us there is only one eye color found on Alderaan (brown).
starwars %>% count(homeworld, eye_color) %>% count(homeworld)
# A tibble: 49 x 2
homeworld n
<chr> <int>
1 Alderaan 1
2 Aleen Minor 1
3 Bespin 1
4 Bestine IV 1
5 Cato Neimoidia 1
6 Cerea 1
7 Champala 1
8 Chandrila 1
9 Concord Dawn 1
10 Corellia 2

Assuming, your dataframe df has columns like 'country' and 'clade' then you can run:
aggregate(data=df, clade ~ country, FUN=function(x) length(unique(x)))

Related

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

two datasets to satisfy one condition in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I need help with working with two different datasets for my research project.
I have two different data frames, they have different number of columns and rows. I need to gather values from one dataset that satisfy a specific condition that involves both datasets.
The condition to be satisfied: that the combination of two values (in the same row but different columns) are the same.
For example, in my dataset, the values in 'data$Regular_response' should be the same as 'swow$response', and data$Test_word should be the same as swow$cue.
data$Regular_response = swow$response
data$Test_word = swow$cue
In other words, I am looking for equal word pairs in both datasets.
When this condition is satisfied, I need the value of swow$R123.Strength to be printed in a new column in data$strength
How do I do that??
> head(swow)
# A tibble: 6 x 5
cue response R123 N R123.Strength
<chr> <chr> <chr> <chr> <chr>
1 a one 31 257 0.120622568093385
2 a the 26 257 0.101167315175097
3 a an 17 257 0.066147859922179
4 a b 14 257 0.0544747081712062
5 a single 9 257 0.0350194552529183
6 a article 6 257 0.0233463035019455
> head(data)
Regular_response Test_word Pyramids_and_Palms_Test
1: princess queen 92
2: shoes slippers 92
3: flowerpot vase 92
4: horse zebra 92
5: cup bowl 85
6: nun church 85
> filter(data, Test_word == 'queen', Regular_response == 'princess')
Regular_response Test_word Pyramids_and_Palms_Test
1 princess queen 92
2 princess queen 87
> filter(swow, cue == 'queen', response == 'princess')
# A tibble: 1 x 5
cue response R123 N R123.Strength
<chr> <chr> <chr> <chr> <chr>
1 queen princess 3 292 0.0102739726027397
I appreciate those who can help me with this code!
Try this solution as I told you earlier:
Merged <- merge(data,swow[,c("cue","response","R123.Strength")],by.x = c('Test_word','Regular_response'),by.y=c('cue','response'),all.x=T)
Sounds like a job for a join. So something like:
data <- data %>%
left_join(swow, by = c("Regular_response" = "response", "Test_word" = "cue")) %>%
mutate(strength = R123.Strength)

How to view all rows of an output thats not in table form

Problem
I started with an ungrouped data set which I proceeded to group, the output of the grouping however, does not return all 427 rows. The output is needed to input that data into a table.
So initially the data was ungrouped and appears as follows:
Occupation Education Age Died
1 household Secondary 39 no
2 farming primary 83 yes
3 farming primary 60 yes
4 farming primary 73 yes
5 farming Secondary 51 no
6 farming iliterate 62 yes
The data is then grouped as follows:
occu %>% group_by(Occupation, Died, Age) %>% count()##use this to group on the occupation of the suicide victimrs
which gives the following output:
Occupation Died Age n
<fct> <fct> <int> <int>
1 business/service no 20 2
2 business/service no 30 1
3 business/service no 31 2
4 business/service no 34 1
5 business/service no 36 2
6 business/service no 41 1
7 business/service no 44 1
8 business/service no 46 1
9 business/service no 84 1
10 business/service yes 21 1
# ... with 417 more rows
problem is i need all the rows in order to input the grouped data into a table using:
dt <- read.table(text="full output from above")
If any more code would be useful to solving this let me know.
It is not really clear what you want but try the following code :
occu %>% group_by(Occupation, Died, Age) %>% count()
dt <- as.data.frame(occu)
It seems you simply want to convert the tibble to a data frame. There is no need to print all the output and then copy-paste it into read.table().
Also if you need you can save your output with write.table(dt,"filename.txt"), it will create a .txt file with your data.
If what you want is really print all the tibble output in the console, then you can do the following code, as suggested by Akrun's link :
options(dplyr.print_max = 1e9)
It will allow R to print the full tibble into the console, which I think is not efficient to do what you are asking.

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

Parsing data from an Excel cell that has more than one data point in it in R

I have an Excel sheet of patient information. The heading for one of the columns is "Discharge diagnosis" The problem is that some patients were discharged with more than one diagnosis and so more than one diagnosis is in some of the cells, separated by a "/".
I am using R to analyze the data. I am trying to find the frequency of any given discharge diagnosis.
How can I get R to look for a diagnosis no matter how it is presented in a cell?
For example, I want to know the frequency of the discharge diagnosis "flu". Some patients have a diagnosis of "flu" while others have a diagnosis of "flu/pneumonia". How can I get R to recognize both of these as containing "flu"?
You didn't provide a sample dataset, so I've made one up. I assume you're OK with getting the data from Excel since you didn't specifically ask about that.
library(tidyverse)
library(stringr)
pts <- tribble(~Pt, ~Diag,
"Bob", "Flu/Pneumonia",
"Cathy", "Flu/Explosive Diarrhea",
"Carol", "Pneumonia/Syphilis")
What I can do next is split the Diag column by the / character, and then use unnest to make a data frame in which each patient gets a record for each diagnosis.
pts <- pts %>%
mutate(Diags = str_split(Diag, "/")) %>%
unnest()
# A tibble: 6 x 3
Pt Diag Diags
<chr> <chr> <chr>
1 Bob Flu/Pneumonia Flu
2 Bob Flu/Pneumonia Pneumonia
3 Cathy Flu/Explosive Diarrhea Flu
4 Cathy Flu/Explosive Diarrhea Explosive Diarrhea
5 Carol Pneumonia/Syphilis Pneumonia
6 Carol Pneumonia/Syphilis Syphilis
Here is a frequency table of diagnoses:
pts %>% count(Diags)
# A tibble: 4 x 2
Diags n
<chr> <int>
1 Explosive Diarrhea 1
2 Flu 2
3 Pneumonia 2
4 Syphilis 1

Resources