Parsing data from an Excel cell that has more than one data point in it in R - r

I have an Excel sheet of patient information. The heading for one of the columns is "Discharge diagnosis" The problem is that some patients were discharged with more than one diagnosis and so more than one diagnosis is in some of the cells, separated by a "/".
I am using R to analyze the data. I am trying to find the frequency of any given discharge diagnosis.
How can I get R to look for a diagnosis no matter how it is presented in a cell?
For example, I want to know the frequency of the discharge diagnosis "flu". Some patients have a diagnosis of "flu" while others have a diagnosis of "flu/pneumonia". How can I get R to recognize both of these as containing "flu"?

You didn't provide a sample dataset, so I've made one up. I assume you're OK with getting the data from Excel since you didn't specifically ask about that.
library(tidyverse)
library(stringr)
pts <- tribble(~Pt, ~Diag,
"Bob", "Flu/Pneumonia",
"Cathy", "Flu/Explosive Diarrhea",
"Carol", "Pneumonia/Syphilis")
What I can do next is split the Diag column by the / character, and then use unnest to make a data frame in which each patient gets a record for each diagnosis.
pts <- pts %>%
mutate(Diags = str_split(Diag, "/")) %>%
unnest()
# A tibble: 6 x 3
Pt Diag Diags
<chr> <chr> <chr>
1 Bob Flu/Pneumonia Flu
2 Bob Flu/Pneumonia Pneumonia
3 Cathy Flu/Explosive Diarrhea Flu
4 Cathy Flu/Explosive Diarrhea Explosive Diarrhea
5 Carol Pneumonia/Syphilis Pneumonia
6 Carol Pneumonia/Syphilis Syphilis
Here is a frequency table of diagnoses:
pts %>% count(Diags)
# A tibble: 4 x 2
Diags n
<chr> <int>
1 Explosive Diarrhea 1
2 Flu 2
3 Pneumonia 2
4 Syphilis 1

Related

Removing matching observations where their adjacent column does not equal to 100

I have ~4000 observations in my data frame, test_11, and have pasted part of the data frame below:
data frame snippit
The k_hidp column represents matching households, the k_fihhmnnet1_dv column is their reported household income and the percentage_income_rounded reports each participant's income contribution to the total household income
I want to filter my data to remove all k_hidp observations where their collective income in the percentage_income_rounded does not equal 100.
So for example, the first household 68632420 reported a contribution of 83% (65+13) instead of the 100% as the other households report.
Is there any way to remove these household observations so I am only left with households with a collective income of 100%?
Thank you!
Try this:
## Creating the dataframe
df=data.frame(k_hidp = c(68632420,68632420,68632420,68632420,68632420,68632420,68632422,68632422,68632422,68632422,68632428,68632428),
percentage_income_rounded = c(65,18,86,14,49,51,25,25,25,25,50,50))
## Loading the libraries
library(dplyr)
## Aggregating and determining which household collective income is 100%
df1 = df %>%
group_by(k_hidp) %>%
mutate(TotalPercentage = sum(percentage_income_rounded)) %>%
filter(TotalPercentage == 100)
Output
> df1
# A tibble: 6 x 3
# Groups: k_hidp [2]
k_hidp percentage_income_rounded TotalPercentage
<dbl> <dbl> <dbl>
1 68632422 25 100
2 68632422 25 100
3 68632422 25 100
4 68632422 25 100
5 68632428 50 100
6 68632428 50 100

Keep specific rows of a data frame based on word sequence in R

I have a dataframe (df) like this. What I want to do is to go through the values for each ID and if there are two strings starting with the same word, I want to compare them to keep distinct values.
df <- data.frame(id = c(1,1,2,3,3,4,4,4,4,5),
value = c('australia', 'australia sydney', 'brazil',
'australia', 'usa', 'australia sydney', 'australia sydney randwick', 'australia', 'australia sydney circular quay', 'australia sydney'))
I want to get the first words to compare them and if they are different keep both but if they are the same go to the second words to compare them and so on...
so like for ID 1 I want to keep the row with the value 'australia sydney' and for Id 4 I want to keep both 'australia sydney circular quay', 'australia sydney randwick'.
For this example I need to get rows 2:5, 7, 9,10
Based on your edit, you can check within groups if any entry matches the start of any other entry and remove entries that do:
library(tidyverse)
df %>%
group_by(id) %>%
filter(!map_lgl(seq_along(value), ~ any(if (length(value) == 1) FALSE else str_detect(value[-.x], paste0("^", value[.x])))))
# A tibble: 7 x 2
# Groups: id, value [7]
id value
<dbl> <chr>
1 1 australia sydney
2 2 brazil
3 3 australia
4 3 usa
5 4 australia sydney randwick
6 4 australia sydney circular quay
7 5 australia sydney

Summarising a data frame in R by 2 elements

I have a large dataset that includes information for multiple sequences, detailing their sequence ID, country of origin, clade, host and many other things. Each country has multiple different sequences and some countries contain sequences from multiple different clades. Is there a way to know the number of different clades for each different country, without having to test each country one by one (there are too many to realistically enter by hand)?
Without example data, I will rephrase your question as:
I have a large dataset that includes information for multiple starwars
characters, detailing their eye color, homeworld, name, and many other
things. Each homeworld has multiple different eye colors. Is there a
way to know the number of different eye colors for each different
homeworld, without having to test each homeworld one by one (there are
too many to realistically enter by hand)?
Here, we can count how many times different combinations of homeworld and eye color exist in the data. For instance, we have three brown eyed characters from Alderaan.
library(dplyr)
starwars %>% count(homeworld, eye_color)
# A tibble: 66 x 3
homeworld eye_color n
<chr> <chr> <int>
1 Alderaan brown 3
2 Aleen Minor unknown 1
3 Bespin blue 1
4 Bestine IV blue 1
5 Cato Neimoidia red 1
6 Cerea yellow 1
7 Champala blue 1
8 Chandrila blue 1
9 Concord Dawn brown 1
10 Corellia brown 1
# … with 56 more rows
We could add another step to count how many eye colors appear on each homeworld, by counting the number of rows for each homeworld from the step before. This tells us there is only one eye color found on Alderaan (brown).
starwars %>% count(homeworld, eye_color) %>% count(homeworld)
# A tibble: 49 x 2
homeworld n
<chr> <int>
1 Alderaan 1
2 Aleen Minor 1
3 Bespin 1
4 Bestine IV 1
5 Cato Neimoidia 1
6 Cerea 1
7 Champala 1
8 Chandrila 1
9 Concord Dawn 1
10 Corellia 2
Assuming, your dataframe df has columns like 'country' and 'clade' then you can run:
aggregate(data=df, clade ~ country, FUN=function(x) length(unique(x)))

How to relate two different dataframes to make calculations

I know how to work and computing math/statistics with one dataframe. But, what happens when I have to deal with two? For example:
> df1
supervisor salesperson
1 Supervisor1 Matt
2 Supervisor2 Amelia
3 Supervisor2 Philip
> df2
month channel Matt Amelia Philip
1 Jan Internet 10 50 20
2 Jan Cellphone 20 60 30
3 Feb Internet 40 40 30
4 Feb Cellphone 30 120 40
How can I compute the sales by supervisor grouped by channel in a efficient and generalizable way?. Is there any methodology or criteria when you need to relate two or more dataframes in order to compute the data you need?
PS: The number are the sales made by each sales person.
Here is the idea of converting to long and merging using tidyverse,
library(tidyverse)
df2 %>%
gather(salesperson, val, -c(1:2)) %>%
left_join(., df1, by = 'salesperson') %>%
spread(salesperson, val, fill = 0) %>%
group_by(channel, supervisor) %>%
summarise_at(vars(names(.)[4:6]), funs(sum))
which gives,
# A tibble: 4 x 5
# Groups: channel [?]
channel supervisor Amelia Matt Philip
<fct> <fct> <dbl> <dbl> <dbl>
1 Cellphone Supervisor1 0. 50. 0.
2 Cellphone Supervisor2 180. 0. 70.
3 Internet Supervisor1 0. 50. 0.
4 Internet Supervisor2 90. 0. 50.
NOTE: You can also add month in the group_by

Calculate Percentage Column for List of Dataframes When Total Value is Hidden Within the Rows

library(tidyverse)
I feel like there is a simple solution for this but I'm stuck. The code below creates a simple list of two dataframes (they are the same for simplicity of the example, but the real data has different values)
Loc<-c("Montreal","Toronto","Vancouver","Quebec","Ottawa","Hamilton","Total")
Count<-c("2344","2322","122","45","4544","44","9421")
Data<-data_frame(Loc,Count)
Data2<-data_frame(Loc,Count)
Data3<-list(Data,Data2)
Each dataframe has "Total" within the "Loc" column with the corresponding overall total of the "Count" column. I would like to calculate percentages for each dataframe by dividing each value in the "Count" column by the total, which is the last number in the "Count" column.
I would like the percentages to be added as new columns for each dataframe.
For this example, the total is the last number in the column, but in reality, it may be mixed anywhere in the column and can be found by the corresponding "Total" value in the "Loc" column.
I would like to use purrr and Tidyverse:
Below is an example of the code, but I'm stuck on the percentage...
Data3%>%map(~mutate(.x,paste0(round(100* (MISSING PERCENTAGE),2),"%"))
This solution uses only base-R:
for (i in seq_along(Data3)) {
Data3[[i]]$Count <- as.numeric(Data3[[i]]$Count)
n <- nrow(Data3[[i]])
Data3[[i]]$perc <- Data3[[i]]$Count / Data3[[i]]$Count[n]
}
> Data3
[[1]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
[[2]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000

Resources