R function that sums up partial words?

R function that sums up partial words? - r

I'm trying to sum up the words 'Moderna' in R and make a count.
The problem is that the original excel file has the value Moderna mixed with other vaccines. As you can see, my original R file has words with 'Moderna' in them mixed with 'Oxford/Astrazeneca'
This is my attempt trying to sum the words 'Moderna' in the Code is Below.
Code is below:
Number_Of_Countries_Using_Moderna <- Number_of_Vaccines_used %>%
group_by(vaccines) %>%
summarize(Moderna_Countries=sum(n))
I would group_by vaccines, to get Moderna, then attempt to sum the amount of Moderna (making a new column in the process). The problem is that using 'group_by(vaccines) function' wouldn't be correct.
Do you guys have any suggestions? Thank you for your time :)
Problem was solved with either of the two solutions below, thank you.

If I understood correctly, you are trying to get the sum of n whenever Moderna is mentionned in the column vaccines? If that's the case, here is a solution below. You need to "filter", not "group_by":
Number_of_Vaccines_used %>%
filter(grepl("Moderna", vaccines)) %>%
summarize(Moderna_Countries = sum(n))

Not exactly what you ask for: If you are looking for a complete list of vaccines and their counts, you could use
library(dplyr)
library(tidyr)
Number_of_Vaccines_used %>%
mutate(vaccines = strsplit(vaccines, ", ")) %>%
unnest(vaccines) %>%
group_by(vaccines) %>%
summarise(n = sum(n))
This results in something like
# A tibble: 10 x 2
vaccines n
<chr> <int>
1 Covaxin 1
2 EpiVacCorona 1
3 Johnson&Johnson 2
4 Moderna 35
5 Oxford/AstraZeneca 105
6 Pfizer/BioNTech 82
7 Sinopharm/Beijing 24
8 Sinopharm/Wuhan 2
9 Sinovac 18
10 Sputnik V 20
Data
structure(list(vaccines = c("Covaxin, Oxford/AstraZeneca", "EpiVacCorona, Sputnik V", "Johnson&Johnson", "Johnson&Johnson, Moderna, Pfizer/BioNTech", "Moderna", "Moderna, Oxford/AstraZeneca"), n = c(1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Related

Keep only rows if number is greater than... in specific column

This is an example of data:
exp_data <- structure(list(Seq = c("AAAARVDS", "AAAARVDSSSAL",
"AAAARVDSRASDQ"), Change = structure(c(19L, 20L, 13L), .Label = c("",
"C[+58]", "C[+58], F[+1152]", "C[+58], F[+1152], L[+12], M[+12]",
"C[+58], L[+2909]", "L[+12]", "L[+370]", "L[+504]", "M[+12]",
"M[+1283]", "M[+1457]", "M[+1491]", "M[+16]", "M[+16], Y[+1013]",
"M[+16], Y[+1152]", "M[+16], Y[+762]", "M[+371]", "M[+386], Y[+12]",
"M[+486], W[+12]", "Y[+12]", "Y[+1240]", "Y[+1502]", "Y[+1988]",
"Y[+2918]"), class = "factor"), `Mass` = c(1869.943,
1048.459, 707.346), Size = structure(c(2L, 2L, 2L), .Label = c("Matt",
"Greg",
"Kieran"
), class = "factor"), `Number` = c(2L, 2L, 2L)), row.names = c(244L,
392L, 396L), class = "data.frame")
I would like to bring your attention to column name Change as this is the one which I would like to use for filtering. We have three rows here and I would like to keep only first one because there is a change bigger than 100 for specific letter. I would like to keep all of the rows which contain the change of letter greater than +100. It might be a situatation that there is up to 4-5 letters in change column but if there is at least one with modification of at least +100 I would like to keep this row.
Do you have any simple solution for that ?
Expected output:
Seq Change Mass Size Number
244 AAAARVDS M[+486], W[+12] 1869.943 Greg 2

Not entirely sure I understood your problem statement correctly, but perhaps something like this
library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
Or the same in base R
exp_data[grep("\\d{3}", exp_data$Change), ]
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.

You can use str_extract_all from the stringr package
library(stringr)
data.table solution
library(data.table)
setDT(exp_data)
exp_data[, max := max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])), by = Seq]
exp_data[max > 100, ]
Seq Change Mass Size Number max
1: AAAARVDS M[+486], W[+12] 1869.9 Greg 2 486
dplyr solution
library(dplyr)
exp_data %>%
group_by(Seq) %>%
filter(max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])) > 100)
# A tibble: 1 x 5
# Groups: Seq [1]
Seq Change Mass Size Number
<chr> <fct> <dbl> <fct> <int>
1 AAAARVDS M[+486], W[+12] 1870. Greg 2

Calculate ratio every two rows with partial string matches

I am trying to calculate a ratio using this formula: log2(_5p/3p).
I have a dataframe in R and the entries have the same name except their last part that will be either _3p or _5p. I want to do this operation log2(_5p/_3p) for each specific name.
For instance for the first two rows the result will be like this:
LQNS02277998.1_30988 log2(40/148)= -1.887525
Ideally I want to create a new data frame with the results where only the common part of the name is kept.
LQNS02277998.1_30988 -1.887525
How can I do this in R?
> head(dup_res_LC1_b_2)
# A tibble: 6 x 2
microRNAs n
<chr> <int>
1 LQNS02277998.1_30988_3p 148
2 LQNS02277998.1_30988_5p 40
3 Dpu-Mir-279-o6_LQNS02278070.1_31942_3p 4
4 Dpu-Mir-279-o6_LQNS02278070.1_31942_5p 4
5 LQNS02000138.1_777_3p 73
6 LQNS02000138.1_777_5p 12
structure(list(microRNAs = c("LQNS02277998.1_30988_3p",
"LQNS02277998.1_30988_5p", "Dpu-Mir-279-o6_LQNS02278070.1_31942_3p",
"Dpu-Mir-279-o6_LQNS02278070.1_31942_5p", "LQNS02000138.1_777_3p",
"LQNS02000138.1_777_5p"), n = c(148L, 40L, 4L, 4L, 73L, 12L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

We can use a group by operation by removing the substring at the end i.e. _3p or _5p with str_remove, then use the log division of the pair of 'n'
library(dplyr)
library(stringr)
df1 %>%
group_by(grp = str_remove(microRNAs, "_[^_]+$")) %>%
mutate(new = log2(last(n)/first(n)))

How to get sum of column from sqldf output in R?

I would like to sum a single column of data that was output from an sqldf function in R.
I have a csv. file that contains groupings of sites with a uniqueID and their associated areas. For example:
occurrenceID sarea
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.30626786
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.49235953
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.03490536
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.00001389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.0302389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.01360811
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.08412911
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.01852466
I used the code below in R to pull out the largest area from each grouping of unique ID's.
> MyData <- read.csv(file="sacandaga2.csv", header=TRUE, sep=",")
> sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
This produced the following output:
max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}
Now I would like to sum all the values in the max(sarea) column. What is the best way to accomplish this?

Either do it in sqldf or R, or assign your existing result and do it in R:
# assign your original
grouped_sum = sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
# and sum in R
sum(grouped_sum$`max(sarea)`)
# you might prefer to use a standard column name so you don't need backticks
grouped_sum = sqldf(
"select max(sarea) as max_sarea, occurrenceID
from MyData
group by occurrenceID"
)
sum(grouped_sum$max_sarea)

If the intention is to do this in a single 'sqldf' call, use with
library(sqldf)
sqldf("with tmpdat AS (
select max(sarea) as mxarea, occurrenceID
from MyData group by occurrenceID
) select sum(mxarea)
as smxarea from tmpdat")
# smxarea
#1 0.6067275
data
MyData <-
structure(list(occurrenceID = c("{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}",
"{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}", "{1EC60400-0AD0-4DB5-B815-221C4123AE7F}",
"{1EC60400-0AD0-4DB5-B815-221C4123AE7F}"), sarea = c(0.30626786,
0.49235953, 0.03490536, 1.389e-05, 0.0302389, 0.01360811, 0.08412911,
0.01852466)), class = "data.frame", row.names = c(NA, -8L))

You can do it by getting the sum of maximum values:
sqldf("select sum(max_sarea) as sum_of_max_sarea
from (select max(sarea) as max_sarea,
occurrenceID from Mydata group by occurrenceID)")
# sum_of_max_sarea
# 1 0.6067275
Data:
Mydata <- structure(list(occurrenceID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L),
.Label = c("0255531B-904F-4E2D-B81D-797A21165A2F", "175A4B1C-CA8C-49F6-9CD6-CED9187579DC",
"1EC60400-0AD0-4DB5-B815-221C4123AE7F"), class = "factor"),
sarea = c(0.30626786, 0.49235953, 0.03490536, 1.389e-05, 0.0302389,
0.01360811, 0.08412911, 0.01852466)), class = "data.frame",
row.names = c(NA, -8L))

If DF is the last data frame shown in the question this sums the numeric column:
sqldf("select sum([max(sarea)]) as sum from DF")
## sum
## 1 11.07853
Note
We assume this data frame shown in reproducible form:
Lines <- "max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}"
DF <- read.table(text = Lines, check.names = FALSE)

How to calculate elapsed times for the total duration of events?

I have collected a dataframe that models the duration of time for events in a group problem solving session in which the members Communicate (Discourse Code) and construct models (Modeling Code). Each minute that that occurs is captured in the Time_Processed column. Technically these events occur simultaneously. I would like to know how long the students are constructing each type of model which is the total duration of that model or the time elapsed before that model changes.
I have the following dataset:
Looks like this:
`Modeling Code` `Discourse Code` Time_Processed
<fct> <fct> <dbl>
1 OFF OFF 10.0
2 MA Q 11.0
3 MA AG 16.0
4 V S 18.0
5 V Q 20.0
6 MA C 21.0
7 MA C 23.0
8 MA C 25.0
9 V J 26.0
10 P S 28.0
# My explicit dataframe.
df <- structure(list(`Modeling Code` = structure(c(3L, 2L, 2L, 6L,
6L, 2L, 2L, 2L, 6L, 4L), .Label = c("A", "MA", "OFF", "P", "SM",
"V"), class = "factor"), `Discourse Code` = structure(c(7L, 8L,
1L, 9L, 8L, 2L, 2L, 2L, 6L, 9L), .Label = c("AG", "C", "D", "DA",
"G", "J", "OFF", "Q", "S"), class = "factor"), Time_Processed = c(10,
11, 16, 18, 20, 21, 23, 25, 26, 28)), row.names = c(NA, -10L), .Names = c("Modeling Code",
"Discourse Code", "Time_Processed"), class = c("tbl_df", "tbl",
"data.frame"))
For this dataframe I can find how often the students were constructing each type of model logically like this.
With Respect to the Modeling Code and Time_Processed columns,
At 10 minutes they are using the OFF model method, then at 11 minutes, they change the model so the duration of the OFF model is (11 - 10) minutes = 1 minute. There are no other occurrences of the "OFF" method so the duration of OFF = 1 min.
Likewise, for Modeling Code method "MA", the model is used from 11 minutes to 16 minutes (duration = 5 minutes) and then from 16 minutes to 18 minutes before the model changes to V with (duration = 2 minutes), then the model is used again at 21 minutes and ends at 26 minutes (duration = 5 minutes). So the total duration of "MA" is (5 + 2 + 5) minutes = 12 minutes.
Likewise the duration of Modeling Code method "V" starts at 18 minutes, ends at 21 minutes (duration = 3 minutes), resumes at 26 minutes, ends at 28 minutes (duration = 2) minutes. So total duration of "V" is 3 + 2 = 5 minutes.
Then the duration of Modeling Code P, starts at 28 minutes and there is no continuity so total duration of P is 0 minutes.
So the total duration (minutes) table of the Modeling Codes is this:
Modeling Code Total_Duration
OFF 1
MA 12
V 5
P 0
This models a barchart that looks like this:
How can the total duration of these modeling methods be constructed?
It would also be nice to know the duration of the combinations
such that the only visible combination in this small subset happens to be Modeling Code "MA" paired with Discourse Code "C" and this occurs for 26 - 21 = 5 minutes.
Thank you.

UPDATED SOLUTION
df %>%
mutate(dur = lead(Time_Processed) - Time_Processed) %>%
replace_na(list(dur = 0)) %>%
group_by(`Modeling Code`) %>%
summarise(tot_time = sum(dur))
(^ Thanks to Nick DiQuattro)
PREVIOUS SOLUTION
Here's one solution that creates a new variable, mcode_grp, which keeps track of discrete groupings of the same Modeling Code. It's not particularly pretty - it requires looping over each row in df - but it works.
First, rename columns for ease of reference:
df <- df %>%
rename(m_code = `Modeling Code`,
d_code = `Discourse Code`)
We'll update df with a few extra variables.
- lead_time_proc gives us the Time_Processed value for the next row in df, which we'll need when computing the total amount of time for each m_code batch
- row_n for keeping track of row number in our iteration
- mcode_grp is the unique label for each m_code batch
df <- df %>%
mutate(lead_time_proc = lead(Time_Processed),
row_n = row_number(),
mcode_grp = "")
Next, we need a way to keep track of when we've hit a new batch of a given m_code value. One way is to keep a counter for each m_code, and increment it whenever a new batch is reached. Then we can label all the rows for that m_code batch as belonging to the same time window.
mcode_ct <- df %>%
group_by(m_code) %>%
summarise(ct = 0) %>%
mutate(m_code = as.character(m_code))
This is the ugliest part. We loop over every row in df, and check to see if we've reached a new m_code. If so, we update accordingly, and register a value for mcode_grp for each row.
mc <- ""
for (i in 1:nrow(df)) {
current_mc <- df$m_code[i]
if (current_mc != mc) {
mc <- current_mc
mcode_ct <- mcode_ct %>% mutate(ct = ifelse(m_code == mc, ct + 1, ct))
current_grp <- mcode_ct %>% filter(m_code == mc) %>% select(ct) %>% pull()
}
df <- df %>% mutate(mcode_grp = ifelse(row_n == i, current_grp, mcode_grp))
}
Finally, group_by m_code and mcode_grp, compute the duration for each batch, and then sum over m_code values.
df %>%
group_by(m_code, mcode_grp) %>%
summarise(start_time = min(Time_Processed),
end_time = max(lead_time_proc)) %>%
mutate(total_time = end_time - start_time) %>%
group_by(m_code) %>%
summarise(total_time = sum(total_time)) %>%
replace_na(list(total_time=0))
Output:
# A tibble: 4 x 2
m_code total_time
<fct> <dbl>
1 MA 12.
2 OFF 1.
3 P 0.
4 V 5.
For any dplyr/tidyverse experts out there, I'd love tips on how to accomplish more of this without resorting to loops and counters!

Get the time between consecutive dates stored in a single column

I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!

Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132

You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R function that sums up partial words? - r

Related

Keep only rows if number is greater than... in specific column

Calculate ratio every two rows with partial string matches

How to get sum of column from sqldf output in R?

How to calculate elapsed times for the total duration of events?

Get the time between consecutive dates stored in a single column

Categories

Resources