Calculate pairwise difference within groups and still keep the grouping factor - r

I want to pick, within each group, 3 values that are closest to one another. How do I fix my code?
This data has 11 rows (the full data is much larger but I just want to make a reproducible example). The number of values in the Total column that share the same JobID range from 2 to 9 values. I am grouping values by their JobID and then calculate the within-group pairwise differences, called pdiff. Assuming that I have all the pairwise differences calculated, I will sort the pdiff column using ascending order and then use head() to pick my top 3 values.
The dat1 lines gave me the list of pairwise differences only. The dat2 lines sent me an error
"Error in `[[<-.data.frame`(`*tmp*`, col, value = c(46, 99891, 99577, 99746, :
replacement has 55 rows, data has 11"
dput(dat)
structure(list(JobID = c("11 W 29", "11 W 29", "11 W 50", "11 W 50",
"11 W 50", "23 E 27", "23 E 27", "23 E 27", "11 W 50", "11 W 50",
"11 W 50"), Total = c(145501, 145547, 45610, 45924, 45755, 56425,
56383, 56262, 756185, 750110, 751467)), row.names = c(NA, -11L
), class = c("tbl_df", "tbl", "data.frame"))
dat%>%
group_by(JobID)%>%
summarize(dist(Total))
dat1 <- dat%>%
group_by(JobID)%>%
summarize(dist(Total))
dat2<-dat%>%
group_by(JobID)%>%
mutate(pdiff=dist(Total))
The expected result should be a data frame of 55 rows and 3 columns (JobID, Total and a new column (called "pdiff" for example)).

Related

how to split a long string into multiple variables

I have a df that contain long strings. If I want to separate it into different variable, how should I do that?
sample data is here:
df <- structure(list(tx = c(" [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening",
" [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3",
" [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3",
" [4] Date Performed ECGDT Date 11",
" [5] Time (24-hour format) ECGTM Time 5",
" [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal"
)), row.names = c(NA, 6L), class = "data.frame")
It seems that the variables occupy a fixed space, so to find those spaces we do the following:
Manually separate one line:
vars = c(" [1] ", "Timepoint ", "EGTMPT ",
"Categorical select one (nominal) ", "51 ", "Screening")
Count the number of characters in each variable:
sizes = numeric(length(vars))
for(i in 1:length(vars)){
sizes[i] = nchar(vars[i])}
Cumulatively sum those values and add a 1 (starting point) at the beggining:
sizes = c(1, cumsum(sizes))
The result is:
> sizes
[1] 1 14 62 74 107 118 127
So the first variable goes from the 1st to the 14th position, etc. Now we just need to cut each line in those places:
df2 = character()
for(i in 2:length(sizes)){
df2 = cbind(df2, apply(df, 1, function(x){substr(x, sizes[i-1], sizes[i])}))}
And lastly remove the extra spaces:
df2 = gsub(" ", "", df2)

How to get the MIN and MAX from a dataframe SUBSET which consists of factors

I scraped a table and need to retrieve the Minimum and the Maximum from a subset column in the dataframe.
The table looks like this:
Date Year Title Budget Gross
2 Jun 22 2018 Project 5 $170,000,000 $1,308,334,005
3 Jun 12 2015 Project 4 $215,000,000 $1,669,943,967
4 Jul 18 2001 Project 3 $93,000,000 $365,900,000
5 May 22 1997 Project 2 $75,000,000 $618,638,999
6 Jun 11 1993 Project 1 $63,000,000 $1,045,573,035
I need to find the minimum and maximum in column Gross.
This doesn't currently work as the column are vectors. But when I use gsub to replace the comma, it gets messed up.
I don't understand
(1) how to change the vectors to real figures
(2) find the MIN and MAX in this subset
(if this way of thought is correct?)
Thanks for any leads
A tidyverse solution:
library(tidyverse)
df %>%
mutate(across(Gross, parse_number)) %>%
summarise(across(Gross, list(min = min, max = max)))
Gross_min Gross_max
1 365900000 1669943967
We need to remove both the , and the $ and $ is a metacharacter in regex to denote the end of the string. Either we can escape it for place them in a square bracket ([$,]+ - one or more characters that are either a $ or ,) and replace with blank (""). Then, we convert the column to numeric (as.numeric)
df1$Gross <- as.numeric(gsub("[$,]+", "", df1$Gross))
Now, we can get the min and max
min(df1$Gross, na.rm = TRUE)
#[1] 365900000
max(df1$Gross, na.rm = TRUE)
#[1] 1669943967
Or use the range function
range(df1$Gross, na.rm = TRUE)
#[1] 365900000 1669943967
data
df1 <- structure(list(Date = c("Jun 22", "Jun 12", "Jul 18", "May 22",
"Jun 11"), Year = c(2018L, 2015L, 2001L, 1997L, 1993L), Title = c("Project 5",
"Project 4", "Project 3", "Project 2", "Project 1"), Budget = c("$170,000,000",
"$215,000,000", "$93,000,000", "$75,000,000", "$63,000,000"),
Gross = c("$1,308,334,005", "$1,669,943,967", "$365,900,000",
"$618,638,999", "$1,045,573,035")), class = "data.frame",
row.names = c("2",
"3", "4", "5", "6"))

How to replace specific values in a dataset with randomized numbers?

I have a data column that contains a bunch of ranges as strings (e.g. "2 to 4", "5 to 6", "7 to 8" etc.). I'm trying to create a new column that converts each of these values to a random number within the given range. How can I leverage conditional logic within my function to solve this problem?
I think the function should be something along the lines of:
df<-mutate(df, c2=ifelse(df$c=="2 to 4", sample(2:4, 1, replace=TRUE), "NA"))
Which should produce a new column in my dataset that replaces all the values of "2 to 4" with a random integer between 2 and 4, however, this is not working and replacing every value with "NA".
Ideally, I am trying to do something where the dataset:
df<-c("2 to 4","2 to 4","5 to 6")
Would add a new column:
df<-c2("3","2","5")
Does anyone have any idea how to do this?
We can split the string on "to" and create a range between the two numbers after converting them to numeric and then use sample to select any one of the number in range.
df$c2 <- sapply(strsplit(df$c1, "\\s+to\\s+"), function(x) {
vals <- as.integer(x)
sample(vals[1]:vals[2], 1)
})
df
# c1 c2
#1 2 to 4 2
#2 2 to 4 3
#3 5 to 6 5
data
df<- data.frame(c1 = c("2 to 4","2 to 4","5 to 6"), stringsAsFactors = FALSE)
We can do this easily with sub. Replace the to with : and evaluate to get the sequence, then get the sample of 1 from it
df$c2 <- sapply(sub(" to ", ":", df$c1), function(x)
sample(eval(parse(text = x)), 1))
df
# c1 c2
#1 2 to 4 4
#2 2 to 4 3
#3 5 to 6 5
Or with gsubfn
library(gsubfn)
as.numeric(gsubfn("(\\d+) to (\\d+)", ~ sample(seq(as.numeric(x),
as.numeric(y), by = 1), 1), df$c1))
Or with read.table/Map from base R
sapply(do.call(Map, c(f = `:`, read.csv(text = sub(" to ", ",", df$c1),
header = FALSE))), sample, 1)
data
df <- structure(list(c1 = c("2 to 4", "2 to 4", "5 to 6")),
class = "data.frame", row.names = c(NA, -3L))

R: how to find value in first column and sum value of the third column

I have file like this
Age.Range Average Probability
1 0 to 04 400 0.00400
2 05 to 09 221 0.00221
3 10 to 14 216 0.00216
4 15 to 19 409 0.00409
X [age of an individual; integer between 0 and 80 years]
Y [the duration of monitoring of an individual; integer between 1 and
50 years or “for life”]
I need to calculate probability that the person of age X (ex. 3) will develop cancer during the interval starting today until Y(ex. 7). In R I need to find value of X and value of X+Y in first column and sum all the values in the third column between those two ranges:
X= 3
x+y=10
probability= 0.004 + 0.00221 + 0.00216
The following function does what you want. It gets the starts of the age ranges and then uses findInterval to find the indices into the probabilities column. Then it is a matter of adding those probabilities.
sumProbs <- function(DF, X, Y){
DF[["Age.Range"]] <- as.character(DF[["Age.Range"]])
Age.Start <- strsplit(DF[["Age.Range"]], " to ")
Age.Start <- as.integer(sapply(Age.Start, '[[', 1))
i <- findInterval(c(X, X + Y), Age.Start)
p <- DF[["Probability"]][i[1]:i[2]]
sum(p)
}
sumProbs(df1, 3, 7)
#[1] 0.00837
Data in dput format.
df1 <-
structure(list(Age.Range = c("0 to 04", "05 to 09",
"10 to 14", "15 to 19"), Average = c(400L, 221L,
216L, 409L), Probability = c(0.004, 0.00221, 0.00216,
0.00409)), row.names = c("1", "2", "3", "4"),
class = "data.frame")

R how to extract part of text based on presence of specific word(s)

'size' Column of my data set contains text like
row_1 = "Small size From 3 mm long when unfed to 9 mm when fully engorged"
row_2 = "Tiny some microscopic Red mite only 0 4 mm diameter Worldwide many different"
row_3 = "Small spiders body length about 10 mm"
size = c(row_1, row_2, row_3)
How can I extract the data in a new column say 'new_size' as under
size_1 = '3mm, 9mm'
size_2 = '4mm'
size_3 = '10mm'
new_size = c(size_1, size_2, size_3)
I have seen the substring methods but am unable to figure out the way to pick up the size from varying text in each row.
Try this:
Numb_Extract <- function(string){
unlist(regmatches(string,gregexpr("[[:digit:]]+\\.*[[:digit:]]*",string)))
}
row_1 = "Small size From 3 mm long when unfed to 9 mm when fully engorged"
p<-as.numeric(Numb_Extract (row_1))
print(p)
Use regmatches/gregexpr.
regmatches(size, gregexpr("[[:digit:]]+[[:space:]]mm", size))
#[[1]]
#[1] "3 mm" "9 mm"
#
#[[2]]
#[1] "4 mm"
#
#[[3]]
#[1] "10 mm"
If you want a vector, unlist the result.
size_n <- regmatches(size, gregexpr("[[:digit:]]+[[:space:]]mm", size))
unlist(size_n)
#[1] "3 mm" "9 mm" "4 mm" "10 mm"

Resources