add number spaced apart in single cell in r - r

I would like to add these numbers together in the following code in the col3.
I have tried using gsub, to add a + and calculate in r
I have tried using separate to do a sum across.
train <- data.table(col1=c(rep('a0001',4),rep('b0002',4)), col2=c(seq(1,4,1),seq(1,4,1)), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""))
I would like the results to be a sum of the number in col3 by row like in col4
result<-data.frame(col1=c("a0001","b0002"), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""),col4=c("2416",'18850'))

Grouped by 'col1', we can split by the space, unlist, convert to numeric, get the sum and assign (:=) to create new column
train[, col4 := sum(as.numeric(unlist(strsplit(col3, ' '))), na.rm = TRUE), col1]
Or another option is scan
train[, col4 := sum(scan(text = col3, what = numeric(), quiet = TRUE)), col1]

Related

How to use extract on multiple columns and name output columns based on input column names

I have a data frame of blood pressure data of the following form:
bpdata <- data.frame(bp1 = c("120/89", "110/70", "121/78"), bp2 = c("130/69", "120/90", "125/72"), bp3 = c("115/90", "112/71", "135/80"))
I would like to use the following extract command, but globally, i.e. on all bp\d columns
extract(bp1, c("systolic_1","diastolic_1"),"(\\d+)/(\\d+)")
How can I capture the digit in the column selection and use it in the column output names? I can hack around this by creating a list of column names and then using one of the apply family, but it seems to me there ought to be a more elegant way to do this.
Any suggestions?
We could use read.csv on multiple columns in a loop (Map) with sep = "/" and cbind the list elements at the end with do.call
do.call(cbind, Map(function(x, y) read.csv(text= x, sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'), y)),
unname(bpdata), seq_along(bpdata)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or without a loop, paste the columns to a single string for each row and then use read.csv/read.table
read.csv(text = do.call(paste, c(bpdata, sep="/")),
sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or using tidyverse, similar option is to unite the column into a single one with /, then use either extract or separate to split the column into multiple columns
library(dplyr)
library(tidyr)
library(stringr)
bpdata %>%
unite(bpcols, everything(), sep="/") %>%
separate(bpcols, into = str_c(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)), convert = TRUE)
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.
One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]
You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.
First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,
A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

How to remove just the set of numbers with / in between among other strings? [duplicate]

This question already has an answer here:
How can I extract numbers separated by a forward slash in R? [closed]
(1 answer)
Closed 3 years ago.
I need to extract the blood pressure values from a text note that is typically reported as one larger number, "/" over a smaller number, with the units mm HG (it's not a fraction, and only written as such). In the 4 examples below, I want to extract 114/46, 135/67, 109/50 and 188/98 only, without space before or after and place the top number in column called SBP, and the bottom number into a column called DBP.
Thank you in advance for your assistance.
bb <- c("PATIENT/TEST INFORMATION (m2): 1.61 m2\n BP (mm Hg): 114/46 HR 60 (bpm)", "PATIENT/TEST INFORMATION:\ 63\n Weight (lb): 100\nBSA (m2): 1.44 m2\nBP (mm Hg): 135/67 HR 75 (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Coronary artery disease. Hypertension. Myocardial infarction.\nWeight (lb): 146\nBP (mm Hg): 109/50 HR (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Aortic stenosis. Congestive heart failure. Shortness of breath.\nHeight: (in) 64\nWeight (lb): 165\nBSA (m2): 1.80 m2\nBP (mm Hg): 188/98 HR 140 (bpm) ")
BP <- head(bb,4)
dput(bb)
Base R solution:
setNames(data.frame(do.call("rbind", strsplit(trimws(gsub("[[:alpha:]]|[[:punct:]][^0-9]+", "",
gsub("HR.*", "", paste0("BP", lapply(strsplit(bb, "BP"), '[', 2)))), "both"), "/"))),
c("SBP", "DBP"))
We can use regmatches/regexpr from base R to extract the required values, and then with read.table, create a two column data.frame
read.table(text = regmatches(bb, regexpr('\\d+/\\d+', bb)),
sep="/", header = FALSE, stringsAsFactors = FALSE)
# V1 V2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
Or using strcapture from base R
strcapture( "(\\d+)\\/(\\d+)", bb, data.frame(X1 = integer(), X2 = integer()))
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
To create this as new columnss in the original data.frame, use either cbind to bind the output with the original dataset
cbind(data, read.table(text = ...))
Or
data[c("V1", "V2")] <- read.table(text = ...)
Or using extract from tidyr
library(dplyr)
library(tidyr)
tibble(bb) %>%
extract(bb, into = c("X1", "X2"), ".*\\b(\\d+)/(\\d+).*", convert = TRUE)
# A tibble: 4 x 2
# X1 X2
# <int> <int>
#1 114 46
#2 135 67
#3 109 50
#4 188 98
If we don't want to remove the original column, use remove = FALSE in extract
You could use str_match and select numbers which has / in between
as.data.frame(stringr::str_match(bb, "(\\d+)/(\\d+)")[, 2:3])
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
In base R, we can extract the numbers that follow the pattern a/b, split them on '/' and form two columns.
as.data.frame(do.call(rbind, strsplit(sub(".*?(\\d+/\\d+).*", "\\1", bb), "/")))
You can give them the column names as per your choice using setNames or any other method.

Sum and place it elsewhere in R

I have one column with 950 numbers. I want to sum row 1:40 and place it in a new column on row 50, then sum row 2:41 and place it on row 51 in the new column and so on. How do I do?
You can use the function RcppRoll::roll_sum()
Hope this helps:
r <- 50
df1 <- data.frame(c1 = 1:951)
v1 <- RcppRoll::roll_sum(df1$c1, n=40)
df1$c2 <- c(rep(NA, r), v1[1:(nrow(df1)-r)])
View(df1) # in RStudio
You decide what happens with the sum from row 911 onwards (I've ignored them)
You can use RcppRoll::roll_sum() and dplyr::lag()...
df <- data.frame(v = 1:950)
library(dplyr)
library(RcppRoll)
range <- 40 # how many values to sum, i.e. window size
offset <- 10 # e.g sum(1:40) goes to row 50
df <- mutate(df, roll_sum = RcppRoll::roll_sum(lag(v, n = offset),
n = range, fill = NA, align = "right"))
df[(range+offset):(range+offset+5), ]
# v roll_sum
# 50 50 820
# 51 51 860
# 52 52 900
# 53 53 940
# 54 54 980
# 55 55 1020
sum(1:range); sum(2:(range+1))
# [1] 820
# [1] 860

Recode multiple columns to create new columns of df

I am trying to write a function that would mutate multiple columns in a df and produce a new column for each recoded variable. In this case the mutation I am running is to subtract each element in the column from 15. I was able to write the following code for three columns, which worked, but in the future I want to run something like this over 20+ columns and writing out each new column name (as you do in mutate) seems burdensome.
I can't seem to get lapply to work with a recode or mutate function to produce new columns.
df2 <- mutate(df1, new_col1 = 15-old_col1,
new_col2 = 15 - old_col2, new_col3 = 15 - old_col3)
A data.table solution, assuming you want to mutate all of the columns* (see below for a more flexible version).
*as #sb0709 mentions in the comments, mutate_all would do this as well.
library( data.table )
df <- data.table( old_col_1 = 20:24,
old_col_2 = 55:49,
old_col_3 = rnorm( 5, 100, 30 ) )
df[ , sub( "old", "new", names( df ) ) := lapply( .SD, function(x) 15-x ) ]
Which gives:
R> df
old_col_1 old_col_2 old_col_3 new_col_1 new_col_2 new_col_3
1: 20 55 86.29104 -5 -40 -71.29104
2: 21 56 144.21564 -6 -41 -129.21564
3: 22 57 104.84574 -7 -42 -89.84574
4: 23 58 93.18084 -8 -43 -78.18084
5: 24 59 104.96188 -9 -44 -89.96188
If you want to select less than all of the columns, you just need to subset the names vector and the .SD list. For example, to run your mutation on only columns 2 and 3:
df[ , sub( "old", "new", names( df )[2:3] ) := lapply( .SD[,2:3], function(x) 15-x ) ]
Which instead gives:
R> df
old_col_1 old_col_2 old_col_3 new_col_2 new_col_3
1: 20 55 138.28667 -40 -123.28667
2: 21 56 69.03836 -41 -54.03836
3: 22 57 147.39790 -42 -132.39790
4: 23 58 88.15505 -43 -73.15505
5: 24 59 28.96437 -44 -13.96437

Resources