Change values in data frame in a specific row using dplyr - r

Is it possible to restrict a data frame to a specific row and then change some values in one of the columns?
Let's say I calculate GROWTH as (SIZE_t+1 - SIZE_t)/SIZE_t and now I can see that there are some strange values for GROWTH (e.g. 1000) and the reason is a corrupt value of the corresponding SIZE variable. Now I'd like to find and replace the corrupt value of SIZE.
If I type:
data <- mutate(filter(data, lead(GROWTH)==1000), SIZE = 2600)
then only the corrupt row is stored in data and the rest of my data frame is lost.
What I'd like to do instead is filter "data" on the left hand side to the corresponding row of the corrupt value and then mutate the incorrect variable (on the right hand side):
filter(data, lead(GROWTH)==1000) <- mutate(filter(data, lead(GROWTH)==1000), SIZE = 2600)
but that doesn't seem to work. Is there a way to handle this using dplyr? Many thanks in advance

You can use an ifelse statement with mutate function. Let's say you have a data frame with some corrupted values in SIZE at row 3 which lead to a large GROWTH value at row 4 and you want to replace the SIZE at row 3, with some value 0.3 here(I chose to be different from yours just to be consistent with my values). The GROWTH > 1000 condition can be replaced accordingly.
data
SIZE GROWTH
1 -1.49578498 NA
2 -0.38731784 -0.7410605
3 0.00010000 -1.0002582
4 0.53842217 5383.2216758
5 -0.65813674 -2.2223433
6 0.29830698 -1.4532599
7 0.04712019 -0.8420413
8 -0.07312482 -2.5518788
9 1.64310713 -23.4698959
10 1.44927727 -0.1179654
library(dplyr)
data %>% mutate(SIZE = ifelse(lead(GROWTH > 1000, default = F), 0.3, SIZE))
SIZE GROWTH
1 -1.49578498 NA
2 -0.38731784 -0.7410605
3 0.30000000 -1.0002582
4 0.53842217 5383.2216758
5 -0.65813674 -2.2223433
6 0.29830698 -1.4532599
7 0.04712019 -0.8420413
8 -0.07312482 -2.5518788
9 1.64310713 -23.4698959
10 1.44927727 -0.1179654
Data:
structure(list(SIZE = c(-1.49578498093657, -0.387317841955887,
1e-04, 0.538422167582116, -0.658136741561064, 0.298306980856383,
0.0471201873908915, -0.0731248216938637, 1.64310713116132, 1.44927727104653
), GROWTH = c(NA, -0.741060482026387, -1.00025818588551, 5383.22167582116,
-2.22234332311492, -1.45325988053609, -0.842041284935343, -2.55187883883499,
-23.4698958999199, -0.117965442690154)), class = "data.frame", .Names = c("SIZE",
"GROWTH"), row.names = c(NA, -10L))

Related

How to generate random data based on some criteria in R

I wan to generate 300 random data based on the following criteria:
Class value
0 1-8
1 9-11
2 12-14
3 15-16
4 17-20
Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.
This gives me the following hypothetical table as an example:
Class Value
0 7
0 4
1 10
1 9
1 11
. .
. .
I want to have equal and unequal mixtures in each class
You could do:
df <- data.frame(Class = sample(0:4, 300, TRUE))
df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class + 1],
sample, size = 1)
This gives you a data frame with 300 rows and appropriate numbers for each class:
head(df)
#> Class Value
#> 1 0 3
#> 2 1 10
#> 3 4 19
#> 4 2 12
#> 5 4 19
#> 6 1 10
Created on 2022-12-30 with reprex v2.0.2
Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:
# load data.table
library(data.table)
# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14",
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]
# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])
# This is a vector of UNEQUAL Probabilities. If wanted, it should be
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)
# This is the number of Class samples wanted
numberOfSamples = 300
# This is the working horse
a[sample(.N, numberOfSamples, TRUE, prob = probs), ][,
smpl := apply(.SD,
1,
function(x) sample(eval(parse(text = x)), 1)),
.SDcols = "value"][,
.(Class, smpl)]
What is good about this code?
If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.

Generate buckets based on a column data then create another column storing values assigned to corresponding buckets

I have a dataframe which includes 2 columns below
|Systolic blood pressure |Urea Nitrogen|
|------------------------|-------------|
|155.86667|50.000000|
|140.00000| 20.33333|
|135.33333| 33.857143|
|126.40000|15.285714|
|...|...|
I want to create 2 more columns called Sys_points and BUN_points based on the bucket criteria like the image attached, which will store the values (not in equally spaced) of column Points in the image. I have tried findInterval and cut but can't find functions that allow me to assign values not in sequence order to buckets.
#findInterval
BUN_int <- seq(0,150,by=10)
data3$BUN <- findInterval(data3$`Urea Nitrogen`,BUN_int)
#cut
cut(data3$`Urea Nitrogen`,breaks = BUN_int, right=FALSE, dig.lab=c(0,2,4,6,8,9,11,13,15,17,19,21,23,25,27,28))
Is there any function that can help me with this?
Here’s how to do it using cut(). Note the use of -Inf and Inf to include <x and >=x bins.
bun_data$Sys_points <- cut(
bun_data$`Systolic blood pressure`,
breaks = c(5:20 * 10, Inf),
labels = c(28,26,24,23,21,19,17,15,13,11,9,8,6,4,2,0),
right = FALSE
)
bun_data$BUN_points <- cut(
bun_data$`Urea Nitrogen`,
breaks = c(-Inf, 1:15 * 10, Inf),
labels = c(0,2,4,6,8,9,11,13,15,17,19,21,23,25,27,28),
right = FALSE
)
Result:
Systolic blood pressure Urea Nitrogen Sys_points BUN_points
1 155.8667 50.00000 9 9
2 140.0000 20.33333 11 4
3 135.3333 33.85714 13 6
4 126.4000 15.28571 15 2

How to add a column with sequential values that expands a data frame in R [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a column time_bin that is based on cumulative radiocarbon dates. However I need to fill the gaps in the time_bin sequence. In the example data below this means I need 2700, and 3100 added in. This will be applied to a lot of different data sets with different gaps so needs to be automated. It will have to expand this size of the dataframe, its fine if the values in the other columns are just NA for now as I think I know how to populate them with what I need once they're created.
The time_bin column is created by using mutate along with ceiling as shown below, so maybe it can be changed at this point, rather than later.
I can create the column I need,called seq below, but I'm not sure how to force it into a dataframe.
If there's a way this can be done with a tidyverse aproach rather than vectored as I have done it that would be great too.
So far I have:
data<- structure(list(cumulative.time = c(2458.09948930625, 2580.22242330625,
2707.31373980624, 2839.71214840625, 2977.77505230625, 3121.87854830625
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
data <- data%>% mutate(time_bin=ceiling(cumulative.time/100)*100)
max <- max(data$time_bin, na.rm = TRUE)
min <- min(data$time_bin, na.rm = TRUE)
seq <- seq(from = min, to = max, by = 100)
Thanks people!
We can use complete from tidyr to create a sequence between minimum of time_bin value till maximum with a step of 100.
tidyr::complete(data, time_bin = seq(min(time_bin), max(time_bin), by = 100))
# time_bin cumulative.time
# <dbl> <dbl>
#1 2500 2458.
#2 2600 2580.
#3 2700 NA
#4 2800 2707.
#5 2900 2840.
#6 3000 2978.
#7 3100 NA
#8 3200 3122.
This calls for a join. If we make your seq variable into a data.frame, we can do the appropriate join with data.
library(dplyr)
seq <- data.frame(time_bin = seq(from = min, to = max, by = 100))
data %>% right_join(seq) %>% arrange(time_bin)
Joining, by = "time_bin"
# A tibble: 8 x 2
cumulative.time time_bin
<dbl> <dbl>
1 2458. 2500
2 2580. 2600
3 NA 2700
4 2707. 2800
5 2840. 2900
6 2978. 3000
7 NA 3100
8 3122. 3200

Extracting value and finding the minimum without merging

I'm trying to extract a subtext and get the minimum value from a list of a list in R. My initial tsv looks like this (this is a smaller version):
cases counts
"S35718:10.63,S35585:6.75,S35708:7.28,S36617:12.23" "6.75,7.28,10.63,12.23,6.17,4.09,3.95,5.00"
"S35718:10.63" "10.63"
And I am trying to extract the numbers after the colon and find the minimum, then I wanted to see how many in the counts column are greater than the minimum.
For instance my ideal output would be:
min: 6.75
greater than 6.75 in counts column: 4
Within this .tsv, there are approximately 100,000 lines. I've tried using gsub, but it ends up merging all the numbers such as the example below:
test <- gsub(".*:", "",outlier$cases)
[1]"10.63" "6.75" "7.28" "12.23" "10.63" ... all the other subsequent values
I would appreciate any help on this. I'm a bit of a beginner with R but would love to improve further. Thank you so much!
An option is to extract the numbers after the :, convert it to numeric, get the min and find the counts by creating a logical expression and take the sum
library(stringr)
library(dplyr)
library(purrr)
library(tidyr)
outlier %>%
transmute(caselist = str_extract_all(cases, "(?<=:)\\d+\\.\\d+"),
countlist = str_extract_all(counts, "[0-9.]+")) %>%
transmute(out = map2(caselist, countlist,
~tibble(min = min(as.numeric(.x)),
greater_than_min = sum(as.numeric(.y) >= min)))) %>%
unnest_wider(c(out))
# A tibble: 2 x 2
# min greater_than_min
# <dbl> <int>
#1 6.75 4
#2 10.6 1
data
outlier <- structure(list(cases = c("S35718:10.63,S35585:6.75,S35708:7.28,S36617:12.23",
"S35718:10.63"), counts = c("6.75,7.28,10.63,12.23,6.17,4.09,3.95,5.00",
"10.63")), class = "data.frame", row.names = c(NA, -2L))

How to create formulas to compute new variables in R considering names in a data frame and their column value

Hi everybody I am trying to solve a little problem in R to compute new variables in R. The dput version of my data frame is:
structure(list(sexo = c(-22.84754, -30.95001, -37.36658, -45.64382,
-54.9466, -0.1732915), cli_edad = c(5.972972, 11.67697, 16.46362,
26.57938, 47.19307, 0.1037254), edad2 = c(-0.0637181, -0.1199798,
-0.1600652, -0.2424397, -0.4273092, -0.001068), veces_mora_ago12 = c(-100.6952,
-166.6598, -391.6087, -710.2349, -1098.773, -0.3525356), veces_mora_sep12 = c(20.06456,
162.5816, 388.1126, 738.8196, 1181.483, 0.2907068), veces_mora_oct12 = c(79.44273,
-15.99917, -13.33856, 9.459844, 103.7592, -0.0863719)), .Names = c("sexo",
"cli_edad", "edad2", "veces_mora_ago12", "veces_mora_sep12",
"veces_mora_oct12"), class = "data.frame", row.names = c(NA,
6L))
The rows of that data frame are the coefficients of different models. When I load a data frame in R I need to compute a new variable for each model. For example the loaded data frame DF will have the same names that z but in this case I have to compute 6 additional variables for DF this variables are defined for different formulas, for example this considering names of z and the first row of z:
DF$I1=-22.8475400*DF$sexo+5.9729720*DF$cli_edad-0.0637181*DF$edad2-100.6952000*DF$veces_mora_ago12+20.0645600*DF$veces_mora_sep12+79.4427300*DF$veces_mora_oct12
Like last formula I have to write 5 additional formulas for the combination between names and second row of z until names and sixth row of z.
I don't know if it is possible to create in R this formulas and when I load a data frame can apply these to compute the new variables. Thanks for your help.
Your example:
> -22.8475400*DF$sexo+5.9729720*DF$cli_edad-0.0637181*DF$edad2-100.6952000*DF$veces_mora_ago12+20.0645600*DF$veces_mora_sep12+79.4427300*DF$veces_mora_oct12
[1] 17410.94776 19549.83787 47112.85467 88294.47367 144127.32240 39.04883
As #JJLagrange says I think you want this???
DM=as.matrix(DF)
DM%*% t(DM)
1 2 3 4 5 6
1 17410.94776 19549.8379 47112.8547 88294.4737 144127.3224 39.0488287
2 19549.83787 55558.5082 129927.5614 240057.8018 375800.3450 113.9736682
3 47112.85467 129927.5614 305834.0191 566896.3661 890283.7105 260.2182340
4 88294.47367 240057.8018 566896.3661 1053167.3837 1658033.7139 475.0128054
5 144127.32240 375800.3450 890283.7105 1658033.7139 2619216.6537 736.2772169
6 39.04883 113.9737 260.2182 475.0128 736.2772 0.2570419

Resources