I have a dataframe that looks something like this:
class <- c(3,0,3,0,0)
value <- c(50,50,70,30,100)
days <- c(3,3,2,2,1)
mydata <- data.frame(class, value, days)
What I need is for each day to have both classes represented - so if there is no class 3 on a given day (in this example, day 1) I'd like to add a row where class = 3 and value = 0 and day = 1. My real data is more complicated, because there are varying numbers of rows for each day (and many more days than 3), and many other columns (but for which it would be fine to enter NA). This doesn't seem like too complicated a problem, but I'm having trouble wrapping my head around the code. Thanks so much!
Using tidyverse you can use complete:
library(tidyverse)
mydata %>%
complete(days, class, fill = list(value = 0))
Output
# A tibble: 6 x 3
days class value
<dbl> <dbl> <dbl>
1 1 0 100
2 1 3 0
3 2 0 30
4 2 3 70
5 3 0 50
6 3 3 50
Data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))
With base R, we can do
out <- merge(expand.grid(lapply(mydata[c('class', 'days')],
unique)), mydata, all.x = TRUE)
out$value[is.na(out$value)] <- 0
out
# class days value
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
NOTE: No packages used
Or with data.table
library(data.table)
setDT(mydata)[CJ(class, days, unique = TRUE),
on = .(class, days)][is.na(value), value := 0][]
# class value days
#1: 0 100 1
#2: 0 30 2
#3: 0 50 3
#4: 3 0 1
#5: 3 70 2
#6: 3 50 3
Or using crossing/left_join from tidyverse
library(dplyr)
library(tidyr)
tidyr::crossing(class = unique(mydata$class),
days = unique(mydata$days)) %>%
left_join(mydata) %>%
mutate(value = replace_na(value, 0))
# A tibble: 6 x 3
# class days value
# <dbl> <dbl> <dbl>
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))
Related
The file "Aspirin" contains a 2 × 2 × 2 contingency table with columns defined as follows.
Column 1: V1=Observation number. [Observations 1 to 8.]
Column 2: V2=Count. [Nonnegative integer count for each cell in the Table.]
Column 3: V3=Case/Control Factor. [Factor Level 1 (Controls) and Level 2 (Cases).]
Column 4: V4=Ulcer Type Factor. [Factor Level 1 (Gastric) and Level 2 (Duodenal).]
Column 5: V5=Aspirin Use Factor. [Factor Level 1 (Non-User) and Level 2 (User).]
> aspirin
V1 V2 V3 V4 V5
1 1 62 1 1 1
2 2 39 2 1 1
3 3 53 1 2 1
4 4 49 2 2 1
5 5 6 1 1 2
6 6 25 2 1 2
7 7 8 1 2 2
8 8 8 2 2 2
I want to construct a 2x2x2 contingency table like the image above in R, so I typed the following code:
case_control=factor(aspirin$V3)
ulcer=factor(aspirin$V4)
use=factor(aspirin$V5)
table(case_control,ulcer,use)
But I get something like this:
, , use = 1
ulcer
case_control 1 2
1 1 1
2 1 1
, , use = 2
ulcer
case_control 1 2
1 1 1
2 1 1
I want a contingency table with counts, so obviously the result above is not what I'm desiring. Is there a way to fix this?
In your case, just use
ftable(case_control,ulcer,use)
which returns a "flat" table
use 1 2
case_control ulcer
1 1 1 1
2 1 1
2 1 1 1
2 1 1
The main problem here is, that you are discarding your count column. So as an alternative here is a - in my opinion - better approach:
You could use xtabs together with ftable() (here used in a dplyr pipe):
library(dplyr)
df %>%
transmute(ID = V1,
Count = V2,
Case_Control = factor(V3,
labels = c("Control", "Case")),
Ulcer_Type = factor(V4,
labels = c("Gastric", "Duodenal")),
Aspirin_Use = factor(V5,
labels = c("Non-User", "User"))) %>%
xtabs(Count ~ Ulcer_Type + Case_Control + Aspirin_Use, data = .) %>%
ftable()
This returns
Aspirin_Use Non-User User
Ulcer_Type Case_Control
Gastric Control 62 6
Case 39 25
Duodenal Control 53 8
Case 49 8
Data
df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8), V2 = c(62, 39,
53, 49, 6, 25, 8, 8), V3 = c(1, 2, 1, 2, 1, 2, 1, 2), V4 = c(1,
1, 2, 2, 1, 1, 2, 2), V5 = c(1, 1, 1, 1, 2, 2, 2, 2)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.
I have a dataset like this.
> dataset
id
a
b
c
d
1
10
1
30
50
2
5
0
5
60
3
20
1
18
90
4
103
0
20
80
5
16
1
56
100
And so on up to 'n' number of columns in relation to the requirement.
My colleagues in the research area carry out certain analyzes where what they give me as an input is a data frame. In which the row names correspond to the variables of interest for a new dataset.
Something like this.
> rownames(Features)
a
b
d
a:d
b:d
b:c
a:c
Where the colon (:) represents the "product of". So in order to continue working I need to include those products in the original dataset.
I have manually created a function that reproduces the multiplications in the following way (where x represents my original dataset):
Products<- function(x){x<- x %>% mutate(Product1=x$a*x$d)
x<- x %>% mutate(Product2=x$b*x$d)
x<- x %>% mutate(Product3=x$b*x$c)
x<- x %>% mutate(Product4=x$a*x$c)
return(x)}
However, given that the number of products to create is variable every time, I want to find a way to automate the creation of these column multiplications with the names that they give me as input. I'm sure my approach isn't the best, so any help is very welcome.
This seems to be some modelling being done and the choice of : to stand for product is quite intriguing as that is often the case in the modeling world. In base R, you could do:
model.matrix(reformulate(c(-1,rownames(Features))), dataset)
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
I added -1 in order to remove the intercept. Otherwise you could maintain it. Also note that this is a matrix, you can then change it to a dataframe.
I am not sure if the following code works for your, where eval + gsub are used to produce the product columns
with(
dataset,
list2DF(
setNames(
lapply(
gsub(":", "*", rownames(Features)),
function(x) eval(str2lang(x))
),
rownames(Features)
)
)
)
which gives
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
Data
> dput(dataset)
structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
> dput(Features)
structure(list(Features = 1:7), class = "data.frame", row.names = c("a",
"b", "d", "a:d", "b:d", "b:c", "a:c"))
We could use strsplit to split the names that have :, select the column in the 'dataset' based on that splitted named, Reduce with * to do elementwise multiplication, and assign those 'Product' columns to the original 'dataset'
nm1 <- grep(':', rownames(Features), value = TRUE)
lst1 <- lapply(strsplit(nm1, ":", fixed = TRUE),
function(x) Reduce(`*`, dataset[x]))
dataset[paste0("Product", seq_along(lst1))] <- lst1
-output
dataset
# id a b c d Product1 Product2 Product3 Product4
#1 1 10 1 30 50 500 50 30 300
#2 2 5 0 5 60 300 0 0 25
#3 3 20 1 18 90 1800 90 18 360
#4 4 103 0 20 80 8240 0 0 2060
#5 5 16 1 56 100 1600 100 56 896
data
dataset <- structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
Features <- structure(1:7, .Dim = c(7L, 1L), .Dimnames = list(c("a", "b",
"d", "a:d", "b:d", "b:c", "a:c"), NULL))
This is similar to many solutions above. One thing about R is you can achieve the same results with so many different ways though the underline principle is still the same.
library(dplyr)
dataset <- tibble(id = c(1, 2, 3, 4, 5),
a = c(10, 5, 20, 103, 16),
b = c(1, 0, 1, 0, 1),
c = c(30, 5, 18, 20, 56),
d = c(50, 60, 90, 80, 100))
features = c("a", "b", "d", "a:d", "b:d", "b:c", "a:c")
final <- bind_cols(
map(features,
function(x) {
dataset %>%
mutate(!!x := eval(parse(text=gsub(":", "*", x)))) %>%
select(!!x)
}
))
Final dataset.
# A tibble: 5 x 9
id a b c d `a:d` `b:d` `b:c` `a:c`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 1 30 50 500 50 30 300
2 2 5 0 5 60 300 0 0 25
3 3 20 1 18 90 1800 90 18 360
4 4 103 0 20 80 8240 0 0 2060
5 5 16 1 56 100 1600 100 56 896
I have a bit of code that I used in an excel spreadsheet that used min and max that I'm trying to transfer over to R.
I have two columns, "mini" and "maxi" which represent a range of possible values. The third column I'm trying to populate is the proportion of that range that falls between 5 and 19. Looking at the first row in the example, if "mini" was 10 and "maxi" was 15, the value of the 5-19 column should be 1, since the range falls completely in that span. In row 9, the "mini" is 1 and the "maxi" is 3, meaning it falls completely outside of the 5-19 range, and should therefore be 0. Row 3 however, straddles this range, and only 25% falls in the range of 5-19, so the output value should be 0.25.
Edit I have updated R and although several solutions worked before, I am now getting the error:
Error in mutate_impl(.data, dots, caller_env()) :
attempt to bind a variable to R_UnboundValue
Here's an example of how the DF looks:
ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6
The code that I used previously was something like this:
=MAX((MIN(maxi,19)-MAX(mini,5)+1),0)/(maxi-mini+1)
I was initially trying to use something like
percentoutput <- mutate(DF, output = MAX((MIN(maxi,19) - MAX(mini,5) + 1),0)/(maxi-mini + 1))
This resulted in the ouput column being full of NAs.
I wasn't sure if this is a situation where I'd need to run an apply function, but I'm not sure how to go about setting it up. Any guidance is appreciated!
Here is an example DF:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), min = c(10,
17, 2, 40, 40, 21, 21, 17, 1, 4), max = c(15, 20, 5, 59, 59,
39, 39, 20, 3, 6)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(ID = structure(list(), class = c("collector_double",
"collector")), mini = structure(list(), class = c("collector_double",
"collector")), maxi = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
We can calculate ratio of min to max values that are in range of 5:19 using rowwise.
library(dplyr)
df %>% rowwise() %>% mutate(ratio = mean(min:max %in% 5:19))
# ID min max ratio
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 15 1
# 2 2 17 20 0.75
# 3 3 2 5 0.25
# 4 4 40 59 0
# 5 5 40 59 0
# 6 6 21 39 0
# 7 7 21 39 0
# 8 8 17 20 0.75
# 9 9 1 3 0
#10 10 4 6 0.667
and similarly in base R using apply :
df$ratio <- apply(df[-1], 1, function(x) mean(x[1]:x[2] %in% 5:19))
Here is a vectorized version using data.table:
DT[, portion := {
mn <- pmax(mini, lb)
mx <- pmin(maxi, ub)
fifelse(mn <= mx, (mx - mn + 1L) / (maxi - mini + 1L), 0)
}]
Or equivalently in base R:
DF$mn <- pmax(DF$mini, lb)
DF$mx <- pmin(DF$maxi, ub)
DF$portion <- ifelse(DF$mn <= DF$mx, (DF$mx - DF$mn + 1L) / (DF$maxi - DF$mini + 1L), 0)
output:
ID mini maxi portion
1: 1 10 15 1.0000000
2: 2 17 20 0.7500000
3: 3 2 5 0.2500000
4: 4 40 59 0.0000000
5: 5 40 59 0.0000000
6: 6 21 39 0.0000000
7: 7 21 39 0.0000000
8: 8 17 20 0.7500000
9: 9 1 3 0.0000000
10: 10 4 6 0.6666667
data:
library(data.table)
DT <- fread("ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6")
lb <- 5L
ub <- 19L
We can use map2
library(dplyr)
library(purrr)
df %>%
mutate(ratio = map2_dbl(min, max, ~ mean(.x:.y %in% 5:19)))
I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120