I have a dataset called PimaDiabetes.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
PimaDiabetes[2:8][PimaDiabetes[2:8]==0] <- NA
mean_1 = 40.5
mean_0 = 30.7
p.tib <- PimaDiabetes %>%
as_tibble()
Here is a snapshot of the data:
And the dataset can be pulled from here.
I'm trying to navigate the columns in such a way that I can group the dataset by Outcomes (so to select for Outcome 0 and 1), and impute a different value (the median of the respected groups) into columns depending on the outcomes.
So for instance, in the fifth column, Insulin, there are some NA values down the line where the Outcome is 1, and some where the Outcome is 0. I would like to place a value (40.5) into it when the value in a row is NA, and the Outcome is 1. Then I'd like to put the mean_2 into it when the value is NA, and the Outcome is 0.
I've gotten advice prior to this and tried:
p.tib %>%
mutate(
p.tib$Insulin = case_when((p.tib$Outcome == 0) & (is.na(p.tib$Insulin)) ~ IN_0,
(p.tib$Outcome == 1) & (is.na(p.tib$Insulin) ~ IN_1,
TRUE ~ p.tib$Insulin))
However it constantly yields the following error:
Error: unexpected '=' in "p.tib %>% mutate(p.tib$Insulin ="
Can I know where things are going wrong, please?
Setup
It appears this dataset is also in the pdp package in R, called pima. The only major difference between the R package data and yours is that the pima dataset's Outcome variable is simply called "diabetes" instead and is labeled "pos" and "neg" instead of 0/1. I have loaded that package and the tidyverse to help.
#### Load Libraries ####
library(pdp)
library(tidyverse)
First I transformed the data into a tibble so it was easier for me to read.
#### Reformat Data ####
p.tib <- pima %>%
as_tibble()
Printing p.tib, we can see that the insulin variable has a lot of NA values in the first rows, which will be quicker to visualize later than some of the other variables that have missing data. Therefore, I used that instead of glucose, but the idea is the same.
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 NA 33.6 0.627 50 pos
2 1 85 66 29 NA 26.6 0.351 31 neg
3 8 183 64 NA NA 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA NA 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA NA 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA NA NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹pressure,
# ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows
Finding the Mean
After glimpsing the data, I checked the mean for each group who did and didn't have diabetes by first grouping by diabetes with group_by, then collapsing the data frame into a summary of each group's mean, thus creating the mean_insulin variable (which you can see removes NA values to derive the mean):
#### Check Mean by Group ####
p.tib %>%
group_by(diabetes) %>%
summarise(mean_insulin = mean(insulin,
na.rm=T))
The values we should be imputing seem to be below. Here the groups are labeled as "neg" or 0 in your data, and "pos", or 1 in your data. You can convert these groups into those numbers if you want, but I left it as is so it was easier to read:
# A tibble: 2 × 2
diabetes mean_insulin
<fct> <dbl>
1 neg 130.
2 pos 207.
Mean Imputation
From there, we will use case_when as a vectorized ifelse statement. First, we use mutate to transform insulin. Then we use case_when by setting up three tests. First, if the group is negative and the value is NA, we turn it into the mean value of 130. If the group is positive for the same condition, we use 207. For all other values (the TRUE part), we just use the normal value of insulin. The & operator here just says "this transformation can only take place if both of these tests are true". What follows the ~ is the transformation to take place.
#### Impute Mean ####
p.tib %>%
mutate(
insulin = case_when(
(diabetes == "neg") & (is.na(insulin)) ~ 130,
(diabetes == "pos") & (is.na(insulin)) ~ 207,
TRUE ~ insulin
)
)
You will now notice that the first rows of insulin data are replaced with the mutation and the rest are left alone:
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 207 33.6 0.627 50 pos
2 1 85 66 29 130 26.6 0.351 31 neg
3 8 183 64 NA 207 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA 130 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA 130 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA 207 NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹pressure,
# ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows
Related
I need to divide columns despesatotal and despesamonetaria by the row named Total:
Lets suppose your data set is df.
# 1) Delete the last row
df <- df[-nrow(df),]
# 2) Build the desired data.frame [combining the CNAE names and the proportion columns
new.df <- cbind(grup_CNAE = df$grup_CNAE,
100*prop.table(df[,-1],margin = 2))
Finally, rename your columns. Be careful with the matrix or data.frame formats, because sometimes mathematical operations may suppose a problem. If you you use dput function in order to give us a reproducible example, the answer would be more accurate.
Here is a way to get it done. This is not the best way, but I think it is very readable.
Suppose this is your data frame:
mydf = structure(list(grup_CNAE = c("A", "B", "C", "D", "E", "Total"
), despesatotal = c(71, 93, 81, 27, 39, 311), despesamonetaria = c(7,
72, 36, 22, 73, 210)), row.names = c(NA, -6L), class = "data.frame")
mydf
# grup_CNAE despesatotal despesamonetaria
#1 A 71 7
#2 B 93 72
#3 C 81 36
#4 D 27 22
#5 E 39 73
#6 Total 311 210
To divide despesatotal values with its total value, you need to use the total value (311 in this example) as the denominator. Note that the total value is located in the last row. You can identify its position by indexing the despesatotal column and use nrow() as the index value.
mydf |> mutate(percentage1 = despesatotal/despesatotal[nrow(mydf)],
percentage2 = despesamonetaria /despesamonetaria[nrow(mydf)])
# grup_CNAE despesatotal despesamonetaria percentage1 percentage2
#1 A 71 7 0.22829582 0.03333333
#2 B 93 72 0.29903537 0.34285714
#3 C 81 36 0.26045016 0.17142857
#4 D 27 22 0.08681672 0.10476190
#5 E 39 73 0.12540193 0.34761905
#6 Total 311 210 1.00000000 1.00000000
library(tidyverse)
Sample data
# A tibble: 11 x 3
group despesatotal despesamonetaria
<chr> <int> <int>
1 1 198 586
2 2 186 525
3 3 202 563
4 4 300 562
5 5 126 545
6 6 215 529
7 7 183 524
8 8 163 597
9 9 213 592
10 10 175 530
11 Total 1961 5553
df %>%
mutate(percentage_total = despesatotal / last(despesatotal),
percentage_monetaria = despesamonetaria/ last(despesamonetaria)) %>%
slice(-nrow(.))
# A tibble: 10 x 5
group despesatotal despesamonetaria percentage_total percentage_monetaria
<chr> <int> <int> <dbl> <dbl>
1 1 198 586 0.101 0.106
2 2 186 525 0.0948 0.0945
3 3 202 563 0.103 0.101
4 4 300 562 0.153 0.101
5 5 126 545 0.0643 0.0981
6 6 215 529 0.110 0.0953
7 7 183 524 0.0933 0.0944
8 8 163 597 0.0831 0.108
9 9 213 592 0.109 0.107
10 10 175 530 0.0892 0.0954
This is a good place to use dplyr::mutate(across()) to divide all relevant columns by the Total row. Note this is not sensitive to the order of the rows and will apply the manipulation to all numeric columns. You can supply any tidyselect semantics to across() instead if needed in your case.
library(tidyverse)
# make sample data
d <- tibble(grup_CNAE = paste0("Group", 1:12),
despesatotal = sample(1e6:5e7, 12),
despesamonetaria = sample(1e6:5e7, 12)) %>%
add_row(grup_CNAE = "Total", summarize(., across(where(is.numeric), sum)))
# divide numeric columns by value in "Total" row
d %>%
mutate(across(where(is.numeric), ~./.[grup_CNAE == "Total"]))
#> # A tibble: 13 × 3
#> grup_CNAE despesatotal despesamonetaria
#> <chr> <dbl> <dbl>
#> 1 Group1 0.117 0.0204
#> 2 Group2 0.170 0.103
#> 3 Group3 0.0451 0.0837
#> 4 Group4 0.0823 0.114
#> 5 Group5 0.0170 0.0838
#> 6 Group6 0.0174 0.0612
#> 7 Group7 0.163 0.155
#> 8 Group8 0.0352 0.0816
#> 9 Group9 0.0874 0.135
#> 10 Group10 0.113 0.0877
#> 11 Group11 0.0499 0.0495
#> 12 Group12 0.104 0.0251
#> 13 Total 1 1
Created on 2022-11-08 with reprex v2.0.2
Below is the sample data and code. I have two issues. First, I need the indtotal column to be the sum by the twodigit code and have it stay constant as shown below. The reasons is so that I can do a simple calculation of one column divided by the other to arrive at the smbshare number. When I try the following,
second <- first %>%
group_by(twodigit,smb) %>%
summarize(indtotal = sum(employment))
it breaks it down by twodigit and smb.
Second issue is having it produce an 0 if the value does not exist. Best example is twodigit code of 51 and smb = 4. When there are not 4 distinct smb values for a given two digit, I am looking for it to produce a 0.
Note: smb is short for small business
naicstest <- c (512131,512141,521921,522654,512131,536978,541214,531214,621112,541213,551212,574121,569887,541211,523141,551122,512312,521114,522112)
employment <- c(11,130,315,17,190,21,22,231,15,121,19,21,350,110,515,165,12,110,111)
smb <- c(1,2,3,1,3,1,1,3,1,2,1,1,4,2,4,3,1,2,2)
first <- data.frame(naicstest,employment,smb)
first<-first %>% mutate(twodigit = substr(naicstest,1,2))
second <- first %>% group_by(twodigit) %>% summarize(indtotal = sum(employment))
Desired result is below
twodigit indtotal smb smbtotal smbshare
51 343 1 23 (11+12) 23/343
51 343 2 130 130/343
51 343 3 190 190/343
51 343 4 0 0/343
52 1068 1 17 23/1068
52 1068 2 221 (110+111) 221/1068
52 1068 3 315 315/1068
52 1068 4 515 515/1068
This gives you all the columns you need, but in a slightly different order. You could use select or relocate to get them in the order you want I suppose:
first %>%
group_by(twodigit, smb) %>%
summarize(smbtotal = sum(employment)) %>%
ungroup() %>%
complete(twodigit, smb, fill = list('smbtotal' = 0)) %>%
group_by(twodigit) %>%
mutate(
indtotal = sum(smbtotal),
smbshare = smbtotal / indtotal
)
`summarise()` has grouped output by 'twodigit'. You can override using the `.groups` argument.
# A tibble: 32 × 5
# Groups: twodigit [8]
twodigit smb smbtotal indtotal smbshare
<chr> <dbl> <dbl> <dbl> <dbl>
1 51 1 23 343 0.0671
2 51 2 130 343 0.379
3 51 3 190 343 0.554
4 51 4 0 343 0
5 52 1 17 1068 0.0159
6 52 2 221 1068 0.207
7 52 3 315 1068 0.295
8 52 4 515 1068 0.482
9 53 1 21 252 0.0833
10 53 2 0 252 0
# … with 22 more rows
I have a dataframe in long format which is organised in this way:
help<- read.table(text="
ID Sodium H
1 140 31.9
1 138 29.6
1 136 30.6
2 145 35.9
2 137 33.3
3 148 27.9
4 139 30.0
4 128 32.4
4 143 35.3
4 133 NA", header = TRUE)
I need the worst value in each subject (ID) for Sodium and H. The worst value for H is defined as either value furthest away from 41-49, while the worst value for sodium is defined as value furthest away from 134-154.
The end result should therefore become something like this:
help<- read.table(text="
ID Sodium H
1 136 29.6
2 137 33.3
3 148 27.9
4 128 30.0 ", header=TRUE)
What is the easiest way to do this? Using aggregate function or dplyr? Or something else? Thank you in advance!
Here's a tidy version:
library(dplyr)
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T11:00 138 29.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T10:00 139 30
If it's possible that an ID may not have something out of limits, then the "worst" might return something within limits. If this is not desired, you can always add a filter to prevent within-limits:
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup() %>%
filter(!between(H, 41, 49))
The premise for Sodium is the same, using abs and the difference between its value and the mean of the desired range:
help %>%
group_by(ID) %>%
slice(which.max(abs(Sodium - 144))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T18:00 136 30.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T12:00 128 32.4
I need to calculate summary statistics for observations of bird breeding activity for each of 150 species. The data frame has the species (scodef), the type of observation (codef)(e.g. nest building), and the ordinal date (days since 1 January, since the data were collected over multiple years). Using dplyr I get exactly the result I want.
library(dplyr)
library(tidyr)
phenology %>% group_by(sCodef, codef) %>%
summarize(N=n(), Min=min(jdate), Max=max(jdate), Median=median(jdate))
# A tibble: 552 x 6
# Groups: sCodef [?]
sCodef codef N Min Max Median
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 ABDU AY 3 172 184 181
2 ABDU FL 12 135 225 188
3 ACFL AY 18 165 222 195
4 ACFL CN 4 142 156 152.
5 ACFL FL 10 166 197 192.
6 ACFL NB 6 139 184 150.
7 ACFL NY 6 166 207 182
8 AMCO FL 1 220 220 220
9 AMCR AY 53 89 198 161
10 AMCR FL 78 133 225 166.
# ... with 542 more rows
How do I get these summary statistics into some sort of data object so that I can export them to use ultimately in a Word document? I have tried this and gotten an error. All of the many explanations of summarize I have reviewed just show the summary data on screen. Thanks
out3 <- summarize(N=n(), Min=min(jdate), Max=max(jdate), median=median(jdate))
Error: This function should not be called directly
Assign this to a variable, then write to a csv like so:
summarydf <- phenology %>% group_by......(as above)
write.csv(summarydf, filename="yourfilenamehere.csv")
below are the codes to get the top 10 most frequent values in one of the variables in my data frame.
#Remove NAs
dataL[dataL == "NA"] <- NA
dataS <- na.omit(dataL)
#getting the Top10 frequent values
Y <- dataS$Variable
X <- sort(table(Y), decreasing=TRUE)[1:10]
Z <- data.frame(X)
colnames(Z)= c("Value", "Frequency")
And this is the output of it
Value Frequency
1 1 635
2 0 296
3 1,000,000 115
4 10,000,000 110
5 20,000,000 104
6 5,000,000 101
7 50,000,000 86
8 25,000,000 85
9 30,000,000 80
10 40,000,000 77
And I want to output a Frequency % of total on a new column. And also add a Missing values frequency and the frequency of all other values that are not in top10. So the output should look like below.
Value Frequency % of Total
0 Missing 67 0.50%
1 1 635 4.60%
2 0 296 2.10%
3 1,000,000 115 0.80%
4 10,000,000 110 0.80%
5 20,000,000 104 0.70%
6 5,000,000 101 0.70%
7 50,000,000 86 0.60%
8 25,000,000 85 0.60%
9 30,000,000 80 0.60%
10 40,000,000 77 0.60%
11 All other 12,136 87.40%
I believe this does what you want.
First, make up some data. Note argument useNA = "ifany" in the call to table and that I do not subset X, I use the entire table.
set.seed(5787) # Make the results reproducible
p <- runif(100)
Y <- sample(100, 1e4, TRUE, prob = p/sum(p))
Y[sample(100, 10)] <- NA
X <- sort(table(Y, useNA = "ifany"), decreasing=TRUE)
Z <- data.frame(X)
colnames(Z)= c("Value", "Frequency")
Z$Value <- as.character(Z$Value)
Now, just compute the parts and put the pieces together.
Z[['% of Total']] <- 100*Z[["Frequency"]]/sum(Z[["Frequency"]])
Other <- c("All Other", colSums(Z[-c(1:10, which(is.na(Z$Value))), 2:3]))
Z <- rbind(Z[is.na(Z$Value), ], head(Z, n = 10), Other)
Z$Value[is.na(Z$Value)] <- "Missing"
row.names(Z) <- NULL
Z
# Value Frequency % of Total
#1 Missing 10 0.1
#2 61 202 2.02
#3 13 200 2
#4 23 197 1.97
#5 55 197 1.97
#6 16 196 1.96
#7 25 189 1.89
#8 48 189 1.89
#9 58 185 1.85
#10 79 183 1.83
#11 54 181 1.81
#12 All Other 8081 80.81