Conditional structures, R - r

I am a new user on R.
I have a datafame like:
Month eqqBio eqqLi ..........
January 20 20000
February 100 500
.
.
.
The 2 columns, eqqBio and eqqLi are the % compared to the previous year.
My aim is if the % compared to the previous year is 1000(or +) you flag it or a message indicating an outlier.
It's just I don't understant the conditional structure, and I don't know how I have to proceed..
Thanks in advance!

df <- data.frame(Month = c("January", "February"),
eqqBio = c(20, 100),
eqqLi = c(20000, 500))
df$alert <- df$eqqBio > 1000 | df$eqqLi > 1000
That gives you a new column. If both eqqBio and eqqLi are below 1000 the value in the according row is FALSE. Otherwise it is TRUE.
Of you have more than 2 columns you can do it like this:
# sample data
df <- data.frame(Month = c("January", "February", "March"),
eqqBio = c(20, 100, 0),
eqqLi = c(20000, 500, 0),
dummy1 = c(0, 0, 1001),
dummy2 = c(0, 0, 0))
# Check is any values in this row are > 1000 but only check columns 2:5.
df$alert <- apply(df[, 2:5], 1, function(x) any(x > 1000))
Adjust the columns you wish to include by changing 2:5 in df[, 2:5].

Related

Select value from data frame in period

I have the following problem - there are two data frames:
observations: 100 observations (= individuals) with a specific "observation date" - this date ist the baseline. There are further time points after 6 months (and maybe after 12 months).
visits: obervations (individuals) visit a place several times during a period of 12 months, f.e. observation 1 has 24 visits, observation 2 has 32 visits etc. During every visit data are collected - among others the "important value" (numeric). These values are connected to a visit date in the same row. So the "important value" occurs several times for each observation (f.e. 24 times for observation 1, 32 times for observation 2, etc.)
Now I would like to read the "important value" from data frame visits which fits best to approximately 6 months (+/- 15 days) after observation date and put this value in a new column in the observations data frame. The reason is because I want to know what was the important value after 6 months for each observation out of multiple visits from these many visits. The +/- 15 days idea is because the important value is not collected exactly after 6 months - this could be a little bit earlier or later.
I created a dummy dataset:
library(dplyr)
library(lubridate)
library(data.table)
set.seed(123)
observations <- data.frame(observation_id = seq(1, 100, 1),
observation_date = sample(seq(as.Date('1980/01/01'), as.Date('2010/12/31'), by="day"), 100),
age = round(runif(100, min = 1, max = 80)),
bmi = round(runif(100, min = 19, max = 30)),
amount = round(runif(100, min = 10, max = 10000)),
stringsAsFactors = FALSE)
observations$observation_date_6months_later <- observations$observation_date + days(180)
observations$observation_date_6months_later_start <- observations$observation_date_6months_later + days(-10)
observations$observation_date_6months_later_end <- observations$observation_date_6months_later + days(+10)
visit_data <- data.frame(visit_observation_id = round(runif(10000, min = 1, max = 100)),
visit_id = seq(1, 10000, 1),
visit_date = sample(seq(as.Date('1980/01/01'), as.Date('2020/12/31'), by="day"), 10000),
important_value = round(runif(10000, min = 0.01, max = 50), 2),
var2 = round(runif(10000, min = 1, max = 1000)),
var3 = round(runif(10000, min = 1, max = 9000)),
stringsAsFactors = FALSE)
observations$observation_date_6months_later_important_value <- NA # I would like to read the best fitting value from data frame "visit_data"
data_joined <-
dplyr::inner_join(visit_data, observations, by = c('visit_observation_id' = 'observation_id')) %>%
arrange(visit_date)
I would be glad if someone could help me with this problem!
Thank you!

Remove columns with uniform values

I have dataframe with 100+ columns. I want to filter those columns based on their nonuniformity.
Ex. if there're columns with more than 90% (or 95% or 99%) NAs (or 0-s, or -999, or whatever value), remove them from dataframe.
I can remove with NAs, or 0-s, but the problem is I don't know what value it will be.
Ex. of removing NAs with more than 90% df[, which(colMeans(!is.na(df)) > 0.9)]
I would simply use table to count the number of occurence of each value, when the maximum of these values exceeds the needed threshold you can discard the column.
In the following toy example, x, y and z are "constant". For x there are 96% of NA values, for y there are 99% of 0 and for z there are 97% of -1 (but any value would work).
set.seed(26012023)
df <- data.frame(w = rnorm(100), x = c(rep(NA, 96), rnorm(4)), y = c(rep(0, 99), rnorm(1)),
z = c(rep(-1, 97), rnorm(3)))
apply(df, 2, function(x, cutoff = .95) {
tab <- table(x, useNA = "ifany")
max_val <- max(tab)
max_val >= cutoff * length(x)
})
# w x y z
# FALSE TRUE TRUE TRUE
We can create a toy example, defining the following data.frame named df
# Seed to make it reproducible
set.seed(12345)
df <- data.frame(cbind(Var1 = c(rep(10,19),1),
Var2 = sample(letters[1:5],20, prob = c(0.8,0.1,0.5,0.25,0.25), replace = T),
Var3 = sample(c("Yes","No"), 20, prob = c(.95, .05), replace = T),
Var4 = sample(1:3, 20, replace = T),
Var5 = c(rep(NA,15),rep(1,5))))
Then we compute the maximum frequency of a single value for each colum and finally we delet those that exceed the required value
# Calculate the maximum frequency for a single value for each column
aux <- apply(df,2,function(x) max(prop.table(table(x, useNA = "ifany"))))
# Define new.df as df whithout the columns that have a value more than a 90% of times
new.df <- df[,-which(aux>.9)]

How to write a function that collects a specific list of observations from a time series data frame

In the data set created below, assume I randomly picked up 20 flat rocks. Each of these rocks were assigned a unique ID number. I measured the concentration of 7 substances (Copper,Iron,Carbon,Lead,Mg,CaCO, and Zinc) across the surface of the longest axis of each rock. Distance is recorded in mm, and therefore is a function of each rocks length. Note that not all Rocks are of the same length. Location is a grouping variable that describes where the Rock was picked up.
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878), each = 200))
ID2 <- data.frame(ID=rep(c(863,425,24,92,75,3,200,300,40,500), each = 300))
RockID<-data.frame(RockID = c(unlist(ID), unlist(ID2)))
Location <- rep(c("Alpha","Beta","Charlie","Delta","Echo"), each = 1000)
a <- rep(c(1:200),times = 10)
b <- rep(c(1:300), times = 10)
Time <- data.frame(Time = c(unlist(a), unlist(b)))
set.seed(1)
Copper <- rnorm(5000, mean = 0, sd = 5)
Iron <- rnorm(5000, mean = 0, sd = 10)
Carbon <- rnorm(5000, mean = 0, sd = 1)
Lead <- rnorm(5000, mean = 0, sd = 4)
Mg <- rnorm(5000, mean = 0, sd = 6)
CaCO <- rnorm(5000, mean = 0, sd = 2)
Zinc <- rnorm(5000, mean = 0, sd = 3)
data <-cbind(RockID, Location, Time,Copper,Iron,Carbon,Lead,Mg,CaCO,Zinc)
data$ID <- as.factor(data$RockID)
I want to create a new data frame that contains the following information:
1. The first observation and the last observation for each individual
2. The average of the first 3 observations and last 3 observations for each individual
3. The same as step 2. for the first and last 5, 7, and 10 observations
I want the new data frame to be set up like this:
ID FirstPt First3 First5 First7 First10 LastPt Last3 Last5 Last7 Last10
12 … … … … … … … … … …
122
242
329
595
130
145
245
654
878
863
425
ect...
How would I write a function to accomplish this?
We can create a function to calculate average of first and last n values. Use pivot_longer to get data in long format, group_by each RockID and substance and calculate the mean.
library(dplyr)
average_of_first_n_values <- function(value, x) mean(head(value, x))
average_of_last_n_values <- function(value, x) mean(tail(value, x))
data %>%
tidyr::pivot_longer(cols = Copper:Zinc) %>%
group_by(RockID, name) %>%
summarise(first_obs = first(value),
last_obs = last(value),
first_3_avg = average_of_first_n_values(value, 3),
first_5_avg = average_of_first_n_values(value, 5),
first_7_avg = average_of_first_n_values(value, 7),
first_10_avg = average_of_first_n_values(value, 10),
last_3_avg = average_of_last_n_values(value, 3),
last_5_avg = average_of_last_n_values(value, 5),
last_7_avg = average_of_last_n_values(value, 7),
last_10_avg = average_of_last_n_values(value, 10))

Average Cells of Two or More DataFrames

So I currently have 3 data frames that I need to average each cell in, and I am at a loss of how to do this... Essentially, I need to obtain the mean of the first observation in column 1 for df1, df2, df3, and like that for every single observation.
Here is a reproducible sample data.
set.seed(789)
df1 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df2 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df3 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
I need to create a fourth data frame of dimensions 100 by 4 that is the result of averaging each cell across the first three dataframes. Any ideas are highly appreciated!
We can do this with Reduce with + and divide by the number of datasets in a list. This has the flexibility of keeping 'n' number of datasets in a list
dfAvg <- Reduce(`+`, mget(paste0("df", 1:3)))/3
Or another option is to convert to array and then use apply, which also have the option of removing the missing values (na.rm=TRUE)
apply(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), 2, rowMeans, na.rm = TRUE)
As #user20650 mentioned, rowMeans can be applied directly on the array with the dim
rowMeans(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), dims=2)

Sampling out of tables depending on other variables (R)

I am a physician just who just started working in R and appreciate any help in this question:
i have 2 tables (A, B) with the variables age (continous), sex (binary) and test_value (binary). Each table has a different age and sex distribution.
set.seed(10)
AgeA <- round(rnorm(100, mean = 40, sd = 15))
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.3)
set.seed(20)
AgeB <- round(rnorm(1000, mean = 50, sd = 15))
SexB <- sample(c("M","F"), 1000, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueB <- rbinom(1000, 1, 0.4)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
genderA<-(prop.table(table(A[,2])))
TestA<-(prop.table(table(A[,3])))
paste("median age in group A is",median(A[,1]), "percentage female in group A is",genderA[1], "percentage of test positive in A is", TestA[2])
genderB<-(prop.table(table(B[,2])))
TestB<-(prop.table(table(B[,3])))
paste("median age in group A is",median(B[,1]), "percentage female in group B is",genderB[1], "percentage of test positive in A is", TestB[2])
The difference in test-proportion is now confounded by age and sex.
now i would like to match the patients from table A with table B to adjust for age and sex. because B is the smaller cohort i would prefer to sample out of A and match to B. is the match package an option? any other ideas
hopefully I was able to explain my problem.
any hints to which functions this may point?
Hello i have a possible answer, I will build two populations of a 100 people with the characteristics you said
set.seed(10)
AgeA <- rnorm(100, mean = 30, sd = 10)
#population A is 0.8 percent male
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.5)
set.seed(20)
AgeB <- rnorm(100, mean = 30, sd = 10)
#population B is 0.8 percent male
SexB <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.8, 0.2))
Test_ValueB <- rbinom(100, 1, 0.3)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
Then using dplyr you can summarise population B parameters:
library(dplyr)
Bsummary <- group_by(B,Sex)
Bsummary <- summarise(Bsummary, PercenteagePositive = sum(Test == 1)/length(Test == 1), PercenteageSex = n()/100)
Bsummary
If you look at the results of this B is 76% male and 24% female, if you sampled 20 people from A you would have to sample 15 males and 5 females. First you separate the population of A on males and females:
Amale <- filter(A, Sex == "M")
Afemale <- filter(A, Sex == "F")
And from that you sample 15 males and 5 females:
SampleAMale <- Amale[sample(nrow(Amale), 15), ]
SampleAFemale <-Afemale[sample(nrow(Afemale), 5), ]
Then join them and you can summarise Them
sampleA <- rbind(SampleAMale, SampleAFemale)
ASampleSummary <- group_by(sampleA,Sex)
ASampleSummary <- summarise(ASampleSummary, PercenteagePositive = sum(Test == 1)/length(Test == 1), PercenteageSex = n()/100)
OK Fank I think you will like this answer a little better, the first part is the same, exept that the AGE IS ROUNDED:
set.seed(10)
AgeA <- round(rnorm(100, mean = 30, sd = 2))
#population A is 0.8 percent male
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.5)
set.seed(20)
AgeB <- round(rnorm(100, mean = 30, sd = 2))
#population B is 0.8 percent male
SexB <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.8, 0.2))
Test_ValueB <- rbinom(100, 1, 0.3)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
now you just use prop.table to get the proportions of your population. Lets say you want to sample a 1000 individuals from B in the same proportion as A in terms of AGE and SEX you do this.
1000*(prop.table(table(A[,1:2])))
then by applying filters you can sample within groups:
for example if you want to get only the males age 30 in group B you could go
BMale30 <- filter(B, Sex == "M" & Age == 30)

Resources