How to fill column based on condition taking other columns into account? - r

to fill an empty column of a dataframe based on a condition taking another column into account, i have found following solution, which works fine, but is somehow a little bit ugly. does anybody know a more elegant way to solve this?
base::set.seed(123)
test_df <- base::data.frame(vec1 = base::sample(base::seq(1, 100, 1), 50), vec2 = base::seq(1, 50, 1), vec3 = NA)
for (a in 1:base::nrow(test_df)){
spc_test_df <- test_df[a, ]
# select the specific row of the dataframe
if(spc_test_df$vec1 <= 25 | spc_test_df$vec1 >= 75){
# evaluate whether the deviation is below/above the threshold
spc_test_df$vec3 <- 1
# if so, write TRUE
} else {
spc_test_df$vec3 <- 0
# if not so, write FALSE
}
test_df[a, ] <- spc_test_df
# write the specific row back to the dataframe
}

There is no need for a for-loop as you can use vectorized solutions in this case. Three options on how to solve this problem:
# option 1
test_df$vec3 <- +(test_df$vec1 <= 25 | test_df$vec1 >= 75)
# option 2
test_df$vec3 <- as.integer(test_df$vec1 <= 25 | test_df$vec1 >= 75)
# option 3
test_df$vec3 <- ifelse(test_df$vec1 <= 25 | test_df$vec1 >= 75, 1, 0)
which in all cases gives:
vec1 vec2 vec3
1 5 1 1
2 6 2 1
3 61 3 0
4 20 4 1
....
47 3 47 1
48 55 48 0
49 44 49 0
50 97 50 1
(only first and last four rows presentend)

Related

How to create area around value?

Sorry, probably my question is not so clear, because I can not formulate it. I wiil explain by example.
I have two dataframesdf and df1:
df <- data.frame(a = c(25,15,35,45,2))
df1 <- data.frame(b = c(28,25,24,43,10))
I want to merge two dataframes with condition if values == +-5 and create column distance. For example, first element in column a is 25, I want to compare 25 with all elements in column b, and I want to select only 25 == +- 25. The output should look like:
a b distance
25 28 3
24 1
25 0
15 10 5
45 43 2
And values which are not equal +- 5 should be excluded like 2 and 35.
We may use outer to create a logical matrix, get the row/column index with which and arr.ind = TRUE. Use the index to subset the 'a', and 'b' column from corresponding datasets and get the difference`
i1 <- which(outer(df$a, df1$b, FUN = function(x, y)
abs(x - y) <=5), arr.ind = TRUE)
transform(data.frame(a = df$a[i1[,1]], b = df1$b[i1[,2]]), distance = abs(a - b))
-output
a b distance
1 25 28 3
2 25 25 0
3 25 24 1
4 45 43 2
5 15 10 5

Looping over the first 100 values then looping over the next 100 values

I have a vector of True and False values. The length of the vector is 1000.
vect <- [T T F T F F..... x1000]
I want loop over the first 100 (i.e 1:100) values and calculate the count of true and false values and store the result into some variable (e.g. True <- 51, False <- 49). Then loop over the next 100 values (101:200) and do the same computation as before, and so on till I reach 1000.
The code below is pretty standard but, instead of slicing the vector, it calculates sums for the entire vector.
count_True = 0
count_False = 0
for (i in vect){
if (i == 'T'){
count_True = count_True + 1
}
else {
count_false = count_false + 1
}
}
I am aware you can split the the vector by
vect_splt <- split(vect,10)
but is there a way to combine these to do what I wanted or any other way?
Does something like this work:
set.seed(42)
vect <- sample(rep(c(T, F), 500))
vect <- tibble(vect)
vect %>%
mutate(seq = row_number() %/% 100) %>%
group_by(seq) %>%
summarise(n_TRUE = sum(vect),
n_FALSE = sum(!vect))
# A tibble: 11 x 3
seq n_TRUE n_FALSE
<dbl> <int> <int>
1 0 42 57
2 1 56 44
3 2 50 50
4 3 55 45
5 4 43 57
6 5 48 52
7 6 48 52
8 7 54 46
9 8 51 49
10 9 53 47
11 10 0 1
We can use a split by table. With a grouping index created with gl, split the vector into a list of vectors and get the count with table and store it in a list
out <- lapply(split(vect, as.integer(gl(length(vect), 100, length(vect)))), table)
It can be converted to a single dataset by rbinding
out1 <- do.call(rbind, out)
data
set.seed(24)
vect <- sample(c(TRUE, FALSE), 1000, replace = TRUE)

Re-bin a data frame in R

I have a data frame which holds activity (A) data across time (T) for a number of subjects (S) in different groups (G). The activity data were sampled every 10 minutes. What I would like to do is to re-bin the data into, say, 30-minute bins (either adding or averaging values) keeping the subject Id and group information.
Example. I have something like this:
S G T A
1 A 30 25
1 A 40 20
1 A 50 15
1 A 60 20
1 A 70 5
1 A 80 20
2 B 30 10
2 B 40 10
2 B 50 10
2 B 60 20
2 B 70 20
2 B 80 20
And I'd like something like this:
S G T A
1 A 40 20
1 A 70 15
2 B 40 10
2 B 70 20
Whether time is the average time (as in the example) or the first/last time point and whether the activity is averaged (again, as in the example) or summed is not important for now.
I will appreciate any help you can provide on this. I was thinking about creating a script in Python to re-bin this particular dataframe, but I thought that there may be a way of doing it in R in a way that may be applied to any dataframe with differing numbers of columns, etc.
There are some ways to come to the wished dataframe.
I have reproduced your dataframe:
df <- data.frame(S = c(rep(1,6),rep(2,6)),
G = c(rep("A",6),rep("B",6)),
T = rep(seq(30,80,10),2),
A = c(25, 20, 15, 20, 5, 20, 10, 10, 10, 20, 20, 20))
The classical way could be:
df[df$T == 40 | df$T == 70,]
The more modern tidyverse way is
library(tidyverse)
df %>% filter(T == 40 | T ==70)
If you want to get the average of each group of G filtered for T==40 and 70:
df %>% filter(T == 40 | T == 70) %>%
group_by(G) %>%
mutate(A = mean(A))

Loop within a function in R

I want to create a function that holds an ifelse statement like follows:
ArrearsL1M<-function(input1, input2, input3, input4){
output=0
output=ifelse((df[[input1]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input2]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input3]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input4]] %in% c(1,2,3,4,5)), 1, 0))))
return(output)
Then I'd have this:
df$Arrears_L1M<-ArrearsL1M("col_201823","col_201822","col_201821","col_201820")
Here is an example of the data:
col_201823 col_201822 col_201821 col_201820 col_201819 col_201818 col_201817 col_201816 col_201815
1 99 5 4 2 99 99 99 99 99
2 3 0 3 2 3 3 3 3 3
3 2 2 2 2 2 2 2 2 2
4 0 0 0 1 0 0 0 0 0
5 99 99 5 99 99 99 99 99 99
6 2 1 4 99 2 2 2 2 2
7 1 1 99 99 1 1 1 1 1
So the code will check the previous 4 weeks of data starting with the most recent (i.e. 2018 week 23, week 22, week 21 and week 20)
The starting week can change and I want to make this work so that the I enter the first week and it runs the function for the past 4 weeks. I only want to enter the first week, so only one input. So if I enter col_201820 I will get the answer for weeks col_201820, col_201819, col_201818 and col_201817.
I'll want to run this for 52 weeks of data (i.e. a year ) at some point so I'm trying to make it easier to change if the starting week changes. It also needs to go to 201752, 201751, 201750,if the starting week is 201801.
I'm not sure where to start with this so can't show you anything I've already tried.
** Code for reproducible example
col_201823<-c(99,3,2,0,99,2,1)
col_201822<-c(5,0,2,0,99,1,1)
col_201821<-c(4,3,2,0,5,4,99)
col_201820<-c(2,2,2,1,99,99,99)
col_201819<-c(99,3,2,0,99,2,1)
col_201818<-c(99,3,2,0,99,2,1)
col_201817<-c(99,3,2,0,99,2,1)
col_201816<-c(99,3,2,0,99,2,1)
col_201815<-c(99,3,2,0,99,2,1)
test<-as.data.frame(cbind(col_201823,col_201822,col_201821,col_201820,col_201819,col_201818,col_201817,col_201816,col_201815))
I guess you want to figure out how to create a vector of weeks from a starting week. For instance
weeks_from_start <- function(x) {
week <- as.integer(substring(x, nchar(x) - 1))
rest <- substring(x, 1, nchar(x) - 2)
paste0(rest, seq(week, by = -1, length.out=4))
}
so
> weeks_from_start("col_201823")
[1] "col_201823" "col_201822" "col_201821" "col_201820"
Use this at the top of your ArrearsL1M() function. I would implement this as
ArrearsL1M <- function(df, last_week) {
weeks <- weeks_from_start(last_week)
m <- as.matrix(df[, weeks])
m[] <- m %in% 1:5 # test all elements in 1 call; format as matrix
rowSums(m) != 0
}
For more complicated parsing, revise weeks_from_start() as
week0 <- as.integer(substring(x, nchar(x) - 1))
year0 <- as.integer(substring(x, 5, 8))
week0 <- seq(week0, by = -1, length.out = 4)
week <- (week0 - 1) %% 52 + 1
year <- year0 - cumsum(week0 == 0)
sprintf("col_%4d%.2d", year, week)
Probably this is approaching a 'hack', e.g., do all years have 52 weeks? For a year beginning on, say, Tuesday, is week 1 Tues - Sunday, week 52 of the previous year just Monday? Time to rethink how this data is represented...
ArrearsL1M <- function(input1, input2, input3, input4){
cols <- c(input1, input2, input3, input4)
output <- as.numeric(apply(apply(df[, cols], 2, function(x) x %in% 1:5), 1, any))
return(output)
}
With 1 imput:
ArrearsL1M <- function(cols){
output <- as.numeric(apply(apply(df[, cols], 2, function(x) x %in% 1:5), 1, any))
return(output)
}

How to get a data.frame with cases from a contingency table in r?

I would like to reproduce some calculations from a book (logit regression). The book gives a contingency table and the results.
Here is the Table:
.
example <- matrix(c(21,22,6,51), nrow = 2, byrow = TRUE)
#Labels:
rownames(example) <- c("Present","Absent")
colnames(example) <- c(">= 55", "<55")
It gives me this:
>= 55 <55
Present 21 22
Absent 6 51
But to use the glm()-function the data has to be in the following way:
(two colums, one with "Age", and one with "Present", filled with 0/1)
age <- c(rep(c(0),27), rep(c(1),73))
present <- c(rep(c(0),21), rep(c(1),6), rep(c(0),22), rep(c(1),51))
data <- data.frame(present, age)
> data
present age
1 0 0
2 0 0
3 0 0
. . .
. . .
. . .
100 1 1
Is there a simple way to get this structure from the table/matrix?
reshape2::melt(example)
This will give you,
Var1 Var2 value
1 Present >= 55 21
2 Absent >= 55 6
3 Present <55 22
4 Absent <55 51
which you can easily use for glm
You could perhaps use the countsToCases function as defined here.
countsToCases(as.data.frame(as.table(example)))
# Var1 Var2
#1 Present >= 55
#1.1 Present >= 55
#1.2 Present >= 55
#1.3 Present >= 55
#1.4 Present >= 55
#1.5 Present >= 55
# ...
You can always recode the variables to numeric afterwards, if you prefer.
I would go for:
library(data.table)
tab <- data.table(AGED = c(1, 1, 0, 0),
CHD = c(1, 0, 1, 0),
Count = c(21, 6, 22, 51))
tabExp <- tab[rep(1:.N, Count), .(AGED, CHD)]
Edit: Quick explanation, as it took me some time to figure it out:
In data.table objects .N stores the number of rows of a group (if grouped with by) or just the number of rows of the whole data.table, so in this example:
tab[rep(1:.N, Count)]
and
tab[rep(1:4, Count)]
and finally
tab[rep(1:4, c(21, 6, 22, 51)]
are equivalent.
Same with base R:
tab2 <- data.frame(AGED = c(1, 1, 0, 0),
CHD = c(1, 0, 1, 0),
Count = c(21, 6, 22, 51))
tabExp2 <- tab2[rep(1:nrow(tab2), tab2$Count), c("AGED", "CHD")]
The code below might look long but only the group_by() and do() instruction deal with expanding the data. All the rest is about changing the data in long format and encoding character variables as 0 and 1. I tried to start from the exact matrix you gave in your question.
Load data manipulation packages
library(tidyr)
library(dplyr)
Create a data frame
Create a matrix as in your example, but avoid ">" signs in column names
example <- matrix(c(21,22,6,51), nrow = 2, byrow = TRUE)
rownames(example) <- c("Present","Absent")
colnames(example) <- c("above55", "below55")
Convert the matrix to a data frame
example <- data.frame(example) %>%
add_rownames("chd")
Or simply create a data frame directly
data.frame(chd = c("Present", "Absent"),
above55 = c(21,6),
below55 = c(22,51))
Reshape data
data2 <- example %>%
gather(age, nrow, -chd) %>%
# Encode chd and age as 0 or 1
mutate(chd = ifelse(chd=="Present",1,0),
age = ifelse(age=="above55",1,0)) %>%
group_by(chd, age) %>%
# Expand each variable by nrow
do(data.frame(chd = rep(.$chd,.$nrow),
age = rep(.$age,.$nrow)))
head(data2)
# Source: local data frame [6 x 2]
# Groups: chd, age [1]
#
# chd age
# (dbl) (dbl)
# 1 0 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
tail(data2)
# Source: local data frame [6 x 2]
# Groups: chd, age [1]
#
# chd age
# (dbl) (dbl)
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 1 1
table(data2)
# age
# chd 0 1
# 0 51 6
# 1 22 21
Same as your example except for the age encoding
issue mentioned in my comment above.
So, glm is not quite that inflexible. In part ?glm reads
For ‘binomial’ and ‘quasibinomial’ families the response can also
be specified as a ‘factor’ (when the first level denotes failure
and all others success) or as a two-column matrix with the columns
giving the numbers of successes and failures.
I'll assume you want to test the effect of age on Present/Absent.
The key is for to specify the response like (in psueudo-code) c(success, failure).
So you need data like data.frame(Age= ..., Present = ..., Absent). The easiest way to do this from your example is to transpose, then coerce to data.frame, and add a column:
example_t <- as.data.frame(t(example))
example_df <- data.frame(example_t, Age=factor(row.names(example_t)))
which gives you
Present Absent Age
>= 55 21 6 >= 55
<55 22 51 <55
Then, you can run the glm:
glm(cbind(Present, Absent) ~ Age, example_df, family = 'binomial')
to get
Call: glm(formula = cbind(Present, Absent) ~ Age, family = "binomial",
data = example_for_glm)
Coefficients:
(Intercept) Age<55
1.253 -2.094
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 18.7
Residual Deviance: -1.332e-15 AIC: 11.99
Addendum
You could also get here via the answer by #therimalaya. But it's just the first step
as.data.frame(as.table(example))
(only gets you part way there)
Var1 Var2 Freq
1 Present >= 55 21
2 Absent >= 55 6
3 Present <55 22
4 Absent <55 51
but to actually have a column of successes and failures, you need to do something more. You could use tidyr to get there
as.data.frame(as.table(example)) %>% tidyr::spread(Var1, Freq)
is similar to my example_df above
Var2 Present Absent
1 >= 55 21 6
2 <55 22 51

Resources