How to organize based on specific data values - r

I have this data frame:
structure(list(ID = c(101, 102, 103, 104, 105, 106
), 1Var = c(1, 3, 3, 1, 1, 1), 2Var = c(1, 1,
1, 1, 1, 1), 3Var = c(3, 1, 1, 1, 1, 1), 4Var = c(1,
1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
I have been trying to subset based on values of 1 and 0. In this data table there are no 0 values but my full data has it.
I toyed around with this method:
Prime <- grep('$Var', names(Data))
DataPrime <- Data[rowSums(Data[Prime] <= 1),]
I am getting duplicated observations though. Another issue with this method is that it keeps all rows that have a 1 or 0 but not rows with ONLY 1 or 0. So, some rows that have 3 but the rest of the variables are value of 1 that row is still kept in my data.
I think my method will work but I'm not sure what else I need to specify in the argument. I tried a simple subset too but that removed everything from the data:
DataPrime <- subset(Data, '1Var' <=1, '2Var' <=1, '3Var' <=1, '4Var' <=1)
I essentially want my data to look something like this:
ID 1Var 2Var 3Var 4Var
4 104 1 1 1 1
5 105 1 1 1 1
6 106 1 1 1 1

We can use Reduce with & to create a logical vector for subsetting the rows
subset(Data, Reduce(`&`, lapply(Data[-1], `<=`, 1)))
-output
# ID 1Var 2Var 3Var 4Var
#4 104 1 1 1 1
#5 105 1 1 1 1
#6 106 1 1 1 1
Or another option is rowSums
subset(Data, !rowSums(Data[-1] > 1))

I think you're looking for something like:
Prime <- grep('\\dVar', names(Data))
Data[apply(Data[Prime], 1, function(x) !any(x > 1)),]
#> ID 1Var 2Var 3Var 4Var
#> 4 104 1 1 1 1
#> 5 105 1 1 1 1
#> 6 106 1 1 1 1
A few things to note are:
Your regex inside grep was wrong. The "$" symbol represents the end of a string, not a number. For numbers you can use \\d . Your Prime variable is therefore empty in the example.
It's best not to have column names (or any variable name) starting with numbers. These are not legal names in R. You can get round this by surrounding them with backticks, but this is easy to overlook and is a source of bugs.
rowSums adds up all the values in each row, so the lowest sum of any of the rows is 4, whereas rowSums(Data[Prime] <= 1) gives the total number of entries that are one or less, giving a vector like c(3, 3, 3, 4, 4, 4). Subsetting Data by this will give 3 copies of row 3 then three copies of row 4, which clearly isn't what you want.
In subset, you need the logical conjunction of all your var <= 1 terms, so you should split these with &, not with commas.

Related

Calculating share of data.frame row pairs that match

I have a dataset with an id variable and several other variables, similar to this:
mydata <- tibble::tribble(
~idvar, ~age,
1, 18,
1, 18,
2, 27,
3, 89,
4, 89,
5, 12,
1, 17,
2, 27,
2, 28,
3, 41
)
For each value of idvar, I want to calculate the rate at which, given idvar is the same between a pair of rows, age is also the same. In other words, I want to know:
PR(age match | id match)
For example, there are three rows with idvar == 1, which form three pairs of rows. For one of those pairs, age also matches. So we would return .333 for idvar == 1.
Desired output:
1 .333
2 .333
3 0
4 NA
5 NA
You could use table from base R. From the manual for ?base::table:
table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
In other words, we can use it to count the number of entries for each unique value of age. Where the count is more than 1, we know we have a match (or repeated value) somewhere in age.
table(mydata$age)
12 17 18 27 28 41 89
1 1 2 2 1 1 2
For your given example, we will not do this for all of age at once. Instead, we will need to group by idvar first.
Additionally, we need to use the binomial coefficient on each element of table(age) to determine how many pairs are possible, and then sum them all up to get the total number of pairs in the numerator. In R, the choose(n,k) function is the binomial coefficient. The denominator is just choose(.N, 2) (in data.table, .N is the number of rows in the current group), which is the number of all possible pairs for the group.
Putting it all together:
library(data.table)
setDT(mydata)
# Helper function
count_pairs <- function(x) {
if (length(x) > 1) { # if more than 1 row
if (length(table(x)[table(x) > 1]) > 0) { # if there is at least 1 match
sum(sapply(table(x)[table(x) > 1], function(z) choose(z, 2)))
} else {
0 # no matches
}
} else {
NA_real_ # only 1 row
}
}
mydata[, count_pairs(age) / choose(.N, 2), by = idvar]
idvar V1
1: 1 0.3333333
2: 2 0.3333333
3: 3 0.0000000
4: 4 NA
5: 5 NA

Subsetting of data.frames with variable name vs. column number

I am fairly new to R and I have run into a problem with subsetting data frames a number of times. I have found a fix but would just like to understand what I am missing.
Here is an exemplary bit of code, where I don't understand the functional difference.
Example data frame:
df <- data.frame(V1 = c(1:10), V2 = c(rep(1, times = 10)))
this produces an "undefined columns selected" error:
df1 <- df[df$V1 < 5, df$V2]
but this works:
df2 <- df[df$V1 < 5, 2]
I don't understand why when reffering to the column by its name via $V2 I do not recieve the same result as when reffering to the same column by its number.
This is a really basic question, I am aware, but I would just like to get my head around it.
Thanks and also sorry if formatting is off or anything (first time posting..),
Christoph
df[df$V1 < 5, df$V2] doesn't give an "undefined columns selected" error.
df[df$V1 < 5, df$V2]
# V1 V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
#1 1 1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3 3 3
#4 4 4 4 4 4 4 4 4 4 4
As you have only 1 in df$V2 and 1st column is present in your dataframe. It selects 1st column for length(df$V2) times and as it is not advised to have columns with same name it adds prefix .1, .2 to it.
This is same as doing
df[df$V1 < 5, c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)]
It would give an undefined column selected error , if you select columns which are not present in data.
df[df$V1 < 5, c(1, 3)]
Error in [.data.frame(df, df$V1 < 5, c(1, 3)) :
undefined columns selected
There are different ways in which you can access data
By column name which is
df[df$V1 < 5, "V2"]
#[1] 1 1 1 1
Or
df$V2[df$V1 < 5]
and by column position.
df[df$V1 < 5, 2]
#[1] 1 1 1 1

Assigning values in a column to deciles when breaks are not unique

Assume that I have a vector with 1000 numbers in it. I want to obtain the deciles of this vector and then find the mean of each decile. However, there are 215+ zeros in this vector. Meaning that the first and second breaks will be zero, thus I will run into Cut() error - 'breaks' are not unique error. What I want is to assign 100 zeros to the first decile, another 100 to the second decile and the last 15 zeros to the third decile. Such that the mean of the first and second deciles will be zero. Here is a reproducible and smaller example with the similar problem:
v=c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 6, 3, 7)
cut_q10 <- quantile(v, probs = seq(0, 1, 0.1))
v_q10 =cut(v, breaks = cut_q10,labels = FALSE)
#Error in cut.default(v, breaks = cut_q10, labels = FALSE) :
# 'breaks' are not unique
What I would like to obtain is:
v_q10 = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,10,9,10)
or
v_q10 = c(2,2,1,1,3,4,4,3,5,5,6,6,7,7,8,8,9,10,9,10)
etc...
All of them are acceptable as long as there is two 0's in the first decile, two 0's in the second, two 1's in the third, two 1's in the fourth etc. etc. such that regardless of which v_q10 is obtained when I find the means of each decile I attain this :
merged = as.data.frame(cbind(v,v_q10))
merged = merged%>%group_by(v_q10)%>%summarise(means = mean(v))
v_q10 means
# <dbl> <dbl>
# 1 1 0
# 2 2 0
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 2
# 7 7 2
# 8 8 3
# 9 9 4
#10 10 6.5
I know that it is possible to achieve this by writing a long code but I was wondering if there is a function or a code of a few lines that can achieve this.
Thanks in advance.
Try this:
cut(rank(v, ties = "first"), 10, lab = FALSE)
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 10 9 10
Alternatives include using ties = "last" or using ties = "random" or using order(order(v)) in place of rank(...).

Alternating between values with rep() in R

I am looking for an elegant way of repeating two values according to a given vector in an alternating fashion. It is better stated by example. Take the following code for instance:
vals_to_rep <- c(1, 2)
tms_to_rep <- c(5, 4, 15)
res <- c(rep(1, 5), rep(2, 4), rep(1, 15))
res
In this example, I wish to repeat the values 1 and 2 according to the vector tms_to_rep where I will be starting with 1 (given it is first in the variable) vals_to_rep, before alternating to 2, back to 1, ...
I wish to continue this process for the length of tms_to_rep-- in this case, three times. The result would look like this:
1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If it helps, you can assume vals_to_rep is binary, but no assumptions on length of tms_to_rep.
Thanks!
You can expand vals_to_rep out to the length of tms_to_rep. Then rep() works fine:
vals_to_rep_expanded = rep(vals_to_rep, length.out = length(tms_to_rep))
rep(vals_to_rep_expanded, times = tms_to_rep)

R efficient search in data.frame

I want using R to organize the most efficient search a value ​​in tables in the format data.frame like this
x01 x02 x03 x04 x05 x06 x07
1 NA 100 200 300 400 500 600
2 10 1 4 3 6 7 1
3 20 2 5 2 5 8 2
4 30 3 6 1 4 9 8
Values ​​in the first row and first column in order of increasing. For example, I need to find value to the crosshairs of a column containing 300 in the first row and the row containing 20 in the first column. The value 2. Code for this:
coefficient_table_1 <- data.frame(
x01=c(NA, 10, 20, 30),
x02=c(100, 1, 2, 3),
x03=c(200, 4, 5, 6),
x04=c(300, 3, 2, 1),
x05=c(400, 6, 5, 4),
x06=c(500, 7, 8, 9),
x07=c(600, 1, 2, 8)
)
col_value <- 300
row_value <- 20
col <- 0
for(i in 2:ncol(coefficient_table_1)){
if(coefficient_table_1[1,i]==col_value ){
col <- i
}
}
row <- which(coefficient_table_1$x01==row_value)
value <- coefficient_table_1[row, col]
Table can be large and the search can be arranged inside the loop. What is the most effective way to search in data.frame?
Your data is all numeric, so your best course of action is probably to use arrays, rather than data frames.
Since arrays contain data of only a single class (e.g. numeric), many operations are much faster when your data is in array format.
Try this:
x <- as.matrix(coefficient_table_1)
x[which(x[, 1]==row_value), which(x[1, ]==col_value)]
x04
2

Resources