Count unique instances in rows between two columns given by index - r

Hi I have an example data frame as follows. What I would like to do is count the number of instances of a unique value (example 1) that occur between the columns given by the indices ind1 and ind2. Output would be a vector with a number for each row that is the number of instances for that row.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
Data
COL1 COL2 COL3 ind1 ind2
1 1 1 1 3
1 NA 1 2 3
1 NA 1 1 2
NA 1 1 2 3
1 1 1 1 3
1 1 1 2 3
so example output should look like
3, 1, 1, 2, 3, 2
My actual data set has many rows so I want to avoid loops as much as possible to save time. I was thinking an apply function with a sum(which(x==1)) may work I'm just not sure how to get the column values from the given indices.

An option would be to loop over the rows, extract the values based on the sequence index from 'ind1' to 'ind2' and get the count with table
apply(Data, 1, function(x) table(x[x['ind1']:x['ind2']]))
#[1] 3 1 1 2 3 2
Or using sum
apply(Data, 1, function(x) sum(x[x['ind1']:x['ind2']] == 1, na.rm = TRUE))
Or create a logical matrix and then use rowSums
rowSums(Data[1:3] * NA^!((col(Data[1:3]) >= Data$ind1) &
(col(Data[1:3]) <= Data$ind2)), na.rm = TRUE)
#[1] 3 1 1 2 3 2

Related

Impute missing values in partial rank data?

I have some rank data with missing values. The highest ranked item was assigned a value of '1'. 'NA' values occur when the item was not ranked.
# sample data
df <- data.frame(Item1 = c(1,2, NA, 2, 3), Item2 = c(3,1,NA, NA, 1), Item3 = c(2,NA, 1, 1, 2))
> df
Item1 Item2 Item3
1 1 3 2
2 2 1 NA
3 NA NA 1
4 2 NA 1
5 3 1 2
I would like to randomly impute the 'NA' values in each row with the appropriate unranked values. One solution that would meet my goal would be this:
> solution1
Item1 Item2 Item3
1 1 3 2
2 2 1 3
3 3 2 1
4 2 3 1
5 3 1 2
This code gives a list of possible replacement values for each row.
# set max possible rank in data
max_val <- 3
# calculate row max
df$row_max <- apply(df, 1, max, na.rm= T)
# calculate number of missing values in each row
df$num_na <- max_val - df$row_max
# set a sample vector
samp_vec <- 1:max_val # set a sample vector
# set an empty list
replacements <- vector(mode = "list", length = nrow(df))
# generate a list of replacements for each row
for(i in 1:nrow(df)){
if(df$num_na[i] > 0){
replacements[[i]] <- sample(samp_vec[samp_vec > df$row_max[i] ], df$num_na[i])
} else {
replacements[[i]] <- NULL
}
}
Now puzzling over how I can assign the values in my list to the missing values in each row of my data.frame. (My actual data has 1000's of rows.)
Is there a clean way to do this?
A base R option using apply -
set.seed(123)
df[] <- t(apply(df, 1, function(x) {
#Get values which are not present in the row
val <- setdiff(seq_along(x), x)
#If only 1 missing value replace with the one which is not missing
if(length(val) == 1) x[is.na(x)] <- val
#If more than 1 missing replace randomly
else if(length(val) > 1) x[is.na(x)] <- sample(val)
#If no missing replace the row as it is
x
}))
df
# Item1 Item2 Item3
#1 1 3 2
#2 2 1 3
#3 2 3 1
#4 2 3 1
#5 3 1 2

Drop Multiple Columns in R

I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?
You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]
Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)
You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}

Using ifelse with conditional statements based on vector indices to loop down rows

Essentially I have a data frame. From this I've taken 2 new indices to indicate to me a value for each row that will be used in changing this dataset. I also have a code to replace the values as I'd like them replaced (essentially up to the column indicated by the new index is changed to a 0). I'm just not sure how to put this all together.
This is the data frame I was originally working with, the ind1 and ind2 were used to create a new indices that I have as separate vectors.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
Data
COL1 COL2 COL3 ind1 ind2
1 1 1 1 3
1 NA 1 2 3
1 NA 1 1 2
NA 1 1 2 3
1 1 1 1 3
1 1 1 2 3
the new vector indices looks like this and are currently not in the data frame
actual <- c(5,3,4,1,1,2)
prediction <- c(1,1,2,5,5,1)
Essentially what I would like to happen is for the function to evaluate actual > prediction for each row and if this is true then it runs the function below on that row
replace(Data, cbind(rep(1:NROW(Data), Data$ind1), sequence(Data$ind1)), 0)
and if actual > prediction is false then it runs the function below on that row
replace(Data, cbind(rep(1:NROW(Data), Data$ind2), sequence(Data$ind2)), 0)
for this data frame example i would expect the output to be a new data frame where
Data2
COL1 COL2 COl3 ind1 ind2
1 1 1 1 3
1 1 1 2 3
1 1 1 1 2
0 0 0 2 3
0 0 0 1 3
0 0 1 2 3
What I've tried so far is...
Data2<- c()
for (i in 1:NROW (Data)) {if (actual < prediction) {
Data2[i]<- replace(Data, cbind(rep(1:NROW(Data), Data$ind1), sequence(Data$ind1)), 0)
} else {
Data2[i]<- replace(Data, cbind(rep(1:NROW(Data), Data$ind2), sequence(Data$ind2)), 0)
}
}
This gives me a list of lists output for Data2. But what I am looking for is a new dataframe.
After our many back a forth, I believe this is the answer to provide you desired output.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
actual <- c(5,3,4,1,1,2)
prediction <- c(1,1,2,5,5,1)
logic <- ifelse(actual > prediction, TRUE, FALSE)
The logic vector's output is:
> logic
[1] TRUE TRUE TRUE FALSE FALSE TRUE
data2<-Data
for (i in 1:NROW(Data)) {
if (logic[i]) {
data2[i,1:Data$ind2[i]] <- 1
} else {
data2[i,1:Data$ind1[i]] <- 0
}
}
The loop output is as follows.
> data2
COL1 COL2 COL3 ind1 ind2
1 1 1 1 1 3
2 1 1 1 2 3
3 1 1 1 1 2
4 0 0 0 2 3
5 0 0 0 1 3
6 1 1 1 2 3
It is not identical to you output because the logic output is true on the sixth location.
I hope this helps.
I would solve the problem the following way. However, I don't fully get your replacement logic. I left it out for this reason.
The following separates the condition and the replacement logic into a function. Hence you can easily use it for other data frames by simply calling it again with apply.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
actual <- c(5,3,4,1,1,2)
prediction <- c(1,1,2,5,5,1)
Data <- data.frame (
col1 = COL1, col2 = COL2, col3 = COL3, ind1 = ind1, ind2 = ind2,
actual = actual, prediction = prediction)
custom.replace <- function(row) {
# watch out! row is an atomic vector.
if (row["actual"] > row["prediction"]) {
# one of your replacement logic goes here.
} else {
# the other goes here.
}
}
row.axis <- 1
apply(Data, row.axis, custom.replace)
I hope this helps!
PS: I know you want to use the ìfelse function. However, I don't see why you have to use it. Furhter, you can easily extend this solution such that it receives multiple inputs.

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

Rank each row in a data frame in descending order

I want to apply rank() to each row in a data frame by apply(data.frame,1,rank). However, rank is by default ascending. So when I apply rank() to my first row with the values (2,1,3,5), I get
[1] 2 1 3 4
However, I want
[1] 3 4 2 1
How can I do this using apply(data.frame,1,rank)?
Try
apply(-data, 1, rank, ties.method='first')
and compare with
apply(data, 1, rank, ties.method='first')
For your specific example
v1 <- c(2,1,3,5)
rank(v1)
#[1] 2 1 3 4
rank(-v1)
#[1] 3 4 2 1
data
set.seed(24)
data <- as.data.frame(matrix(sample(1:20, 4*20, replace=TRUE), ncol=4))

Resources