Impute missing values in partial rank data? - r

I have some rank data with missing values. The highest ranked item was assigned a value of '1'. 'NA' values occur when the item was not ranked.
# sample data
df <- data.frame(Item1 = c(1,2, NA, 2, 3), Item2 = c(3,1,NA, NA, 1), Item3 = c(2,NA, 1, 1, 2))
> df
Item1 Item2 Item3
1 1 3 2
2 2 1 NA
3 NA NA 1
4 2 NA 1
5 3 1 2
I would like to randomly impute the 'NA' values in each row with the appropriate unranked values. One solution that would meet my goal would be this:
> solution1
Item1 Item2 Item3
1 1 3 2
2 2 1 3
3 3 2 1
4 2 3 1
5 3 1 2
This code gives a list of possible replacement values for each row.
# set max possible rank in data
max_val <- 3
# calculate row max
df$row_max <- apply(df, 1, max, na.rm= T)
# calculate number of missing values in each row
df$num_na <- max_val - df$row_max
# set a sample vector
samp_vec <- 1:max_val # set a sample vector
# set an empty list
replacements <- vector(mode = "list", length = nrow(df))
# generate a list of replacements for each row
for(i in 1:nrow(df)){
if(df$num_na[i] > 0){
replacements[[i]] <- sample(samp_vec[samp_vec > df$row_max[i] ], df$num_na[i])
} else {
replacements[[i]] <- NULL
}
}
Now puzzling over how I can assign the values in my list to the missing values in each row of my data.frame. (My actual data has 1000's of rows.)
Is there a clean way to do this?

A base R option using apply -
set.seed(123)
df[] <- t(apply(df, 1, function(x) {
#Get values which are not present in the row
val <- setdiff(seq_along(x), x)
#If only 1 missing value replace with the one which is not missing
if(length(val) == 1) x[is.na(x)] <- val
#If more than 1 missing replace randomly
else if(length(val) > 1) x[is.na(x)] <- sample(val)
#If no missing replace the row as it is
x
}))
df
# Item1 Item2 Item3
#1 1 3 2
#2 2 1 3
#3 2 3 1
#4 2 3 1
#5 3 1 2

Related

How to manage a list of the different dimensions data frame (keep \Drop columns)?

I have a large list of data frames with different dimensions. The dimensions of my real data are too big, so I create a new list (myList) for an instant as follow:
myList <- list( data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),Shape=c("C","S","r"),
Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"), Shape=c("C","S","r"),
Value = 1:3,Time=rnorm(3, mean=90, sd=10),
Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),
Value = 1:3, Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")),
data.frame(ID = c("T-02", "T-04","T-06"),
Test = rnorm(3, mean=50, sd=10),
Value = 1:3,Time=rnorm(3, mean=90, sd=10),Event=c(1,0,1),
KPS=c(90,100,70),Sex=c("F","M","F"),Race=c("W","B","W")))
I looking for a function in which I can drop all columns after the "Event" column or select columns from the first to the "Event" column.
I can do it easily for such short data by the bellow code:
new_drop<-lapply(myList, function(x){x[,!names(x)%in%c("KPS","Sex","Race")]})
But I want to remove more than 20 columns like this in my data. I wonder if there is a simpler way.
I also tried this code but it did not work properly!
new_drop1<-lapply(myList, function(x){x[,endsWith(colnames(x),"Event")]})
I appreciate any help.
you could use grep:
lapply(myList, function(x) x[seq(grep("Event", names(x)))])
[[1]]
ID Test Shape Time Event
1 T-02 65.11001 C 94.53361 1
2 T-04 70.25636 S 84.86061 0
3 T-06 44.56480 r 85.30492 1
[[2]]
ID Shape Value Time Event
1 T-02 C 1 93.40279 1
2 T-04 S 2 85.78726 0
3 T-06 r 3 97.02140 1
[[3]]
ID Test Value Time Event
1 T-02 39.89387 1 94.80438 1
2 T-04 48.28122 2 85.62445 0
3 T-06 49.47685 3 90.10609 1
[[4]]
ID Test Value Time Event
1 T-02 38.55385 1 78.33900 1
2 T-04 47.60908 2 77.63453 0
3 T-06 43.59754 3 92.25645 1
The below code allows you to keep only the previous columns to "Event", using which. which is a base function which returns the indexes that satisfy a set of conditions. Here the only condition is colnames(x)=="Event".
new_drop<-lapply(myList, function(x){
res <- NULL
if(is.data.frame(x)) {
eventcol <- which(colnames(x)=="Event")
res <- x[,1:eventcol]
} else {
res <- x
}
return(res)
})
This would work even if not all your list elements are of class data.frame
> lapply(new_drop, head)
[[1]]
ID Test Shape Time Event
1 T-02 57.38475 C 76.05545 1
2 T-04 40.84934 S 85.98049 0
3 T-06 45.44281 r 85.18336 1
[[2]]
ID Shape Value Time Event
1 T-02 C 1 101.68492 1
2 T-04 S 2 100.13524 0
3 T-06 r 3 89.14877 1
[[3]]
ID Test Value Time Event
1 T-02 42.92581 1 82.37073 1
2 T-04 42.10800 2 90.51706 0
3 T-06 50.51329 3 96.52649 1
[[4]]
ID Test Value Time Event
1 T-02 49.13385 1 85.91036 1
2 T-04 52.72536 2 98.83747 0
3 T-06 68.96858 3 96.51575 1
For each data frame you can find the number of columns with ncol and then the position of the Event column with the which function. For example with mtcars assume the wt column is the "Event" column.
cars <- mtcars
ncols <- ncol(cars)
11
wt <- which(colnames(cars) == "wt")
6
# To remove columns 1 to to wt-1 and wt+1 to end
before <- wt - 1
cars1 <- cars[ , -(1:before)]
after <- wt + 1
cars2 <- cars[ , -(after:ncols)]
You can use the sapply function to find the dimensions of all your data frames and the positions of the Event column followed by an lapply to process the column deletions.

Perform calculations on row depending on individual cells [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9

Count unique instances in rows between two columns given by index

Hi I have an example data frame as follows. What I would like to do is count the number of instances of a unique value (example 1) that occur between the columns given by the indices ind1 and ind2. Output would be a vector with a number for each row that is the number of instances for that row.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
Data
COL1 COL2 COL3 ind1 ind2
1 1 1 1 3
1 NA 1 2 3
1 NA 1 1 2
NA 1 1 2 3
1 1 1 1 3
1 1 1 2 3
so example output should look like
3, 1, 1, 2, 3, 2
My actual data set has many rows so I want to avoid loops as much as possible to save time. I was thinking an apply function with a sum(which(x==1)) may work I'm just not sure how to get the column values from the given indices.
An option would be to loop over the rows, extract the values based on the sequence index from 'ind1' to 'ind2' and get the count with table
apply(Data, 1, function(x) table(x[x['ind1']:x['ind2']]))
#[1] 3 1 1 2 3 2
Or using sum
apply(Data, 1, function(x) sum(x[x['ind1']:x['ind2']] == 1, na.rm = TRUE))
Or create a logical matrix and then use rowSums
rowSums(Data[1:3] * NA^!((col(Data[1:3]) >= Data$ind1) &
(col(Data[1:3]) <= Data$ind2)), na.rm = TRUE)
#[1] 3 1 1 2 3 2

R: compare multiple columns pairs and place value on new corresponding variable

Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1

For Loop in R - deleting all rows which match by one variable

I'm trying to completely delete rows in a dataset for cases with matching variables (case ID) with the help of this function I wrote:
del_row_func <- function(x){
for(i in 1:length(x$FALL_ID)){
for(j in 1:length(x$FALL_ID)){
if(x$FALL_ID[i] == x$FALL_ID[j] & i != j){
x[-i, ]
}
}
}
}
Anybody have an idea, why it doesn't work?
The reason your code didn't work was that you weren't modifying or returning x. However, there is a better way to remove all rows with a duplicated ID:
dat = data.frame(FALL_ID = c(1, 2, 2, 3), y = 1:4)
dat
# FALL_ID y
# 1 1 1
# 2 2 2
# 3 2 3
# 4 3 4
dat[!duplicated(dat$FALL_ID) & !duplicated(dat$FALL_ID, fromLast=T),]
# FALL_ID y
# 1 1 1
# 4 3 4

Resources