Fill rows depending on another row values - r

The problem is as follows:
I have a data base with 3 columns: ID / SCORE / ACTION. I need to identify the fist score different from NA and assign its value (and the action too) to the NA's before it. In this case the observations #1 and #2 swould have the same score and action as the observation #3. As well, observations #4, #5 and #6 should take the values of observation #7.
ID SCORE ACTION
1 NA NA
2 NA NA
3 BB+ T
4 NA NA
5 NA NA
6 NA NA
7 AAA S
8 NA NA
Any ideas? Thanks

You can look into na.locf from the "zoo" package. In this case, you would want to use the fromLast argument:
library(zoo)
na.locf(mydf, fromLast=TRUE)
# ID SCORE ACTION
# 1 1 BB+ T
# 2 2 BB+ T
# 3 3 BB+ T
# 4 4 AAA S
# 5 5 AAA S
# 6 6 AAA S
# 7 7 AAA S
# 8 8 <NA> <NA>

Related

Store first non-missing value in a new column

Ciao, I have several columns that represents scores. For each STUDENT I want to take the first non-NA score and store it in a new column called TEST.
Here is my replicating example. This is the data I have now:
df <- data.frame(STUDENT=c(1,2,3,4,5),
CLASS=c(90,91,92,93,95),
SCORE1=c(10,NA,NA,NA,NA),
SCORE2=c(2,NA,8,NA,NA),
SCORE3=c(9,6,6,NA,NA),
SCORE4=c(NA,7,5,1,9),
ROOM=c(01,02, 03, 04, 05))
This is the column I am aiming to add:
df$FIRST <- c(10,6,8,1,9)
This is my attempt:
df$FIRSTGUESS <- max.col(!is.na(df[3:6]), "first")
This is exactly what coalesce from package dplyr does. As described in its documentation:
Given a set of vectors, coalesce() finds the first non-missing value
at each position.
Therefore, you can simplify do:
library(dplyr)
df$FIRST <- do.call(coalesce, df[grepl('SCORE', names(df))])
This is the result:
> df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9
You can do this with apply and which.min(is.na(...))
df$FIRSTGUESS <- apply(df[, grep("^SCORE", names(df))], 1, function(x)
x[which.min(is.na(x))])
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRSTGUESS
#1 1 90 10 2 9 NA 1 10
#2 2 91 NA NA 6 7 2 6
#3 3 92 NA 8 6 5 3 8
#4 4 93 NA NA NA 1 4 1
#5 5 95 NA NA NA 9 5 9
Note that we need is.na instead of !is.na because FALSE corresponds to 0 and we want to return the first (which.min) FALSE value.
Unfortunately, max.col gives indices of max values and not the values itself. However, we can subset the values from the original dataframe using the mapply call.
#Select only columns which has "SCORE" in it
sub_df <- df[grepl("SCORE", names(df))]
#Get the first non-NA value by row
inds <- max.col(!is.na(sub_df), ties.method = "first")
#Get the inds value by row
df$FIRSTGUESS <- mapply(function(x, y) sub_df[x,y], 1:nrow(sub_df), inds)
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST FIRSTGUESS
#1 1 90 10 2 9 NA 1 10 10
#2 2 91 NA NA 6 7 2 6 6
#3 3 92 NA 8 6 5 3 8 8
#4 4 93 NA NA NA 1 4 1 1
#5 5 95 NA NA NA 9 5 9 9
Using zoo,na.locf, borrowing the setting up of sub_df from Ronak
df['New']=zoo::na.locf(t(sub_df),fromLast=T)[1,]
df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM New
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9

Remove groups which do not have non-consecutive NA values in R

I have the following Data frame
group <- c(2,2,2,2,4,4,4,4,5,5,5,5)
D <- c(NA,2,NA,NA,NA,2,3,NA,NA,NA,1,1)
df <- data.frame(group, D)
df
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
9 5 NA
10 5 NA
11 5 1
12 5 1
I would like to only keep groups that contain non consecutive NA values at least once. in this case group 5 would be removed because it does not contain non consecutive NA values, but only consecutive NA values. group 2 and 4 remain because they do contain non consecutive NA values (NA values separated by row(s) with a non NA value).
therefore the resulting data frame would look like this:
df2
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
any ideas :)?
How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.

Grouping in Embedded Group Structures in R data.table

I have a data.table object looks like this:
FamilyID InterFamilyID MumInFamilyID Edu
1 1 NA 2
1 2 NA 5
1 3 2 3
2 1 NA 6
2 2 1 9
2 2 1 3
I want to perform a query like this one:
tbl1[, MumEdu:= Edu[InterFamilyID == MumInFamilyID], by=FamilyID]
to get something like this:
FamilyID InterFamilyID MumInFamilyID Edu MumEdu
1 1 NA 2 NA
1 2 NA 5 NA
1 3 2 3 5
2 1 NA 6 NA
2 2 1 9 6
2 2 1 3 6
In fact I have a data.table grouped by a column (FamilyID) and each of these groups are 1-1 grouped by another column (InterFamilyID). In another column there is reference to smaller group id of another group member. I want to use these values to access the referenced rows values.
You can use match to:
returns a vector of the positions of (first) matches of its first argument in its second.
and use the result positions to find out the corresponding element in Edu column:
tbl1[, MumEdu := Edu[match(MumInFamilyID, InterFamilyID)], by = FamilyID]
tbl1
# FamilyID InterFamilyID MumInFamilyID Edu MumEdu
#1: 1 1 NA 2 NA
#2: 1 2 NA 5 NA
#3: 1 3 2 3 5
#4: 2 1 NA 6 NA
#5: 2 2 1 9 6
#6: 2 2 1 3 6

Getting the value of columns for which more than one value exists in a dataframe

If I have a dataframe in R like this,
1 2 abc bh abd NA NA
2 3 abc NA NA NA NA
3 4 NA NA ad yu ae
...................
I want to get those values in columns 1 and 2 which have more than one value in the rest of the column. For example, here, 1 2 has 3 values and 3 4 has 3 values as well and 2 3 has only one value and rest are NA. So, I want 1 2 and 3 4. How can I do it in R?
Thanks!
x <- read.table(text="1 2 abc bh abd NA NA
2 3 abc NA NA NA NA
3 4 NA NA ad yu ae")
x[rowSums(!is.na(x[, -1:-2])) > 1, 1:2]
# V1 V2
#1 1 2
#3 3 4
!is.na(x[, -1:-2]) returns a matrix of TRUE/FALSE values. rowSums converts TRUE values to 1 and FALSE values to 0 and sums them by row. Subset to only include rows where that is greater than 1, and return columns 1:2.

Retrieving subset of a data frame by finding entries with NA in specific columns

Suppose we had a data frame with NA values like so,
>data
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2
I wish to know a general method for retrieving the subset of data with NA values in C or A. So the output should be,
A B C D
1 3 NA 4
NA 3 3 5
4 2 NA NA
I tried using the subset command like so, subset(data, A==NA | C==NA), but it didn't work. Any ideas?
A very handy function for these sort of things is complete.cases. It checks row-wise for NA and if any returns FALSE. If there are no NAs, returns TRUE.
So, you need to subset just the two columns of your data and then use complete.cases(.) and negate it and subset those rows back from your original data, as follows:
# assuming your data is in 'df'
df[!complete.cases(df[, c("A", "C")]), ]
# A B C D
# 1 1 3 NA 4
# 3 NA 3 3 5
# 4 4 2 NA NA
Here is one possibility:
# Read your data
data <- read.table(text="
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2",header=T,sep="")
# Now subset your data
subset(data, is.na(C) | is.na(A))
A B C D
1 1 3 NA 4
3 NA 3 3 5
4 4 2 NA NA

Resources