Overwriting a row with a matched ID value in the same dataframe - r

I have an odd issue I just haven't been able to find a good solution to. Basically, our system outputs scores for people, usually only the first time they are in the system. However, sometimes it enters the scores the second time they are in the system, and sometimes it fills all rows like it should. Correcting the database would be ideal but that isn't going to happen (thanks management). We also can't just get rid of duplicate ID values as they are duplicated for a reason. What I need to do is copy the scores into the fields that have NA for all matching ID values. So, here is a data example:
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 16 15 14 15 46
1 NA NA NA NA NA
2 15 12 11 14 12
3 14 12 12 15 22
3 14 12 12 15 22
4 NA NA NA NA NA
4 11 04 12 33 12
6 NA NA NA NA NA
The output would look like
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 16 15 14 15 46
1 16 15 14 15 46
2 15 12 11 14 12
3 14 12 12 15 22
3 14 12 12 15 22
4 11 04 12 33 12
4 11 04 12 33 12
6 NA NA NA NA NA
I managed to get something working for this problem in order to move it off my desk, but this problem is going to be reoccuring and I want a better solution. My solution is:
df_2 <- list()
for(i in df$ID){
filter(df, ID == i) %>%
mutate(VAR1 = mean(VAR1, na.rm = TRUE),
VAR2 = mean(VAR2, na.rm = TRUE),
VAR3 = mean(VAR3, na.rm = TRUE),
VAR4 = mean(VAR4, na.rm = TRUE),
VAR5 = mean(VAR5, na.rm = TRUE))
} -> df_2[[i]]
# Then we bind this together as a dataframe
bind_rows(df_2) -> df_replaced
# Remove the list object as it's huge
rm(df_2)
This works but it takes about a thousand years and creates a temporary list around 4 gigs (df_2). Which is why I need to remove it as soon as possible because it pretty much brings my system to a complete halt. I feel like there is something with match but I'm not really sure how to intelligently select data row to copy over NA row.
EDIT: Fixed table formatting.

Here is a base R method using is.na and match to select the indices to use as fillers and to be filled.
df[is.na(df$VAR1), -1] <- df[match(df$ID[is.na(df$VAR1)],
df$ID[ifelse(!is.na(df$VAR1), TRUE, NA)]), -1]
which returns
df
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 1 16 15 14 15 46
2 1 16 15 14 15 46
3 2 15 12 11 14 12
4 3 14 12 12 15 22
5 3 14 12 12 15 22
6 4 11 4 12 33 12
7 4 11 4 12 33 12
8 6 NA NA NA NA NA
The trick here is to use ifelse to return a table (the second argument to match) that is the same length as the number of rows in the data.frame.

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Create "row" from first non-NA value in an R data frame

I want to create a "row" containing the first non-NA value that appears in a data frame. So for example, given this test data frame:
test.df <- data.frame(a=c(11,12,13,14,15,16),b=c(NA,NA,23,24,25,26), c=c(31,32,33,34,35,36), d=c(NA,NA,NA,NA,45,46))
test.df
a b c d
1 11 NA 31 NA
2 12 NA 32 NA
3 13 23 33 NA
4 14 24 34 NA
5 15 25 35 45
6 16 26 36 46
I know that I can detect the first appearance of a non-NA like this:
first.appearance <- as.numeric(sapply(test.df, function(col) min(which(!is.na(col)))))
first.appearance
[1] 1 3 1 5
This tells me that the first element in column 1 is not NA, the third element in column 2 is not NA, the first element in column 3 is not NA, and the fifth element in column 4 is not NA. But when I put the pieces together, it yields this (which is logical, but not what I want):
> test.df[first.appearance,]
a b c d
1 11 NA 31 NA
3 13 23 33 NA
1.1 11 NA 31 NA
5 15 25 35 45
I would like the output to be the first non-NA in each column. What is a base or dplyr way to do this? I am not seeing it. Thanks in advance.
a b c d
1 11 23 31 45
We can use
library(dplyr)
test.df %>%
slice(first.appearance) %>%
summarise_all(~ first(.[!is.na(.)]))
# a b c d
#1 11 23 31 45
Or it can be
test.df %>%
summarise_all(~ min(na.omit(.)))
# a b c d
#1 11 23 31 45
Or with colMins
library(matrixStats)
colMins(as.matrix(test.df), na.rm = TRUE)
#[1] 11 23 31 45
You can use :
library(tidyverse)
df %>% fill(everything(), .direction = "up") %>% head(1)
a b c d
<dbl> <dbl> <dbl> <dbl>
1 11 23 31 45

Wide format: a function to calculate row means for specific batches of columns, then scale up for multiple batches

This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!
Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}

R - Detect end of observations in groups and remove redundant rows

I have a data.frame consisting of about 300k rows with 24 rows for each ID - each row representing an hourly observation of that ID. My problem lies in that for some IDs the observation ends before the 24 hours has gone by - yet still have 24 rows with the remaining rows having NA in their 3 observation variables.
In a simplified table would be something like this
ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
1 0 29 32 34 19 21
1 1 21 12 NA 19 21
1 2 NA 24 NA 19 21
1 3 NA NA NA 19 21
1 4 NA  NA NA 19 21
2 0 41 16 21 13 24
2 1 NA NA NA 13 24
2 2 11 30 41 13 24
2 3 21 NA NA 13 24
2 4 24 35 21 13 24
2 5 NA NA NA 13 24
2 6 NA NA NA 13 24
3 0 NA NA NA 35 46
3 1 23 34 24 35 46
3 2 NA 26 NA 35 46
3 3 NA NA 24 35 46
3 4 12 29 42 35 46
3 5 NA NA NA 35 46
3 6 NA NA NA 35 46
In the table, each ID would represent a scenario that should be handled appropriately:
ID 1: Ordinary with observations starting from hour 0 and observation ending at hour 3 - and thus row with hour 3 and 4 for that group should be removed
ID 2: Has an hour (1) where all three observation variables are set at NA, but observation is resumed and ends at hour 5 - and thus row 2 should be kept (due to faulty registration and not end of observation) and rows with hour 5 and 6 should be removed.
ID 3: Starts out with an row with NA in all three observation variables, but observation begins then next hour and ends at hour 5. This is akin to the scenario for ID 2, but this time occurring at the very start (instead of in the middle of the observations). However, this still represent a faulty registration and should be kept and rows from hour 5 and 6 in this group should be removed.
Conceptually, I would think a possible solution would be do a group_by ID and then for R to go through the rows in a group in reverse (from bottom and up) until it encounters a row where "OBS_1", "OBS_2" and "OBS_3" are not all NA and remove the rows examined before reaching to this row and then move on to examine the next group.
Any help would be greatly appreciated!
If your MISC and MISC_2 values are consistent for each ID, you could
filter all rows that have na values then fill back in the missing data with complete and fill.
library(dplyr)
library(tidyr)
df %>% filter(!(is.na(OBS_1)&is.na(OBS_2)&is.na(OBS_3))) %>%
group_by(ID) %>%
complete(HOUR=0:max(HOUR)) %>%
fill(MISC,MISC_2) %>% fill(MISC,MISC_2,.direction = "up")
# A tibble: 13 x 7
# Groups: ID [3]
# ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 29 32 34 19 21
# 2 1 1 21 12 NA 19 21
# 3 1 2 NA 24 NA 19 21
# 4 2 0 41 16 21 13 24
# 5 2 1 NA NA NA 13 24
# 6 2 2 11 30 41 13 24
# 7 2 3 21 NA NA 13 24
# 8 2 4 24 35 21 13 24
# 9 3 0 NA NA NA 35 46
# 10 3 1 23 34 24 35 46
# 11 3 2 NA 26 NA 35 46
# 12 3 3 NA NA 24 35 46
# 13 3 4 12 29 42 35 46
This filters only the missing values if the no observation for the day are existing after this and keeps all missing observations that do not indicate the end of the observations for the day. These also allow for your other variables to vary during the day because it just removes them if the end of observations is reached.
df %>% arrange(rev(as.numeric(rownames(.)))) %>%
group_by(ID) %>%
mutate(rowNum = 1:n(),
naObs = cumsum((is.na(OBS_1) & is.na(OBS_2) & is.na(OBS_3))),
missingBlock = naObs != rowNum) %>%
slice(min(which(missingBlock)):n()) %>%
ungroup() %>%
arrange(rev(as.numeric(rownames(.)))) %>%
select(-rowNum, -naObs, -missingBlock)

Forcing unique values before casting (pivoting) in R

I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.

Resources