Split a dataframe dynamically

Split a dataframe dynamically - r

I would like to have a function to split data frames like this:
q1 q2 q3 q4
1 4 0 33
8 5 33 44
na na na na
na na na na
3 33 2 66
4 2 3 88
6 44 5 99
We will get 2 dataframes:
d1
q1 q2 q3 q4
1 4 0 33
8 5 33 44
and
d2
3 33 2 66
4 2 3 88
6 44 5 99
The obs in d1 and d2 are not fixed. This means that we do not know the obs in the dataframe and how many obs are NAs.

Suppose DF is the data frame. Since it wasn't specified precisely what the splitting criterion is lets assume that any row with all NAs is a dividing row. If its some other criterion change the first line appropriately:
isNA <- apply(is.na(DF), 1, all)
split(DF[ !isNA, ], cumsum( isNA )[ !isNA ])

First, read in your data so that "na" gets converted to actual NA values.
mydf <- read.table(
header = TRUE,
na.strings="na",
text = "q1 q2 q3 q4
1 4 0 33
8 5 33 44
na na na na
3 33 2 66
4 2 3 88
6 44 5 99")
Second, figure out where to split your data.frame:
# Find the rows where *all* the values are `NA`
RLE <- rle(rowSums(is.na(mydf)) == ncol(mydf))$lengths
# Use that to create "groups" of rows
RLE2 <- rep(seq_along(RLE), RLE)
# Replace even numbered rows with NA -- we don't want them
RLE2[RLE2 %% 2 == 0] <- NA
Third, split your data.frame
split(mydf, RLE2)
# $`1`
# q1 q2 q3 q4
# 1 1 4 0 33
# 2 8 5 33 44
#
# $`3`
# q1 q2 q3 q4
# 4 3 33 2 66
# 5 4 2 3 88
# 6 6 44 5 99
However, this is all somewhat guesswork, because your statement that "This means that we do not know the obs in the dataframe and how many obs are NAs" is not really clear. Here, I've made the assumption that you want to split the data whenever you encounter a full row of NA values.

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA

I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.

One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA

Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)

One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Wide format: a function to calculate row means for specific batches of columns, then scale up for multiple batches

This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!

Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}

Ranking based on two variables

I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!

You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]

combining datasets with known identity variable

So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4

I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.

If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)

Conditional filtering of data.frame with preceeding and tailing NA observations

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.

I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3

Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split a dataframe dynamically - r

Suppose DF is the data frame. Since it wasn't specified precisely what the splitting criterion is lets assume that any row with all NAs is a dividing row. If its some other criterion change the first line appropriately: isNA <- apply(is.na(DF), 1, all) split(DF[ !isNA, ], cumsum( isNA )[ !isNA ])

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Wide format: a function to calculate row means for specific batches of columns, then scale up for multiple batches

Ranking based on two variables

combining datasets with known identity variable

Conditional filtering of data.frame with preceeding and tailing NA observations

Categories

Resources