Using apply in R to extract rows from a dataframe - r

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.

Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.

We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X

Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.

When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

Related

R: avoid turning one-row data frames into a vector when using apply functions

I often have the problem that R converts my one column data frames into character vectors, which I solve by using the drop=FALSE option.
However, there are some instances where I do not know how to put a solution to this kind of behavior in R, and this is one of them.
I have a data frame like the following:
mydf <- data.frame(ID=LETTERS[1:3], value1=paste(LETTERS[1:3], 1:3), value2=paste(rev(LETTERS)[1:3], 1:3))
that looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
2 B B 2 Y 2
3 C C 3 X 3
The task I am doing here, is to replace spaces by _ in every column except the first, and I want to use an apply family function for this, sapply in this case.
I do the following:
new_df <- as.data.frame(sapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- cbind(mydf[,1,drop=F], new_df)
The resulting data frame looks exactly how I want it:
> new_df
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
My problem starts with some rare cases where my input can have one row of data only. For some reason I never understood, R has a completely different behavior in these cases, but no drop=FALSE option can save me here...
My input data frame now is:
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
which looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
However, when I apply the same code, my resulting data frame looks hideous like this:
> new_df
ID sapply(mydf[, -1, drop = F], function(x) gsub("\\\\s+", "_", x))
value1 A A_1
value2 A Z_1
How to solve this issue so that the same line of code gives me the same kind of result for input data frames of any number of rows?
A deeper question would be why on earth does R do this? I keep going back to my codes when I have some new weird inputs with one row/column cause they break everything... Thanks!
You can solve your problem by using lapply instead of sapply, and then combine the result using do.call as follows
new_df <- as.data.frame(lapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- do.call(cbind, new_df)
new_df
# value1 value2
#[1,] "A_1" "Z_1"
new_df <- cbind(mydf[,1,drop=F], new_df)
#new_df
# ID value1 value2
#1 A A_1 Z_1
As for your question about unpredictable behavior of sapply, it is because s in sapply represent simplification, but the simplified result is not guaranteed to be a data frame. It can be a data frame, a matrix, or a vector.
According to the documentation of sapply:
sapply is a user-friendly version and wrapper of lapply by default
returning a vector, matrix or, if simplify = "array", an array if
appropriate, by applying simplify2array().
On the simplify argument:
logical or character string; should the result be simplified
to a vector, matrix or higher dimensional array if possible? For
sapply it must be named and not abbreviated. The default value, TRUE,
returns a vector or matrix if appropriate, whereas if simplify =
"array" the result may be an array of “rank” (=length(dim(.))) one
higher than the result of FUN(X[[i]]).
The Details part explain its behavior that loos similar with what you experienced (emphasis is from me) :
Simplification in sapply is only attempted if X has length greater
than zero and if the return values from all elements of X are all of
the same (positive) length. If the common length is one the result is
a vector, and if greater than one is a matrix with a column
corresponding to each element of X.
Hadley Wickham also recommend not to use sapply:
I recommend that you avoid sapply() because it tries to simplify the
result, so it can return a list, a vector, or a matrix. This makes it
difficult to program with, and it should be avoided in non-interactive
settings
He also recommends not to use apply with a data frame. See Advanced R for further explanation.
You can also use map_df function from purrr package, which applies a function on each element of an object and also returns a data frame:
library(dplyr)
library(purrr)
mydf %>%
mutate(map_df(select(cur_data(), starts_with("value")), ~ gsub("\\s", "_", .x)))
ID value1 value2
1 A A_1 Z_1
And with the original data frame:
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
Here's a solution that replaces the original data. Not sure if this is plays into your workflow, though. Notice that I used apply which is used to process data.frames by rows or columns.
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
xy <- apply(X = mydf[, -1, drop = FALSE],
MARGIN = 2,
FUN = function(x) gsub("\\s+", "_", x),
simplify = FALSE
)
xy <- do.call(cbind, xy)
xy <- as.data.frame(xy)
mydf[, -1] <- as.data.frame(xy)
mydf
ID value1 value2
1 A A_1 Z_1

R - count occurrences in long vectors

I have a dataframe that is 6249 rows long, filled with character-type data and will likely get a lot bigger.
I want to count the number of occurrences of each string. Normally I'd use table(df)
or
count(df)
but they both seem to stop after 250 rows.
Is there a different function or a way to force count() or table() to continue for 6000+ results?
A simple way to do this with any sized data frame is to add a count field to the data frame and then summarize the string field by count with the doBy package - like so:
require(doBy)
df$count <- 1
result <- summaryBy(count ~ string, data = df, FUN = sum, keep.names = TRUE)
As #Gregor noticed it seems like you interpreted the table output wrongly whereas it is actually doing the right counting. Anyway here goes a solution using Reduce, you should replace df where indicated by your dataframe and string column name by the column name of your actual dataframe in which you are counting.
# let's create some dataframe with three strings randomly distributed of length 1000
df <- data.frame(string = unlist(lapply(round(runif(1000, 1, 3)), function(i) c('hi', 'ok', 'my cat')[i])))
my.count <- function(word, df) {
# now let's count how many 'b' we found
Reduce(function(acc, r) {
# replace 'string' by the name of the column of your dataframe over which you want to count
if(r$string == word)
acc + 1
else
acc
}, apply(df, 1, as.list), init = 0)
}
# count how many 'my cat' strings are in the df dataframe at column 'string', replace with yours
my.count('my cat', df)
# now let's try to find the frequency of all of them
uniq <- unique(df$string)
freq <- unlist(lapply(uniq, my.count, df))
names(freq) <- uniq
freq
# output
# ok my cat hi
# 490 261 249
# we can check indeed that the sum is 1000
sum(freq)
# [1] 1000
Well, this won't be popular, but in the end I achieved the desired result with a for loop and and taking the number of rows in a subset.
y <- as.numeric(vector())
x <- as.numeric(vector())
for (i in test$token){
x <- as.numeric(nrow(df[(df$token == i),]))
y <- c(y, x)
}
Y then becomes a vector with the number of occurences of each string.

Remove rows from data frame using row indices where row indices might be zero length vector

I want to drop some rows from some dataframe using numeric indices of the rows. But sometimes the indices vector that I am going to drop becomes zero length vector. In this case, I expect that nothing should be dropped from the original data frame. But instead of nothing everything is dropped.
For example, here drop works as expected
df = data_frame( a = 10:12 )
drop = c(1,2)
df[ -drop, ]
# # A tibble: 1 × 1
# a
# <int>
# 1 12
But when drop is zero length vector, then removing those rows doesn't work as I expect.
drop = integer()
df[ -drop, ]
# A tibble: 0 × 1
# ... with 1 variables: a <int>
I was expecting to get the whole df object without any modification.
How to remove rows from a data frame using row indices safely where row indices might become a zero length vector?
For this reason, it is better to use %in% and negate ! it
df[!seq_len(nrow(df)) %in% drop, ]
As it is a data_frame, we can use tidyverse methods
df %>%
filter(!row_number() %in% drop)
It seems the following simple code works fine !
df <- data.frame(a = 10:12)
drop <- c(1,2)
'if'(length(drop) == 0, df, df[-drop, ])
drop <- integer()
'if'(length(drop) == 0, df, df[-drop, ])

How to transpose data frame in R by group with column header

I am working with a data frame that looks like the following which I need to transpose by group based on the common Id:
testDF = data.frame(c("Id", "1", "1", "2", "2"), c("Item", 'Milk','Eggs','Bacon', "Bread"))
testDF
#>Id Item
#>1 Milk
#>1 Eggs
#>2 Bacon
#>2 Bread
newDT <- dcast(testDF, Id ~ Item, value.var = "Item")
View(newDT)
I need the output to look like the following (excluding the header row and the Id column altogether):
Milk, Eggs (Id 1)
Bacon, Bread (Id 2)
I am receiving the error 'Error: value.var (Item) not found in input'. Can you please tell me what I am doing wrong?
Thanks,
Matt
Some good answers above, however I think this should be listed as an option as well:
df %>%
group_by(Id) %>%
# Create string listing all items in given Id, separated by comma
summarise(Items = str_c(Item, collapse = ', '))
Returns:
# A tibble: 2 × 2
Id Items
<fctr> <chr>
1 Milk, Eggs
2 Bacon, Bread
testDF = data.frame(Id = c("1", "1", "2", "2"),
Item = c('Milk','Eggs','Bacon', "Bread"))
testDF
z <- aggregate(list(Item = testDF$Item), list(ID = testDF$Id),
function(x) paste(x, collapse = ','))
z
ID Item
1 1 Milk,Eggs
2 2 Bacon,Bread
With group_by and summarise functions from dplyr you have:
library("dplyr")
testDF %>%
group_by(Id) %>%
summarise(Items=paste0(Item,collapse=","))
testDF
#Source: local data frame [2 x 2]
#Groups: Id [2]
#
# Id Items
# (fctr) (chr)
#1 1 Milk,Eggs
#2 2 Bacon,Bread
since I didn't see a Green checkmark, figured I'd take a stab at it because I wrote a function for this exact problem.
library(dplyr)
transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
if(class(input[,uniq_var]) == "factor"){
input[uniq_var] = sapply(input[uniq_var],as.character)
}
#' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations 'distance_1','distance_2','distance_3'...ect.)
list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
# it gets us the aggregates we want, BUT all of our columns are stored in a list
# instead of in separate columns.... so we need to create a new dataframe with the dimensions
# rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be
# the maximum number of observations that are assigned to one of those groups.
# so first we will create the skeleton of the matrix, and then use a user defined function
# to fill it with the correct values
new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
new_df <- data.frame(new_df)
new_df <- cbind(list_df[,1],new_df)
# i am writing a function inside of a function becuase for loops can take a while
# when doing operaitons on multiple columns of a dataframe
func2 <- function(input,thing = new_df){
# here, we have a slightly easier case when we have the maximum number of children
# assigned to a household.
# we subtract 1 from the number of columns because the first column holds the value of the
# unique value we are looking at, so we don't count it
if(length(input[2][[1]])==dim(thing)[2]-1){
# we set the row corresponding to the specific unique value specified in our list_df of aggregated values
# equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose.
thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]
#new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
}else{
thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
}
# if you're wondering why I have to use so many []'s it's because our list_df has 1 column
# of unique identifiers and the other column is actually a column of dataframes
# each of which only has 1 row and 1 column, and that element is a list of the transposed values
# that we want to add to our new dataframe
# so essentially the first bracket
return(thing[which(thing[,1]==input[1]),])
}
quarter_final_output <- apply(list_df,1,func2)
semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
#return(apply(list_df,1,func2))
# this essentially names the columns according to the column names that a user would typically specify
# in a proc transpose.
name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
#print(trans_var)
colnames(df)[1] = colnames(input[uniq_var])
colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
return(df)
}
final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
return(final_output)
}
In your case, you'd apply it like this:
transp(testDF,uniq_var = 'Id',compare_var = "Item")
If you want to download it from my github https://github.com/seanpili/R_PROC_TRANSPOSE

R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
The answer from a previous question about working with melted data tables does not apply here because of the list column.
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
This old question piqued my curiosity as data.table has been improved sigificantly since 2013.
However, even with data.table version 1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
Columns specified in formula can not be of type list
The workaround follows the general outline of jonsedar's answer :
Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y
but uses the features of the actual data.table syntax, e.g., the on parameter:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
The list column data is aggregated by taking the first element. This is in line with OP's code line
l$data = .SD[1,]$data
which also picks the first element.
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

Resources