How to transpose data frame in R by group with column header - r

I am working with a data frame that looks like the following which I need to transpose by group based on the common Id:
testDF = data.frame(c("Id", "1", "1", "2", "2"), c("Item", 'Milk','Eggs','Bacon', "Bread"))
testDF
#>Id Item
#>1 Milk
#>1 Eggs
#>2 Bacon
#>2 Bread
newDT <- dcast(testDF, Id ~ Item, value.var = "Item")
View(newDT)
I need the output to look like the following (excluding the header row and the Id column altogether):
Milk, Eggs (Id 1)
Bacon, Bread (Id 2)
I am receiving the error 'Error: value.var (Item) not found in input'. Can you please tell me what I am doing wrong?
Thanks,
Matt

Some good answers above, however I think this should be listed as an option as well:
df %>%
group_by(Id) %>%
# Create string listing all items in given Id, separated by comma
summarise(Items = str_c(Item, collapse = ', '))
Returns:
# A tibble: 2 × 2
Id Items
<fctr> <chr>
1 Milk, Eggs
2 Bacon, Bread

testDF = data.frame(Id = c("1", "1", "2", "2"),
Item = c('Milk','Eggs','Bacon', "Bread"))
testDF
z <- aggregate(list(Item = testDF$Item), list(ID = testDF$Id),
function(x) paste(x, collapse = ','))
z
ID Item
1 1 Milk,Eggs
2 2 Bacon,Bread

With group_by and summarise functions from dplyr you have:
library("dplyr")
testDF %>%
group_by(Id) %>%
summarise(Items=paste0(Item,collapse=","))
testDF
#Source: local data frame [2 x 2]
#Groups: Id [2]
#
# Id Items
# (fctr) (chr)
#1 1 Milk,Eggs
#2 2 Bacon,Bread

since I didn't see a Green checkmark, figured I'd take a stab at it because I wrote a function for this exact problem.
library(dplyr)
transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
if(class(input[,uniq_var]) == "factor"){
input[uniq_var] = sapply(input[uniq_var],as.character)
}
#' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations 'distance_1','distance_2','distance_3'...ect.)
list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
# it gets us the aggregates we want, BUT all of our columns are stored in a list
# instead of in separate columns.... so we need to create a new dataframe with the dimensions
# rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be
# the maximum number of observations that are assigned to one of those groups.
# so first we will create the skeleton of the matrix, and then use a user defined function
# to fill it with the correct values
new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
new_df <- data.frame(new_df)
new_df <- cbind(list_df[,1],new_df)
# i am writing a function inside of a function becuase for loops can take a while
# when doing operaitons on multiple columns of a dataframe
func2 <- function(input,thing = new_df){
# here, we have a slightly easier case when we have the maximum number of children
# assigned to a household.
# we subtract 1 from the number of columns because the first column holds the value of the
# unique value we are looking at, so we don't count it
if(length(input[2][[1]])==dim(thing)[2]-1){
# we set the row corresponding to the specific unique value specified in our list_df of aggregated values
# equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose.
thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]
#new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
}else{
thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
}
# if you're wondering why I have to use so many []'s it's because our list_df has 1 column
# of unique identifiers and the other column is actually a column of dataframes
# each of which only has 1 row and 1 column, and that element is a list of the transposed values
# that we want to add to our new dataframe
# so essentially the first bracket
return(thing[which(thing[,1]==input[1]),])
}
quarter_final_output <- apply(list_df,1,func2)
semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
#return(apply(list_df,1,func2))
# this essentially names the columns according to the column names that a user would typically specify
# in a proc transpose.
name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
#print(trans_var)
colnames(df)[1] = colnames(input[uniq_var])
colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
return(df)
}
final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
return(final_output)
}
In your case, you'd apply it like this:
transp(testDF,uniq_var = 'Id',compare_var = "Item")
If you want to download it from my github https://github.com/seanpili/R_PROC_TRANSPOSE

Related

Applying functions to each group in a dataframe in R

I have a dataframe like this:
df<-data.frame(info=c("Lucas sold $3.01","Lucia bought 3.00","Lucas bought $2.5","Lucas sold
$3.01","Lucia bought 3.00","Lucas bought $2.5"),
number=c("1001","1001","1002","1003","1003","1003"),
step=c("step 1","step 2","step 1","step 1","step 2","step 3"),
status=c("ok",NA,NA,"ok",NA,NA))
I need to transforme the information that i already have, using diverse functions, but I need to do it grouping the information based in "Number".
For example, I need to group by "number" and then replace the first NA in column "Status" for an "ok", for each group.
Then "status" would be c("ok","ok","ok","ok","ok",NA)
last(which.na(df$status)) would do the trick if I could apply that to each group.
Another function that I need to apply would be to create a new column where I can place a "1", the last time that the word "bought" is in the column "info".
Something like df[max(which(grepl("bought",df$info))]<-"1" would do the trick if I could apply that to each group, but I am not sure about how to do it.
You could make great use of dplyr's group_by syntax here after creating some bespoke functions to do the required tasks:
# Replace the last NA element of a vector with 'ok'
replace_first_na <- function(x) {
# Coerce to character to catch potential issues
x <- as.character(x)
# Get the position of the first NA
first_na <- which(is.na(x))[1]
# Replace the element in that position with 'ok'
x[first_na] <- "ok"
x
}
# Get the last element containing the word 'bought'
last_bought_flag <- function(x) {
# Prepare the output
out <- rep(0, length(x))
# Get the position of the last string to contain 'bought'
last_bought <- max(which(grepl("bought", x)))
# Replace the element in that position with `1`
out[last_bought] <- 1
# Return the output
out
}
df %>%
as_tibble() %>%
# Apply grouping by `number`
group_by(number) %>%
# Replace the first `NA` with 'ok' in the `status` column
mutate(status = replace_first_na(status)) %>%
# Get a flag column indicating the last 'bought' item for each group
mutate(last_bought = last_bought_flag(info)) %>%
# Remove grouping
ungroup()

How do you add a column to data frames in a list?

I have a list of data frames. I want to add a new column to each data frame. For example, I have three data frames as follows:
a = data.frame("Name" = c("John","Dor"))
b = data.frame("Name" = c("John2","Dor2"))
c = data.frame("Name" = c("John3","Dor3"))
I then put them into a list:
dfs = list(a,b,c)
I then want to add a new column with a unique value to each data frame, e.g.:
dfs[1]$new_column <- 5
But I get the following error:
"number of items to replace is not a multiple of replacement length"
I have also tried using two brackets:
dfs[[1]]$new_column <- 5
This does not return an error but it does not add the column.
This would be in a 'for' loop and a different value would be added to each data frame.
Any help would be much appreciated. Thanks in advance!
Let's say you want to add a new column with value 5:7 for each dataframe. We can use Map
new_value <- 5:7
Map(cbind, dfs, new_column = new_value)
#[[1]]
# Name new_column
#1 John 5
#2 Dor 5
#[[2]]
# Name new_column
#1 John2 6
#2 Dor2 6
#[[3]]
# Name new_column
#1 John3 7
#2 Dor3 7
With lapply you could do
lapply(seq_along(dfs), function(i) cbind(dfs[[i]], new_column = new_value[i]))
Or as #camille mentioned it works if you use [[ for indexing in the for loop
for (i in seq_along(dfs)) {
dfs[[i]]$new_column <- new_value[[i]]
}
The equivalent purrr version of this would be
library(purrr)
map2(dfs, new_value, cbind)
and
map(seq_along(dfs), ~cbind(dfs[[.]], new_colum = new_value[.]))

R equivalent to SAS "merge" "by"

If you only use "merge" and "by" in SAS to merge datasets that contain several variables with equal names (beside the ID(s) that you merge by), SAS will combine these variables in to one using the value read last - it is described here https://communities.sas.com/t5/SAS-Programming/Merge-step-overwriting-shared-vars/m-p/281542#M57117
Text from above link:
"There is a rule: whichever value was read last. But that rule is simple only when the merge is one-to-one. In that case, the value you get depends on the order in the MERGE statement:
merge a b;
by id;
The value of common variables (for a one-to-one merge) comes from data set B. SAS reads a value from data set A, then reads a value from data set B. The value from B is read last, and overwrites the value read from data set A.
If there is a mismatch, and an ID appears only in data set A but not in data set B, the value will be the one found in data set A."
How do I make R behave the same way without having to combine the rows afterwards after certain conditions? (in SAS, values are not overwritten by NAs)
library(tidyverse)
#create tibbles
df1 <- tibble(id = c(1:3), y = c("tt", "ff", "kk"))
df2 <- tibble(id = c(1,2,4), y = c(4,3,8))
df3 <- tibble(id = c(1:3), y = c(5,7,NA))
#combine the tibbles
combined_df <- list(df1, df2, df3) %>%
reduce(full_join, by = "id")
# desired output
combined_df_desired <- tibble(id = 1:4, y = c(5,7,"kk",8))
I don't know exactly what you mean with "certain conditions". There isn't a way to change the inner workings of full_join() but you can do:
list(df1, df2, df3) %>%
reduce(full_join, by = "id") %>%
mutate_all(as.character) %>%
mutate(y = coalesce(y, y.y , y.x,)) %>%
select(id, y)
A tibble: 4 x 2
id y
<chr> <chr>
1 1 5
2 2 7
3 3 kk
4 4 8
coalesce() takes a set of columns and returns the first non-NA value for each row. You can order the columns inside the function according to your priorities.

Using apply in R to extract rows from a dataframe

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

create unique values in each cells for a given range in excel or R

Apologies if the example is not formatted properly.
I have a data set with one sample per row, the data contain two columns with reference numbers of the start value and end value.
cell A1 = Sample #1
cell B1 = 101-263 (start value)
cell C1 = 101-266 (end value)
cell A2 = Sample #2
cell B2 = 162-186 (start value)
cell C2 = 162-187 (end value)
The range of values is a different length of each row of data, with a maximum range of 8 values. I need to fill in the values in the range, with each value in a cell along the row.
So for sample #1 above I need to create the cell values: D1 = 101-264, and E1 = 101-265
While for sample #2 there will be no extra cells needed.
Is there a formula (using Vlookup and If perhaps?) that I can create and drag across all rows and over the 8 needed columns to fill in this data? (I don't mind if there are N/A in the shorter-range rows)
If there is an easier way using R also fine with me.
Thanks for any advice
Please try in D1 copied across eight columns and then D1:K1 copied down to suit:
=IF(1*RIGHT($C1,3)>RIGHT($B1,3)+COLUMN()-3,LEFT($B1,4)&RIGHT($B1,3)+COLUMN()-3,"")
The condition (IF) checks whether or not to display a result (or 'blank' "", for neater presentation) depending on the result equalling or exceeding the upper limit specified in ColumnC.
There is some text manipulation (RIGHT and LEFT) to get at the part that is to be integer incremented or to add back the static part.
COLUMN() returns the column number (A>1, B>2 etc) so is useful as a kind of stepping function. In D1 COLUMN()-3 is 4-3 or 1 so there 1 is added to the start of the range (shown on the right of B1). When copied across to ColumnE COLUMN()-3 becomes 5-3, so 2 is added to the start of the range.
The following code:
library(magrittr)
library(plyr)
library(reshape2)
# Create input example
dat = data.frame(
sample = c("Sample #1", "Sample #2"),
start = c("101-263", "162-186"),
end = c("101-266", "162-187"),
stringsAsFactors = FALSE
)
# Extract 'start' and 'end' values
dat$num1 = dat$start %>% strsplit("-") %>% sapply("[", 1)
dat$start2 = dat$start %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$end2 = dat$end %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$start = NULL
dat$end = NULL
# For each row
for(i in 1:nrow(dat)) {
# Check if there is any need to add entries
if((dat$end2[i] - dat$start2[i]) > 1) {
# For each entry
for(j in seq(dat$start2[i], dat$end2[i] -1)) {
# Create entry
new_entry = data.frame(
sample = dat$sample[i],
num1 = dat$num1[i],
start2 = dat$start2[i],
end2 = j,
stringsAsFactors = FALSE
)
# Add to table
dat = rbind(dat, new_entry)
}
}
}
# Calculate all values
dat$value = paste0(dat$num1, "-", dat$end2)
dat = dat[, c("sample", "value")]
# Create column labels
dat = ddply(
dat,
"sample",
transform,
var = paste0("col", rank(value))
)
# Reshape to required format
dat = dcast(dat, sample ~ var, value.var = "value")
Does what you asked on the provided example.
It transforms this table -
sample start end
1 Sample #1 101-263 101-266
2 Sample #2 162-186 162-187
Into this one -
sample col1 col2 col3 col4
1 Sample #1 101-263 101-264 101-265 101-266
2 Sample #2 162-187 <NA> <NA> <NA>
If there is a larger example for testing will be happy to do so :)

Resources