Conditionally split rows based on value in a specific column - r

I would like to convert the columns below into the format below this. The way the reformatting works is that the sample is grouped between sample type N. For example the first two rows below are grouped together, and 7397-DNA_A01 to 7399-DNA_A01 is grouped together.
Sample Sample Type
7393.DNA_A01 N
7394-DNA_A01 T
7395-DNA_A01 N
7396-DNA_A01 T
7397-DNA_A01 N
7398-DNA_A01 T
7399-DNA_A01 LN
7400-DNA_A01 N
7401-DNA_A01 T
7402-DNA_A01 B
desired output
N T B LN
7393.DNA_A01 7394-DNA_A01
7395-DNA_A01 7396-DNA_A01
7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
I'm really not sure how to split the rows when N is encountered and then I suppose I would need to transpose somehow. Please help!

We need to create a grouping index ('indx') based on the occurence of 'N'. Here, a logical vector was created (SampleType=='N') and cumsum it to create the 'indx'. Based on the order of the columns, it may be useful to change the 'SampleType' column to factor and specify the levels as in the order of column names in the expected result. Then we can use dcast from either reshape2 or data.table.
library(data.table)#v1.9.5+
setDT(df1)[, indx:=cumsum(SampleType=='N')
][, SampleType:= factor(SampleType, levels=c('N', 'T', 'B', 'LN'))]
dcast(df1, indx~SampleType, value.var='Sample', fill='')[,-1,with=FALSE]
# N T B LN
#1: 7393.DNA_A01 7394-DNA_A01
#2: 7395-DNA_A01 7396-DNA_A01
#3: 7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
#4: 7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
If you are using dcast from reshape2, the 'indx' column can be created by base R options. You can also change the 'SampleType' column to factor using a similar code as below.
df1$indx <- cumsum(df1$SampleType=='N')
library(reshape2)
dcast(df1, indx~SampleType, value.var='Sample', fill='')
data
df1 <- structure(list(Sample = c("7393.DNA_A01", "7394-DNA_A01",
"7395-DNA_A01",
"7396-DNA_A01", "7397-DNA_A01", "7398-DNA_A01", "7399-DNA_A01",
"7400-DNA_A01", "7401-DNA_A01", "7402-DNA_A01"), SampleType = c("N",
"T", "N", "T", "N", "T", "LN", "N", "T", "B")), .Names = c("Sample",
"SampleType"), class = "data.frame", row.names = c(NA, -10L))

Related

when creating a data, it hasn't name

dataset=structure(list(goods = structure(1:6, .Label = c("a", "b", "c",
"d", "e", "f"), class = "factor")), .Names = "goods", class = "data.frame", row.names = c(NA,
-6L))
goods
1 a
2 b
3 c
4 d
5 e
6 f
i want create new data, simple i do
df1=dataset$goods
but after it df1 doesn't have name column goods.
Why?
str(df1)
Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
As you can see it hasn't name goods
How to do that df1 data has name column goods?
If this post is dublicate, let me know, i delete it.
You are assigning a column vector, not a data frame. To assign the whole data frame, simply do
df = dataset
If you want to preserve only some columns and not all, use column subsetting (documentation):
df = dataset[, "goods", drop = FALSE]
drop = FALSE is necessary here because the dataframe subset operator will otherwise return a vector instead of a data frame with a single column (this is arguably a bug, which is why tidyverse tibbles behave differently).
Using tidyverse operations (aka the “modern” R way), this would be written as
library(dplyr)
df = select(dataset, goods)
df1=data.frame(goods=dataset$goods, stringsAsFactors=F) works perfectly well, or you can use the longer but (somewhat?) more explicit:
ds <- dataset[,c("goods")]
df1=data.frame(goods=dataset$goods)
library(dplyr)
ds <- dataset[,c("goods")] %>% as.data.frame(stringsAsFactors=F)
colnames(ds) <- "goods"
edit: Added the stringsAsFactors option as it is useful to control where you'd like factor conversion or not. c("goods") is equivalent to "goods", but I left it as a template in case you need to add more columns.

In R, compare vectors of different length to match and replace values

Thanks for your help.
I have two data frames. The data frames are of differing lengths. One is a data set that often includes mistakes. Another is a set of corrections. I'm trying to do two things at once with these two data sets. First, I would like to compare three columns of df1 with three columns in df2. This means reading the first row of data in df1 and seeing if those three variables match any of the rows in df2 for those three variables, then moving on to row 2, and so on. If a match is found in a row for all three variables, then replace the value in one of the columns in df1 with a replacement in df2. I have included an example below.
df1 <- data.frame("FIRM" = c("A", "A", "B", "B", "C", "C"), "LOCATION" = c("N", "S", "N", "S", "N", "S"), "NAME" = c("Apple", "Blooberry", "Cucumber", "Date", "Egplant", "Fig"))
df2 <- data.frame("FIRM" = c("A", "C"), "LOCATION" = c("S", "N"), "NAME" = c("Blooberry", "Egplant"), "NEW_NAME" = c("Blueberry", "Eggplant"))
df1[] <- lapply(df1, as.character)
df2[] <- lapply(df2, as.character)
If there is a row in df1 that matches against "FIRM", "LOCATION" and "NAME" in df2, then I would like to replace the "NAME" in df1 with "NEW_NAME" in df2, such that "Blooberry" and "Egplant" change to "Blueberry" and "Eggplant".
I can do the final replacements using*:
df1$NAME[match(df2$NAME, df1$NAME)] <- df2$NEW_NAME[match(df1$NAME[match(df2$NAME, df1$NAME)], df2$NAME)]
But this does not include the constraint of the three matches. Also, my code seems unnecessarily complex with the nested match functions. I think I could accomplish this task by subsetting df2 and using a for loop to match rows one by one but I would think that there is a better vectorized method out there.
*I'm aware that inside the brackets of df2$NEW_NAME[], the function calls both elements in that column, but I'm trying to generalize.
Consider an all.x merge (i.e., LEFT JOIN in SQL speak) with an ifelse conditional comparing NAME and NEW_NAME.
Below, transform allows same line column assignment and the bracketed sequence at end keeps first three columns.
mdf <- transform(merge(df1,df2,all.x=TRUE),NAME=ifelse(is.na(NEW_NAME),NAME,NEW_NAME))[1:3]
mdf
# FIRM LOCATION NAME
# 1 A N Apple
# 2 A S Blueberry
# 3 B N Cucumber
# 4 B S Date
# 5 C N Eggplant
# 6 C S Fig

collapse multiple rows of a dataframe into single row based on unique id in r [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Merging rows with the same ID variable [duplicate]
(1 answer)
Closed 5 years ago.
I have a data set with millions of rows. The first row has an ID, though there are repeated IDs in the data set (all IDs are grouped and ordered). The data set has multiple columns. I would like to transform the data such that there is one row item per ID, and all the previous entries of the columns for the ID are put into a single row, in order.
See example snippet of the before data
And example of what I would like the data to look like
Here is an example of a very similar problem, however in this problem the data only has two columns (one column for the ID), but my data has over 5 columns (and one column for ID): Collapse mutiple rows of a dataframe into one row - based on a unique key
I would like to do this in either R or Excel :)
In R, we can do this with dcast from data.table
library(data.table)
dcast(setDT(df1), ID ~ rowid(ID), value.var = c("V1", "V2"), fill = "")
# ID V1_1 V1_2 V1_3 V2_1 V2_2 V2_3
#1: 1 a b c aa bb cc
#2: 2 d e dd ee
#3: 3 f ff
data
df1 <- structure(list(ID = c(1, 1, 1, 2, 2, 3), V1 = c("a", "b", "c",
"d", "e", "f"), V2 = c("aa", "bb", "cc", "dd", "ee", "ff")), .Names = c("ID",
"V1", "V2"), row.names = c(NA, -6L), class = "data.frame")

R - reshape dataframe from duplicated column names but unique values

Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)

sample data.table rows with different conditions

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x

Resources