How to split a column in multiple columns using data.table [duplicate] - r

This question already has answers here:
Split text string in a data.table columns
(5 answers)
How to speed up row level operation with dplyr
(1 answer)
Closed 6 months ago.
I have a quite simple question regarding data.table dt, but as always, this package brings me to the brink of despair :D
I have a column name with these contents: e.g. "bin1_position1_ID1" and I want to split these infos into separate columns:
name bins positions IDs
-------------------- ------------------------
"bin1_position1_ID1" -> "bin1" "position1" "ID1"
"bin2_position2_ID2" "bin2" "position2" "ID2"
I tried it with
dt <- dt[, bins := lapply(.SD, function(x) strsplit(x, "_")[[1]][1]), .SDcols="name"]
(and for the other new columns with [[1]][2] [[1]][3])
However, I end up having a new column bins (so far, so good), but this has the info from row 1 in every row and not the info from the same row then itself (i.e. bin1 in every row).
And I have some columns that have more infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5"
Code for testing (see Maƫls solution):
library(data.table)
name <- c("bin1_position1_ID1",
"bin2_position2_ID2",
"bin3_position3_ID3",
"bin4_position4_ID4",
"bin5_position5_ID5_another5_more5")
dt <- data.table(name)
dt[, c("bin", "position", "ID") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]

Use tstrsplit with keep = 1:3 to keep only the first three columns:
dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
name bin position ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5

Related

Slice a list for a new column in a data.table in R [duplicate]

This question already has answers here:
Split text string in a data.table columns
(5 answers)
Closed 3 months ago.
I am having a data.table with a tab separated string that I want to separate into new columns. However, if I slice by index, I just get the first element of the first row for every field. How do I do this?
library(data.table)
a <- c("feature1\titem1\titem2")
dt1 <- data.table(a)
a <- c("feature2\titem3\titem4")
dt2 <- data.table(a)
dt <- rbindlist(list(dt1, dt2))
dt[, split := mapply(str_split, a, "\t", n = 2)]
# how to get a feature column from that?
One possible solution with transpose:
dt[, transpose(stringr::str_split(a,"\t"))]
V1 V2 V3
<char> <char> <char>
1: feature1 item1 item2
2: feature2 item3 item4

Combining rows if text is same before certain character [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame similar to this. I want to sum up values for rows if the text in column "Name" is the same before the - sign.
Remove everything after "-" using sub and then use How to sum a variable by group
df$Name <- sub("-.*", "",df$Name)
aggregate(cbind(val1, val2)~Name, df, sum)
Below is a data.table solution.
Data:
df = data.table(
Name = c('IRON - A', 'IRON - B', 'SABBATH - A', 'SABBATH - B'),
val1 = c(1,2,3,4),
val2 = c(5,6,7,8)
)
Code:
df[, Name := sub("-.*", "", Name)]
mat = df[, .(sum(val1), sum(val2)), by = Name]
> mat
Name V1 V2
1: IRON 3 11
2: SABBATH 7 15
You can rbind your 2 tables (top and bottom) into one data frame and then use dplyr or data.table. The data.table would be much faster for large tables.
data_framme$Name <- sub("-.*", "", data_frame$Name)
library(dplyr)
data_frame %>%
group_by(Name) %>%
summarise_all(sum)
library(data.table)
data.frame <- data.table(data.frame)
data.frame[, lapply(.SD, sum, na.rm=TRUE), by=Name ]

Paste name of new columns when summarizing data.table [duplicate]

This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
Dynamic column names in data.table
(3 answers)
Closed 5 years ago.
How to summarize a data.table creating new column whose name comes from a string or character?
reproducible example:
library(data.table)
dt <- data.table(x=rep(c("a","b"),20),y=factor(sample(letters,40,replace=T)), z=1:20)
i <- 15
new_var <- paste0("new_",i)
# my attempt
dt[, .( eval(new_var) = sum( z[which( z <= i)] )), by= x]
# expected result
dt[, .( new_15 = sum( z[which( z <= i)] )), by= x]
> x new_15
> 1: a 128
> 2: b 112
This approach using eval() works fine for creating a new column with := (see this SO questions), but I don't know why it does not work when summarizing a data.table.
One option is setNames
dt[, setNames(.(sum( z[which( z <= i)])), new_var) , by= x]
# x new_15
#1: a 128
#2: b 112

Match rows between dataframe and replace with a value in another column of the second dataframe

I have two dataframes. The first contains a column containing IDs and various other columns while the other contains mapping information for these IDs (ID to Name).
I want to replace the ID in the first dataframe with the Name from the other dataframe.
I am able to do this
for(id in 1:nrow(df1)){
df1$X[df1$X %in% df2$ID[id]] <- df2$Name[id]
}
This works so long as I do not have repeating IDs in the mapping file such as this:
ID,Name
MSTRG.11187,gng7.S
MSTRG.11187,Novel
But this occurs quite a lot. I think my previous code will work if I can get rid of any rows from the mapping file which contain the word Novel in them. I am just struggling to do this. I have tried this :
data = data %>% group_by(GeneID) %>% filter(!("Novel" %in% Gene_Name))
But in the previous example of the repeating IDs with different names, it gets rid of the row with gng7.S as well as getting rid of the row with Novel. I'd like to do this but keep the row with gng7.S and only get rid of the row with Novel.
I'm thinking this might be something to do with the group_by part.
Thanks,
S
Edit: Here are some example dataframes
df1=data.frame(X=c("MSTRG.199","MSTRG.18989","MSTRG.8890","MSTRG.7767"))
df2=data.frame(ID=c("MSTRG.18989","MSTRG.18989","MSTRG.8890","MSTRG.7767", "MSTRG.199"),Name=c("gng7.S", "Novel", "Novel","cdc20", "Novel"))
The question is not fully clear whether any appearances of "Novel" should be removed from df2 or only in cases of duplicate ID. The second case is quite tricky, so I propose a data.table solution which I'm more fluent in (and Q isn't explicitely tagged with dplyr)
df1 <- data.frame(X = c("MSTRG.199", "MSTRG.18989", "MSTRG.8890", "MSTRG.7767"))
df2 <- data.frame(
ID = c("MSTRG.18989", "MSTRG.18989", "MSTRG.8890", "MSTRG.7767", "MSTRG.199"),
Name = c("gng7.S", "Novel", "Novel", "cdc20", "Novel"))
library(data.table)
DT1 <- data.table(df1)
DT2 <- data.table(df2)
# case 1
# remove all rows with Name == Novel before joining
DT2[!Name %in% c("Novel")][DT1, on = .(ID = X)]
ID Name N
1: MSTRG.199 NA NA
2: MSTRG.18989 gng7.S 2
3: MSTRG.8890 NA NA
4: MSTRG.7767 cdc20 1
# case 2
# remove Novel in cases of duplicate appearances of ID
DT2[, N := .N, by = ID][!(N > 1L & Name %in% "Novel")][, N := NULL][DT1, on = .(ID = X)]
ID Name
1: MSTRG.199 Novel
2: MSTRG.18989 gng7.S
3: MSTRG.8890 Novel
4: MSTRG.7767 cdc20

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))

Resources