This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame similar to this. I want to sum up values for rows if the text in column "Name" is the same before the - sign.
Remove everything after "-" using sub and then use How to sum a variable by group
df$Name <- sub("-.*", "",df$Name)
aggregate(cbind(val1, val2)~Name, df, sum)
Below is a data.table solution.
Data:
df = data.table(
Name = c('IRON - A', 'IRON - B', 'SABBATH - A', 'SABBATH - B'),
val1 = c(1,2,3,4),
val2 = c(5,6,7,8)
)
Code:
df[, Name := sub("-.*", "", Name)]
mat = df[, .(sum(val1), sum(val2)), by = Name]
> mat
Name V1 V2
1: IRON 3 11
2: SABBATH 7 15
You can rbind your 2 tables (top and bottom) into one data frame and then use dplyr or data.table. The data.table would be much faster for large tables.
data_framme$Name <- sub("-.*", "", data_frame$Name)
library(dplyr)
data_frame %>%
group_by(Name) %>%
summarise_all(sum)
library(data.table)
data.frame <- data.table(data.frame)
data.frame[, lapply(.SD, sum, na.rm=TRUE), by=Name ]
Related
This question already has answers here:
Split text string in a data.table columns
(5 answers)
How to speed up row level operation with dplyr
(1 answer)
Closed 6 months ago.
I have a quite simple question regarding data.table dt, but as always, this package brings me to the brink of despair :D
I have a column name with these contents: e.g. "bin1_position1_ID1" and I want to split these infos into separate columns:
name bins positions IDs
-------------------- ------------------------
"bin1_position1_ID1" -> "bin1" "position1" "ID1"
"bin2_position2_ID2" "bin2" "position2" "ID2"
I tried it with
dt <- dt[, bins := lapply(.SD, function(x) strsplit(x, "_")[[1]][1]), .SDcols="name"]
(and for the other new columns with [[1]][2] [[1]][3])
However, I end up having a new column bins (so far, so good), but this has the info from row 1 in every row and not the info from the same row then itself (i.e. bin1 in every row).
And I have some columns that have more infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5"
Code for testing (see Maƫls solution):
library(data.table)
name <- c("bin1_position1_ID1",
"bin2_position2_ID2",
"bin3_position3_ID3",
"bin4_position4_ID4",
"bin5_position5_ID5_another5_more5")
dt <- data.table(name)
dt[, c("bin", "position", "ID") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
Use tstrsplit with keep = 1:3 to keep only the first three columns:
dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
name bin position ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5
This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a data.frame with colnames: A01, A02, ..., A25, ..., Z01, ..., Z25 (altogether 26*25). For example:
set.seed(1)
df <- data.frame(matrix(rnorm(26*25),ncol=26*25,nrow=1))
cols <- c(paste("0",1:9,sep=""),10:25)
colnames(df) <- c(sapply(LETTERS,function(l) paste(l,cols,sep="")))
and I want to dcast it to a data.frame of 26x25 (rows will be A-Z and columns 01-25). Any idea what would be the formula for this dcast?
We can use tidyverse
library(tidyverse)
res <- gather(df) %>%
group_by(key = sub("\\D+", "", key)) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
dim(res)
#[1] 26 25
The removing of columns doesn't look nice (still learning data.table). Someone needs to make that one nice.
# convert to data.table
df <- data.table(df)
# melt all the columns first
test <- melt(df, measure.vars = names(df))
# split the original column name by letter
# paste the numbers together
# then remove the other columns
test[ , c("ch1", "ch2", "ch3") := tstrsplit(variable, "")][ , "ch2" :=
paste(ch2, ch3, sep = "")][ , c("ch3", "variable") := NULL]
# dcast with the letters (ch1) as rows and numbers (ch2) as columns
dcastOut <- dcast(test, ch1 ~ ch2 , value.var = "value")
Then just remove the first column which contains the number?
The "formula" you're looking for can come from the patterns argument in the "data.table" implementation of melt. dcast is for going from a "long" form to a "wide" form, while melt is for going from a wide form to a long(er) form. melt() does not use a formula approach.
Essentially, you would need to do something like:
library(data.table)
setDT(df) ## convert to a data.table
cols <- sprintf("%02d", 1:25) ## Easier way for you to make cols in the future
melt(df, measure.vars = patterns(cols), variable.name = "ID")[, ID := LETTERS][]
This question already has an answer here:
Merging rows with the same ID variable [duplicate]
(1 answer)
Closed 7 years ago.
I have the following dataframe:
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
but I want it to be the following:
1 2 3 4 5
-1.53697673029089 , 2.10652020463275 -1.02183940974772 , 0.623009466458354 1.33614674072657 , 1.5694345481646 0.270466789820086 , -0.75670874554064 -0.280167896821629 , -1.33313822867893
0.26012874418111 , 2.87972571647846 -1.32317949800031 , -2.92675188421021 0.584199000313255 , 0.565499464846637 -0.555881716346136 , -1.14460518414649 -1.0871665543915 , -3.18687136890236
I mean when the value of se is the same, make a column bind.
Do you have any ideas how to accomplish this?
I had no luck with spread(tidyr), and I guess it's something which involves sapply, cbind and a if statement. Because the real data involves more than 35.000 rows.
It seems as though your eventual goal is to have a data file which has roughly 35000 columns. Are you sure about that? That doesn't sound very tidy.
To do what you want, you are going to need to have a row identifier. In the below, I've called it caseid, and then removed it once it was no longer required. I then transpose the result to get what you asked for.
library(tidyr)
library(dplyr)
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
st$caseid = rep(1:(nrow(st)/2), each = 2) # temporary
df = spread(st, se, xy) %>%select(-caseid) %>%t()
print(df)
If we need to split the 'xy' column elements into individual units, cSplit from splitstackshape can be used. Then rbind the alternating rows of 'st1' after unlisting`.
library(splitstackshape)
st1 <- cSplit(st, 'xy', ', ', 'wide')
rbind(unlist(st1[c(TRUE,FALSE)][,-1, with=FALSE]),
unlist(st1[c(FALSE, TRUE)][,-1, with=FALSE]))
If we don't need to split the 'xy' column into individual elements, we can use dcast from data.table. It should be fast enough. Convert the 'data.frame' to 'data.table' (setDT(st), create a sequence column ('N') by 'se', and then dcast from 'long' to 'wide'.
library(data.table)
dcast(setDT(st)[, N:= 1:.N, se], se~N, value.var= 'xy')
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.