In data.table in R, how can we create an sequenced indicator variable by the values of two columns? [duplicate] - r

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Create a new data frame column based on the values of two other columns
(2 answers)
Closed 4 years ago.
In the data.table package in R, for a given data table, I am wondering how an indicator index can be created for the values that are the same in two columns. For example, for the following data table,
> M <- data.table(matrix(c(2,2,2,2,2,2,2,5,2,5,3,3,3,6), ncol = 2, byrow = T))
> M
V1 V2
1: 2 2
2: 2 2
3: 2 2
4: 2 5
5: 2 5
6: 3 3
7: 3 6
I would like to create a new column that essentially orders the values that are the same for each row of the two columns, so that I can get something like:
> M
V1 V2 Index
1: 2 2 1
2: 2 2 1
3: 2 2 1
4: 2 5 2
5: 2 5 2
6: 3 3 3
7: 3 6 4
I essentially would like to repeat values of .N above, is there a nice way to do it?

We can use .GRP after grouping by 'V1' and 'V2'
M[, Index := .GRP, .(V1, V2)]

Related

Filling cell data with mean for each unique name [duplicate]

This question already has answers here:
replace NA with groups mean in a non specified number of columns [duplicate]
(2 answers)
Closed 3 years ago.
I have been using R for the past couple days and I have question that I am a little stumped on. I have a dataframe with bidder names and bids where some of the bids are empty. I am having trouble implementing a dynamic way to take the average bid for each unique bidder and apply that to the empty cells. This line of code below will take the mean bid for all of the unique bidders. All I need to do is place the mean value of unique_bid in the empty cells that shares the same bidder.
unique_bid <- aggregate(bid ~ bidder, auction[complete.cases(auction),], mean)
Here is a picture of what the dataframe looks like.
You could use ave.
Example:
df = data.frame(a = c(1,1,1,2,2,2), b=c(1,2,NA,4,5,NA),c= c(1,2,3,4,5,6))
> df
a b c
1 1 1 1
2 1 2 2
3 1 NA 3
4 2 4 4
5 2 5 5
6 2 NA 6
Do:
sel = is.na(df$b)
df$b[sel] = ave(df$b, df$a, FUN = function(x){mean(x, na.rm = T)})[sel]
ave will use apply the function FUN to df$b while grouping by df$a. The sel will select NA elements of df$b and replace them by the correponding function's result.
Result:
> df
a b c
1 1 1.0 1
2 1 2.0 2
3 1 1.5 3
4 2 4.0 4
5 2 5.0 5
6 2 4.5 6

Select column dynamically based on value from another column in R [duplicate]

This question already has answers here:
Select values from different columns based on a variable containing column names [duplicate]
(2 answers)
Closed 4 years ago.
How do I use column in data table as variable name to fetch values from other columns based on the said column.
library(data.table)
a = c(2,3,5)
b = c(5,7,7)
c = c(1,2,3)
x = c ('a','b','c')
dt <- data.table(a,b,c,x)
> dt
a b c x
1: 2 5 1 a
2: 3 7 2 b
3: 5 7 3 c
output I desire column y which is based on values of column x which contains the column names of values to be fetched.
dt
a b c x y
1: 2 5 1 a 2
2: 3 7 2 b 7
3: 5 7 3 c 3
I tried
dt[,get(x)]
dt[,match(x,colnames(dt))]
By looping through the sequence of rows, extract the value with get and assign it to create 'y'
dt[, y := .SD[, get(x), seq_len(.N)]$V1]
dt
# a b c x y
#1: 2 5 1 a 2
#2: 3 7 2 b 7
#3: 5 7 3 c 3

join/merge data frames in R [duplicate]

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Left join using data.table
(3 answers)
Closed 5 years ago.
I would like to join similar data frames:
input:
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
desired output:
z <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,1,2,NA))
I tried
x <- data_frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(1,NA,NA,NA))
y <- data_frame(a=c(2,3),b=c(5,6),c=c(1,2))
z <- merge(x,y, all=TRUE)
but it has one inconvenience:
a b c
1 1 4 1
2 2 5 1
3 2 5 NA
4 3 6 2
5 3 6 NA
6 4 7 NA
It doubles rows where there are similarities. Is there a way to get desired output without deleting unwanted rows?
EDIT
I can not delete rows with NA, x data frame consists of rows with NA which are not in y data frame. If I would do this I would deleted 4th row from x data frame (4 7 NA)
Thanks for help
You can use an update join with the data.table package:
# load the packge and convert the dataframes to data.table's
library(data.table)
setDT(x)
setDT(y)
# update join
x[y, on = .(a, b), c := i.c][]
which gives:
a b c
1: 1 4 1
2: 2 5 1
3: 3 6 2
4: 4 7 NA

R converting from short form to long form with counts in the short form [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Reshaping data.frame from wide to long format
(8 answers)
Closed 4 years ago.
I have a large table (~100M row and 28 columns) in the below format:
ID A B C
1 2 0 1
2 0 1 0
3 0 1 2
4 1 0 0
Columns besides ID (which is unique) gives the counts for each type (i.e. A,B,C). I would like to convert this to the below long form.
ID Type
1 A
1 A
1 C
2 B
3 B
3 C
3 C
4 A
I also would like to use data table (rather than data frame) given the size of my data set. I checked reshape2 package in R regarding converting between long and short form however I am not clear if melt function would allow me to have counts in the short form as above.
Any suggestions on how I can convert this in a fast and efficient way in R using reshape2 and/or data.table?
Update
You can try the following:
DT[, rep(names(.SD), .SD), by = ID]
# ID V1
# 1: 1 A
# 2: 1 A
# 3: 1 C
# 4: 2 B
# 5: 3 B
# 6: 3 C
# 7: 3 C
# 8: 4 A
Keeps the order you want too...
You can try the following. I've never used expandRows on what would become ~ 300 million rows, but it's basically rep, so it shouldn't be slow.
This uses melt + expandRows from my "splitstackshape" package. It works with data.frames or data.tables, so you might as well use data.table for the faster melting....
library(reshape2)
library(splitstackshape)
expandRows(melt(mydf, id.vars = "ID"), "value")
# The following rows have been dropped from the input:
#
# 2, 3, 5, 8, 10, 12
#
# ID variable
# 1 1 A
# 1.1 1 A
# 4 4 A
# 6 2 B
# 7 3 B
# 9 1 C
# 11 3 C
# 11.1 3 C

split a dataframe with numbers separated by the add sign '+' into new rows [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
Sorry for the naive question but I have a dataframe like this:
n sp cap
1 1 a 3
2 2 b 3+2+4
3 3 c 2
4 4 d 1+5
I need to split the numbers separated by the add sign ("+") into new rows in order to the get a new dataframe like this below:
n sp cap
1 1 a 3
2 2 b 3
3 2 b 2
4 2 b 4
5 3 c 2
6 4 d 1
7 4 d 5
How can I do that? strsplit?
thanks in advance
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df1, 'cap', sep="+", 'long')
# n sp cap
#1: 1 a 3
#2: 2 b 3
#3: 2 b 2
#4: 2 b 4
#5: 3 c 2
#6: 4 d 1
#7: 4 d 5
Or could do this in base R. Use strsplit to split the elements of "cap" column to substrings, which returns a list (lst), Replicate the rows of dataset by the length of each list element, subset the dataset based on the new index, convert the "lst" elements to "numeric", unlist, and cbind with the modified dataset.
lst <- strsplit(as.character(df1$cap), "[+]")
df2 <- cbind(df1[rep(1:nrow(df1), sapply(lst, length)),1:2],
cap= unlist(lapply(lst, as.numeric)))

Resources