I have read data from multiple xlsx sheet with xlsx package of R. Currently my data frame is like below
firstcol SecondCol
A abcd
B bds
A <NA>
A asd
C <NA>
B adfdf
? <NA>
C adfd
From the above data I want to get the following output.
Firsrcol FirstcolCount SecondCol
A 3 times 2 # we'll not count NA's
B 2 times 2
C 2 times 1
other 1 times 0
Is there any direct method that can do this? It would be nice to have some suggestion about this.
A data.table approach:
#load library
require(data.table)
# convert data.frame to data.table
setDT(df)
# make a new data.table with two columns. First one has the counts by each level of firstcol. Second one has the count minus the number of NA cases:
df[, .(FirsrcolCount = .N,
secondCol = .N - sum(is.na(secondcol))),
by = firstcol]
Though not quite clear what exactly you mean. Something like this?
library(dplyr)
df %>% group_by(firstcol) %>% summarise(FirstcolCount = n(), SecondCol = n() - sum(SecondCol == "<NA>"))
Source: local data frame [4 x 3]
firstcol FirstcolCount SecondCol
1 ? 1 0
2 A 3 2
3 B 2 2
4 C 2 1
Related
I have the following dataset
#df
Factors Transactions
a,c 1
b 0
c 0
d,a 0
a 1
a 0
b 1
I'd like to know how many times we did not have a factor and we had a transaction. So, my desirable output is as follows:
#desired output
Factors count
a 1
b 2
c 2
d 3
For instance, only one time we didn't have a and we had a transaction (i.e. only in the last row).
There are many ways to know how many times we had each factor and we had transactions. For instance I tried this one:
library(data.table)
setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))),
by = Transactions][,.(Transactions = sum(Transactions > 0)), by = Factors]
But I wish to count how many times we didn't have a factor and we had transaction.
Thanks in advance.
You can calculate the opposite, i.e, how many times the factor has a transaction and then the difference between the total transactions and transactions for each individual factor would be what you are looking for:
library(data.table)
total <- sum(df$Transactions > 0)
(setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))), Transactions]
[, total - sum(Transactions > 0), Factors])
# Factors V1
#1: a 1
#2: c 2
#3: b 2
#4: d 3
We can also do this with cSplit
library(splitstackshape)
cSplit(df, "Factors", ',', 'long')[, sum(df$Transactions) - sum(Transactions>0), Factors]
# Factors V1
#1: a 1
#2: c 2
#3: b 2
#4: d 3
Or with dplyr/tidyr
library(dplyr)
library(tidyr)
separate_rows(df, Factors) %>%
group_by(Factors) %>%
summarise(count = sum(df$Transactions) - sum(Transactions>0))
I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]
I know this may be a simple question but I cant seem to get it right.
I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.
Here is an example. Old_dt
v1 v2
1 a
2 b
3 c
4 d
Here is new_dt
v1 v2
3 c
4 d
5 e
What I want is to get just the 5 e row.
Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Only resulted in new_dt itself.
Using
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Reulted in nothing.
Using
sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]
Reulted in multiple rows of FALSEs
Can somebody help me?
Thank you in advance
We can do a join (data from #giraffehere's post)
df2[!df1, on = "a"]
# a b
#1: 6 14
#2: 7 15
To get rows in 'df1' that are not in 'df2' based on the 'a' column
df1[!df2, on = "a"]
# a b
#1: 4 9
#2: 5 10
In the OP's example we need to join on both columns
new_dt[!old_dt, on = c("v1", "v2")]
# v1 v2
#1: 5 e
NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.
Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used
library(dplyr)
anti_join(new_dt, old_dt)
# v1 v2
# (int) (chr)
#1 5 e
or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.
setdiff(new_dt, old_dt)
# v1 v2
#1: 5 e
However, the question is tagged as data.table.
dplyr would help a lot when you deal with tabular data in R - Would recommend you learn more about dplyr here
library(dplyr)
library(magrittr) # this is just for shorter code with %<>%
# Create a sequence number that combine v1 & v2
Old_dt %<>%
mutate(sequence = paste0(v1,v2))
new_dt %<>%
mutate(sequence = paste0(v1,v2))
# Filter new_dt by sequence not existed in old_dt
result <- new_dt %>%
filter(!(sequence %in% Old_dt$sequence)) %>%
select(v1:v2)
v1 v2
5 e
EDIT: I noticed OP wanted both rows and not just one to match on. I'll keep the data initialization part of the solution here as it is referenced above by #akron. However, use the top solution #akrun posted. It is the more of the "data.table way".
df1 <- data.table(a = 1:5, b = 6:10)
df2 <- data.table(a = c(1, 2, 3, 6, 7), b = 11:15)
head(df1)
a b
1: 1 6
2: 2 7
3: 3 8
4: 4 9
5: 5 10
head(df2)
a b
1: 1 11
2: 2 12
3: 3 13
4: 6 14
5: 7 15
If column a has repeats, you could try this base R hack:
id.var1 <- paste(df1$a, df1$b,sep="_")
id.var2 <- paste(df2$a, df2$b,sep="_")
dfKeep <- df[!(id.var2 %in% id.var1),]
I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)
Below is a subset of my data:
> head(dt)
name start end
1: 1 3195984 3197398
2: 1 3203519 3205713
3: 2 3204562 3207049
4: 2 3411782 3411982
5: 2 3660632 3661579
6: 3 3638391 3640590
dt <- data.frame(name = c(1, 1, 2, 2, 2, 3), start = c(3195984,
3203519, 3204562, 3411782, 3660632, 3638391), end = c(3197398,
3205713, 3207049, 3411982, 3661579, 3640590))
I want to calculate another value: the difference between the end coordinate of line n and the start coordinate of line n+1 but only if both elements share a name. To elaborate this is what I want a resulting data frame to look like:
name start end dist
1: 1 3195984 3197398
2: 1 3203519 3205713 -6121
3: 2 3204562 3207049
4: 2 3411782 3411982 −204733
5: 2 3660632 3661579 −248650
6: 3 3638391 3640590
The reason I want to do this is that I'm looking for dist values that are positive. One way I've tried this is to offset the start and end coordinates but then I run into a problem where I am comparing things with different names.
How does one do this in R?
A data.table solution may be good here:
library(data.table)
dt <- as.data.table(dt)
dt[, dist := c(NA, end[-(length(end))] - start[-1]) , by=name]
dt
# name start end dist
#1: 1 3195984 3197398 NA
#2: 1 3203519 3205713 -6121
#3: 2 3204562 3207049 NA
#4: 2 3411782 3411982 -204733
#5: 2 3660632 3661579 -248650
#6: 3 3638391 3640590 NA
Assuming your data is sorted, you can also do it with base R functions:
dt$dist <- unlist(
by(dt, dt$name, function(x) c(NA, x$end[-(length(x$end))] - x$start[-1]) )
)
Using dplyr (with credit to #thelatemail for the calculation of dist):
library(dplyr)
dat.new <- dt %.%
group_by(name) %.%
mutate(dist = c(NA, end[-(length(end))] - start[-1]))
Here is a different dplyr solution:
dt %.% group_by(name) %.% mutate(dist = lag(end) - start)
giving:
Source: local data frame [6 x 4]
Groups: name
name start end dist
1 1 3195984 3197398 NA
2 1 3203519 3205713 -6121
3 2 3204562 3207049 NA
4 2 3411782 3411982 -204733
5 2 3660632 3661579 -248650
6 3 3638391 3640590 NA