I am using data.table to do a one-to-many merge. Instead of matching with all the rows, the output is showing only the last matched row for each unique value of the key.
a <- data.table(x = 1:2L, y = letters[1:4])
b <- data.table(x = c(1L,3L))
setkey(a,x)
setkey(b,x)
I want to do a many to one (b to a) join based on column x.
c <- a[b,on=.(x)]
c
# x y
# 1: 1 a
# 2: 1 c
# 3: 3 NA
However, this approach creates a new data.table called c, instead of making a new data.table, I use the following code to add the column y with b.
b[a,y:=i.y]
Now b looks like,
b
# x y
# 1: 1 c
# 2: 3 NA
The desired output is the one in the first method (c). Is there a way of using := and output all the rows instead of the last matched row alone?
PS: The reason I want to use method 2 using := is because my data is huge and I do not want to make copies. The example I showed reflects what happens in my data.
Related
I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.
In R, I have a data table and a character vector with a subset of the data table's column names. I need to compute the z-scores (i.e. number of standard deviations from the mean) of each column with a specified name, and put the averages of the z-scores in a new column. I found a solution with explicit for-loops (posted below), but this must be a common enough task that some library function could be made to do the work more elegantly. Is there a better way?
Here's my solution:
#! /usr/bin/env RSCRIPT
library(data.table)
# Sample data table.
dt <- data.table(a=1:3, b=c(5, 6, 3), c=2:4)
# List of column names.
cols <- c('a', 'b')
# Convert columns to z-scores, and add each to a new list of vectors.
zscores <- list()
for (colIx in 1:length(cols)) {
zscores[[colIx]] <- scale(dt[,get(cols[colIx])], center=TRUE, scale=TRUE)
}
# Average corresponding entries of each vector of z-scores.
avg <- numeric(nrow(dt))
for (rowIx in 1:nrow(dt)) {
avg[rowIx] <- mean(sapply(1:length(cols),
function(colIx) {zscores[[colIx]][rowIx]}))
}
# Add new vector to the table, and print out the new table.
dt[,d:=avg]
print(dt)
This gives what you might expect.
a b c d
1: 1 5 2 -0.39089105
2: 2 6 3 0.43643578
3: 3 3 4 -0.04554473
scale can be applied to matrix(-like) object, you can get desired output by
> set(dt, NULL, 'd', rowMeans(scale(dt[, cols, with = F])))
> dt
a b c d
1: 1 5 2 -0.39089105
2: 2 6 3 0.43643578
3: 3 3 4 -0.04554473
I have a huge data frame. I am stuck with if function. Let me first present the simple example and then I lay down my problem:
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
Problem: I want to run if function which sums column z values based for row which has same y and a only if the second row of each group has corresponding z equals 1
I am sorry but I am quite new in R so not able to present any reasonable codes which I have done by my own.
Any help would be highly appreciated.
As mentioned, your problem isn't clearly stated.
Perhaps you are looking to do something like this:
x$new <- with(x, ave(z, y, a, FUN = function(k)
ifelse(k[2] == 1, sum(k), NA)))
x
# z y a new
# 1 0 2 1 3
# 2 1 2 1 3
# 3 2 2 1 3
# 4 3 3 2 NA
# 5 4 3 2 NA
# 6 5 3 2 NA
Here, I've created a new column "new" which sums the values of "z" grouped by "y" and "a", but only if the second value in the group is equal to 1.
Since you say your data frame is quite large, you might want to convert your data frame to a data.table object using the data.table package. You will likely find that the required operations are much faster if you have a great many rows. However, the construction of the code for your case is not straight forward with data.table.
If I understnad what you want to do (which is not entirely clear to me) you could try the following:
library(data.table)
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
xx <- as.data.table(x) # Make a data.table object
setkey(xx, z) # Make the z column a key
xx[1, sum(a)] # Sum all values in column a where the key z = 1
[1] 1
# Now try the other sum you mention
xx[, sum(z), by = list(z = y)] # A column sum over groups defined by z = y
z V1
1: 2 2
2: 3 3
sum(xx[, sum(z), by = list(z = y)][, V1]) # Summing over the sums for each group should do it
[1] 5
To create the sum over the column a where z = 1, I made the z column a key. The syntax xx[1, sum(a)] sums a where the key value (z value) is 1.
I can create groups with the data.table object with by, which is analogous to a SQL WHERE clause if you are familiar with SQL. However, the result is the sum of the column z for each of groups created. This may be inefficient if you have a great many possible matching values where z = y. The outer sum adds the values for each group in the sub-selected V1 column of the inner result.
If you are going to use data.table in a serious way study the informative vignettes available for that package.
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014). data.table: Extensions of data.frame. R package version 1.9.2. http://CRAN.R-project.org/package=data.table
I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah
If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).