I have a dataset as follows:
i,o,c
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
I want to reform this dataset into a form as follows:
i,u,o,c
A,3,4,2
B,1,3,1
C,2,2.5,1
Here, u represents the unique instances of variable i in the dataset, o = (sum of o / u) and c = unique countries.
I can get u with the following statement and by using plyr:
count(df1,vars="i")
I can also get some of the other variables by using the insights learned from my previous question. I can laboriously and by saving to multiple data frames and then finally joining them together achieve my intended results by I wonder if there is a one line optimization or just simply a better way of doing this than my current long winded way.
Thanks !
I don't understand how this is different from your earlier question. The approach is the same:
library(plyr)
ddply(mydf, .(i), summarise,
u = length(i),
o = mean(o),
c = length(unique(c)))
# i u o c
# 1 A 3 4.0 2
# 2 B 1 3.0 1
# 3 C 2 2.5 2
If you prefer a data.table solution:
> library(data.table)
> DT <- data.table(mydf)
> DT[, list(u = .N, o = mean(o), c = length(unique(c))), by = "i"]
i u o c
1: A 3 4.0 2
2: B 1 3.0 1
3: C 2 2.5 2
Related
Hi I am wondering if anyone knows/ could show me how to calculate a value in a column C based on the previous value(s) in this column C and another column D and save the calculated value as a new current value in column C?
For example, suppose I first initialize the column C to 1s and the calculation I want to implement is C(1) = 1 + B(1)*0.1*1 and C(2) = C(1) + B(2)*0.1*C(1).
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
test
A B C
1: 1 1 1
2: 2 2 1
3: 3 1 1
4: 4 2 1
5: 5 1 1
What I want is:
test
A B C
1: 1 1 1.1
2: 2 2 1.32
3: 3 1 1.452
4: 4 2 1.7424
5: 5 1 1.91664
I could achieve what I want with for loop or apply() but I really want to know if this is doable just using data.table and get some speed up.
Edit:
As pointed out by Frank in the comments below,
test[, C := cumprod(1 + .1*B)]
will do since multiplication is distributive. What if I want to supply a more complex custom function?
Many thanks in advance!
Using the formula as presented we have:
test[, C := Reduce(function(c, b) c + .1 * b * c, B, init = 1, acc = TRUE)[-1] ]
Of course, as pointed out already it simplifies in this particular case since we can write the body of the function as c * ( 1 + .1 * b) which implies a cumulative product of the parenthesized portion.
It seems you need to apply the function cumulatively
library(data.table)
library(zoo)
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
z <- function(b){1+b*0.1}
test[,C:=cumprod(rollapply(B, width=1, FUN=z))]
But I agree that there's really no need to bring zoo here. Frank's solution is more elegant and concise.
test[,C:=cumprod(1 + .1*B)]
I don't believe there is a similar data.table function, but it seems like accumulate from purrr is what you want. Simple example below, but the input could be rows of a data.table also.
library(purrr)
accumulate(1:4, function(x, y){2*x + y})
# [1] 1 4 11 26
I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?
So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4
in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".
Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4
I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.
dt <- data.table(L=1:5,A=letters[7:11],B=letters[12:16])
L A B
1: 1 g l
2: 2 h m
3: 3 i n
4: 4 j o
5: 5 k p
Now I want to paste columns "A" and "B" to get a new one, let's call it "new":
dt2
L A B new
1: 1 g l gl
2: 2 h m hm
3: 3 i n in
4: 4 j o jo
5: 5 k p kp
I had a similar issue but had many columns, and didn't want to type them each manually.
New version
(based on comment from #mnel)
dt[, new:=do.call(paste0,.SD), .SDcols=-1]
This is roughly twice as fast as the old version, and seems to sidestep the quirks. Note the use of .SDcols to identify the columns to use in paste0. The -1 uses all columns but the first, since the OP wanted to paste columns A and B but not L.
If you would like to use a different separator:
dt[ , new := do.call(paste, c(.SD, sep = ":"))]
Old version
You can use .SD and by to handle multiple columns:
dt[,new:=paste0(.SD,collapse=""),by=seq_along(L)]
I added seq_along in case L was not unique. (You can check this using dt<-data.table(L=c(1:4,4),A=letters[7:11],B=letters[12:16])).
Also, in my actual instance for some reason I had to use t(.SD) in the paste0 part. There may be other similar quirks.
Arun's comment answered this question:
dt[,new:=paste0(A,B)]
This should do it:
dt <- data.table(dt, new = paste(dt$A, dt$B, sep = ""))
If you want to paste strictly using column indexes (when you may not know the row names)...
I want to get a new column by pasting two columns 6 and 4
dt$new <- apply( dt[,c(6,4)], 1, function(row){ paste(row[1],row[2],sep="/") })
I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah