I have the following data and I was wondering how to generate a table of the frequency from each response via base, plyr, or another package.
My data:
df = data.frame(id = c(1,2,3,4,5),
Did_you_use_tv=c("tv","","","tv","tv"),
Did_you_use_internet=c("","","","int","int"))
df
I can run a table and get the frequencies for any column using the table
table(df[,2])
table(df[,2], df[,3])
However, how can I go about setting up the data so it looks like below.
df2 = data.frame(Did_you_use_tv=c(3),
Did_you_use_internet=c(2))
df2
It's just a summary of frequencies for each column.
I'm going to be creating cross tabs but given the structure of the data, I feel this may be a little more useful.
This is similar in concept to #Tyler's answer. Just take the sum of all values that are not equal to "":
colSums(!df[-1] == "")
# Did_you_use_tv Did_you_use_internet
# 3 2
Update
Fellow Stack Overflow user #juba has done some work on a function called multi.table which looks like this:
multi.table <- function(df, true.codes=NULL, weights=NULL) {
true.codes <- c(as.list(true.codes), TRUE, 1)
as.table(sapply(df, function(v) {
sel <- as.numeric(v %in% true.codes)
if (!is.null(weights)) sel <- sel * weights
sum(sel)
}))
}
The function is part of the questionr package.
Usage in your example would be:
library(questionr)
multi.table(df[-1], true.codes=list("tv", "int"))
# Did_you_use_tv Did_you_use_internet
# 3 2
Here's one approach of many that came to mind:
FUN <- function(x) sum(x != "")
do.call(cbind, lapply(df[, -1], FUN))
## Did_you_use_tv Did_you_use_internet
## [1,] 3 2
Here's another approach
> do.call(cbind, lapply(df[,-1], table))[-1, ]
Did_you_use_tv Did_you_use_internet
3 2
With plyr and reshape2
t(dcast(subset(melt(df,id.var="id"), value!=""), variable ~ .))
Related
I have the following problem when trying to make the following analysis with big data:
I have two data frames A and B with the same primary key (over multiple columns), but data frame A has in addition a date variable with it.
I now want to check if for all unique entities in data frame A if I have an entry in dataframe B as well. I do this with the following function:
checkMissing <- function(A, B, primary_key) {
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
return(A[!A %in% B,])
}
As it turns out, the unique part is terrible slow when A is getting more and more data (I checked for something like 15MN rows and it took about 30 seconds on my machine).
Is there a smarter way to check if entities are missing in B without using dplyr? (base R would be perfect, but data.table works as well)
Here a reproduceable example:
library(tictoc)
checkMissing <- function(A, B, primary_key) {
tic("making data unique")
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
toc()
return(A[!A %in% B,])
}
# creating the dummy key data
ID1 <- 250000
ID2 <- seq(1,ID1/100,1)
ID3 <- seq(1,ID1/10000,1)
tmp <- data.frame("ID1" = seq(1,ID1,1),
"ID2" = sample(ID2, ID1, replace = TRUE),
"ID3" = sample(ID3, ID1, replace = TRUE)
)
#creating the date sequence
dates <- data.frame("date" = seq.Date(as.Date("2019-01-01"),as.Date("2019-02-28"),1))
#cross join to get data frame A
df.A <- merge(dates,tmp,by=NULL)
# create data frame B
df.B <- unique(df.A[,c("ID1","ID2","ID3")])
tic("overall time")
df.result <- checkMissing(df.A,df.B,c("ID1","ID2","ID3"))
toc()
Thanks!
Stephan
as joran pointed out in his comment - the anti join implementation in data.table is way faster:
setDT(df.A)[!df.B, on = c("ID1","ID2","ID3")]
On my test data execution reduced from something between 30-35 seconds to less then 2 seconds.
Although still interested in a faster base R version this is a correct answer.
Best
Stephan
I asked a question before which was complicated and I did not get any help. So I tried to simplify the question and input output.
I have tried many ways but none worked for example , I sort down some
# 1
for(i in ncol(mydata)){
corsA = grep(colnames(mydata)[i] , colnames(mysecond))
mydata[,corsA]%in%mysecond[,i]}
# here if I get true then means they have match
## 2
are.cols.identical <- function(col1, col2) identical(mydata[,col1], mysecond[,col2])
res <- outer(colnames(mydata), colnames(mysecond),FUN = Vectorize(are.cols.identical))
cut <- apply(res, 1, function(x)match(TRUE, x))
### 3
(mydata$Rad) %in% (mysecond$Ro5_P1_A5)
#### 4
which(mydata %in% mysecond)
#### 5
match(mydata$sus., mysecond$R5_P1_A5)
or
which(mydata$sus. %in% mysecond$RP1_A5)
matches <- sapply(mydata,function(x) sapply(mysecond,identical,x))
and few others, but none led me to an answer
Here is another solution using regex:
rows<-mapply(grep,mysecond,mydata)
The step above will return a list with the matched rows in each column:
rows
If you would like to see how many rows where matched you can do this:
lapply(rows,length)
Now we can go ahead a get the rows of interest in mydata, but rows is a list so we need to unlist() and we might have some duplicate rows, and we don't want them to appear twice in the output, so we use the unique() function:
rows<-unique(unlist(rows))
mydata[rows,]
#View(mydata[rows,])
require(plyr)
dat <- strsplit(as.character(mydata$subunits..UniProt.IDs.), ',')
dat <- data.frame(mydata[,1],rbind.fill(lapply(dat,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)})))
mydata[unlist(apply(dat,2, function(x) which(x %in% mysecond[,2]))),]
I am trying to filter my data frame to leave the highest value of one variable (Question), for each combination of two others (Person and Test).
My data.frame looks something like this:
df <- data.frame(Person=c("Person1","Person1","Person1","Person2","Person2","Person2","Person3","Person3","Person3"),
Test=c(rep("Test1",9)),
Question=c("1","2","3","1","2","3","1","2","3"))
Except that there are multiple tests i.e. Test2, Test3 etc.
I want to filter down to show the last Question in each Test for every Person. There are a different number of Questions in each Test.
Using the response to this question:
dplyr filter: Get rows with minimum of variable, but only the first if multiple minima, I managed to get some of the way with:
library(dplyr)
df.grouped <- group_by(df.orginial, Person, Test)
df.lastquestion <- filter(df.grouped, Question == max(Question))
Unfortunately, it leaves me with the highest Question number that each Person answered across all Tests. Whereas, I would like the highest Question number that each Person answered on each Test.
Thanks
Whilst there'll be a plethora of dplyr, plyr, and data.table options proffered, here's a good ol' fashioned base-R version, using a somewhat expanded (and vastly simplified) version of your example data
df <- data.frame(Person = rep(paste0("Person", 1:3), each = 3, times = 2),
Test = rep(paste0("Test", 1:4), each = 9),
Question = as.character(rep(1:3, times = 3 * 2)))
You could do this inline, but an explicit wrapper allows me to focus on two aspects of this question
wrapper <- function(x) {
with(x, x[Question == max(Question), ])
}
You could use which.max(Question) here, but that would select the first of the maxima if more than one value in Question took the same value as the maximum value.
Now we want to split the data and then apply wrapper() to each element. The other packages mentioned above provide more consistent and in some cases faster implementations of this but base-R is often competitive:
ll <- lapply(with(df, split(df, list(Person, Test))), wrapper)
Now just bind everything together:
newdf <- do.call("rbind", c(ll, make.row.names = FALSE))
head(newdf)
Which returns:
> head(newdf)
Person Test Question
1 Person1 Test1 3
2 Person2 Test1 3
3 Person3 Test1 3
4 Person1 Test2 3
5 Person2 Test2 3
6 Person3 Test2 3
The whole thing would be:
wrapper <- function(x) {
with(x, x[Question == max(Question), ])
}
ll <- lapply(with(df, split(df, list(Person, Test))), wrapper)
newdf <- do.call("rbind", c(ll, make.row.names = FALSE))
Use ave:
df[df$Question == ave(as.numeric(df$Question),list(df$Person,df$Test),FUN = max), ]
I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Inspired by #beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))
I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA