Error Merging Two Columns - r

I'm trying to combine two columns, Previous$Col1 and Previous$Col3, into one data.frame.
This is what I tried to do:
x<-data.frame(Previous$Col1)
y<-data.frame(Previous$Col3)
z<-merge(x,y)
And for some reason I got this error on the console:
Error: cannot allocate vector of size 24.0 Gb
What's going wrong and what should I do to fix this?
How could a data frame with two columns with 80000ish rows take up 24 GB of memory?
Thanks!

You are creating a full cartesian product, which has 80000*80000 rows and two columns, that is a total of 1.28e+10 elements (about 51GB if I am correct). What are you trying to accomplish with your merge?
> x<-data.frame(a= c("a","b"))
> y<-data.frame(b= c(1,2,3))
> z<-merge(x,y)
> x
a
1 a
2 b
> y
b
1 1
2 2
3 3
> z
a b
1 a 1
2 b 1
3 a 2
4 b 2
5 a 3
6 b 3
You could do data.frame(Col3 = Previous$Col3, Col1= Previous$Col1) to achieve what you want.

Try using bind_cols from the dplyr package or cbind from base R.
bind_cols(Previous$Col1,Previous$Col3)
or
cbind(Previous$Col1,Previous$Col3)
Additionally, since these columns come from the same original data.frame. select() from the dplyr package could be used:
select(Previous,Col1,Col3)

Related

Finding Index from Vector/Matrix or Dataframe in R

I have data in R as follow:
data <- c(1,12,22,0,8,1,0,0)
Is there any way to index the data to find the index for element that is greater than 0? So the result will be:
1 2 3 5 6
I tried to use as.factor(data), but it will take several more step to get the result that I aim for. Thanks.
We can use which on a logical vector
which(data >0)
#[1] 1 2 3 5 6
Another option is using seq_along (but not as straightforward as the which method by #akrun)
> seq_along(data)[data>0]
[1] 1 2 3 5 6

How to replicate all rows of a dataframe for each ID of another dataframe in R?

I have one dataframe (df_features) consisting of 32 rows and six columns that relate to potential features of a study and a second dataframe (df_participants) containing 10,000 unique (non-numeric) IDs of my participants. There are no common columns across the two dataframes.
I want to create a dataset that contains each of the 32 rows from df_features for every ID in df_participants (so 320,000 rows and 7 columns in total).
How do I do this? I feel like it should be straightforward but I just can't find anything anywhere!
It sounds like you are looking to do a full outer join that will combine all features with all IDs. This can be done using several packages, and in base-R with the following:
features <- data.frame(f1=c("blue","geeen"),f2=c("young","old"))
participants <- data.frame(ID=c(1:10))
merge(features,participants,all=T)
You can do a full outer join. When you do a full outer join with no common columns across the two dataframes, you get the cartesian product of the two dataframes, which is what you're looking for. You can get this using the merge function. If your only two arguments to merge are the dataframes you want to perform the join on, you will get back the cartesian product of those dataframes.
Example:
df1 <- data.frame(y = 1:4)
df2 <- data.frame(z = 1:3)
df_merged <- merge(df1, df2)
print(df1)
# y
#1 1
#2 2
#3 3
#4 4
print(df2)
# z
#1 1
#2 2
#3 3
print(df_merged)
# y z
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 1 2
#6 2 2
#7 3 2
#8 4 2
#9 1 3
#10 2 3
#11 3 3
#12 4 3
I found a fairly convoluted way around it in case anyone is looking to do something similar:
matching_1<- expand.grid(df_participants$ID, df_features$feature_rownumber) %>% arrange(Var1) %>%
rename("ID"=Var1, "feature_rownumber"=Var2)
matching_2 <- left_join(df_participants, matching_1, by="ID")
final_dataset <- left_join(matching_2, df_features, by="feature_rownumber")
However I'm fairly sure there must be a more concise method out there!

Return df with a columns values that occur more than once [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?
Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).
Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Finding sum in all rows with R

I have two vectors
x <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,5,5,6,6,6,6)
y <- c(1,1,2,3,4,2,2,4,4,4,3,3,1,4,2,3,1,4,4,4,2,2,2,3,3)
I found the number of values for each x (from 1 to 6) as
t=table(x,y)
and get the table with 6 rows and 4 columns. Then I calculate the sum in all rows as s=apply(t,1,sum) and get the error. Could anybody explain what I do wrong?
What is the error? I don't get one with apply(t, 1, sum). Try instead
rowSums(t)
##1 2 3 4 5 6
##5 5 4 5 2 4
Or, you could simply use table(x), which gives you exactly the same output.

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources