Error in merge large data tables in R - r

I have two data tables.
Table 1: 1349445 rows and 21 cols
Table 2: 3235 rows x 4 cols
Table 1:
YEAR STATE_NAME CROP .......
1990 Alabama Cotton
1990 Alabama Cotton
1990 Alabama Peanuts
.
.
.
Table 2:
STATE STATEFP COUNTYFP STATE_NAME
AK 2 13 Alaska
AK 2 16 Alaska
AK 2 20 Alaska
AK 2 50 Alaska
I want to merge the two tables by "STATE_NAME"
Table 1 <- data.table(Table 1)
Table 2 <- data.table(Table 2)
setkeyv(Table 1, c("STATE_NAME"))
setkeyv(Table 2, c("STATE_NAME"))
Hydra_merge <- merge(Table 1, Table 2, all.x = TRUE)
I am getting the below error. Can somebody help me to figure out what I am doing wrong here.
Thanks in advance.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 141691725 rows; more than 1352680 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I am not sure why nobody answered this yet, and probably this will be useless for OP, but this is quite straightforward!
As the error message states, you have plenty of rows in both tables with repeated keys. If you have two tables with, say, 5 and 6 rows, and the keys are unique, their join will have at least 5 and at most 11 rows (depending on whether all.x, all.y or all) is true.
If, instead, in both tables all rows have the same key, joining them will result in a table with 30 kinda meaningless rows.
as in:
table_1: table_2:
key val1 key val2
k a k 1
k b k 2
k c k 3
k d k 4
k e k 5
k 6
merge(table_1, table_2)
key val1 val2
k a 1
k a 2
k a 3
k a 4
... ...
k c 2
k c 3
k c 4
k c 5
... ...
k e 3
k e 4
k e 5
k e 6
data.table noticed and it's trying to help you. Which is also why it states If you are sure you wish to proceed, rerun with allow.cartesian=TRUE and go home with your, likely wrong but who am I to tell, cartesian product of the two tables.
Now, I am very tempted to try and guess the size of your two tables, given that the sum of their nrows is 1.352.680, the resulting mess of a table has 141.691.725 and the states are 50 (but one of the tables skips Alaska), but maybe next time.

Related

Join one table with multiple tables in R

I have around 800 data frames(a1,a2,a3...a800) in R and all of them have the same number of columns and column names.I want to a left join table a1 with rest of the 799 tables and store it in an object. Similarly, left join table a2 with the rest of them and store it another object and so on. I am unable to proceed with this! If anyone could help me will be great.
Here is an example
Table a1:
Names ID Time
X 1 2
Y 2 6
Z 3 5
K 4 8
Table a2;
Names ID Time
P 11 8
Q 12 9
R 10 7
Y 2 6
and so on.. I want to join by ID Column. And I have 800 tables!
u can use data.table::rbindlist
dataframe_name_list = list(a1,a2,a3,...a800)
data.table::rbindlist(dataframe_name_list, use.names=TRUE)

Plot network graph by rolling period using R

I was trying to conduct the network graph and some statistics by rolling three-year period, but I don't know how to set the rolling function. Below is my coding without rolling.
> library(igraph)
> em <- read.csv(file.choose(), header = TRUE )
> network <- graph.data.frame(em, directed = TRUE )
> em
sender receiver weights year
1 a d 2 2001
2 b a 3 2001
3 c a 1 2001
4 d h 1 2001
5 e d 3 2001
6 a b 1 2002
7 b c 2 2002
8 c d 1 2002
9 d a 1 2002
10 e h 2 2002
11 a d 1 2003
12 b d 1 2003
13 c a 2 2003
14 d h 2 2003
15 e a 1 2003
16 a d 1 2004
17 b c 1 2004
18 c d 2 2004
19 d c 1 2004
20 e a 2 2004
> E(network)$weight <- as.numeric(network[,3])
Warning message:
In eattrs[[name]][index] <- value :
number of items to replace is not a multiple of replacement length
> p <- plot (network,edge.width=E(network)$weight)
So in this example eventually it would come up with one weighed network graph. I would like to conduct the graphs using data in 2001-2003 and 2002-2004 subsamples, further with some SNA statistics. Some online resource suggests -rollappy()- or the functions in package -tidyquant- could do, but I didn't manage to figure our how to recognise the 4th column as year and how to set the rolling period. Would much appreciate it if anyone can help out, as I am a newbie to R.
Many thanks!!
Thanks #emilliman5 for the further questions.
My real dataset is an unbalanced panel with 15-year time span. I was intended to conduct network graphs using part of the full data. Subtracting rule is the 3-year rolling period (in fact with some other conditions but I just asked rolling here). So I planned to call the rolling subsamples first and conduct graphs. I hope it is a bit clear now.
Above data was just a mock sample. 4-year range should generate two graphs (2001-2003, 2002-2004), and 15-year should come up with 13 graphs. The real weighting variable is not called weight, but I agree the line "as.numeric(network[,3])" is redundant. (I realise now the example I made wasn't good...sorry...)
I get someone helped me with that now, so I'll just post some of the codes. Hope it might help other people.
Method 1: call sub-samples by function. This save me from constructing the nested loops together with graphing.
# Function: conditions for substracting; here rolling-year only
f <- function(j){
df <- subset(em, year>=j & year <= j+2)
}
print (f(2001)) # Test function output, print-out
# Rolling by location and year, graph-plotting
for (j in 2001:2002){
sdf = f(j)
nw <- graph.data.frame(sdf, directed = TRUE)
plot(nw, edge.width = E(sdf[[j]])$weight)
}
Method 2: Use loops -- fine for one or two subtracting conditions but would be a bit clumsy for more.
c <- 2001 # a number for year count
sdf <- {} # Set an empty set to save subsets
for (j in 2001:2002){
df_temp <- subset(em, year>=j & year<=j+2)
print(nrow(df_temp)) # To check the subset, not necessary
sdf[[c]] <- cbind(df_temp)
nw <- graph.data.frame(sdf[[c]], directed = TRUE)
plot(nw, edge.width = E(sdf[[c]])$weight)
c <- c + 1
}

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Performing simple lookup using 2 data frames in R

In R, I have two data frames A & B as follows-
Data-Frame A:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Data-Frame B:
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
one row in A that doesn't join with B (50/NewYork/F).
one row in B that doesn't join with A (27/NewYork/M).
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.
Note that in the data.table world X[Y] does a right join, so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X, counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt]. We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.
So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.
You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table

Tabulating association frequency counts

I have data which is in this format:
User Item
1 A
1 B
1 C
1 D
2 A
2 C
2 E
What I want to get is a frequency count for each pair. Order is not important so I don't want to count the inverse. I want to end up with a result similar to this, where the frequency counts are partitioned by user.
Pair Frequency
AB 1
AC 2
AD 1
AE 1
BC 1
BD 1
BE 0
CD 1
CE 1
What tool can I use to formulate this kind of table? I'd prefer some open source solution if possible.
Edit- Added example for my comment below
I'm reading in data from a CSV file using the following two lines and removing the factors with these two steps in code.
xa<-read.csv("C:/Direcotry/MyData.csv")
xa<-data.frame(lapply(xa, as.character), stringsAsFactors=FALSE)
User Item
1 394324 Item A
2 124209 Item B
3 212457 Item C
4 427052 Item A
5 118281 Item D
6 156831 Item A
7 212442 Item E
8 156831 Item B
9 212442 Item A
10 177734 Item C
When I try running suggested answer, I get an error with this result:
Error in combn(x, 2) : n < m
Well R is open source.
Here's an example based on your tiny sample of data:
Here I just read your data in by copypasting it straight from your post:
> xa=read.table(stdin(),header=TRUE,as.is=TRUE)
0: User Item
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 2 A
6: 2 C
7: 2 E
8:
So that's the data in. Then with a couple of lines of code:
> f=function(x) apply(combn(x,2),2,paste0,collapse="")
> table(unlist(tapply(xa$Item,xa$User,f)))
AB AC AD AE BC BD CD CE
1 2 1 1 1 1 1 1
If you need all the empty combinations explicitly as zeroes it takes another line or two (you need to generate all the possible combinations as a factor, rather than just the observed ones and tell table to include the empty ones).
After some research and suggestions by Glen, I came up with the following code which gets me a 3 column CSV file with the pair combination plus frequency count. If anyone sees a better way, let me know! But this seems to work.
The errors I was referring to in my follow up comments were caused by users having purchased only at one location.
library(reshape2)
xa<-read.csv("C:/Input.csv",as.is=TRUE)
xa=xa[!duplicated(xa),]
xa<-data.table(xa)
setkey(xa,ContactId,PurchaseLocation)
tab=table(xa$ContactId)
xa=xa[xa$ContactId %in% names(tab[tab>1]),]
f=function(x) apply(combn(x,2),2,paste0,collapse="--")
xb<-as.data.frame(table(unlist(tapply(xa$PurchaseLocation,xa$ContactId,f))))
xc=with(xb, cbind(Freq, colsplit(xb$Var1, pattern = "--", names = c('a', 'b'))))
xc=subset(xc,a!=b & a!="" & b!="" & Freq>1)
write.csv(xc,file="C:/Output.csv")
Edit- I made a small change to make it order independent by sorting the data table on a key.

Resources