How to plot subset of a dataframe - r

I have a data frame listing names, number of names in a specific year. When I subset this to find a specific name, say James, I cannot plot the subset. It is from a dataframe with one column listing names (thousands of them), one listing years, one listing gender (M or F), and one listing number. I split it by gender too. The main dataframe is called df1.
Here is the fist ten lines from the df1. No column is called years...
Name Gender Number Date
1 Mary F 7065 ob1880
2 Anna F 2604 ob1880
3 Emma F 2003 ob1880
4 Elizabeth F 1939 ob1880
5 Minnie F 1746 ob1880
6 Margaret F 1578 ob1880
7 Ida F 1472 ob1880
8 Alice F 1414 ob1880
9 Bertha F 1320 ob1880
10 Sarah F 1288 ob1880
df.james = subset(df1,df1 =="James")
df.split = split(df.james,df.james$Gender)
df.male = df.split$M
tbl = table(df.male) #this is the bit that doesn't work.
I get the following error:
Error in vector("integer", length) : vector size cannot be NA
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In bin + pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
3: In pd * nl : NAs produced by integer overflow
Also, when I try to tabulate two columns from that subset, it seems to include lots of values from the original data frame.

Related

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Error in merge large data tables in R

I have two data tables.
Table 1: 1349445 rows and 21 cols
Table 2: 3235 rows x 4 cols
Table 1:
YEAR STATE_NAME CROP .......
1990 Alabama Cotton
1990 Alabama Cotton
1990 Alabama Peanuts
.
.
.
Table 2:
STATE STATEFP COUNTYFP STATE_NAME
AK 2 13 Alaska
AK 2 16 Alaska
AK 2 20 Alaska
AK 2 50 Alaska
I want to merge the two tables by "STATE_NAME"
Table 1 <- data.table(Table 1)
Table 2 <- data.table(Table 2)
setkeyv(Table 1, c("STATE_NAME"))
setkeyv(Table 2, c("STATE_NAME"))
Hydra_merge <- merge(Table 1, Table 2, all.x = TRUE)
I am getting the below error. Can somebody help me to figure out what I am doing wrong here.
Thanks in advance.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 141691725 rows; more than 1352680 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I am not sure why nobody answered this yet, and probably this will be useless for OP, but this is quite straightforward!
As the error message states, you have plenty of rows in both tables with repeated keys. If you have two tables with, say, 5 and 6 rows, and the keys are unique, their join will have at least 5 and at most 11 rows (depending on whether all.x, all.y or all) is true.
If, instead, in both tables all rows have the same key, joining them will result in a table with 30 kinda meaningless rows.
as in:
table_1: table_2:
key val1 key val2
k a k 1
k b k 2
k c k 3
k d k 4
k e k 5
k 6
merge(table_1, table_2)
key val1 val2
k a 1
k a 2
k a 3
k a 4
... ...
k c 2
k c 3
k c 4
k c 5
... ...
k e 3
k e 4
k e 5
k e 6
data.table noticed and it's trying to help you. Which is also why it states If you are sure you wish to proceed, rerun with allow.cartesian=TRUE and go home with your, likely wrong but who am I to tell, cartesian product of the two tables.
Now, I am very tempted to try and guess the size of your two tables, given that the sum of their nrows is 1.352.680, the resulting mess of a table has 141.691.725 and the states are 50 (but one of the tables skips Alaska), but maybe next time.

Performing simple lookup using 2 data frames in R

In R, I have two data frames A & B as follows-
Data-Frame A:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Data-Frame B:
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
one row in A that doesn't join with B (50/NewYork/F).
one row in B that doesn't join with A (27/NewYork/M).
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.
Note that in the data.table world X[Y] does a right join, so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X, counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt]. We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.
So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.
You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table

Unable to set column names to a subset of a dataframe

I run the following code, p is the dataframe loaded.
a <- sort(table(p$Title))
a1 <- as.data.frame(a)
tail(a1, 7)
a
Maths 732
Science 737
Physics 737
Chemistry 776
Social Science 905
null 57374
88117
I want to do some manipulations on the above dataframe result. I want to add column names to the dataframe. I tried the colnames function.
colnames(a1) <- c("category", "count")
I get the below error:
Error in `colnames<-`(`*tmp*`, value = c("category", "count")) :
attempt to set 'colnames' on an object with less than two dimensions
Please suggest.
As I said in the comments to your question, the categories are rownames. A reproducible example:
# create dataframe p
x <- c("Maths","Science","Physics","Chemistry","Social Science","Languages","Economics","History")
set.seed(1)
p <- data.frame(title=sample(x, 100, replace=TRUE), y="some arbitrary value")
# create the data.frame as you did
a <- sort(table(p$title))
a1 <- as.data.frame(a)
The resulting dataframe:
> a1
a
Social Science 6
Maths 9
History 10
Science 11
Physics 12
Languages 15
Economics 17
Chemistry 20
Looking at the dimensions of dataframe a1, you get this:
> dim(a1)
[1] 8 1
which means that your dataframe has 8 rows and 1 column. Trying to assign two columnnames to the a1 dataframe will hence result in an error.
You can solve your problem in two ways:
1: assign just 1 columnname with colnames(a1) <- c("count")
2: convert the rownames to a category column and then assign the columnnames:
a1$category <- row.names(a1)
colnames(a1) <- c("count","category")
The resulting dataframe:
> a1
count category
Social Science 6 Social Science
Maths 9 Maths
History 10 History
Science 11 Science
Physics 12 Physics
Languages 15 Languages
Economics 17 Economics
Chemistry 20 Chemistry
You can remove the rownames with rownames(a1) <- NULL. This gives:
> a1
count category
1 6 Social Science
2 9 Maths
3 10 History
4 11 Science
5 12 Physics
6 15 Languages
7 17 Economics
8 20 Chemistry

Convert data frame of redundant frequencies

I have a data.frame like so:
category count
A 11
B 1
C 45
A 1003
D 20
B 207
E 634
E 40
A 42
A 7
B 44
B 12
Each row represents a specific element with a category type and a count of that element. I would like to produce a frequency distribution of counts per category, but the categories are at the moment redundant.
How do I retrieve a table of redundant category counts? i.e. I want a table that looks like:
category count
A 11234
B 4005
C 100023
D 65567
E 54654
... ...
I almost got there using lapply:
df.nrcounts <- lapply(unique(df.counts$category),
function(x) c(category=x, count=sum(subset(df.counts, category==x)$count)))
but I can't seem to coerce the output to a proper dataframe. I can't quite get my head around using the function.
aggregate(df.counts$count,by=list(df.counts$category),FUN=sum)
Or
library(data.table)
setDT(df.counts)[, list(count=sum(count)), by = category]

Resources