Performing simple lookup using 2 data frames in R - r

In R, I have two data frames A & B as follows-
Data-Frame A:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Data-Frame B:
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.

library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
one row in A that doesn't join with B (50/NewYork/F).
one row in B that doesn't join with A (27/NewYork/M).
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.
Note that in the data.table world X[Y] does a right join, so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X, counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt]. We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.

So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.

You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table

Related

R Calculation of sum from the value if the column is NA

I have a csv which has 10 thousand rows of data which consists of 5 columns which are Name, Age, Sex, Money and Sum.
I need to perform the for loop through the Age column and check if it is empty.
If it is empty meaning NA, then I need to perform the sum(calculation) under the Sum column from the value, Money
For example, from the data below,
for the Name which is Tan and the Age is NA which pass the criteria which is NA
I need to perform the calculation of the number from the Money column and then provide the value under the Sum column.
I will show the top 5 rows of data.
Name Age Sex Money Sum
Alex 50 M 50 NA
James 20 M 30 NA
Tan NA F 40 NA
Andy 35 M 70 NA
David NA M 60 NA
R code
for(externalitem in df$Age)
{
if(is.na(externalitem)
{
#perform calculation of money column and provide the under the sum value under the sum column
}
}
How to achieve it? I want to perform looping through the Age column and check if it is empty. if it is empty, then I want to grab the data from the Money column and then perform calculation and load the data to Sum column.

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Unable to set column names to a subset of a dataframe

I run the following code, p is the dataframe loaded.
a <- sort(table(p$Title))
a1 <- as.data.frame(a)
tail(a1, 7)
a
Maths 732
Science 737
Physics 737
Chemistry 776
Social Science 905
null 57374
88117
I want to do some manipulations on the above dataframe result. I want to add column names to the dataframe. I tried the colnames function.
colnames(a1) <- c("category", "count")
I get the below error:
Error in `colnames<-`(`*tmp*`, value = c("category", "count")) :
attempt to set 'colnames' on an object with less than two dimensions
Please suggest.
As I said in the comments to your question, the categories are rownames. A reproducible example:
# create dataframe p
x <- c("Maths","Science","Physics","Chemistry","Social Science","Languages","Economics","History")
set.seed(1)
p <- data.frame(title=sample(x, 100, replace=TRUE), y="some arbitrary value")
# create the data.frame as you did
a <- sort(table(p$title))
a1 <- as.data.frame(a)
The resulting dataframe:
> a1
a
Social Science 6
Maths 9
History 10
Science 11
Physics 12
Languages 15
Economics 17
Chemistry 20
Looking at the dimensions of dataframe a1, you get this:
> dim(a1)
[1] 8 1
which means that your dataframe has 8 rows and 1 column. Trying to assign two columnnames to the a1 dataframe will hence result in an error.
You can solve your problem in two ways:
1: assign just 1 columnname with colnames(a1) <- c("count")
2: convert the rownames to a category column and then assign the columnnames:
a1$category <- row.names(a1)
colnames(a1) <- c("count","category")
The resulting dataframe:
> a1
count category
Social Science 6 Social Science
Maths 9 Maths
History 10 History
Science 11 Science
Physics 12 Physics
Languages 15 Languages
Economics 17 Economics
Chemistry 20 Chemistry
You can remove the rownames with rownames(a1) <- NULL. This gives:
> a1
count category
1 6 Social Science
2 9 Maths
3 10 History
4 11 Science
5 12 Physics
6 15 Languages
7 17 Economics
8 20 Chemistry

How to plot subset of a dataframe

I have a data frame listing names, number of names in a specific year. When I subset this to find a specific name, say James, I cannot plot the subset. It is from a dataframe with one column listing names (thousands of them), one listing years, one listing gender (M or F), and one listing number. I split it by gender too. The main dataframe is called df1.
Here is the fist ten lines from the df1. No column is called years...
Name Gender Number Date
1 Mary F 7065 ob1880
2 Anna F 2604 ob1880
3 Emma F 2003 ob1880
4 Elizabeth F 1939 ob1880
5 Minnie F 1746 ob1880
6 Margaret F 1578 ob1880
7 Ida F 1472 ob1880
8 Alice F 1414 ob1880
9 Bertha F 1320 ob1880
10 Sarah F 1288 ob1880
df.james = subset(df1,df1 =="James")
df.split = split(df.james,df.james$Gender)
df.male = df.split$M
tbl = table(df.male) #this is the bit that doesn't work.
I get the following error:
Error in vector("integer", length) : vector size cannot be NA
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In bin + pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
3: In pd * nl : NAs produced by integer overflow
Also, when I try to tabulate two columns from that subset, it seems to include lots of values from the original data frame.

recoding using R

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz
At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003
Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Resources