R - joining/merging rows in one dataset - r

I would like to know how to use R to merge rows in one set of data.
Currently my data looks like this:
Text 1 Text 2 Text 3 Text 4
Bob Aba Abb Abc
Robert Aba Abb Abc
Fred Abd Abe Abf
Martin Abg Abh Abi
If text two and text 3 are both the same for two rows (as in rows 1 & 2) I would like to make it into one row with more columns for the other data.
Text 1 Text 1a Text 2 Text 3 Text 4 Text 4a
Bob Robert Aba Abb Abc Abd
Fred NA Abd Abe Abf NA
Martin NA Abg Abh Abi NA
I did something similar with joining two separate sets of data and merging them using join
join=join(Data1, Data2, by = c('Text2'), type = "full", match = "all")
but I can't work out how to do it for duplicates within one set of data.
I think it might be possible to use aggregate but I have not used it before, my attempt was:
MyDataAgg=aggregate(MyData, by=list(MyData$Text1), c)
but when I try I am getting an output that looks like this on summary:
1 -none- numeric
1 -none- numeric
2 -none- numeric
or this on structure:
$ Initials :List of 12505
..$ 1 : int 62
..$ 2 : int 310
..$ 3 : int 504
I would also like to be able to combine rows using matching elements of two variables.

I don't think you can reshape or aggregate because :
You have duplicated rows that corresponds to the same key
You don't have the same number of value for each keys : you should fill it with missing values
Here a a manual attempt using by to process by key, and rbind.fill to aggregate all the list together. Each by step , is creating a one-row data.frame having (Text2,Text3) as key.
do.call(plyr::rbind.fill,by(dat,list(dat$Text2,dat$Text3),
function(d){
## change all other columns to a one row data.frame
dd <- as.data.frame(as.list(rapply(d[,-c(2,3)],as.character)))
## the tricky part : add 1 to a name like Text1 to become Text11 ,
## this is import to join data.frames formed by by
names(dd) <- gsub('(Text[0-9]$)','\\11',names(dd))
## add key to to the row
cbind(unique(d[,2:3]),dd)
}))
Text2 Text3 Text11 Text12 Text41 Text42
1 Aba Abb Bob Robert Abc Abd
2 Abd Abe Fred <NA> Abf <NA>
3 Abg Abh Martin <NA> Abi <NA>

Related

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Performing simple lookup using 2 data frames in R

In R, I have two data frames A & B as follows-
Data-Frame A:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Data-Frame B:
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
one row in A that doesn't join with B (50/NewYork/F).
one row in B that doesn't join with A (27/NewYork/M).
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.
Note that in the data.table world X[Y] does a right join, so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X, counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt]. We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.
So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.
You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table

unexpected rbind.fill behavior when combining columns of different class

I tried to use the rbind.fill function from the plyr package to combine two dataframes with a column A, which contains only digits in the first dataframe, but (also) strings in the second dataframe. Reproducible example:
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666))
rbind.fill(data1,data2)
This produced the output below with incorrect data in column A, row 4,5,6. It did not produce an error message.
A b c
1 107778 33434 6
2 1756756 4 7
3 2324234 5 8
4 2 NA 14562
5 3 NA 45613
6 1 NA 14
I had expected that the function would coerce the whole column into character class, or at least display NA or a warning. Instead, it inserted digits that I do not understand (in the actual file, these are two digit numbers that are not sorted). The documentation does not specify that columns must be of the same type in the to-be-combined data.frames.
How can I get this combination?
A b c
1 11111 4444 5555
2 22222 444 66666
3 33333 44444 7777
4 1234 NA 888
5 ss150 NA 777
6 123456 NA 666
look at class(data2$A). It's a factor which is actually an integer with a label vector. Use stringsAsFactors=F in your data.frame creation or in read.csv and friends. This will force the variables be either numeric or character vectors.
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666), stringsAsFactors=FALSE)
rbind.fill(data1,data2)

Parsing data in R, alternative to rbind() which can be put in "for" loop to write rows to new data table?

Let's say I have a data table called YC that looks like this:
Categories: colsums: tillTF:
ID: cat NA 0
MA NA 0
spayed NA 0
declawed NA 0
black NA 0
3 NA 0
no 57 1
claws NA 0
calico NA 0
4 NA 0
no 42 1
striped NA 0
0.5 NA 0
yes 84 1
not fixed NA 0
declawed NA 0
black NA 0
0.2 NA 0
yes 19 1
0.2 NA 0
yes 104 1
NH NA 0
spayed NA 0
claws NA 0
striped NA 0
12 NA 0
no 17 1
black NA 0
4 NA 0
yes 65 1
ID: DOG NA 0
MA NA 0
...
Only it's 1) not actually pivot table, it's inconsistently formatted to look like one and 2) the data is much more complicated, and was entered inconstantly over the course of a few decades. The only assumption that can be safely made about the data is that there are 12 variables associated with each record, and they are always entered in the same order.
My goal is to parse this data so that each attribute and associated numeric record are in in appropriate columns in a single row, like this:
Cat MA spayed declawed black 3 no 57
Cat MA spayed claws calico 0.5 no 42
Cat MA not fixed declawed black 0.2 yes 19
Cat MA not fixed declawed black 0.2 yes 104
Cat NH spayed claws striped 12 no 17
Cat NH spayed claws black 4 yes 65
Dog MA ....
I've written a for loop which identifies a "record" and then re-writes values in an array by reading backwards up the column in the data table until another "record" is reached. I'm new to R, and so wrote out my ideal loop without knowing whether it was possible.
array<-rep(0, length(7))
for (i in 1:7)
if(YC$tillTF[i]==1){
array[7]<-(YC$colsums[i])
array[6]<-(YC$Categories[i])
array[5]<-(YC$Categories[i-1])
array[4]<-(YC$Categories[i-2])
array[3]<-(YC$Categories[i-3])
array[2]<-(YC$Categories[i-4])
array[1]<-(YC$Categories[i-5])
}
YC_NT<-rbind(array)
Once array is filled in, I want to loop through YC and create a new row in YC_NT for each unique record:
for (i in 8:length(YC$tillTF))
if (YC$tillTF[i]==1){
array[8]<-(YC$colsums[i])
array[7]<-(YC$Categories[i])
if (YC$tillTF[i-1]==0){
array[6]<-YC$Categories[i-1]
}else{
rbind(array, YC_NT)}
if (YC$tillTF[i-2]==0){
array[5]<-YC$Categories[i-2]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-3]==0){
array[4]<-YC$Categories[i-3]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-4]==0){
array[3]<-YC$Categories[i-4]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-5]==0){
array[2]<-YC$Categories[i-5]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-6]==0){
array[1]<-YC$Categories[i-6]
}else{
rbind(array, YC_NT)}
}else{
array<-array}
When I run this loop within a function on my data, I'm getting my YC_NT data table back containing a single row. After spending a few days searching, I don't know that there is an R function which would be able to add the vector array to last row of a data table without giving it a unique name every time. My questions:
1) Is there a function that would add a vector called array to the end of a data table without re-writing a previous row called array?
2) If no such function exists, how could I create a new name for array every time my for loop reached a new numeric record?
Thanks for your help,
rbind or rbind.fill should do the trick. Alternatively, you can insert a row more efficiently with code such as:
df[nrow(df) + 1,] <- newrow
So I'm going to assume a new record begins every time tillTF=1. And that the n variables specified for the next subject are just the last n variables, the previous values all remain the same. I'm aslo assuming that all records are "complete" in that the last line is tillTF=1. (To make the last statement true, I removed the last two lines form your sample)
Here's how I might read the data in
dog <- read.fwf("dog.txt", widths=c(22,11,7), skip=1, stringsAsFactors=F)
dog$V1 <- gsub("\\s{2,}","",dog$V1)
dog$V2 < -gsub("\\s","",dog$V2)
dog$V3 <- as.numeric(gsub("\\s","",dog$V3))
So I read in the data here and and strip off the extra spaces. Now I will add an ID column giving each record a unique ID and incrementing that value every time tillTF=1. Then i'll split the data on that ID value
dog$ID<-c(0, cumsum(dog$V3[-nrow(dog)]))
dv <- lapply(split(dog, dog$ID), function(x) {
c(x$V1, x$V2[nrow(x)])}
)
Now I'll go through the list with Reduce and each time replace the last n variables with the n variables for a given ID.
trans < -Reduce(function(a,b) {
a[(length(a)-length(b)+1):length(a)] <- b
a
}, dv, accumulate=T)
Now i'll put all the data together with tabs and then use read.table to process the data and do all the proper data conversions and create a data frame
dd<-read.table(text=sapply(a, paste0, collapse="\t"), sep="\t")
That gives
# print(dd)
V1 V2 V3 V4 V5 V6 V7 V8
1 ID: cat MA spayed declawed black 3.0 no 57
2 ID: cat MA spayed claws calico 4.0 no 42
3 ID: cat MA spayed claws striped 0.5 yes 84
4 ID: cat MA not fixed declawed black 0.2 yes 19
5 ID: cat MA not fixed declawed black 0.2 yes 104
6 ID: cat NH spayed claws striped 12.0 no 17
7 ID: cat NH spayed claws black 4.0 yes 65
So as you can see, I left the "ID:" on but it should be easy enough to strip that off. But these commands do the basic reshaping for you. There are fewer arrays and if statements and rbinding in the solution which is nice, but I encourage you to make sure you understand each line if you want to use it.
Also note that my output is slightly different than your expected output; you are missing the "84" value and have the calico with "42" listed as "0.5" rather than "4.0". So let me know if I was wrong in how I interpreted the data or perhaps correct the example output.

Resources