How to add columns from another data frame where there are multible matching rows - r

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.

You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")

An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

Related

Delete rows when certain factor is present more than 200 times

I have a dataset with over 400,000 cows. These cows are (unevenly) spreak over 2355 herds. Some herds are only present once in the data, while one herd is even present 2033 times in the data, meaning that 2033 cows belong to this herd. I want to delete herds from my data that occur less than 200 times.
With use of plyr and subset, I can obtain a list of which herds occur less than 200 times, I however can not find out how to apply this selection to the full dataset.
For example, my current data looks a little like:
cow herd
1 1
2 1
3 1
4 2
5 3
6 4
7 4
8 4
With function count() I can obtain the following:
x freq
1 3
2 1
3 1
4 3
Say I want to delete the data belonging to herds that occur less than 3 times, I want my data to look like this eventually:
cow herd
1 1
2 1
3 1
6 4
7 4
8 4
I do know how to tell R to delete data herd by herd, however since, in my real datatset, over 1000 herds occur less then 200 times, it would mean that I would have to type every herd number in my script one by one. I am sure there is an easier and quicker way of asking R to delete data above or below a certain occurence.
I hope my explanation is clear and someone can help me, thanks in advance!
Use n + group_by:
library(dplyr)
your_data %>%
group_by(herd) %>%
filter(n() >= 3)

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

making a table with multiple columns in r

I´m obviously a novice in writing R-code.
I have tried multiple solutions to my problem from stackoverflow but I'm still stuck.
My dataset is carcinoid, patients with a small bowel cancer, with multiple variables.
i would like to know how different variables are distributed
carcinoid$met_any - with metastatic disease 1=yes, 2=no(computed variable)
carcinoid$liver_mets_y_n - liver metastases 1=yes, 2=no
carcinoid$regional_lymph_nodes_y_n - regional lymph nodes 1=yes, 2=no
peritoneal_carcinosis_y_n - peritoneal carcinosis 1=yes, 2=no
i have tried this solution which is close to my wanted result
ddply(carcinoid, .(carcinoid$met_any), summarize,
livermetastases=sum(carcinoid$liver_mets_y_n=="1"),
regionalmets=sum(carcinoid$regional_lymph_nodes_y_n=="1"),
pc=sum(carcinoid$peritoneal_carcinosis_y_n=="1"))
with the result being:
carcinoid$met_any livermetastases regionalmets pc
1 1 21 46 7
2 2 21 46 7
Now, i expected the row with 2(=no metastases), to be empty. i would also like the rows in the column carcinoid$met_any to give the number of patients.
If someone could help me it would be very much appreciated!
John
Edit
My dataset, although the column numbers are: 1, 43,28,31,33
1=yes2=no
case_nr met_any liver_mets_y_n regional_lymph_nodes_y_n pc
1 1 1 1 2
2 1 2 1 2
3 2 2 2 2
4 1 2 1 1
5 1 2 1 1
desired output - I want to count the numbers of 1:s and 2:s, if it works, all 1:s should end up in the met_any=1 row
nr liver_mets regional_lymph_nodes pc
met_any=1 4 1 4 2
met_any=2 1 4 1 3
EDIT
Although i probably was very unclear in my question, with your help i could make the table i needed!
setDT(carcinoid)[,lapply(.SD,table),.SDcols=c(43,28,31,33,17)]
gives
met_any lymph_nod liver_met paraortal extrahep
1: 50 46 21 6 15
2: 111 115 140 151 146
i am very grateful! #mtoto provided the solution
John
Based on your example data, this data.table approach works:
library(data.table)
setDT(df)[,lapply(.SD,table),.SDcols=c(2:5)]
# met_any liver_mets_y_n regional_lymph_nodes_y_n pc
# 1: 4 1 4 2
# 2: 1 4 1 3

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

R - create new vectors based on elements of existing vector

and thanks in advance for your help. I am very new to R and am having some trouble with code that, to me looks like it should work, but isn't. I have a data frame like the one below:
studentID classNumber classRating
7 1 4
7 2 4
7 4 3
79 1 5
79 2 3
116 1 5
116 2 4
134 1 5
134 3 5
134 4 5
And I want it to read like this:
Student ID class1 class2 class3 class4
7 4 4 NA 3
79 5 3 NA NA
116 5 4 NA NA
134 5 NA 5 5
I've tried to piece together different things that I've come across and it seemed like the best approach was to create a new data frame and matrix and then populate it from the current data frame. I came up with the broken code below:
classRatings = data.frame(matrix(NA,4,5))
for(i in 1:nrow(classDB)){
#Find ratings by each student
rowsToReplace = classDB$studentID==classRatings$studentID[i]
#Make a row for each unique studentID in classRatings
classDB$studentID[rowsToReplace] = classRatings$studentID[i]
#for each studentID, find put the given rating for each unique class into
#it's own vector
for(j in classDB$classNumber){
if(classDB$classNumber==1){classRatings$class1==classDB$classRating}[j]
if(classDB$classNumber==2){classRatings$class2==classDB$classRating}[j]
if(classDB$classNumber==3){classRatings$class3==classDB$classRating}[j]
if(classDB$classNumber==4){classRatings$class4==classDB$classRating}[j]
if(classDB$classNumber==5){classRatings$class5==classDB$classRating}[j]
}
}
I'm getting an error that says:
the condition has length > 1 and only the first element will be used
and I am beyond my skill level to figure it out. Any help is appreciated.
The tidyr package can spread this long table into a wider one:
library(tidyr)
spread(classDB,classNumber,classRating,fill=NA)

Resources