R - create new vectors based on elements of existing vector - r

and thanks in advance for your help. I am very new to R and am having some trouble with code that, to me looks like it should work, but isn't. I have a data frame like the one below:
studentID classNumber classRating
7 1 4
7 2 4
7 4 3
79 1 5
79 2 3
116 1 5
116 2 4
134 1 5
134 3 5
134 4 5
And I want it to read like this:
Student ID class1 class2 class3 class4
7 4 4 NA 3
79 5 3 NA NA
116 5 4 NA NA
134 5 NA 5 5
I've tried to piece together different things that I've come across and it seemed like the best approach was to create a new data frame and matrix and then populate it from the current data frame. I came up with the broken code below:
classRatings = data.frame(matrix(NA,4,5))
for(i in 1:nrow(classDB)){
#Find ratings by each student
rowsToReplace = classDB$studentID==classRatings$studentID[i]
#Make a row for each unique studentID in classRatings
classDB$studentID[rowsToReplace] = classRatings$studentID[i]
#for each studentID, find put the given rating for each unique class into
#it's own vector
for(j in classDB$classNumber){
if(classDB$classNumber==1){classRatings$class1==classDB$classRating}[j]
if(classDB$classNumber==2){classRatings$class2==classDB$classRating}[j]
if(classDB$classNumber==3){classRatings$class3==classDB$classRating}[j]
if(classDB$classNumber==4){classRatings$class4==classDB$classRating}[j]
if(classDB$classNumber==5){classRatings$class5==classDB$classRating}[j]
}
}
I'm getting an error that says:
the condition has length > 1 and only the first element will be used
and I am beyond my skill level to figure it out. Any help is appreciated.

The tidyr package can spread this long table into a wider one:
library(tidyr)
spread(classDB,classNumber,classRating,fill=NA)

Related

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Resources