Find unique length of one column by matching other columns - r

So i have this data frame in CSV format:
and i would like to know how to find unique length of different lecturer.id matched by program.id and program.id.ime.
So my outcome should be variable which would give me length of unique lecturer.id who are teaching English (in my case i can see from the data or picture that this is 10 lecturers), and length of unique lecturer.id who are teaching History and so on. So i would like to generate code that:
If this lecturer.id matches this program.id than paste length of this program.id.ime which is =10 othervise paste different length
I am thinking in this direction (but it is not what i want)
length(unique(subset(df, lecturer.id==program.id)))
I was thinking of using aggregate, but I need this in a variable that will produce different lengths according to program.id and program.id.ime.
So small part of my data frame looks like this
lecturer.id<- c(111, 111,112,126,127,132,139,143,155)
program.id<- c(35,35,35,35,44,44,44,42,42)
program.id.ime<- c('English', 'English', 'English', 'English',
'History', 'History', 'History', 'Sociology', 'Sociology')
df <- data.frame(lecturer.id, program.id, program.id.ime)
So i know that lecturer with id 111 is teaching on program with id 35 and this program name is English. My outcome should be the length or the number of all lecturers that are teaching English, and length of all lecturers that are teaching History and so on.
So as I am combining R code with latex (hmisc) my output is a table (because of the data confidentiality I deleted some variables:
I would like to generate number in parentheses which is the example of the OUTPUT I want. It is important to generate it automatically by matching columns.
The whole point is that I am doing PDF reports for seperate lecturer and I am matching lecturer with his lecture.id based on foor-loop. So output is PDF report for one lecturer and in the table in second picture I need number of all lecturers on specific course.

Using the data in the link (changed the file name to 'Miha.csv')
library(data.table)#v1.9.5+
df1 <- read.csv('Miha.csv', sep=';')
Or
df1 <- fread('Miha.csv') #in this case, the object will be `data.table`
setDT(df1)[, list(n= uniqueN(lecturer.id)), .(program.id, program.id.ime)
][, program.id.ime:=sprintf('%s (%d)', program.id.ime, n)][, n:=NULL]
# program.id program.id.ime
#1: 35 English (9)
#2: 44 History (4)
#3: 43 Sociology (8)
#4: 34 Politology (21)
#5: 40 Antropology (62)
#6: 41 Music (65)
#7: 116 Music II (10)
In the dataset, each 'program.id.ime' have a single 'program.id', so
setDT(df1)[, list(program.id.ime=sprintf('%s (%d)',
program.id.ime[1L], uniqueN(lecturer.id))) , .(program.id)]
# program.id program.id.ime
# 1: 35 English (9)
# 2: 44 History (4)
# 3: 43 Sociology (8)
# 4: 34 Politology (21)
# 5: 40 Antropology (62)
# 6: 41 Music (65)
# 7: 116 Music II (10)

Related

Replace id numbers in rows based on the match between two columns

I am dealing with a data on club membership where each row represents a club's membership in one of the 10 student clubs, and the length of non-empty column represents the membership "size" of that club. Each non-empty cell of the data frame is filled with a "random number" denoting a student's membership in a club (random numbers were used to suppress their identities).
By default, each club has at least one member but not all students are registered as club members (some have no involvement in any clubs). The data looks like this (the data displayed at below contains only part of the data):
club_id mem1 mem2 mem3 mem4 mem5 mem6 mem7
1 339 520 58
2 700
3 80 434
4 516 811 471
5 20
6 211 80 439 516 305
I want to replace those random numbers with student ids (without revealing their real names) based on the match between the random numbers assigned to them and their student ids; however, only some of the students ids are matched to the random numbers assigned to those students.
I compiled them into a dataframe of 2 columns, which is available here and looks like
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
head(match)
id rn
1 1 700
2 2 339
3 3 540
4 4 58
5 5 160
6 6 371
where column rm means random number.
So the tasks I am having trouble with are to
(1) match and replace the random numbers on the dataframe with their corresponding student ids
(2) set those unmatched random number as NA
It will be really appreciated if someone could enlighten me on this.
Not sure if I got the logic right. I replicated only a short version of your initial table and replaced the first number with 1000 (because that is a number that has no matching id).
club2 <- data.frame(club_id = 1:6, mem2 = c(1000, 700, 80, 516, 20, 211))
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
Then, for the column mem2, I check if it exists in match$rn. If that is not the case, an NA is inserted. If that is the case, however, it inserts match$id - the one at the position where match$rn is equal to the number in mem2.
club2$mem2 <- ifelse(club2$mem2 %in% match$rn == TRUE, match$id[match(club2$mem2, match$rn)], NA)

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

Randomise 380 samples by covariates across four 96-well plates using OSAT package

I need to randomise 380 samples (by age, sex and group [grp]) across four 96 well plates (8 rows, 12 columns), with A01 reserved in each plate for a positive control.
I tried the r-pkg (OSAT) and the recommended script is below. The only piece that does not work is excluding well A01 from each of the four plates.
library(OSAT)
samples <- read.table("~/file.csv", sep=";", header=T)
head(samples)
grp sex age
1 A F 45
2 A M 56
3 A F 57
4 A M 67
5 A F 45
6 A M 65
sample.list <- setup.sample(samples, optimal = c("grp", "sex", "age"))
excludedWells <- data.frame("plates"= 1:4, chips=rep(1,4), wells=rep(1,4))
container <- setup.container(IlluminaBeadChip96Plate, 4, batch = 'plates')
exclude(container) <- excludedWells
setup <- create.optimized.setup(fun ="optimal.shuffle", sample.list, container)
out <- map.to.MSA(setup, MSA4.plate)
The corresponding R help doc states:
"If for any reason we need to reserve certain wells for other usage, we can exclude them from the sample assignment process. For this one can create a data frame to mark these excluded wells. Any wells in the container can be identified by its location identified by three variable "plates", "chips", "wells". Therefore the data frame for the excluded wells should have these three columns.
For example, if we will use the first well of the first chip on each plate to hold QC samples, these wells will not be available for sample placement. We have 6 plates in our example so the following will reserve the 6 wells from sample assignment:
excludedWells <- data.frame(plates=1:6, chips=rep(1,6), wells=rep(1,6))
Our program can let you exclude multiple wells at the same position of plate/chip. For example, the following data frame will exclude the first well on each chips regardless how many plates we have:
ex2 <- data.frame(wells=1)
I tried both of these and they do not work - as they simply specify ANY well (and not well #1-A01).
*Update - I emailed the developer of the package and he acknowledged the error and provided a work around. Incorporated here (exclude wells after setting up the container)

Reshape specific rows into columns in R

My sample data frame would look like the following:
1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y
I want to make rows 1, 3, and 5 column names and have the data below each fall into each column, respectively. I was looking into the reshape function, but I only saw examples where an entire column of values needed to be individual columns. So I wasn't sure what to do in this scenario--apologies if it's obvious.
Here is the desired output:
1 Number Type Code Reason Date Amount Damage Act State City Zip Phone
2 0123 06 09 010 08/31/16 10,000 Y N WI GB 1234 Y
Thanks
As some people have commented, you could build a data frame out of the rows of your starting data frame, but I think it's a little easier to work on the lines of text.
If your starting file looks something like this
Number , Type , Code ,Reason
0123 , 06 , 09 , 010
Date , Amount , Damage , Act
08/31/16 , 10000 , Y , N
State , City , Zip , Phone
WI , GB , 1234, Y
we can read it in with each line as an element of a character vector:
lines <- readLines("start.csv")
make all the odd lines into a single line:
oddind <- seq(from=1, to= length(lines), by=2)
namelines <- paste(lines[oddind], collapse=",")
make all the even lines into a single line:
datlines <- paste(lines[oddind+1], collapse=",")
make those lines into a new CSV to read:
writeLines(text= c(namelines, datlines), con= "nice.csv")
print(read.csv("nice.csv"))
This gives
Number Type Code Reason Date Amount Damage Act State
1 123 6 9 10 08/31/16 10000 Y N WI
City Zip Phone
1 GB 1234 Y
So, it's all in one row of the data frame and all the variable names show up correctly in the data frame.
The benefits of this strategy are:
It will work for starting CSV files where the number of variables isn't a multiple of 4.
It will work for starting CSV files with any number of rows
There is no chance of weird dynamic casting happening with unlist() or as.character().
Creating a dataframe roughly appearing like that (although it necessarily has column names). Those are probably factor columns if you just used one of the standard read.* functions without using stringsAsFactors=FALSE, hence the need to convert with as.character.
dat=read.table(text="1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y")
Then you can set odd number rows as names of the values-vector of the even number rows with:
setNames( unlist( lapply( dat[!c(TRUE,FALSE), ] ,as.character)),
unlist( lapply( dat[c(TRUE,FALSE), ] ,as.character)) )
1 3 5 Number Date State Type
"2" "4" "6" "0123" "08/31/16" "WI" "06"
Amount City Code Damage Zip Reason Act
"10,000" "GB" "09" "Y" "1234" "010" "N"
Phone
"Y"
The !c(TRUE,FALSE) and its logical complement in the next extract operation get magically recycled along all the possible rows. Obviously there would be better ways of doing this if you posted a version of a text file rather than saying that the starting point was a dataframe. You would need to remove what were probably rownames. If you want a "clean solution then post either dput(.) from your dataframe or the raw text file.

Comparing multiple columns in different data sets to find values within range R

I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.
I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.
So this is what I have so far:
d<-read.table("domains.txt")
d
Gene.name Domain Start End
ABCF1 low_complexity_region 2 13
DKK1 low_complexity_region 25 39
ABCF1 AAA 328 532
F2 coiled_coil_region 499 558
m<-read.table("mutations.tx")
m
Gene.name Mutation
ABCF1 10
DKK1 21
ABCF1 335
xyz 15
F2 499
newfile<-m[, list(new=findInterval(d(c(d$Start,
d$End)),by'=Gene.Name']
My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.
I"d like my final data to look like this:
Gene.name Mutation Domain
DKK1 21 low_complexity_region
ABCF1 335 AAA
F2 499 coiled_coil_region
A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):
result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]
# Gene.name Domain Start End Mutation
#1 ABCF1 low_complexity_region 2 13 10
#4 ABCF1 AAA 328 532 335
#6 F2 coiled_coil_region 499 558 499

Resources