grepl function in R - r

I have a data set and I want to find the rows which include a specific word "result". I used the following function but it seems it doesn't work correctly. Any suggestion?
data$new<-data.frame(grepl("result",col1))
data:
col1 col2
ABC result VDCbvdc home 22
fgc school 34
university result home exam 45
exam math stat 65

try data$new <- grepl("result",data$col1)
data$new should be assigned to a vector, but you're trying to feed it a data frame. also, col1 only exists inside data, so you'll need data$col1.

Related

R xml2 : How to query only corresponding xml nodes

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.
You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Find if a specific choice is in a Data Frame R

I have a Data Frame object which contains a list of possible choices. For example, an analogy of this would be:
FirstName, SurName, Subject, Grade
Brian, Smith, History, 75
Jenny, Jackson, English, 60
How would I...
1) Check to see if a certain pupil-subject combination is in my Data Frame
2) And for those who are there, extract their grade (And potentially other relevant fields)
?
Thanks so much
The only solutions I've found so far include appending the values onto the end of the Data Frame and trying to see if it is unique or not? This seems a crude and ridiculous hack?
learn data subset (extraction) using base R.
To subset any data frame by its rows and column you use [ ]
Let df be your data frame.
FirstName SurName Subject Grade
1 Brian Smith History 75
2 Jenny Jackson English 60
3 Tom Brandon Physics 50
You can subset it by its rows and columns using
df[rows,columns]
Here rows and column can be :
1) Index (Number/Name)
Which means subset that give me that particular row and column like
df[2,3]
this will return second row and third column
[1] English
or
df[2,"Grade"]
returns
[1] 60
2) Range (Indices/List of Names)
Which means subset that give me these rows and columns like
df[1:2,2,drop=F]
Here drop=F to avoid flattening of result and output like a data.frame. It will give you this
SurName
1 Smith
2 Jackson
Range also supports all by leaving either rows or columns empty like
df[,3,drop=F]
this will return all rows for third column
Subject
1 History
2 English
3 Physics
or
df[1:2,c("Grade","Subject")]
Grade Subject
1 75 History
2 60 English
3) Logical
Which means you want to subset using a logical condition.
df[df$FirstName=="Brian",]
meaning give me rows where FirstName is Brian and all columns for it.
FirstName SurName Subject Grade
1 Brian Smith History 75
or
df[df$FirstName=="Brian",1:3]
give me rows where FirstName is Brian and give me only 1 to 3 columns.
or create complex logicals
df[df$FirstName=="Brian" & df$SurName==" Smith",1:3]
output
FirstName SurName Subject
1 Brian Smith History
or complex logical and extract column by name
df[df$FirstName=="Brian" & df$SurName==" Smith","Grade",drop=F]
Grade
1 75
or complex logical and extract multiple columns by name
df[df$FirstName=="Brian" & df$SurName==" Smith",c("Grade","Subject")]
Grade Subject
1 75 History
to use this in a function do
myfunc<-function(input_var1,input_var2,input_var3)
{
df[df$FirstName==input_var1 & df$SurName==input_var2 & df$Subject==input_var3,"Grade",drop=F]
}
run it like this
myfunc("Tom","Brandon","Physics")
I think you are looking for this:
result <- data[data$FirstName == "Brian" & data$Subject == "History", c("Grade") ]
Try subset:
con <- textConnection("FirstName,SurName,Subject,Grade\nBrian,Smith,History,75\nJenny,Jackson,English,60")
dat <- read.csv(con, stringsAsFactors=FALSE)
subset(dat, FirstName=="Brian" & SurName=="Smith" & Subject=="History", Grade)
Maybe aggregate can be helpful, too. The following code gives the mean of the grades for all pupil/subject combinations:
dat <- transform(dat, FullName=paste(FirstName, SurName), stringsAsFactors=FALSE)
aggregate(Grade ~ FullName+Subject, data=dat, FUN=mean)

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources