I import data from a single IMPORTXML and divide it into two columns as follows:
=ArrayFormula(IFERROR(HLOOKUP(1,{1;IMPORTXML(A1,"
//table[#class='table squad sortable']//td[#class='photo']/a/img/#src |
//table[#class='table squad sortable']//td[#class='name large-link']/a/#href")},
(ROW(A:A)+1)*2-TRANSPOSE(sort(ROW($A$1:$A$2)+0,1,0)))))
I would like to know if there is any way of not having to use more than one call to importxml and adding the IMAGE function only in the first column, so I don't have to put it separately as I currently do.
link to Spreadsheet:
https://docs.google.com/spreadsheets/d/10_G3QjcLcE9dYz_0m8PjPxHDAAaKL7BGEehTfmprYTQ/edit?usp=sharing
best you can do with one formula would be:
=ARRAYFORMULA({IFERROR(IMAGE(IMPORTXML(A1,
"//table[#class='table squad sortable']//td[#class='photo']/a/img/#src"))),
QUERY(IFERROR(HLOOKUP(1, {1; IMPORTXML(A1,
"//table[#class='table squad sortable']//td[#class='photo']/a/img/#src | //table[#class='table squad sortable']//td[#class='name large-link']/a/#href")},
(ROW(A:A)+1)*2-TRANSPOSE(SORT(ROW($A$1:$A$2)+0, 1, 0)))), "where Col1 !=''", 0)})
Related
I am trying to create another list to include all of the trials from 2 out of the 3 variables shown in the picture.
I am trying to learn how to map this.
So far I have:
d2 <- map(d1 ,`[` , c("time_100L_1", "vertical_100L_1"))
but this only brings in the first trial. I need all 14 for time and vertical and force is in the middle of the list.
any suggestions? See picture for list
map(d1, [, c(paste0("time_100L_", 1:14), paste0("vertical_100L_", 1:14)))
I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)
I have imported an excel spreadsheet into R studio and I need to write R commands for the data. I need a command to display how many times an item has been sold. The data looks a little something like this
PRODUCT ------------------- UNITS
eye liner ----------------------- 10
lip gloss ----------------------- 5
eye liner ----------------------- 10
lip gloss ----------------------- 5
I do not know how to count how many units of lip gloss have been sold. The best I can do is display how many times lip gloss shows up in the data with the command:
nrow(mySales[mySales$Product=="lip gloss",])
This command doesn't count how many units of lip gloss are sold which is 10, it only counts how many times lip gloss appears in the data (2). This is a beginner course and this is the first exercise, I am assuming it is a simple problem however I am completely lost.
You are almost there. If you look at your code :
nrow(mySales[mySales$Product=="lip gloss",])
this line here :
mySales[mySales$Product=="lip gloss",]
will subset the data that has the product called lip gloss
When you add nrow you are counting the number of rows in the new subset data
Hence you can get the total count by using the function row
Hence what you need to do next can replace nrow with rowSum, or sum if you subset the units columns of the new dataframe
sum(mySales[mySales$Product=="lip gloss",]$UNITS)
Heres a step by step version
lipGlossSales<- mySales[mySales$Product=="lip gloss",]
lipGlossUnits <-lipGlossSales$UNITS
totallipGloss <- sum(lipGlossUnits)
Happy R-ing
cheers,
This is called the split-apply-combine approach and is well-documented and very common in data analysis. In this case I would try the plyr library which allows for making a nice summary of the data as such:
fakedata <- data.frame(Product=c('eye liner', 'lip gloss', 'eye liner', 'lip gloss'),
count=c(10,5,10,5))
library(plyr)
product.counts <- ddply(fakedata, "Product", function(x) data.frame(Productcount = sum(x$count)))
R> product.counts
Product Productcount
1 eye liner 20
2 lip gloss 10
I have a list of boroughs and a list of localities (like this one). Each locality lies in exactly one borough. What's the best way to store this kind of hierarchical structure in R, considerung that I'd like to have a convenient and readable way of accessing these, and using this list to accumulate data on the locality-level to the borough level.
I've come up with the following:
localities <- list("Mitte" = c("Mitte", "Moabit", "Hansaviertel", "Tiergarten", "Wedding", "Gesundbrunnen",
"Friedrichshain-Kreuzberg" = c("Friedrichshain", "Kreuzberg")
)
But I am not sure if this is the most elegant and accessible way.
If I wanted to assign additional information on the localitiy-level, I could do that by replacing the c(...) by some other call, like rbind(c('0201', '0202'), c("Friedrichshain", "Kreuzberg")) if I wanted to add additional information to the borough-level (like an abbreviated name and a full name for each list), how would I do this?
Edit: For example, I'd like to condense a table like this into a borough-wise version.
Hard to know without having a better view on how you intend to use this, but I would strongly recommend moving away from a nested list structure to a data frame structure:
library(reshape2)
loc.df <- melt(localities)
This is what the molten data looks like:
value L1
1 Mitte Mitte
2 Moabit Mitte
3 Hansaviertel Mitte
4 Tiergarten Mitte
5 Wedding Mitte
6 Gesundbrunnen Mitte
7 Friedrichshain Friedrichshain-Kreuzberg
8 Kreuzberg Friedrichshain-Kreuzberg
You can then use all the standard data frame and other computations:
loc.df$population <- sample(100:500, nrow(loc.df)) # make up population
tapply(loc.df$population, loc.df$L1, mean) # population by borough
gives mean population by Borough:
Friedrichshain-Kreuzberg Mitte
278.5000 383.8333
For more complex calculations you can use data.table and dplyr
You can extract all of this data directly into a data.frame using the XML library.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Boroughs_and_localities_of_Berlin#List_of_localities"
tables<-readHTMLTable(theurl)
boroughs<-tables[[1]]$Borough
localities<-tables[c(3:14)]
names(localities) <- as.character(boroughs)
all<-do.call("rbind", localities)
#Roland, I think you will find data frames superior to lists for the reasons cited earlier, but also because there is other data on the web page you reference. Loading to a data frame will make it easy to go further if you wish. For example, making comparisons based on population density or other items provided "for free" on the page will be a snap from a data frame.
I have two sources of clinical procedure billing information that I have added together (with rbind). In each row there is a CPT field and a CPT.description field that supplys a brief explanation. However, the descriptions are slightly different from the two sources. I want to be able to combine them. That way, if different words or abbreviations are used, then I can just do a string search to find what I am looking for.
So lets make up a simplified representation of a data table that I was able to generate.
cpt <- c(23456,23456,10000,44555,44555)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy")
cpt.desc <- data.frame(cpt,description)
And here is what I want to get to.
cpt.wanted <- c(23456,10000,44555)
description.wanted <- c("tonsillectomy; tonsillectomy in >12 year old","brain transplant","castration; orchidectomy")
cpt.desc.wanted <- data.frame(cpt.wanted,description.wanted)
I have tried using functions such as unstack and then lapply(list,paste) but that is not pasting the elements of each list. I also tried reshape but there was no categorical variable to differentiate first or second version of description or even in some cases a third. The really annoying part is I had a similar problem a few months or years ago and someone helped me either on stackoverflow or on r-help and for the life of me I cannot find it.
So the underlying problem is, imagine that I have a spreadsheet in front of me. I need to do a vertical merge (paste) of two or maybe even three description cells who have the same CPT code in the adjacent column.
What buzzwords should I have been using to search for a solution to this problem.
Thank you so much for your help.
sapply( sapply(unique(cpt), function(x) grep(x, cpt) ),
# creates sets of index vectors as a list
function(x) paste(description[x], collapse=";") )
# ... and this pastes each set of selected items from "description" vector
[1] "tonsillectomy;tonsillectomy in >12 year old"
[2] "brain transplant"
[3] "castration;orchidectomy"
Here is an approach that uses plyr.
library("plyr")
cpt.desc.wanted <- ddply(cpt.desc, .(cpt), summarise,
description.wanted = paste(unique(description), collapse="; "))
which gives
> cpt.desc.wanted
cpt description.wanted
1 10000 brain transplant
2 23456 tonsillectomy; tonsillectomy in >12 year old
3 44555 castration; orchidectomy