I have a data frame I am applying a function to using mapply(), but my result doesn't keep certain info in the initial data frame that I would like to keep. The function I am applying across the data frame is not vectorized.
DATA FRAME
Dealer Cities Zip Radius in miles
A Rancho Cucamonga, CA 91730 40
A San Bernardino, CA 92401 40
B Chino, CA 91710 40
B Fontana, CA 92337 40
I am applying a function that gets all zip codes in the given mile radius of the initial zip.
remotes::install_github("EAVWing/ZipRadius")
results <- with(city_names, mapply(ZipRadius::zipRadius,
as.character(`Center Zip Code`),`Radius in miles`, SIMPLIFY =FALSE ))
RESULT
The result is a large list containing a data frame for each time the function zipRadius was called.
DESIRED RESULT
Each data frame contains the zip code used by the function and the data the function generates, but I would like it to also keep the corresponding "Dealer" column associated with each initial Zip.
For example, the above data frame is generated from the zip 91730, which has a Dealer value of A.
DEALER ZIP Other columns generated by the function...
A 90001
A 90002
A 90011
#Onyambu gave me a great solution in the comments (source):
dplyr::left_join(city_names, dplyr::bind_rows(results, .id ='zip1'),
by =c(Zip= 'zip1'))
Related
I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)
I have been working on leaflet in R.
https://rstudio.github.io/leaflet/choropleths.html
The above us-Map contains density of a state.The Format of the data is Geo-Json. I want to remove the density variable and I want to pass my columnname with corresponding variable value. (For Example when you hover on the New Mexico I am getting density as 17.16 (density:17.16), instead I want to display as (mycolumnname:value) ).
This is a pretty common need in working with leaflet. There are a few ways to do this, but this is the simplest in my mind:
All of the information you would like to plot is stored in the section of the SpatialPolygonsDataFrame found at states#data, which you can see by looking at the head of this data frame section:
I made a data frame (traditional r data frame) using the state names from the original SpatialPolygonsDataFrame names states in your code above and created my_var.
a<-data.frame( States=states#data$name)
a$my_var <- round(runif(52, 15, 185),2)
This is the first few rows of my new data frame, which is like yours but has data OTHER than density in it.
head(a)
States my_var
1 Alabama 120.33
2 Alaska 179.41
3 Arizona 67.92
4 Arkansas 30.57
5 California 72.26
6 Colorado 56.33
Now that you have this data frame you can call up the library maptools and do a polygon cbind as follows:
states2<-spCbind(states,a$my_var)
Now looking at the head of states2 (which you could name states and replace the original states SpatialPolygonsDataFrame I kept both to compare before and after)
head(states2#data)
id name density data.my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
2 04 Arizona 57.050 81.05
3 05 Arkansas 56.430 124.68
4 06 California 241.700 138.19
5 08 Colorado 49.330 103.78
this added the data.my_var variable into the spatial data frame. Now you can use find/replace, to go through and replace the references in your code where it says density with data.my_var and the new variables will be used.
Important things to consider
Your data has 50 state names, the spatial data frame has 52, you will need to add in the missing states to your data frame before cBinding them, they must be the same length AND in the same order.
If you grab the names like this:
a<-data.frame( States=states#data$name)
from the states object, you can then left merge on States, with your data and it will keep the order a and all the cells which are empty where the new regions have not data in your data set will remain empty.
Use merge to be sure that data lines up properly.
a<- merge(a, your_data ,by=c("States","name"))
Also, once they are merged and you have checked that states#data$name is in the same order as a$States, you can use any name you want as new heading in the SpatialPolygonDataFrame by extracting the data into a vector with the name you want prior to binding them:
my_var <- a$my_var
states2<-spCbind(states, my_var)
this will leave you with a data frame which looks like this:
id name density my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
This is easier to address as a column name from inside leaflet without long strings.
I have a list of boroughs and a list of localities (like this one). Each locality lies in exactly one borough. What's the best way to store this kind of hierarchical structure in R, considerung that I'd like to have a convenient and readable way of accessing these, and using this list to accumulate data on the locality-level to the borough level.
I've come up with the following:
localities <- list("Mitte" = c("Mitte", "Moabit", "Hansaviertel", "Tiergarten", "Wedding", "Gesundbrunnen",
"Friedrichshain-Kreuzberg" = c("Friedrichshain", "Kreuzberg")
)
But I am not sure if this is the most elegant and accessible way.
If I wanted to assign additional information on the localitiy-level, I could do that by replacing the c(...) by some other call, like rbind(c('0201', '0202'), c("Friedrichshain", "Kreuzberg")) if I wanted to add additional information to the borough-level (like an abbreviated name and a full name for each list), how would I do this?
Edit: For example, I'd like to condense a table like this into a borough-wise version.
Hard to know without having a better view on how you intend to use this, but I would strongly recommend moving away from a nested list structure to a data frame structure:
library(reshape2)
loc.df <- melt(localities)
This is what the molten data looks like:
value L1
1 Mitte Mitte
2 Moabit Mitte
3 Hansaviertel Mitte
4 Tiergarten Mitte
5 Wedding Mitte
6 Gesundbrunnen Mitte
7 Friedrichshain Friedrichshain-Kreuzberg
8 Kreuzberg Friedrichshain-Kreuzberg
You can then use all the standard data frame and other computations:
loc.df$population <- sample(100:500, nrow(loc.df)) # make up population
tapply(loc.df$population, loc.df$L1, mean) # population by borough
gives mean population by Borough:
Friedrichshain-Kreuzberg Mitte
278.5000 383.8333
For more complex calculations you can use data.table and dplyr
You can extract all of this data directly into a data.frame using the XML library.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Boroughs_and_localities_of_Berlin#List_of_localities"
tables<-readHTMLTable(theurl)
boroughs<-tables[[1]]$Borough
localities<-tables[c(3:14)]
names(localities) <- as.character(boroughs)
all<-do.call("rbind", localities)
#Roland, I think you will find data frames superior to lists for the reasons cited earlier, but also because there is other data on the web page you reference. Loading to a data frame will make it easy to go further if you wish. For example, making comparisons based on population density or other items provided "for free" on the page will be a snap from a data frame.
It seems like I should either know how to do this or at least find the answer here or elsewhere. Unfortunately neither is working.
I have a data frame of customers where one column is their id and another column is their full address. I want to add 3 columns for each row with the lat, long and county code from a geocode lookup.
That data frame looks like
customer_id fulladdress
1 123 Main St., Anywhere, FL
2 321 Oak St., Thisplace, CA
I created a geocode function that takes the full address and returns a data frame with lat, long and county columns.
How can I apply my geocode function to each row of the data frame and append the results as 3 columns into the existing data frame so that it looks like this:
customer_id fulladdress lat long county
1 123 Main St., Anywhere, FL 33.2345 -92.3333 43754
2 321 Oak St., Thisplace, CA 25.3333 -120.333 32960
I've tried playing with apply and ddply, but I can't seem to figure out what either one is doing. I tried this with ddply but all it does is give me back the original data frame.
ddply(customers[1:3,], .(fulladdress), function(x) { geocode(x$fulladdress)})
Thanks for the help.
Thanks for putting me on the right track. Here is what finally worked:
cbind(customers, t(sapply(customers$fulladdress,geocode, USE.NAMES=F)))
I have a tricky problem with applying a function to a list of data frames. Ultimately I want to plot individual time series charts for large data set of drug usage figures.
My dataset comprises 30 different antibiotics with a usage rate that has been collected monthly over a 5 year period. It has 3 columns and 1692 rows.
So far I have made a list of individual data frames for each antibiotic class. (The name of the list is drug and drug.class is a character vector of drug names from the original data frame)
drugList <- list()
n<-length(drug.class)
for (i in 1:n){
drugList[[i]] <-AB[Drug==(drug.class[i]),]
}
For example, I have 30 data frames in a list with the following columns:
[[29]]
Drug Usage DateA
1353 Tobramycin 5.06 01-Jan-2006
1354 Tobramycin 4.21 01-Feb-2006
1355 Tobramycin 6.34 01-Mar-2006
.
.
.
Drug Usage DateA
678 Vancomycin 11.62 01-Jan-2006
679 Vancomycin 11.94 01-Feb-2006
680 Vancomycin 14.29 01-Mar-2006
Before each plot is made a logical test is performed to determine if the time series is autocorrelated. The data frmaes in the list are of verying lengths.
I have written a function to perform the test as follows:
acTest <- function(){
id<-ts(1:length(DateA))
a1<-ts(Usage)
a2<-lag(a1-1)
tg<-ts.union(a1,id,a2)
mg<-lm(a1~a2+bs(id,df=3), data=tg)
a2Pval <- summary(mg)$coefficients[2, 4]
if (a2Pval<=0.05) {
TRUE
} else {
FALSE
}
}
I have previously tested all my functions on individual data frames and they work as expected.
I am trying to work out how to apply the test to each data frame in the drug list. I believe if I can get help working this out I will be in a position to apply the time series functions in the same manner.
Thanks in advance for any assistance offered.
A few suggestions:
Change your acTest function so that it actually accepts a data.frame as a parameter. Otherwise you'll have lots of problems with the function looking for (and modifying) objects named DateA and Usage in the global environment.
acTest <- function(dat){
id<-ts(1:length(dat$DateA))
a1<-ts(dat$Usage)
a2<-lag(a1-1)
tg<-ts.union(a1,id,a2)
mg<-lm(a1~a2+bs(id,df=3), data=tg)
a2Pval <- summary(mg)$coefficients[2, 4]
if (a2Pval<=0.05) {
TRUE
} else {
FALSE
}
}
Applying a function to each element of a list is a common task in R. It is (most often) done using lapply.
lapply(drugList,FUN=acTest)
Finally, you can do tasks like this without storing each data frame as a separate list element by using tools like ddply (among others) that split a data frame using one variable, apply a function to each piece and then reassemble them into a single data frame again. In your case, that would look something like:
ddply(AB,.(Drug),.fun = acTest)