I have a list of plant names in a dataframe. Plant names come as a couplet with "genus" followed by "species". In my case the couplet is already split across columns (which should help). As a dummy example for three species (Helianthus annuus, Pinus radiatia, and Melaleuca leucadendra):
df <- data.frame(genus=c("Helianthus", "Pinus", "Melaleuca"), species=c("annuus","radiata", "leucadendra"))
I would like to use a function in the package "Taxize" to check these names against a database (IPNI).
There is no batch function for this, and annoyingly the format for querying a single name is:
checked <- ipni_search(genus='Helianthus', species='annuus')
What I need is a loop to feed each genus name and it's associated species name into that function.
I can do this for just genus:
list <- df$genus
checked <- lapply(list, function(z) ipni_search(genus=z))
but am tied up in all sorts of knots trying to pass the species with it.
Any help appreciated!
Cheers
Loop (or *apply) over the index, not the actual value:
checked = lapply(
1:nrow(df),
function(i) ipni_search(genus = df$genus[i], species = df$species[i])
)
Alternately, you can use Map which is made for iterating over multiple vectors/lists in parallel:
checked = Map(ipni_search, genus = df$genus, species = df$species)
Related
The issue I have is that I want to create dataframes for individual countries. I have a relatively large dataset, one of the columns is countries so I need to have every instance country x and then all other variables.
When I try to subset I Subset[name, country ="X"] it says that X isnt found. The other option would be name[1:2,], but the issue here is the dataset is so large it is hard to find the vectors needed.
The key thing is that I need all veriables for every instance of country X as they reoccur.
Any help would be greatly appreciated.
mydata <- data.frame(country = rep(c("NL", "BE"), 2), value = 9:12)
library(data.table)
setDT(mydata)
# split by country, create a named list
L <- split(mydata, by = "country")
# pass list's content to current environment, based on list's names
list2env(L, envir = environment())
I'd like to use a loop function to recognise names from a list/dataframe as an actual list/dataframe name in the R script (for data analysis or manipulation).
I will create some pseudo data to try to help show what i'm trying to do.
Here is code to create 3 lists
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
This code creates a list containing those list names
vars <- c("height","weight","income")
The code below doesn't run, but I would like to use a loop code like this, where it takes the name from the list position and uses it in script as a list name. Thus it's using the name to calculate the mean, and it's using the name to create a new object.
for (i in 1:3)
{mean_**vars[i]** = mean(**vars[i]**) }
The result should be 3 objects "mean_height", "mean_weight", "mean_income" which contain the mean scores
I'm not so much interested in the calculating of mean scores, I'm interested in the ability to use the names from the list. I want to be able to expand this to other analyses that are repetitive.
Apologies if above hasn't been articulated too well, I'm quite new to R, so I hope it makes some sense.
Any help will be most useful, or if you can point me in the right direction that would be great.
This may be what you're looking for, where lapply applies the mean function to each of the items in vars (a list of dataframes). Note that you want to make the list of dataframes using the variable names.
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
vars <- list(height, weight, income)
lapply(vars, function(x) mean(x))
Then create an output dataframe using that:
df1 <- data.frame(lapply(vars, function(x) mean(x)))
colnames(df1) <- c("mean_height", "mean_weight", "mean_income")
df1
From your additional comment, using vars <- list(height, weight, income) should allow you do this:
mean(height)
mean(vars[[1]])
[1] 160.48
[1] 160.48
This should work to output dynamically named variables:
vars <- list(height = height, weight = weight, income = income)
for (i in names(vars)){
assign(paste("mean_", i, sep = ""), mean(vars[[i]]))
}
mean_height
mean_weight
mean_income
[1] 163.28
[1] 90.465
[1] 109686.5
However, I'd suggest not programming that way since it can cause issues and it's not very scalable. E.g., you could end up with 10000 variables.
I guess what you want is something like below, which produces three objects into your global environment for the means of weight, height, and income from list list, i.e.,
list2env(setNames(Map(mean,lst),paste0("mean_",names(lst))),envir = .GlobalEnv)
DATA
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
lst <- list(height,weight,income)
A more common approach in R is to use lists of data, rather than separate variables.
Like this:
# make this reproducible
set.seed(123)
# make an empty list for the data
raw_data <- list()
# then fill the list. The data can be of varying length in a list.
raw_data$height <- sample(120:200,200,TRUE)
raw_data$weight <- sample(40:140,200,TRUE)
raw_data$income <- sample(20000:200000,200, TRUE)
Then looping becomes a one-liner and your names are preserved, using the *apply family of functions:
mean_data <- lapply(raw_data, mean)
# print that
mean_data
$height
[1] 159.06
$weight
[1] 90.83
$income
[1] 114000.7
Note what we didn't have to do:
know the number of variables.
have variables all the same length.
build a loop and keep track of names.
All handled automagically. Nice.
I am converting my for-loops in R for a model that has multiple input datasets. In the for-loop I use the current loop value to retrieve values from other datasets. I am looking to replicate this using an apply function (over columns in a dataset) however I'm struggling to establish index of the apply function in order to retrieve the appropriate variables from other data
The apply function references the column by the variable in the function which is fine and I've tried to use both colname (after having named my various columns by number) but have not had any joy. Below is an example dataset and for loop with what I'd like to achieve (simplified somewhat). The length of the vectors and the number of columns in the tabular dataset will always be equal.
iteration<-1:3
df <- data.frame("column1" = 6:10, "column2" = 12:16, "column3" = 31:35)
variable1<-rnorm(3,mean = 25)
variable2<-rnorm(3, mean = 0.21)
outcome<-numeric()
for (i in iteration) {
intermediate<-(mean(df[,i])*variable1[i])^variable2[i]
outcome<-c(outcome,intermediate)
}
outcome
The expected results are outcome above...trying this in apply
What I imagine it to be is this:
apply(df, 2, function(x) (mean(x)*variable1[colnumber(x)])^variable2[colnumber(x)]
or perhaps
apply(df, 2, function(x) (mean(x)*variable1[x])^variable2[x])
but these two obviously do not work.
first time user so apologies for any etiquette issues but found the answer to my own problem using the purrr package, but maybe this helps someone else
pmap(list(df, variable1, variable2), function(df, variable1, variable2) (mean(df)*variable1)^variable2)
I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.
I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.