For Loop in R to read in data - r

I have the following code and I wanna make a for loop out of it. I only have to change the year numbers on all lines (years 1996-2019). The following is my code:
# loading health data
health_data_1996 <- read.csv("1996-Annual.csv")
#delete data which is not needed
health_data_1996 <- health_data_1996[!(health_data_1996$Measure.Name != "Unemployment Rate, Annual" &
health_data_1996$Measure.Name != "High School Graduation"),]
health_data_1996 <- health_data_1996[,-c(1,2,5,7:11)]
#rename value column
colnames(health_data_1996)[3] <- "1996"
Can somebody tell me how I could make a for loop out of this?
Thank you very much for your help.

Since you just want to read the datasets and not combine them I suggest the following. I'm assuming here that all your CSV files have the same name structure.
# create a vector with all the years
years <- 1996:2019
# apply the desired function on every value in years consecutively
all_data <- lapply(years, function(y) {
df <- read.csv(paste0(y, "-Annual.csv"))
df <- df[df$Measure.Name == "Unemployment Rate, Annual" |
df$Measure.Name == "High School Graduation", ]
df <- df[, -c(1, 2, 5, 7:11)]
colnames(df)[3] <- y
df
})
This will give you a named list where every element is the dataset for a given year. So for example if you want the data from 2019 you should be able to retrieve it with all_data[["2019"]].

Related

Is there an easy way to pick specific observations from column out of a data frame for a function to use? Like a filter in excel pivot table?

I have a function below from the package gapminder to run an analysis. I need to pick two continents out of the five available.
library(gapminder)
part3 <- gapminder
continent1 <- subset(part3, continent == "Asia")
continent2 <- subset(part3, continent =="Africa")
#As I'm going to t-test I need two factors - picking two continents
part3c <- rbind(continent1, continent1)
Question Is there a way for the user to pick continents for the analysis e.g., some code that says "pick two from the five available" that allows for the analysis to be run with different combinations?
Something like getting the output from filtering data in an excel pivot table or do I need to code in the continents each time - as above?
Do you want something like this?
Function combn returns the combinations of a vector, in the case below 2 by 2 and applies a function to each of them. The function test_fun first makes sure the groups are of the same size, then runs the t-test.
In the example call, I test equality of lifeExp by continent but any other column can be tested.
test_fun <- function(X, col){
cols <- c(col, "continent")
n <- min(nrow(X[[1]]), nrow(X[[2]]))
Y <- lapply(X, \(y) {
if(nrow(y) > n)
y[sample(nrow(y), n), cols]
else y[cols]
})
Y <- do.call(rbind, Y)
t.test(get(col) ~ continent, Y)
}
sp_part3 <- split(part3, part3$continent)
combn(sp_part3, 2, test_fun, simplify = FALSE, col = "lifeExp")

Different implementation of subsetting in a loop

I have a problem relating to subsetting.
Basically I have a dataset. This toy dataset is a good small example:
df<- data.frame(year = c(1980:2019), randnorm = rnorm(40, 0, 1), count1 = rpois(40, 18),
lograndnorm=(rlnorm(40, 3, 2)))
For each value of year between 2000 and 2019, I want to remove each years observation, and output a subset of the total df data excluding a year. I then want to take the year removed and enter it into a model, and use the remainder of the data to train the model.
For example, subset_ex2010 might be excluding 2010. Therefore, all data except for where year= 2010 goes into subset_ex2010 , and I can then use that data to predict 2010.
Once those parameters are entered into the model, the output is saved (after the model has run) and the loop does the next year, that is, removes 2009 from the full df dataframe and subsets the remainder.
I've tried:
for(i in 2000:2019){
subset_excl_[i] <- subset(df, year<i | year>i] )
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
and,
for(i in 2000:2019){
subset_excl_[i] <- [df$year-i]
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
but both fall over. Any assistance would be gratefully received.
I don't know linear programming. But. In both your blocks of code
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmMod[i], subset_of_[i])
You're referring to both lmmod and lmMod. R is case-sensitive.
If that alone doesn't fix it - put a broswer() call in the head of the loop and single step til you find where it's blowing up.

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

Repetitive Action Over Ten Matrices in R

I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).
The code I have written is as follows:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.
Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)
You can use the tidyverse package to summarize each of your data frames. First, you'll want to put them in a list. Then you can iterate over each of the data frames in the list, summarizing by occupation:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
Another option, with base. First load the files into a list, then use lapply to calculate the means for each dataset
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
Note that if your data are already all loaded, you could just put them into a list (rather then loading all the data directly into a list) and then use the lapply function and it should still work.
Edit
I just realized you only wanted the means for the three groups. I've edited the code above to subset all means to only the three groups.
I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
Now we can run the same code on all of them, with lapply, and get a list back out:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
Result:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512

How to make iterations over dates in R

This is a more general problem I get in R. Say I want to create a subset for a dataset data that contains the first 10 days the days 1,...,10. For a single day I can easy make the subset like this
data_new <- subset(data, data$time == as.Date(as.character(2016-01-01)) )
But say I want the first 10 days in January 2016. I try to make a loop like this
data_new <- matrix(ncol=2,nrow=1)
for(j in 1:10) {
data_new[,j]= subset(data, data$time==as.Date(as.character(2016-01-j)))
}
but this code can not run in R because of the term as.character(2016-01-j).
How can I create such a subset?
You could do
data_new = subset(data, data$time %in% as.Date(paste0("2016-01-", 1:10)))

Resources