I need to run a script for each station (I was replacing the numbers 1 by 1 in the script) but there're more than 100 stations.
I thought maybe loop in script could save my time. Never done loop before, don't know if it's possible to do what I want. I've tried as the bellow but doesn't work.
Just a bit of my df8 data (txt):
RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0
Code:
station <- sample(50:150,53,replace=F)
for(i in station)
{
df08_1 <- filter(df08, V7==station [i])
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp))
write.csv(final,file="../station [i].csv", row.names = FALSE)
}
If you're not averse to using some tidyverse packages, I think you could simplify this a bit:
Updated with your new sample data - this runs ok on my computer:
Code:
library(dplyr)
dat %>%
select(-RowNum) %>%
distinct() %>%
group_by(date_hour = lubridate::floor_date(date, 'hour'), gauging_station, code) %>%
summarize(precp = sum(precp)) %>%
split(.$gauging_station) %>%
purrr::map(~write.csv(.x,
file = paste0('../',.x$gauging_station, '.csv'),
row.names = FALSE))
Data:
dat <- data.table::fread("RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0") %>%
mutate(date = as.POSIXct(date, format = '%m/%d/%Y %H:%M'))
Can't comment for a lack of reputation, but if the code works if you change station [i] for the number of the station, it sounds like each station is a part of and has to be extracted from the df08 object (dataframe).
If I understand you correctly, I would do this as follows:
stations <- c(1:100) #put your station IDs into a vector
for(i in stations) { #run the script for each entry in the list
#assuming that 'V7' is the name of the (unnamed) seventh column of df08, it could
#work like this:
df08_1 <- filter(df08, df08$V7==i) #if your station names are something like
#'station 1' as a string, use paste("station", 1, sep = "")
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp)) #floor_date here is probably your own function
write.csv(final,file=paste("../station", i, ".csv", sep=""), row.names = FALSE)
#automatically generate names. You can modify the string to whatever you want ofc.
}
If this and all of the other examples don't work, could you provide us with some dummy data to work with, just to see what the df08 dataframe looks like? And also what the floor_date() function does?
Related
I am a beginner user in R and have been compiling a code to create a custom function to execute a specific task on some data that I possess. The custom function is structured to identify missing data in a csv file and patch this using the mean value. Thereafter, I want to summarize the data by year and month and export this as a csv file. I have multiple csv files that are sitting in a folder and would like to perform this task on each of these files. Thus far, I am able to get the code to perform the task at hand but don't know how to write a unique output for each csv file that has been processed and save these to a new folder. I would also like to retain the original file name in the processed output but have the words "_processed" appended to it. Additionally, any suggestions on how this code can be improved are most welcome. Thanks in advance.
# Load all packages required by the script
library(tidyverse) # data science package
library(lubridate) # work with dates
library(dplyr) # data manipulation (filter, summarize, mutate)
library(ggplot2) # graphics
library(gridExtra) # tile several plots next to each other
library(scales)
# Set the working directory #
setwd("H:/Shaeden_Post_Doc/Genus_Exchange/GEE_Data/MODIS_Product_Data_Raw/Cold_Temperate_Moist")
#create a function to summarize data by year and month
#patch missing values using the average
summarize_by_month = function(df){
# counting unique, missing and mean values in the ET column
df %>% summarise(n = n_distinct(ET),
na = sum(is.na(ET)),
med = mean(ET, na.rm = TRUE))
# assign mean values to the missing data and modify the dataframe
df = df %>%
mutate(ET = replace(ET,is.na(ET),mean(ET, na.rm = TRUE)))
df
#separate data into year, month and day
df$date = as.Date(df$date,format="%Y/%m/%d")
#summarize by year and month
df %>%
mutate(year = format(date, "%Y"), month = format(date, "%m")) %>%
group_by(year, month) %>%
summarise(mean_monthly = mean(ET))
}
#import all files and execute custom function for each
file_list = list.files(pattern="AET", full.names=TRUE)
file_list
my_AET_files = lapply(file_list, read_csv)
monthly_AET = lapply(my_AET_files, summarize_by_month)
monthly_AET
A link to the sample datasets is provided below
https://drive.google.com/drive/folders/1pLHt-vT87lxzW2We-AS1PwVcne3ALP2d?usp=sharing
You can read, manipulate data and write the csv in the same function :
library(dplyr)
summarize_by_month = function(file) {
df <- readr::read_csv(file)
# assign mean values to the missing data and modify the dataframe
df = df %>% mutate(ET = replace(ET,is.na(ET),mean(ET, na.rm = TRUE)))
#separate data into year, month and day
df$date = as.Date(df$date,format="%Y/%m/%d")
#summarize by year and month
new_df <- df %>%
mutate(year = format(date, "%Y"), month = format(date, "%m")) %>%
group_by(year, month) %>%
summarise(mean_monthly = mean(ET))
write.csv(new_df, sprintf('output_folder/%s_processed.csv',
tools::file_path_sans_ext(basename(file))), row.names = FALSE)
}
monthly_AET = lapply(file_list, summarize_by_month)
path<-"your_peferred_path/" #set a path to were you want to save the files
x<-list.files(pattern= "your_pattern") # create a list of your file names
name<-str_sub(x, start=xL, end=yL) #x & y being the part of the name you want to keep
for (i in 1:length(monthly_AET)){
write_excel_csv(monthly_AET[i], paste0(path, name, "_processed.csv")) # paste0 allows to create custom names from variables and static strings
}
note: this is only an assumption and may have to be tweaked to suit your needs
With the next code I import data in JSON format from two different urls and then I combine them to get an unique dataframe:
library(jsonlite)
library(dplyr)
url1 <- 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_DJUSOI'
document1<-na.omit(select(fromJSON(url1),Ticker=sym,Name=name,SCTR=sctr,Capital=univ))
document1$SubSector="DJUSOI"
url2 <- 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_DJUSOL'
document2<-na.omit(select(fromJSON(url2),Ticker=sym,Name=name,SCTR=sctr,Capital=univ))
document2$SubSector="DJUSOL"
#Combined file
USA<-rbind.data.frame(document1,document2)
My problem is that I need to import data from more than 100 different urls, so I assume I should use a loop. The only thing that changes at each url is the sector name (after the underscore): DJUSOI, DJUSOL, etc.
Could somebody let me know how to do it?
Try this loop. You can add the sectors you want to sectors vector:
library(jsonlite)
library(dplyr)
sectors <- c('DJUSOI', 'DJUSOL')
documents <- data.frame()
for (sector in sectors){
url <- paste0('https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_', sector)
current <- fromJSON(url) %>%
select(Ticker=sym,Name=name,SCTR=sctr,Capital=univ) %>%
na.omit() %>%
mutate(SubSector = sector)
documents <- bind_rows(documents, current)
}
Since the only thing that really changes is the sector name, I think this should work for you:
library(jsonlite)
library(dplyr)
DownloadSubSec = function(sector) {
url = 'https://c.stockcharts.com/j-sum/sum?cmd=perf&group=SECTOR_'
url = paste0(url, sector)
doc = na.omit(select(fromJSON(url), Ticker = sym, Name = name,
SCTR = sctr, Capital = univ))
doc$SubSector = sector
return(doc)
}
sector_names = c('DJUSOI', 'DJUSOL')
usa = sector_names %>%
lapply(DownloadSubSec) %>%
bind_rows
So basically I am trying the following loop:
rawData = read.csv(file = "SampleData.csv")
companySplit = split(rawData, rawData$Company)
NameOfCompany <- numeric()
DateOfOrder <- character()
WhichProducts <- numeric()
for (i in 1:length(companySplit)){
company_DateSplit = split(companySplit[[i]], companySplit[[i]]$Date)
for (j in 1:length(company_DateSplit)){
WhichProducts[j] <- (paste0(company_DateSplit[[j]]$ID, collapse=","))
DateOfOrder[j] <- (paste0(company_DateSplit[[j]]$Date[1]))
NameOfCompany[j] <- (paste0(companySplit[[i]]$Company[[1]]))
}
}
df <- data.frame(NameOfCompany,DateOfOrder, WhichProducts)
write.csv(df, file = "basket.csv")
If you check basket.csv there is output for only company D. It is not writing because of nesting of for loops I guess. I am not able to get out of it.
I need exact output as basket.csv but for all companies.
Here are the CSVs:
Input Data: Link
Output of code basket.csv: Link
The output should look like this:
Company,Date, All IDs comma seperated.
e.g.
A,Jan-18,(1,2,4)
A,Feb-18,(1,4)
B,Jan-18,(2,3,4)
I'm able to get it from the above code. But Not able to save it in CSV for all A,B,C,D companies. It saves values for only company D which is the last value in looping. (check output file link)
The initial error is that you import your data without the parameter stringsAsFactors = FALSE which happens all the time. Also, looping in R is usually less efficient and harder to reason about than using a more functional approach. I think what you're trying to do can be done with the aggregate function
rawData <- read.csv(file = "SampleData.csv", stringsAsFactors = FALSE)
df <- aggregate(ID ~ Company + Date, data = rawData, FUN = paste, collapse = ",")
colnames(df) <- c("NameOfCompany", "DateOfOrder", "ID")
df = split(df, df$NameOfCompany)
Or using a tidy approach
df <- rawData %>% group_by(Company, Date) %>%
summarise(WhichProducts=paste(ID,collapse=',')) %>%
rename(DateOfOrder = Date) %>%
rename(NameOfCompany = Company) %>%
group_split()
allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing
You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.
I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.