I'm quite a novice at using R but I'm trying to self-teach and learn as I go along. I'm trying to create a loop to download and save multiple met data files individually as csv files using the worldmet package.
I have two variables, the met site code and the years of interest. I have included code to create a list of the years in question:
Startyear <- "2018"
Endyear <- "2020"
Yearlist <- seq(as.numeric(Startyear), as.numeric(Endyear))
and I have a .csv file with all the site codes listed which are required, and have read this into R. See below a simplified version of the dataframe, however in total there are 204 rows. This dataframe is called 'siteinfo'.
code station ctry
037760-99999 GATWICK UK
037690-99999 CHARLWOOD UK
038760-99999 SHOREHAM UK
038820-99999 HERSTMONCEUX WEST END UK
037810-99999 BIGGIN HILL UK
An example of the code to import one years worth of metdata for one site is as follows
importNOAA(code="037760-99999",year=2019,hourly=TRUE,precip=FALSE,PWC=FALSE,parallel=FALSE,quiet=FALSE)
I understand that I likely need a nested loop to change both variables, but I am unsure if I am going about this correctly. I also understand that I need to have quotation marks around the code value for it to be read correctly, however I was wondering if there's a quick way to include this as part of the code rather than editing all 204 values in the csv?
Would I also need a separate loop following downloading the files, or can this be included into one piece of code?
The current code I have, and I am sure there is a lot wrong with this so I appreciate any guidance, is as follows
for(i in 1:siteinfo$code) {
for(j in 1:Yearlist){
importNOAA(code=i,year=j,hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
This currently isn't working, so if you could help me piece this together, and if possible provide any explanation of where I have gone wrong or how I can improve my coding, I would be extremely grateful!
You can avoid loops altogether (better for large data sets and files) with some functions in dplyr and purrr. I get an error for invalid parameters when I try to run your importNOAA code, so I am using a simpler call to that function.
met_data <- siteinfo %>%
full_join(data.frame(year = Yearlist), by = character(0)) %>%
group_by(code, year) %>%
mutate(dat = list(data.frame(code, year))) %>%
mutate(met = purrr::map(dat, function(df) {
importNOAA(code = df$code, year = df$year, hourly=TRUE, quiet=FALSE)
}) ) %>%
select(-dat)
This code returns a tbl.df where the last column is a list of data.frames, each containing the data for a year-code combination. You can use met_data %>% summarize(met) to expand the data into one big data.frame to save to a csv, or if you want to write them all to indidividual csvs, use lapply:
lapply(1:nrow(met_data), function(x) {
write.csv(met_data$met[x],
file = paste(met_data$station[x], "_", met_data$year[x], ".csv", sep = ""))})
you can't use for loop like for(i in 1:siteinfo$code){}...
just short example
for(i in 1:mtcars$mpg){
print(i)
}
output:
numerical expression has 32 elements: only the first used[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
[1] 21
So use just index like this
for(i in 1:nrow(siteinfo$code){
for(j in 1:nrow(Yearlist){
importNOAA(code=siteinfo$code[i],year=Yearlist[j],hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
maybe that's works
Related
I have a number of CSV files exported from our database, say site1_observations.csv, site2_observations.csv, site3_observations.csv etc. Each CSV looks like below (site1 for example):
Column A
Column B
Column C
# Team: all teams
# Observation type: xyz
Site ID
Reason
Quantity
a
xyz
1
b
abc
2
Total quantity
3
We need to skip the top 2 rows and the last 1 row from each CSV before combining them as a whole master dataset for further analysis. I know I can use the skip = argument to skip the first few lines of CSV, but read_csv() doesn't seem to have simple argument to skip the last lines and I have been using n_max = as a workaround. The data import has been done in manual way. I want to shift the manual process to programmatic manner using purrr::map(), but just couldn't work out how to efficiently skip the last few lines here.
library(tidyverse)
observations_skip_head <- 2
# Approach 1: manual ----
site1_rawdata <- read_csv("/data/site1_observations.csv",
skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)
# site2_rawdata
# site3_rawdata
# [etc]
# all_sites_rawdata <- bind_rows(site1_rawdata, site2_rawdata, site3_rawdata, [etc])
I have tried to use purrr::map() and I believe I am almost there, except the n_max = part which I am not sure how/what to do this in map() (or any other effective way to get rid of the last line in each CSV). How to do this with purrr?
observations_csv_paths_chr <- paste0("data/site", 1:3,"_observations.csv")
# Approach 2: programmatically import csv files with purrr ----
all_sites_rawdata <- observations_csv_paths_chr %>%
map(~ read_csv(., skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)) %>%
set_names(observations_csv_paths_chr)
I know this post uses a custom function and fread. But for my education I want to understand how to achieve this goal using the purrr approach (if it's doable).
You could try something like this?
library(tidyverse)
csv_files <- paste0("data/site", 1:3, "_observations.csv")
csv_files |>
map(
~ .x |>
read_lines() |>
tail(-3) |> # skip first 3
head(-2) |> # ..and last 2
paste(collapse = '\n') |>
read_csv()
)
manual_csv<-function(x) {
txt<-readLines(x)
txt<-txt[-c(2,3,length(txt))] # insert the row you want to delete
result<-read.csv(text=paste0(txt, collapse="\n"))
}
test<-manual_csv('D:/jaechang/pool/final.csv')
Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)
ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file
FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(
Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}
I am trying to create a large list of file URLs by concatenating various pieces together. (Say, ~40 file URLs which represent multiple data types for each of the 50 states.) Eventually, I will download and then unzip/unrar these files. (I have working code for that portion of it.)
I'm very much an R noob, so please bear with me, here.
I have a set of data frames:
states - list of 50 state abbreviations
partial_url - a partial URL for the 50 states
url_parts - a list of each of the remaining URL
pieces (5 file types to download)
year
filetype
I need a URL that looks like this:
http://partial_url/state_urlpart_2017_file.csv.gz
I was able to build the partial_url data frame with the following:
for (i in seq_along(states)) {
url_part1 <- as.data.frame(paste0(url,states[[i]],"/",dir,"/"))
}
I was hoping that some kind of nested loop might work to do the rest, like so:
for (i in 1:partial_url){
for (j in 1:url_parts){
for(k in 1:states){
url_part2 <- as.data.frame(paste0(partial_url[[i]],"/",url_parts[[j]],states[[k]],year,filetype))
}}}
Can anyone suggest how to proceed with the final step?
In my understanding all OP needs can be handled by paste0 function itself. The paste0 works as vectorise format. Hence, the for-loop shown by OP is not needed. The data used in my example is stored in vector format but it can be represented by a column of data.frame.
For example:
states <- c("Alabama", "Colorado", "Georgia")
partial_url <- c("URL_1", "URL_2", "URL_3")
url_parts <- c("PART_1", "PART_2", "PART_3")
year <- 2017
fileType <- "xls"
#Now use paste0 will list out all the URLS
paste0(partial_url,"/",url_parts,states,year,fileType)
#[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls"
#[3] "URL_3/PART_3Georgia2017xls"
EDIT: multiple fileType based on feedback from #Onyambu
We can use rep(fileType, each = length(states)) to support multiple files.
The solution will look like.
fileType <- c("xls", "doc")
paste0(partial_url,"/",url_parts,states,year,rep(fileType,each = length(states)))
[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls" "URL_3/PART_3Georgia2017xls"
[4] "URL_1/PART_1Alabama2017doc" "URL_2/PART_2Colorado2017doc" "URL_3/PART_3Georgia2017doc"
Here is a tidyverse solution with some simple example data. The approach is to use complete to give yourself a data frame with all possible combinations of your variables. This works because if you make each variable a factor, complete will include all possible factor levels even if they don't appear. This makes it easy to combine your five url parts even though they appear to have different nrow (e.g. 50 states but only 5 file types). unite allows you to join together columns as strings, so we call it three times to include the right separators, and then finally add the http:// with mutate.
Re: your for loop, I find it hard to work through nested for loop logic in the first place. But at least two issues as written include that you have 1:partial_url instead of 1:length(partial_url) and similar, and you are simply overwriting the same object with every pass of the loop. I prefer to avoid loops unless it's a problem where they're absolutely necessary.
library(tidyverse)
states <- tibble(state = c("AK", "AZ", "AR", "CA", "NY"))
partial_url <- tibble(part = c("part1", "part2"))
url_parts <- tibble(urlpart = c("urlpart1", "urlpart2"))
year <- tibble(year = 2007:2010)
filetype <- tibble(filetype = c("csv", "txt", "tar"))
urls <- bind_cols(
states = states[[1]] %>% factor() %>% head(2),
partial_url = partial_url[[1]] %>% factor() %>% head(2),
url_parts = url_parts[[1]] %>% factor() %>% head(2),
year = year[[1]] %>% factor() %>% head(2),
filetype = filetype[[1]] %>% factor() %>% head(2)
) %>%
complete(states, partial_url, url_parts, year, filetype) %>%
unite("middle", states, url_parts, year, sep = "_") %>%
unite("end", middle, filetype, sep = ".") %>%
unite("url", partial_url, end, sep = "/") %>%
mutate(url = str_c("http://", url))
print(urls)
# A tibble: 160 x 1
url
<chr>
1 http://part1/AK_urlpart1_2007.csv
2 http://part1/AK_urlpart1_2007.txt
3 http://part1/AK_urlpart1_2008.csv
4 http://part1/AK_urlpart1_2008.txt
5 http://part1/AK_urlpart1_2009.csv
6 http://part1/AK_urlpart1_2009.txt
7 http://part1/AK_urlpart1_2010.csv
8 http://part1/AK_urlpart1_2010.txt
9 http://part1/AK_urlpart2_2007.csv
10 http://part1/AK_urlpart2_2007.txt
# ... with 150 more rows
Created on 2018-02-22 by the reprex package (v0.2.0).
I have the following data.table (data.frame) called output:
> head(output)
Id Title IsProhibited
1 10000074 Renault Logan, 2005 0
2 10000124 Ñêëàäñêîå ïîìåùåíèå, 345 ì<U+00B2> 0
3 10000175 Ñó-øåô 0
4 10000196 3-ê êâàðòèðà, 64 ì<U+00B2>, 3/5 ýò. 0
5 10000387 Samsung galaxy S4 mini GT-I9190 (÷¸ðíûé) 0
6 10000395 Êàðòèíà ""Êðûì. Ïîñåëîê Àðîìàò"" (õîëñò, ìàñëî) 0
I am trying to export it to a CSV like so:
> write.table(output, 'output.csv', sep = ',', row.names = FALSE, append = T)
However, when doing so I get the following error:
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
In addition: Warning message:
In write.table(output, "output.csv", sep = ",", row.names = FALSE, :
appending column names to file
I have tried converting the Title to a string so that it is no longer of type list like so:
toString(output$Title)
But, I get the same error. My types are:
> class(output)
[1] "data.frame"
> class(output$Id)
[1] "integer"
> class(output$Title)
[1] "list"
> class(output$IsProhibited)
[1] "factor"
Can anyone tell me how I can export my data.frame to CSV?
Another strange thing that I've noticed, is that if I write head(output) my text is not encoded properly (as shown above) whereas if I simply write output$Title[0:3] it will display the text correctly like so:
> output$Title[0:3]
[[1]]
[1] "Renault Logan, 2005"
[[2]]
[1] "Складское помещение, 345 м²"
[[3]]
[1] "Су-шеф"
Any ideas regarding that? Is it relevant to my initial problem?
Edit: Here is my new output:
Id Title IsProhibited
10000074 Renault Logan, 2005 0
10000124 СкладÑкое помещение, 345 м<U+00B2> 0
10000175 Су-шеф 0
10000196 3-к квартира, 64 м<U+00B2>, 3/5 ÑÑ‚. 0
10000387 Samsung galaxy S4 mini GT-I9190 (чёрный) 0
10000395 Картина \\"Крым. ПоÑелок Ðромат\"\" (холÑÑ‚ маÑло)" 0
10000594 КальÑн 25 Ñм 0
10000612 1-к квартира, 45 м<U+00B2>, 6/17 ÑÑ‚. 0
10000816 Гараж, 18 м<U+00B2> 0
10000831 Платье 0
10000930 Карбюраторы К-22И, К-22Г от газ 21 и газ 51 0
Notice how line ID 10000395 is messed up? It seems to contains quotes of it's own which are messing up the CSV. How can I fix that?
Do this, irrespective of how many columns you have:
df <- apply(df,2,as.character)
Then do write.csv.
As mentioned in the comments, you should be able to do something like this (untested) to get "flatten" your list into a character vector:
output$Title <- vapply(output$Title, paste, collapse = ", ", character(1L))
As also mentioned, if you wanted to try the unlist approach, you could "expand" each row by the individual values in output$Title, something like this:
x <- vapply(output$Title, length, 1L) ## How many items per list element
output <- output[rep(rownames(output), x), ] ## Expand the data frame
output$Title <- unlist(output$Title, use.names = FALSE) ## Replace with raw values
There is a new function (introduced in november 2016) in data.table package that handles writing a data.table object to csv quite well, even in those cases when a column of the data.table is a list.
fwrite(data.table, file ="myDT.csv")
Another easy solution. Maybe one or more columns are of type list, so we need convert them to "character" or data frame. So there are two easy solutions
Convert each column "as.character" using--
df$col1 = as.character(df$col1)
df$col2 = as.character(df$col2)
.......and so on
The best one convert df in to a "matrix"
df = as.matrix(df)
now write df into csv. Works for me.
Those are all elegant solutions.
For the curious reader who would prefer some R-code to ready made packages , here's an R-function that returns a non-list dataframe that can be exported and saved as .csv.
output is the "troublesome" data frame in question.
df_unlist<-function(df){
df<-as.data.frame(df)
nr<-nrow(df)
c.names<-colnames(df)
lscols<-as.vector(which(apply(df,2,is.list)==TRUE))
if(length(lscols)!=0){
for(i in lscols){
temp<-as.vector(unlist(df[,i]))
if(length(temp)!=nr){
adj<-nr-length(temp)
temp<-c(rep(0,adj),temp)
}
df[,i]<-temp
} #end for
df<-as.data.frame(df)
colnames(df)<-c.names
}
return(df)
}
Apply the function on dataframe "output" :
newDF<-df_unlist(output)
You can next confirm that the new (newDF) data frame is not 'listed' via apply(). This should successfully return FALSE.
apply(newDF,2,is.list) #2 for column-wise step.
Proceed to save the new dataframe, newDF as a .csv file to a path of your choice.
write.csv(newDF,"E:/Data/newDF.csv")
Assuming
the path you want to save to is Path, i.e. path=Path
df is the dataframe you want to save,
follow those steps:
Save df as txt document:
write.table(df,"Path/df.txt",sep="|")
Read the text file into R:
Data = read.table("Path/df.txt",sep="|")
Now save as csv:
write.csv(Data, "Path/df.csv")
That's it.
# First coerce the data.frame to all-character
df = data.frame(lapply(output, as.character), stringsAsFactors=FALSE)
# write file
write.csv(df,"output.csv")