I am a student and have been given a project to study climate data from Giovanni (NASA). Our code is provided and we are left to 'find our way' and therefore other answers don't seem to relate to the style of code i've been given. Further to this i am a beginner in R so changing the code is very difficult.
Basically i'm trying to create a time-series plot from the following code:
## Function for loading Giovanni time series data
load_giovanni_time <- function(path){
file_data <- read.csv(path,
skip=6,
col.names = c("Date",
"Temperature",
"NA",
"Site",
"Bleached"))
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
return(file_data)
}
## Creat a list of files
file.list <- list.files("./Data/courseworktimeseries/")
file.list <- as.list(paste0("./Data/courseworktimeseries/", file.list))
# for(i in file.list){
# load_giovanni_time(i)
# }
#Load all the files
all_data <- lapply(X=file.list,
FUN=load_giovanni_time)
all_data <- as.data.frame(do.call(rbind, all_data))
## Inspect the data with a plot
p <- qplot(data=all_data,
x=Date,
y=Temperature,
colour=Site,
linetype=Bleached,
geom="line")
print(p)
Now the first problem is that when the data is merged into one dataset, it changes all the dates (the starting date range is 2002-2015 and it changes to 2002-2030), which obviously ruins the plot. I found that i can stop the dates changing by deleting this code:
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
However, when this is deleted, i get the following error:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
Could anyone help me get round this without editing the code too much? I feel like it's a problem with the line of the code formatting the date incorrectly or something so i imagine it's only a small problem. I'm just very much a beginner and have to implement the code within 1-2 days.
Thanks
For anyone who ever has this problem... I found the solution.
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
This line of code reads in the date information from my CSV in that order. In excel my date was the other way round so if it said 30th December 2015 (30/12/2015), R would read it in as 2030-12-20, screwing up the data.
In Excel select all dates, CTRL+1 and then change format to match the date R is 'parsing'.
All done =)
Related
In all honesty, I'm not quite sure what the issue is but I've had a similar issue in R in the past. I've written code to extract the variables I want from .dat files (specifically the current panel survey). I have CSV files that contain the positions of each variable by year (positions change by year). For example, HRFS12M1 in 2010 is 1173-1174, and in 2019 1223-1224 (and this part of the code is not shown but works so I didn't include it). So I have two folders and two separate directories one with the positions and one with the .dat files. I first loop through the positions files and create dfs with positions for each year (2010-2019). After the position dfs are generated I run the code below to obtain what variables I want in a large merged df. Now the code works as intended when I select 4 or fewer variables in the varList. However, the moment I try to use more variables the df starts to produce values that aren't within those columns. Does anyone know why it's doing this? I've tried several different variables to confirm it's not a problem with the position files but a problem with the number of variables.
#Loop through list of .dat files (lst2 contains name of files example:2010dec.dat)
for(i in 1:length(lst2)){
#Import the data cps data set
temp_cps<-readLines(lst2[i])
#Get the positions of the relevant year
temp_pos<- get(paste("Year", i, sep = "."))
#List of Variables we are looking at (can't use more than 4)
**varList=c("HRYEAR4","GESTFIPS","HESP1","HRFS12M1")**
#Get positions only for the variables selected
temp_pos=temp_pos[grep(paste(varList, collapse="|"), temp_pos$Variable),]
#Create the dataframe
df<-NULL
for(j in 1:length(varList)){
df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j]))
}
df<-as.data.frame(df)
names(df)<-varList
assign(paste("CPS", i, sep = "."), df)
}
#AutoMate appending each year
for (k in 1:(length(lst2)-1)){
if(k==1){
CPS1 <- get(paste("CPS", k, sep = "."))
CPS2 <- get(paste("CPS", k+1, sep = "."))
#Append to keep only rows of second data set
merged_data=rbind(CPS1,CPS2)
}
else{
CPS_C <- get(paste("CPS", k+1, sep = "."))
merged_data=merged_data=rbind(merged_data,CPS_C)
}
if(k==length(lst2)-1){
#Clear Console
rm(list=setdiff(ls(), "merged_data"))
}
}
This is what it looks like before it breaks
This what happens after adding more than 4 variables
I think I figured it out. Need to run a few extra variables to confirm. But the program currently won't work if my list of variables is not in order in terms of position. For example, if "HRYEAR4" is 82-84 and "GESTFIPS" is 93-94 then the program will fail if I put GETSFIPS before HRYEAR4 in varList. However, if HRYEAR comes first then the program will run as intended. Does anyone, have any quick idea how to replace this line df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j])) to make it more dynamic and not have this issue? If not, it's not a big deal for the moment I'll just put them in order and see if I can find a better solution in the future. Thanks to anyone who tried to help.
I have a vcf file with 20 variants in chromosome 1 that I would like to visualise using vcfR.
What I am doing is the following:
#Read in my mouse genome and filter and rename chromosome 1
ref_genome <- ape::read.FASTA("mouse_genome/Mus_musculus.GRCm38.dna.primary_assembly.fa", type = "DNA")
ref_genome_chr1 <- ref_genome[ grep("GRCm38:1:", names(ref_genome))]
names(ref_genome_chr1) <- "1"
ref_genome_chr1 <- as.matrix(ref_genome_chr1)
#Read in my vcf file and also a mouse gff annotation file
vis_test_vcf <- read.vcfR("test_data/filter_chr1_test.recode.vcf", verbose = TRUE)
mouse_gff <- read.table("mouse_genome/Mus_musculus.GRCm38.102.gff3", sep="\t", quote="")
#Generate chromR object
chrom_test <- create.chromR(name="chr1_test", vcf=vis_test_vcf, seq=ref_genome_chr1, ann=mouse_gff, verbose=TRUE)
#Now try and plot this
chromoqc(chrom_test)
When I head() etc the various objects they look ok and I don't get any warnings about chromosome names not matching or anything. However, the plot is missing the "Variants per site" track, which is all I care about...I get this plot, whereby it's not showing the Variants per site. It's also not showing the DP and MQ but I'm not so worried about that at this stage...
Has anyone had a similar issue? I would be grateful for any pointers!
Kind regards
Cora
Ok so I just found the answer!
I needed to use
proc.chromR()
to process my chromR object, now it has plotted the variants.
Still need to work out the DP and MQ stuff..
I'm doing a little project where there goal is to retrieve data in text format from a website. (http://regsho.finra.org/regsho-Index.html)
The website was nice enough to provide it online but they sorted the data over several days in different links
I thought about looping through the dates and store the data with the following code:
#Download the needed data
my_data <- c()
for (i in 01:13){
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
}
head(my_data)
The problem here is that in line
for (i in 01:13){ # The date in the website is 01-02-03 and the loop seems to ommit the 0
I've used the sprintf() method so I can have a variable in a string.
and this line the empty variable my_data always seems to be overwritten by the last data downloaded.
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
# the empty variable my_data always seems to be overwritten by the last data downloaded.
Could somebody reassure me if i'm going in the right direction because i'm starting to doubt myself here
Any help would be greatly appreciated!
Thanks in advance
This should give you a leading 0 without using an extra package:
sprintf("%02d", i)
i.e.
sprintf("http://regsho.finra.org/CNMSshvol202005%02d.txt", i)
This question is pretty simple and maybe even dumb, but I can't find an answer on google. I'm trying to read a .txt file into R using this command:
data <- read.csv("perm2test.txt", sep="\t", header=FALSE, row.names=1, col.names=paste("V", seq_len(max(count.fields("perm2test.txt", sep="\t"))), sep=""), fill=TRUE)
The reason I have the col.names command is because every line in my .txt file has a different number of observations. I've tested this on a much smaller file and it works. However, when I run it on my actual dataset (which is only 48MB), I'm not sure if it is working... The reason I'm not sure is because I haven't received an error message, yet it has been "running" for over 24 hours at this point (just the read.csv command above). Is it possible that it has run out of memory and it just doesn't output a warning?
I've looked around and I know people say there are functions out there to reduce the size and remove lines that aren't needed, etc. but to be honest I don't think this file is THAT big, and unfortunately I do need every line in the file... (it's actually only 70 lines, but some lines contain as much as 100k entries, while others may only have say 100). Any ideas what is happening?
Obviously untested but should give you some code to modify:
datL <- readLines("perm2test.txt") # one line per group
# may want to exclude some lines but question is unclear
listL <- lapply(datL, function(L) read.delim(text=L, colCasses="numeric") )
# This is a list of values by group
dfL <- data.frame( vals = unlist(listL),
# Now build a grouping vector that is associated with each bundle of values
groups= rep( LETTERS[1:length(listL)] ,
sapply(listL, length) )
# Might have been able to do that last maneuver with `stack`.
library(lattice)
bwplot( vals ~ groups, data=dfL)
I'm new to R, and new to this forum. I've searched but cannot easily find an answer to this question:
I have numbers of cases of a disease by week according to location, stored in a .csv file with variable names cases.wk24, cases.wk25, etc. I also have population for each location, and want to generate incidence rates (# cases/population) for each of the locations.
I would like to write a loop that generates incidence rates by location for each week, and stores these in new variables called "ir.wk24", "ir.wk25", etc
I am stuck at 2 points:
is it possible to tell R to run a loop if it comes across a variable that looks like "cases.wk"? In some programmes, one would use a star - cases.wk*
How could I then generate the new variables with sequential naming and store these in the dataset?
I really appreciate any help on this - been stuck with internet searches all day!
thanks
x <- data.frame(case.wk24=c(1,3),case.wk25=c(3,2), pop=c(7,8))
weeks <- 24:25
varnames <- paste("case.wk", weeks, sep="")
ir <- sapply(varnames,FUN=function(.varname){
x[,.varname]/x[,"pop"]
})
ir <- as.data.frame(ir)
names(ir) <- paste("ir.wk", weeks, sep="")
x <- cbind(x,ir)
x