Difficulties with understanding read.csv code

Difficulties with understanding read.csv code - r

I'm improving my R-skills rebuilding some of the amazing stuff they do on r-bloggers. Right now im trying to reproduce this:
http://wiekvoet.blogspot.nl/2015/06/deaths-in-netherlands-by-cause-and-age.html. The relevant dataset for this excersize could be found here:
http://statline.cbs.nl/Statweb/publication/?VW=D&DM=SLNL&PA=7052_95&D1=0-1%2c7%2c30-31%2c34%2c38%2c42%2c49%2c56%2c62-63%2c66%2c69-71%2c75%2c79%2c92&D2=0&D3=0&D4=0%2c10%2c20%2c30%2c40%2c50%2c60%2c63-64&HD=150710-0924&HDR=G1%2cG2%2cG3&STB=T
If I'm diving into the code (to be found at the bottom of the first link) and am running into this piece of code:
r1 <- read.csv(sep=';',header=FALSE,
col.names=c('Causes','Causes2','Age','year','aantal','count'),
na.strings='-',text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
Could anybody help me seperating the steps that are taken here?

Here is an explanation of what each line in the call to read.csv() is doing from your example. Note that the assignment of the last parameter text is complicated and is dependent on the script from the link you gave above. From a high level, he is first reading in all lines from the file "Overledenen__doodsoo_170615161506.csv" which contain the string "Centraal", using only the third to final lines from that filtered set. There is an additional step applied to these lines as well.
r1 <- read.csv( # columns separate by semi-colon
sep=';',
# first row is data (i.e. is NOT a header)
header=FALSE,
# names of the six columns
col.names=c('Causes','Causes2','Age','year','aantal','count'),
# treat hyphen as NA
na.strings='-',
# read from third line to final line of the original input
# Overledenen__doodsoo_170615161506.csv, after some
# filtering has been applied
text=txtlines[3:length(txtlines)]) %>% select(.,-aantal,-Causes2)

The read.csv, read the csv file, separating column with the separator ";"
so that an input like this a;b;c will be separated in: first column=a, second=b, third=c
header=FALSE -> It specifies no header in the original file was given.
col.names assigns the listed names to your columns in r
na.strings='-' substitutes NA values with '-'
text=txtlines[3:length(txtlines)]) read the lines from position 3 till the end.
%>% select(.,-aantal,-Causes2) filter the data frame

Related

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!

The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.

Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

I have a data frame like this I read in from a .csv (or .xlsx, I've tried both), and one of the variables in the data frame is a vector of dates.
Generate the data with this
Name <- rep("Date", 15)
num <- seq(1:15)
Name <- paste(Name, num, sep = "_")
data1 <- data.frame(
Name,
Due.Date = seq(as.Date("2020/09/24", origin = "1900-01-01"),
as.Date("2020/10/08", origin = "1900-01-01"), "days")
)
When I reference one of the cells specifically, like this: str(project_dates$Due.Date[241]) it reads the date as normal.
However, the exact position of the important dates varies from project to project, so I wrote a command that identifies where the important dates are in the sheet, like this: str(project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"])
This code worked on a few projects, but on the current project it now returns a character vector of length 2. One of the values is the date, and the other value is NA. And to make matters worse, the location of the date and the NA is not fixed across dates--the date is the first value in some cells and the second in others (otherwise I would just reference, e.g., the first item in the vector).
What is going on here, but more importantly, how do I fix this?!
Clarification on the second command:
When I was originally reading from an Excel file, the command was project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"]$Due.Date because it was returning a 1x1 tibble, and I needed the value in the tibble.
When I switched to reading in data as a csv, I had to remove the $Due.Date because the command was now reading the value as an atomic vector, so the $ operator was no longer valid.
Help me, Oh Blessed 1's (with) Knowledge! You're my only hope!
Edited to include an image of the data like the one that generates the error

I feel sheepish.
I was able to remove the NAs with
data1<- data1[!is.na(data1$Due.Date), ].
I assumed that command would listwise delete the rows with any missing values, so if the cell contained the 2-length vector, then I would lose the whole row of data. Instead, it removed the NA from the cell, leaving only the date.
Thank you to everyone who commented and offered help!

For-loop error - values not being entered into column

I have a dataframe with one column containing a bunch of sound file filenames (.wav). I would like to measure something within each sound file, and have each measurement listed in a second new column next to the corresponding filename. In my code, each measurement is not being put into the second column. If I take each line of the for-loop and run it independently with the values 1, 2, 3 etc substituted for i, the resulting dataframe has the output measurements correctly entered.
I have written a toy example below, which cannot be run without some wav files, but perhaps the problem can be spotted based on the code alone:
library(seewave); library(tuneR)
setwd("D:/wavs")
#make a dataframe containing a column of wav filenames
file_list <-data.frame(c("wav1.wav", "wav2.wav", "wav3.wav"))
colnames(file_list)[1] <-"filelist" #give it a sensible column name
file_list$filelist <-as.character(file_list$filelist) #convert from factor
file_list$mx <-NA #a new empty vector for the measurement results
str(file_list) #this is how it will look before adding measurements
#now read in each wav file, measure something,
#and put the outcome in the empty cell next to that corresponding filename.
for (i in length(file_list$filelist)){
a <-as.character(file_list$filelist[[i]]) #this seemed to be a requirement, that 'wav1.wav' etc be a character
temp <-readWave(a) #read the file using tuneR package
mx <-max(range(temp#left)) #take some measurement from the left channel
file_list$mx[[i]] <-mx #put it in a new column next to the original filename
rm(mx); rm(temp); rm(a) #kill unnecessary things before starting again, for just in case
}
Needless to say, I have scoured the web and Stackoverflow for guidance without success, and tried a bunch of things (e.g. using }next{). Perhaps I need something similar to: file_list2$mx[i,] <-mx. Maybe some easy points for someone? Thank you, I am always grateful for help on SO.

The only problem with your code is that, You have not iterated the for loop. Means as per your code the for loop runs only one time i.e i=3. The correct code given below:
for (i in 1:length(file_list$filelist)){
a <-as.character(file_list$filelist[[i]]) #this seemed to be a requirement, that 'wav1.wav' etc be a character
temp <-readWave(a) #read the file using tuneR package
mx <-max(range(temp#left)) #take some measurement from the left channel
file_list$mx[[i]] <-mx #put it in a new column next to the original filename
rm(mx); rm(temp); rm(a) #kill unnecessary things before starting again, for just in case
}
Hope this could help you !

How about a tidyverse solution?
library(dplyr)
library(purrr)
data.frame(filelist=c("wav1.wav", "wav2.wav", "wav3.wav"), stringsAsFactors = F) %>%
mutate(mx = map_dbl(filelist, ~ readWave(.x)#left %>% max))

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:

The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

Using cbind to create a columnar .csv file but 'X' always appears in row one

I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara

Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Difficulties with understanding read.csv code - r

Related

How to import a CSV with a last empty column into R?

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

For-loop error - values not being entered into column

Combining CSV files and splitting the column into 2 columns using R

Using cbind to create a columnar .csv file but 'X' always appears in row one

Categories

Resources