I have a dataset where I have about 200 \N and I'd like to replace \N with the date/day value in the row above. Such as for row 641, I want to change the date to 10-Nov-14 and day to Mon:
If this can be done in R, does the format of date matter? As currently, these dates are shown as factors.
Easy in Excel. Replace all \N with nothing, select the two relevant columns, HOME > Editing, Find & Select, Go To Special..., Blanks, then
=
↑
Ctrl+Enter.
Then copy range again and HOME > Clipboard - Paste, Paste Special..., Values, OK over the top.
If that's an Excel file that you are importing into R then you need to understand how R works with the backslash characters (which is what is showing in the screenshot) and which is used to "escape" characters. See ?Quotes. Once the data is in R it will probably all be factor columns.
If the dataframe is named 'dat' then this should work to really make true missing values:
is.na( dat) <- dat == "\\N" # need to escape the escape character.
Then use na.locf from package zoo:
library(zoo) # lots of useful methods in zoo.
dat$date <- na.locf(dat$date)
dat$day_of_week <- na.locf(dat$day_of_week)
These methods should work with any class of column, and these would not be R Date-classed variables until you made the conversion.
It can be solved easily in R with the following code:
ListNa <- grep("\\N", a$date)
ListPrewRow <- ListNa-1
data[ListNa,c("date", "day")] <- data [ListPrewRow,c("date", "day")]
Where:
"data" is the data table
"date" and "day" is the columns to be replaced.
This can be done rapidly in Excel with a quick use of Find and FindNext. Assuming you want to replace all of the \N on the ActiveSheet, this code will terminate once it has replaced them all. Offset(-1) gets the value one row up.
Sub ReplaceWithValueAbove()
Dim rng_search As Range
Set rng_search = ActiveSheet.UsedRange.Find("\N")
While Not rng_search Is Nothing
'set to row above
rng_search = rng_search.Offset(-1)
'find the next one
Set rng_search = ActiveSheet.UsedRange.FindNext()
Wend
End Sub
Before and After pictures
Related
I am trying to output a dataframe in R to a .txt file. I want the .txt file to ultimately mirror the dataframe output, with columns and rows all aligned. I found this post on SO which mostly gave me the desired output with the following (now modified) code:
gene_names_only <- select(deseq2_hits_table_df, Gene, L2F)
colnames(gene_names_only) <- c()
capture.output(
print.data.frame(gene_names_only, row.names=F, col.names=F, print.gap=0, quote=F, right=F),
file="all_samples_comparison_gene_list.txt"
)
The resultant output, however, does not align negative and positive values. See:
I ultimately want both positive and negative values to be properly aligned with one another. This means that -0.00012 and 4.00046 would have the '-' character from the prior number aligned with the '4' of the next character. How could I accomplish this?
Two other questions:
The output file has a blank line at the beginning of the output. How can I change this?
The output file also seems to put far more spaces between the left column and the right column than I would want. Is there any way I can change this?
Maybe try a finer scale treatment of the printing using sprintf and a different format string for positive and negative numbers, e.g.:
> df = data.frame(x=c('PICALM','Luc','SEC22B'),y=c(-2.261085123,-2.235376098,2.227728912))
> sprintf('%15-s%.6f',df$x[1],df$y[1])
[1] "PICALM -2.261085"
> sprintf('%15-s%.6f',df$x[2],df$y[2])
[1] "Luc -2.235376"
> sprintf('%15-s%.7f',df$x[3],df$y[3])
[1] "SEC22B 2.2277289"
EDIT:
I don't think that write.table or similar functions accept custom format strings, so one option could be to create a data frame of formatted strings and the use write.table or writeLines to write to a file, e.g.
dfstr = data.frame(x=sprintf('%15-s', df$x),
y=sprintf(paste0('%.', 7-1*(df$y<0),'f'), df$y))
(The format string for y here is essentially what I previously proposed.) Next, write dfstr directly:
write.table(x=dfstr,file='filename.txt',
quote=F,row.names=F,col.names=F)
I have a data frame like this I read in from a .csv (or .xlsx, I've tried both), and one of the variables in the data frame is a vector of dates.
Generate the data with this
Name <- rep("Date", 15)
num <- seq(1:15)
Name <- paste(Name, num, sep = "_")
data1 <- data.frame(
Name,
Due.Date = seq(as.Date("2020/09/24", origin = "1900-01-01"),
as.Date("2020/10/08", origin = "1900-01-01"), "days")
)
When I reference one of the cells specifically, like this: str(project_dates$Due.Date[241]) it reads the date as normal.
However, the exact position of the important dates varies from project to project, so I wrote a command that identifies where the important dates are in the sheet, like this: str(project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"])
This code worked on a few projects, but on the current project it now returns a character vector of length 2. One of the values is the date, and the other value is NA. And to make matters worse, the location of the date and the NA is not fixed across dates--the date is the first value in some cells and the second in others (otherwise I would just reference, e.g., the first item in the vector).
What is going on here, but more importantly, how do I fix this?!
Clarification on the second command:
When I was originally reading from an Excel file, the command was project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"]$Due.Date because it was returning a 1x1 tibble, and I needed the value in the tibble.
When I switched to reading in data as a csv, I had to remove the $Due.Date because the command was now reading the value as an atomic vector, so the $ operator was no longer valid.
Help me, Oh Blessed 1's (with) Knowledge! You're my only hope!
Edited to include an image of the data like the one that generates the error
I feel sheepish.
I was able to remove the NAs with
data1<- data1[!is.na(data1$Due.Date), ].
I assumed that command would listwise delete the rows with any missing values, so if the cell contained the 2-length vector, then I would lose the whole row of data. Instead, it removed the NA from the cell, leaving only the date.
Thank you to everyone who commented and offered help!
There is a large dataset that I need to download over the web using R, but I would like to learn how to filter it at the same time while downloading to the Dates that I need. Right now, I have it setup to download and .unzip and then I create another data set with a filter. The file is a text ";" delimited file
There is a Date column with format 1/1/2009 and I need to only select two dates, 3/1/2009 and 3/2/2009, how to do that in R ?
When I import it, R set it as a factor, since I only need those two dates and there is no need to do a Between, I just select the two factors and call it a day.
Thanks!
I don't think you can filter while downloading. To select only these dates you can use the subset function:
# do not convert string to factors
d.all = read.csv(file, ..., stringsAsFactors = FALSE, sep = ';')
# Date column is called DATE:
d.filter = subset(d.all, DATE %in% c("1/1/2009", "3/1/2009"))
I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara
Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here
I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
This is what I have done so far:
I have a csv file which I open in excel. I manipulate the columns algebraically to obtain a new column "A". I import the file into R using read.csv() and the entries in column A are stored as factors - I want them to be stored as numeric. I find this question on the topic:
Imported a csv-dataset to R but the values becomes factors
Following the advice, I include stringsAsFactors = FALSE as an argument in read.csv(), however, as Hong Ooi suggested in the page linked above, this doesn't cause the entries in column A to be stored as numeric values.
A possible solution is to use the advice given in the following page:
How to convert a factor to an integer\numeric without a loss of information?
however, I would like a cleaner solution i.e. a way to import the file so that the entries of column entries are stored as numeric values.
Cheers for any help!
Whatever algebra you are doing in Excel to create the new column could probably be done more effectively in R.
Please try the following: Read the raw file (before any excel manipulation) into R using read.csv(... stringsAsFactors=FALSE). [If that does not work, please take a look at ?read.table (which read.csv wraps), however there may be some other underlying issue].
For example:
delim = "," # or is it "\t" ?
dec = "." # or is it "," ?
myDataFrame <- read.csv("path/to/file.csv", header=TRUE, sep=delim, dec=dec, stringsAsFactors=FALSE)
Then, let's say your numeric columns is column 4
myDataFrame[, 4] <- as.numeric(myDataFrame[, 4]) # you can also refer to the column by "itsName"
Lastly, if you need any help with accomplishing in R the same tasks that you've done in Excel, there are plenty of folks here who would be happy to help you out
In read.table (and its relatives) it is the na.strings argument which specifies which strings are to be interpreted as missing values NA. The default value is na.strings = "NA"
If missing values in an otherwise numeric variable column are coded as something else than "NA", e.g. "." or "N/A", these rows will be interpreted as character, and then the whole column is converted to character.
Thus, if your missing values are some else than "NA", you need to specify them in na.strings.
If you're dealing with large datasets (i.e. datasets with a high number of columns), the solution noted above can be manually cumbersome, and requires you to know which columns are numeric a priori.
Try this instead.
char_data <- read.csv(input_filename, stringsAsFactors = F)
num_data <- data.frame(data.matrix(char_data))
numeric_columns <- sapply(num_data,function(x){mean(as.numeric(is.na(x)))<0.5})
final_data <- data.frame(num_data[,numeric_columns], char_data[,!numeric_columns])
The code does the following:
Imports your data as character columns.
Creates an instance of your data as numeric columns.
Identifies which columns from your data are numeric (assuming columns with less than 50% NAs upon converting your data to numeric are indeed numeric).
Merging the numeric and character columns into a final dataset.
This essentially automates the import of your .csv file by preserving the data types of the original columns (as character and numeric).
Including this in the read.csv command worked for me: strip.white = TRUE
(I found this solution here.)
version for data.table based on code from dmanuge :
convNumValues<-function(ds){
ds<-data.table(ds)
dsnum<-data.table(data.matrix(ds))
num_cols <- sapply(dsnum,function(x){mean(as.numeric(is.na(x)))<0.5})
nds <- data.table( dsnum[, .SD, .SDcols=attributes(num_cols)$names[which(num_cols)]]
,ds[, .SD, .SDcols=attributes(num_cols)$names[which(!num_cols)]] )
return(nds)
}
I had a similar problem. Based on Joshua's premise that excel was the problem I looked at it and found that the numbers were formatted with commas between every third digit. Reformatting without commas fixed the problem.
So, I had the similar situation here in my data file when I readin as a csv. All the numeric value were turned into char. But in my file there was a value with a word "Filtered" instead of NA. I converted "Filtered" to NA in vim editor of linux terminal with a command <%s/Filtered/NA/g> and saved this file and later used it and read it in R, all the values were num type and not char type any more.
Looks like character value "Filtered" was inducing all values to be char format.
Charu
Hello #Shawn Hemelstrand here are the steps in detail below:
example matrix file.csv having 'Filtered' word in it
I opened the file.csv in linux command terminal
vi file.csv
then press "Esc shift:"
and type the following command at the bottom
"%s/Filtered/NA/g"
press enter
then press "Esc shift:"
write "wq" at the bottom (this save the file and quit vim editor)
then in R script I read the file
data<- read.csv("file.csv", sep = ',', header = TRUE)
str(data)
All columns were num type which were earlier char type.
In case you need more help, it would be easier to share your txt or csv file.