Is it possible to pass column indices to read_csv?
I am passing many CSV files to read_csv with different header names so rather than specifying names I wish to use column indices.
Is this possible?
df.list <- lapply(myExcelCSV, read_csv, skip = headers2skip[i]-1)
Alternatively, you can use a compact string representation
where each character represents one column: c = character, i
= integer, n = number, d = double, l = logical, f = factor, D
= date, T = date time, t = time, ? = guess, or ‘_’/‘-’ to
skip the column.
If you know the total number of columns in the file you could do it like this:
my_read <- function(..., tot_cols, skip_cols=numeric(0)) {
csr <- rep("?",tot_cols)
csr[skip_cols] <- "_"
csr <- paste(csr,collapse="")
read_csv(...,col_types=csr)
}
If you don't know the total number of columns in advance you could add code to this function to read just the first line of the file and count the number of columns returned ...
FWIW the skip argument might not do what you think it does (it skips rows rather than selecting/deselecting columns): as I read ?readr::read_csv() there doesn't seem to be any convenient way to skip and/or include particular columns (by name or by index) except by some ad hoc mechanism such as suggested above; this might be worth a feature request/discussion on the readr issues list? (e.g. add cols_include and/or cols_exclude arguments that could be specified by name or position?)
Related
I have a FASTA_16S.txt file containing paragraphs of different lengths with a unique code (e.g. 16S317) at the top. After transfer into R, I have a list with 413 members that looks like this:
[1]">16S317_V._rotiferianus_A\n
AAATTGAAGAGTTTGATCATGGCTCAG..."
[2]">16S318_Salmonella_bongori\n
AAATTGAAGAGTTTGATCATGGCTCAGATT..."
[3]">16S319_Escherichia_coli\n
TTGAAGAGTTTGATCATGGCTCAGATTG...
I need to substitute the existing codes with the new ones from a table Code_16S:
Old New
1. 16S317 16S001
2. 16S318 16S307
3. 16S319 16S211
4. ... ...
Can anybody suggest a code that would identify an old code and substitute it with a new one?
Consider that we have the same codes in columns New and Old, so direct application of gsub or replace for the whole list did not work (after a substitution we have two paragraphs with the same code, so one of the next steps will change both of them).
Below there is my solution for the problem, but I don´t consider it as an optimal.
Instead of using lapply, it may be easier with str_replace_all
library(stringr)
library(tibble)
FASTA_16S <- str_replace_all(FASTA_16S, deframe(Code_16S))
-output
FASTA_16S
[1] ">16S001_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG..."
[2] ">16S307_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
data
FASTA_16S <- c(">16S317_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG...",
">16S318_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
)
Code_16S <- structure(list(Old = c("16S317", "16S318", "16S319"), New = c("16S001",
"16S307", "16S211")), class = "data.frame", row.names = c("1.",
"2.", "3."))
As long as the new codes are sorted according to the old ones, which corresponds to the order of the paragraphs in the file, we can perform substitution paragraph by paragraph.
(Initially the table was sorted by the column New)
Num = seq.int(1:413) # total number of paragraphs
Code_16S = codes$New
F_16S = function(x) {
row = code_16S[x]
gsub("^.{7}", paste(">", row, sep = ""), FASTA_16S[[1]][x])
}
N_16S = lapply(Num, F_16S)
With gsub("^[>].{7}", I tried to substitute first 6 characters (the code) except the first one (>) in each string, but it did not work, thus added paste function.
I am trying to import a SAS data set to R (I cannot share the data set). SAS sees columns as number or character. However, some of the number columns have coded character values. I've used the sas7bdat package to bring in the data set but those character values in number columns return NaN. I would like the actual character value. I have tried exporting the data set to csv and tab delimited files. However, I end up with observations that take 2 lines (a problem with SAS that I haven't been able to figure out). Since there are over 9000 observations I cannot go back and look for those observations that take 2 lines manually. Any ideas how I can fix this?
SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.
First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.
Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.
Third would be a custom format that displays the numbers using letters.
The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.
As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.
filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;
The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.
data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;
partial answer for dealing with data across multiple rows
library( data.table )
#first, read the whole lines into a single colunm, for example with
DT <- data.table::fread( myfile, sep = "")
#sample data for this example: a data.table with ten rows containing the numbers 1 to 10
DT <- data.table( 1:10 )
#column-bind wo subsets of the data, using a logical vector to select the evenery first
#and every second row. then paste the colums together and collapse using a
#comma-separator (of whatever separator you like)
ans <- as.data.table(
cbind ( DT[ rep( c(TRUE, FALSE), length = .N), 1],
DT[ rep( c(FALSE, TRUE), length = .N), 1] )[, do.call( paste, c(.SD, sep = ","))] )
# V1
# 1: 1,2
# 2: 3,4
# 3: 5,6
# 4: 7,8
# 5: 9,10
I prefer read_sas function from 'haven' package for reading sas data
library(haven)
data <- read_sas("data.sas7bdat")
I have a problem with the selection of column in a dataframe using a for loop. I'm new to R so it's very possible that I missed something obvious, but I did not find anything that works for me.
I have a file with 20 climatic variable measured during 60 years in 399 differents places.
I have a line for each day, and my column are the 20 climatic variable for each place (with a number at the end of the name to identify the place where the measure was taken).
It looks like that :
Temperature_1 Rain_1 .....Temperature_399 Rain_399
Date 1
Date 2
...
I want to select the 20 column corresponding to one place, run some calculations on the variables, put the results in an empty 3D array I have created, then do the same for the next place until the last one.
My problem is that I don't know how to select the right columns automatically. I also have issues with the writing of the results in the array.
I tried to select the columns corresponding to one place using the numbers at the end of the name of the variables, but I don't think it is possible to change automatically the condition.
I also tried to use the position of the columns but I'm not doing it properly
This is my code :
#creation of an empty array
Indice_clim=array(NA,dim = c(60,8,399),dimnames=list(c(1959:2018),c("Huglin","CNI","HD","VHD","SHS","DoF","FreqLF","SLF"),c(1:399)))
#selection of the columns corresponding to the first place using "end with"
maille=select(donnees_SAFRAN,c(1:4),ends_with(".1",ignore.case = FALSE))
# another try using the columns position which I know is really badly done
for (j in seq(from=5, to=7984,by=20)){
paste0("maille",j-4)=select(donnees_SAFRAN,c(1:4),c(j:j+19))
}
#and the calculation on the selected columns, the "i loop" is working.
for(i in 1959:2018)temp=c(maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(((T_moy.1-10)+(T_max.1-10))/2)*1.03),
maille%>%filter(an==i,mois==9)%>%summarise(mean(T_min.1)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=30)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=35)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1>=28)%>%summarise(sum(T_moy.1-28)),
maille%>%filter(an==i)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1<2)%>%summarise(sum(abs(2-T_moy.1))))
Indice_clim[[i-1958,,]]=as.numeric(temp)}
I would like to create a loop or something to do my calculation on each place and write the result in my array.
If you have any idea, I would very much appreciate it !
You can use the grep() function to look for each of the locations 1, 2, ..., 399 in the column names. If your big dataframe containing all the data is called df, then you could do this:
for (i in 1:399) {
selected_indices <- grep(paste0('_', i, '$'), colnames(df))
# do calculations on the selected columns
df[, selected_indices]
}
The for loop will automatically run through each location i from 1 through 399. The paste0() function concatenates '_' with the variable i and the dollar sign $ to create strings like "_1$", "_2$", ..., "_399$", which are then searched for using the grep() function in the column names of df. The '$' is used to specify that you want the patterns _1, _2, ... to appear at the end of the column names (it is a regular expression special character).
The grep() function uses the above regular expressions to returns the column indices required for each location. You can then extract the relevant portion of df and do whatever calculations you want.
i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.
I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.
I have tried using lapply, but am not sure what to put as the "x" variable.
Here is my function with the error below it
ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
as.integer(gsub(",", "", ItemStats_2014[intColList])) )
Error in [.data.table(ItemStats_2014, intColList) : When i is a
data.table (or character vector), the columns to join by must be
specified either using 'on=' argument (see ?data.table) or by keying x
(i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.
The file I am reading in stores its numbers as characters [with commas as decimal separator]
Just directly read those columns in as decimal, not as string:
data.table::fread() understands decimal separators: dec=',' by default.
You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:
myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)
This approach is simpler and faster and uses much less memory than reading as string, then converting later.
Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.
ItemStats_2014 <- ItemStats_2014 %>%
mutate_at(intColList, funs(as.integer(gsub(',', '', .))))
mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....
I need to read a bunch of zip codes into R but they have to be in double type. I also need to keep the leading zeros for the ones that start with zero. I tried
for (i in 1:length(df$region)){
if (nchar(df$region[i])==4) {
df$region[i] <- paste0("0", df$region[i])
}
}
This converts the way I want to but it changes them all to character type and I can't read the region column into another function that requires numeric or double. If I convert to numeric or double it gets rid of the leading zeros again. Any ideas?
Why not store them as a numeric and just add the zeros when needed through formatC? For example,
tst <- 345
class(tst)
formatC(tst, width = 5, format = "d", flag = "0")
gives,
#[1] "numeric"
#[1] "00345"
For brevity, you could even write a wrapper:
zip <- function(z)formatC(z, width = 5, format = "d", flag = "0")
zip(tst)
#[1] "00345"
And this only adds leading zeroes when needed.
zip(12345)
#[1] "12345"
I would recommend keeping two columns, one in which the ZIP code appears as text, and the other as a double. You would have to first read in the ZIP codes as character data, then create the double column from that, e.g.
# given df$zip_code
df$zip_as_double <- as.double(df$zip_code)
Double variables don't normally maintain the number of leading zeroes, because those digits are not significant anyway. So I think storing your ZIP codes as character data is the only option here.