So basically im stuck as I tried this code but it isn't splitting the names and number. PLs refer to the sample image to understand the desired outcome.
Code that i have tried
You have a problem with quoting and separator. To clean your dataframe, use the code below:
pd.read_csv('names_tab2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True) \
.to_csv('clean_names.csv', index=False, header=False)
Old answer
Use str.extract:
Suppose this dataframe:
df = pd.DataFrame({'ColA': ['CAIN TAN86092142', 'YEO KIAT JUN81901613']})
print(df)
# Output:
ColA
0 CAIN TAN86092142
1 YEO KIAT JUN81901613
Split on the first encountered digit:
out = df['ColA'].str.extract(r'([^\d]*)(\d+)') \
.rename(columns={0: 'Name', 1: 'Number'})
print(out)
# Output:
Name Number
0 CAIN TAN 86092142
1 YEO KIAT JUN 81901613
Update:
Is there a way to remove the Name and Number when it outputs to the csv?
out.to_csv('data.csv', index=False, header=None)
# content of data.csv:
CAIN TAN,86092142
YEO KIAT JUN,81901613
When opening a plain text file (or in this case a plain text csv file) you can use a for loop to go through the file line by line like so (Current as of Python 3):
name = []
id_num = []
file = open('file.csv', 'r')
for f in file:
f = f.split(',') # split the data
name.append(str(f[0])) # append name to list
id_num.append(str(f[1])) # append ID to list
Now that you have the data in lists, you can print/store it how you want.
Related
Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)
ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file
FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(
Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}
How to keep every fifth row (and deleting all the others) in a file Excel? For example, I have a starting file like this:
07/12/1989 106,9
08/12/1989 106,05
12/12/1989 103,1
13/12/1989 106,5
14/12/1989 104,75
15/12/1989 105,6
18/12/1989 104,5
19/12/1989 106,2
20/12/1989 106,5
21/12/1989 107,5
22/12/1989 109,8
and I would like the result:
07/12/1989 106,9
15/12/1989 105,6
22/12/1989 109,8
Try this:
Step 1: Read excel file in R using read.xlsx
Step 2: Generate the sequences and then retrieve rows based on sequences
indexes<-seq(1,nrow(df),5) # Set index
df[indexes,] # Retrive only index
Output:
V1 V2
1 07/12/1989 106,9
6 15/12/1989 105,6
11 22/12/1989 109,8
Step 3: Store this result to excel file using write.xlsx
Let assume you have this dataset:
dt<-data.frame(ID=LETTERS, stringsAsFactors = F)
Then you can do:
as.data.frame( dt[ 1:nrow(dt) %% 5 ==0,])
I want to ingest all files in the working directory and scan all rows for line breaks or carriage returns. Instead of eliminating them, I'd like to divert them into a new output file for manual review. Here's what I have so far:
library(plyr)
library(dplyr)
library(readxl)
filenames <- list.files(pattern = "Sara Lee.*\\.xlsx$", ignore.case = TRUE)
read_excel_filename <- function(filename){
ret <- read_excel(filename, col_names = TRUE, skip = 5, trim_ws = FALSE)
ret
}
import.list <- ldply(filenames, read_excel_filename)
returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
(import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
(import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
(import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
(import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
(import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
(import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
(import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
(import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
(import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
(import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
(import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
(import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
(import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
(import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
(import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
now <- Sys.time()
carriage_return_file_name <- paste(format(now,"%Y%m%d"),"ROWS with Carriage Returns or New Lines.csv",sep="_")
write.csv(returnornewline, carriage_return_file_name, row.names = FALSE)
Here's some sample data:
Customer Segment Address
BuyFood 123 Main St.\r
BigKetchup 679 Smith Dr.\r
DownUnderMeat 410 Crocodile Way
BuyFood 123 Main St.
I thought the trim_ws = FALSE condition would work, but it hasn't.
Apologies for the column spam, I've yet to figure out an easier way to scan all the columns without listing them. Any help on that issue is appreciated as well.
EDIT: Added some sample data. I don't know how to show a carriage return in the address other than the regex of it. It doesn't look like that in the real sample data, that's just for our use here. Please let me know if that's not clear. The desired output would take the first 2 rows of data where there's a carriage return and output it to the csv file listed at the end of the code block.
EDIT 2: I used the code provided in the suggestion in place of the original long list of columns as follows. However, this doesn't give me a new variable that contains a dataframe of rows with new lines or carriage returns. When I look at my global environment in R Studio I see another variable under Data called "returnornewline" but it shows as a large list, unlike the import.list variable which shows a dataframe. This shouldn't be the case because I've only added a carriage return in the first row of the first spreadsheet of the data, so that list should not be so large.:
returnornewline <- lapply(import.list, function(x) lapply(x, function(s) grep("\r", s)))
# returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
# (import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
# (import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
# (import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
# (import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
# (import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
# (import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
# (import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
# (import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
# (import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
# (import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
# (import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
# (import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
# (import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
# (import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
# (import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
EDIT 3: I need to be able to take all rows in the newly created data frame "import.list" and scan them for any instances of carriage returns or new lines within all the rows. The example above is rudimentary, but the concept stands. In the example, I'd expect for the script to read the first two rows and say "hey, these rows have carriage returns, add this to the variable assigned to this line of code and at the end of the script output this data to a csv." The remaining two rows in the sample data above have no need to be output because they have no carriage returns in their data.
I have several ASCII files I Need to Import into R with return data for different asset classes. The structure of the ASCII files is as follows (with 2 sample data)
How can I Import this? I wasn't succesfull with read.table, but I would like to have it in a data.frame Format.
<Security Name> <Ticker> <Per> <Date> <Close>
Test Description,Test,D,19700101,1.0000
Test Description,Test,D,19700102,1.5
If you really want to force the column names into R, you could use something like that:
# Data
dat <- read.csv("/path/to/data.dat", header = FALSE, skip = 1)
dat
V1 V2 V3 V4 V5
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
# Column names
dat.names <- readLines("/path/to/data.dat", n = 1)
names(dat) <- unlist(strsplit(gsub(">", " ", gsub("<", "", dat.names)), " "))
dat
Security Name Ticker Per Date Close
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
Although I think there might be better solutions, e.g. manually adding the header...
You can easily read this data using read.csv. Since your column names are not comma separated then you will need to use the header=FALSE argument and then add the names once the data is in R or oyu can manually edit the data before reading it by omitting the <> characters and adding a comma between each column name.