R-text connection - r

I need to create a table with the following data (I don't want to use a csv to import it).
toread<-"DomicLabProv cluster
BUENOS AIRES 1
CHUBUT 1
FORMOSA 1
LA PAMPA 1
SAN JUAN 1
CAPITAL FEDERAL 1
MISIONES 1
SAN LUIS 1
SANTA FE 1
ENTRE RIOS 2
JUJUY 2
LA RIOJA 2
SANTIAGO DEL ESTERO 2
CHACO 2
CORDOBA 2
CORRIENTES 2
SALTA 2
TIERRA DEL FUEGO 2
TUCUMAN 2
CATAMARCA 3
MENDOZA 3
NEUQUEN 3
RIO NEGRO 3
SANTA CRUZ 3
"
As you can see first and second fields are separated by tabs.
When I try:
read.table(textConnection(toread), header = TRUE)
I get the following error mesage:
Error in scan. Line 2 hasn't got 2 elements.
I think this is related with the fact that names in DomicLabProv have spaces, for example "Buenos Aires". Is there a way to overcome this issue? I mean those are spaces done by space bar and the ones that are between fields are done by the tab key.
Thanks.

No need for a textConnection; pass the string to the function read.table via the text option instead:
read.delim(text = toread)
(read.delim is the same as read.table but uses tabs as delimiters, and defaults to having a header.)
This works if your text is indeed delimited by the tab character '\t'. If that isn’t the case, a bit more work is required, as you need to manually split the columns, while taking care not to split fields like “LA PAMPA”, which also contains a space. This is finicky and best avoided by having the data in the right format to start with. In your particular case, we can use the information that the second column is numeric (but we first need to remove the header, since that doesn’t conform):
header = strsplit(sub('\n.*', '', toread), ' +')[[1]]
no_header = sub('^.*?\n', '', toread)
no_header = gsub(' +(?=\\d)', '\t', no_header, perl = TRUE)
data = read.delim(text = no_header, header = FALSE)
colnames(data) = header

With a little regex, we can convert it to CSV format:
read.csv(text = sub('\\s+(\\S+$)', ',\\1', readLines(textConnection(toread))))
# DomicLabProv cluster
# 1 BUENOS AIRES 1
# 2 CHUBUT 1
# 3 FORMOSA 1
# 4 LA PAMPA 1
# 5 SAN JUAN 1
# 6 CAPITAL FEDERAL 1
# 7 MISIONES 1
# 8 SAN LUIS 1
# 9 SANTA FE 1
# 10 ENTRE RIOS 2
# 11 JUJUY 2
# 12 LA RIOJA 2
# 13 SANTIAGO DEL ESTERO 2
# 14 CHACO 2
# 15 CORDOBA 2
# 16 CORRIENTES 2
# 17 SALTA 2
# 18 TIERRA DEL FUEGO 2
# 19 TUCUMAN 2
# 20 CATAMARCA 3
# 21 MENDOZA 3
# 22 NEUQUEN 3
# 23 RIO NEGRO 3
# 24 SANTA CRUZ 3
sub looks for whitespace characters \\s+ followed by a group that it captures (...), which consists of anything but whitespace characters \\S+ followed by the end of the line $. It replaces them with a comma , followed by the captured group \\1.

Related

How to assign one dataframe column's value to be the same as another column's value in r?

I am trying to run this line of code below to copy the city.output column to pm.city where it is not NA (in my sample dataframe, nothing is NA though) because city.output contains the correct city spellings.
resultdf <- dplyr::mutate(df, pm.city = ifelse(is.na(city.output) == FALSE, city.output, pm.city))
df:
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <chr> <fct>
1 1 1809 MAIN ST OH 63312 NORWOOD NORWOOD
2 2 123 ELM DR CA NA BRYAN BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 BATEN ROUGE BATON ROUGE
4 4 4444 OAK AVE OH 87481 CINCINATTI CINCINNATI
5 5 3333 HELPME DR MT 87482 HELENA HELENA
6 6 2342 SOMEWHERE RD LA 45103 BATON ROUGE BATON ROUGE
resultdf (pm.city should be the same as city.output but it's an integer)
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <int> <fct>
1 1 1809 MAIN ST OH 63312 7 NORWOOD
2 2 123 ELM DR CA NA 2 BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 1 BATON ROUGE
4 4 4444 OAK AVE OH 87481 3 CINCINNATI
5 5 4444 HELPME DR MT 87482 4 HELENA
6 6 2342 SOMEWHERE RD LA 45103 1 BATON ROUGE
An integer is instead assigned to pm.city. It appears the integer is the order number of the cities when they're in alphabetical order. Prior to this, I used the dplyr left_join method to attach city.output column from another dataframe but even there, there was no row number that I supplied explicitly.
This works on my computer in r studio but not when I run it from a server. Maybe it has something to do with my version of dplyr or the factor data type under city.output? I am pretty new to r.
The city.output is factor which gets coerced to integer storage values. Instead, convert to character with as.character
dplyr::mutate(df, pm.city = ifelse(!is.na(city.output), as.character(city.output), pm.city))

Remove specific value in R or Linux

Hi I have a file (tab sep) in terminal that has several columns as below. You can see last column has a comma in between followed by one or more characters.
1 100 Japan Na pa,cd
2 120 India Ca pa,ces
5 110 Japan Ap pa,cres
1 540 China Sn pa,cd
1 111 Nepal Le pa,b
I want to keep last column values before the comma so the file can look like
2 120 India Ca pa
5 110 Japan Ap pa
1 540 China Sn pa
1 111 Nepal Le pa
I have looked for sed but I cannot find a way to exclude them
Regards
In R you can read the file with a tab-separator and remove the values after comma.
result <- transform(read.table('file1.txt', sep = '\t'), V5 = sub(',.*', '', V5))
V5 is used assuming it is the 5th column that you want to change the value.
We can use
df1 <- read.tsv('file1.txt', sep="\t")
df1$V5 <- sub("^([^,]+),.*", "\\1", df1$V5)

How to remove rows that contain duplicate characters in R

I want remove entire row if there are duplicates in two columns. Any quick help in doing so in R (for very large dataset) would be highly appreciated. For example:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
value=c(10,20,10,11,12,13,14,15,16))
This gives:
mydf
p1 p2 value
1 a b 10
2 c c 20
3 a d 10
4 b c 11
5 d d 12
6 b b 13
7 c d 14
8 c e 15
9 e e 16
I want to get:
p1 p2 value
1 a b 10
2 a d 10
3 b c 11
4 c d 14
5 c e 15
your note in the comments suggests your actual problem is more complex. There's some preprocessing you could do to your strings before you compare p1 to p2. You will have the domain expertise to know what steps are appropriate, but here's a first start. I remove all spaced and punctuation from p1 and p2. I then convert them all to uppercase before testing for equality. You can modify the clean_str function to include more / different cleaning operations.
Additionally, you may consider approximate matching to address typos / colloquial naming conventions. Package stringdist is a good place to start.
mydf <- data.frame(p1=c('New York','New York','New York','TokYo','LosAngeles','MEMPHIS','memphis','ChIcAGo','Cleveland'),
p2=c('new York','New.York','MEMPHIS','Chicago','knoxville','tokyo','LosAngeles','Chicago','CLEVELAND'),
value=c(10,20,10,11,12,13,14,15,16),
stringsAsFactors = FALSE)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 1 New York new York 10
#> 2 New York New.York 20
#> 3 New York MEMPHIS 10
#> 4 TokYo Chicago 11
#> 5 LosAngeles knoxville 12
#> 6 MEMPHIS tokyo 13
#> 7 memphis LosAngeles 14
#> 8 ChIcAGo Chicago 15
#> 9 Cleveland CLEVELAND 16
clean_str <- function(col){
#removes all punctuation
d <- gsub("[[:punct:][:blank:]]+", "", col)
d <- toupper(d)
return(d)
}
mydf$p1 <- clean_str(mydf$p1)
mydf$p2 <- clean_str(mydf$p2)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 3 NEWYORK MEMPHIS 10
#> 4 TOKYO CHICAGO 11
#> 5 LOSANGELES KNOXVILLE 12
#> 6 MEMPHIS TOKYO 13
#> 7 MEMPHIS LOSANGELES 14
Created on 2020-05-03 by the reprex package (v0.3.0)
Several ways to do that. Among them :
Base R
mydf[mydf$p1 != mydf$p2, ]
dplyr
library(dplyr)
mydf %>% filter(p1 != p2)
data.table
library(data.table)
setDT(mydf)
mydf[p1 != p2]
Here's a two-step solution based on #Chase's data:
First step (as suggested by #Chase) - preprocess your data in p1and p2to make them comparable:
# set to lower-case:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], tolower)
# remove anything that's not alphanumeric between words:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], function(x) gsub("(\\w+)\\W(\\w+)", "\\1\\2", x))
Second step - (i) using apply, paste the rows together, (ii) use grepl and backreference \\1 to look out for immediately adjacent duplicates in these rows, and (iii) remove (-) those rows which contain these duplicates:
mydf[-which(grepl("\\b(\\w+)\\s+\\1\\b", apply(mydf, 1, paste0, collapse = " "))),]
p1 p2 value
3 newyork memphis 10
4 tokyo chicago 11
5 losangeles knoxville 12
6 memphis tokyo 13
7 memphis losangeles 14

Looping over a data frame and adding a new column in R with certain logic

I have a data frame which contains information about sales branches, customers and sales.
branch <- c("Chicago","Chicago","Chicago","Chicago","Chicago","Chicago","LA","LA","LA","LA","LA","LA","LA","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa")
customer <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)
sales <- c(33816,24534,47735,1467,39389,30659,21074,20195,45165,37606,38967,41681,47465,3061,23412,22993,34738,19408,11637,36234,23809)
data <- data.frame(branch, customer, sales)
What I need to accomplish is to iterate over each branch, take each customer in the branch and divide the sales for that customer by the total of the branch. I need to do it to find out how much each customer is contributing towards the total sales of the corresponding branch. E.g. for customer 1 I would like to divide 33816/177600 and store this value in a new column. (177600 is the total of chicago branch)
I have tried to write a function to iterate over each row in a for loop but I am not sure how to do it at a branch level. Any guidance is appreciated.
Consider base R's ave for new column of inline aggregate which also considers same customer with multiple records within the same branch:
data$customer_contribution <- ave(data$sales, data$customer, FUN=sum) /
ave(data$sales, data$branch, FUN=sum)
data
# branch customer sales customer_contribution
# 1 Chicago 1 33816 0.190405405
# 2 Chicago 2 24534 0.138141892
# 3 Chicago 3 47735 0.268778153
# 4 Chicago 4 1467 0.008260135
# 5 Chicago 5 39389 0.221784910
# 6 Chicago 6 30659 0.172629505
# 7 LA 7 21074 0.083576241
# 8 LA 8 20195 0.080090263
# 9 LA 9 45165 0.179117441
# 10 LA 10 37606 0.149139610
# 11 LA 11 38967 0.154537126
# 12 LA 12 41681 0.165300433
# 13 LA 13 47465 0.188238887
# 14 Tampa 14 3061 0.017462291
# 15 Tampa 15 23412 0.133560003
# 16 Tampa 16 22993 0.131169705
# 17 Tampa 17 34738 0.198172193
# 18 Tampa 18 19408 0.110718116
# 19 Tampa 19 11637 0.066386372
# 20 Tampa 20 36234 0.206706524
# 21 Tampa 21 23809 0.135824795
Or less wordy:
data$customer_contribution <- with(data, ave(sales, customer, FUN=sum) /
ave(sales, branch, FUN=sum))
We can use dplyr::group_by and dplyr::mutate to calculate fractional sales of total by branch.
library(dplyr);
library(magrittr);
data %>%
group_by(branch) %>%
mutate(sales.norm = sales / sum(sales))
## A tibble: 21 x 4
## Groups: branch [3]
# branch customer sales sales.norm
# <fct> <dbl> <dbl> <dbl>
# 1 Chicago 1. 33816. 0.190
# 2 Chicago 2. 24534. 0.138
# 3 Chicago 3. 47735. 0.269
# 4 Chicago 4. 1467. 0.00826
# 5 Chicago 5. 39389. 0.222
# 6 Chicago 6. 30659. 0.173
# 7 LA 7. 21074. 0.0836
# 8 LA 8. 20195. 0.0801
# 9 LA 9. 45165. 0.179
#10 LA 10. 37606. 0.149

Reading a fixed length text file in R

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R.
The File looks some what like this :
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
It is not clear here , I have attached it , please find.
It reads easily in SAS using infile input.
SAS Code:
infile "filename.txt" lrecl=32767 ;
input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;
run;
Now to do the same in R , I have been trying various things:
1)Atfirst read.fwf ,
Code:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
But it returns just NA values for most of the fields and only some values that too in wrong positions.
Head(dat1) gives output:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Output:
2) Now I use the Sascii package to call the SAS code in R.
Code:
sas_imp <- "input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
It too gives the same useless output as above.
3) Now I use the Laf package and the laf_open_fwf command like :
library(LaF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
Then I convert it like :
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
But it gives output:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4) Lastly read.table.ffdf like
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
But result is same.
The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].
What am I doing wrong , am I putting the widths wrong ?
Or I am getting the ideas wrong altogether?
I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.
Update
Please see here for what I use nowadays for this problem:
Faster way to read fixed-width files
for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅
Here's the FW -> .csv converter I created in Python to destroy these awful files:
It also includes the checkLength function that can help get at what #RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).
Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)
import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS
def checkLength(ffile):
"""
Used to check that all lines in file have the same length (and so don't cause any issues below)
"""
with open(ffile,'r') as ff:
firstrow=1
troubles=0
for rows in ff:
if firstrow:
length=len(rows)
firstrow=0
elif len(rows) != length:
print rows
print len(rows)
troubles=1
return troubles
def fixed2csv(infile,outfile,dictfile):
"""
This function takes a file name for a fixed-width dataset as input and
converts it to .csv format according to slices and column names specified in dictfile
Parameters
==========
infile: string of input file name from which fixed-width data is to be read
e.g. 'fixed_width.dat'
outfile: string of output file name to which comma-separated data is to be saved
e.g. 'comma_separated.csv'
dictfile: .csv-formatted dictionary file name from which to read the following:
* widths: field widths
* column names: names of columns to be written to the output .csv
* types: object types (character, integer, etc)
column order must be: col_names,slices,types
"""
with open(dictfile,'r') as dictf:
fieldnames = ("col_names","widths","types") #types used in R later
ddict = csv.DictReader(dictf,fieldnames)
slices=[]
colNames=[]
wwidths=[]
for rows in ddict:
wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
colNames.append(rows['col_names'])
offset = 0
for w in wwidths:
slices.append(slice(offset,offset+w))
offset+=w
with open(infile,'r') as fixedf:
with open(outfile,'w') as csvf:
csvfile=csv.writer(csvf)
csvfile.writerow(colNames)
for rows in fixedf:
csvfile.writerow([rows[s] for s in slices])
Good luck, and curses be on whoever it is that is proliferating these FW format data files.
The file you uploaded is not a fixed width file:
I am not a SAS user, but from looking at the SAS code in your post, the column widths in the code do not match up with those in the file.
Moreover, some lines are entirely blank.
It appears that there are many carriage return / new lines which do not belong there - in particular they seem to be used in places as a delimiter. There should be one CRLF at the end of each line, and that's it.
Since you say that SAS opens it, I suggest you use save to a CSV format in SAS and then open it in R. Alternatively you could remove the superfluous CRLF using a good text editor/processor, leaving a single CRLF at the end of each line. Since it appears that each "real" line begins with "DP" you could try to do a replace of -CRLF-DP with (say) -tab- then delete all -CRLF-s then replace all -tab-s with -CRLF- (this relies on their being no -tab-s in the file already)

Resources