how to compare difference in 2 txt files and output to a new txt file - atom-editor

How to compare difference in 2 txt files and output and print it to shell?
working files in this link

Use drop_duplicates with Pandas:
df1 = pd.read_csv('members_1.txt', header=None).drop_duplicates()
df2 = pd.read_csv('members_2.txt', header=None).drop_duplicates()
out = pd.concat([df1, df2]).drop_duplicates(keep=False)
Output
>> print(*out[0].to_list(), sep='\n')
LEE RI KE
LIM YONG
KOH CHEE KIAT
LEE YONG
KOH CHEW KIAT
LEE RI KHEE
OR
Use set in Python:
with open('members_1.txt') as fp1, open('members_2.txt') as fp2:
data1 = set([l.strip() for l in fp1])
data2 = set([l.strip() for l in fp2])
out = data1.symmetric_difference(data2)
Output:
>>> print(*out, sep='\n')
KOH CHEW KIAT
LEE RI KE
LEE YONG
KOH CHEE KIAT
LEE RI KHEE
LIM YONG
Update: export to file
with open('output.txt', 'w') as fp:
print(*out, sep='\n', file=fp)

of course use diff
difflib.Differ

Related

why can i plot my stars_proxy object but not write it out

I'm new to stars so hoping this is a simple answer and just me failing to understand the stars workflow properly.
R Version: 4.1.1
Stars Version: 0.5-5
library(stars)
library(starsdata) #install.packages("starsdata", repos = "http://gis-bigdata.uni-muenster.de", type = "source")
#Create the rasters to read in as proxy
granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
r1<-read_stars(s2,,RasterIO=list(bands=1),proxy=T)
r2<-read_stars(s2,,RasterIO=list(bands=2),proxy=T)
r3<-read_stars(s2,,RasterIO=list(bands=3),proxy=T)
write_stars(r1,dsn="r1.tif")
write_stars(r2,dsn="r2.tif")
write_stars(r3,dsn="r3.tif")
Then I clear the objects from my environment and restart the R session.
#I clear all the objects and restart my R session here.
library(stars)
foo<-read_stars(c("r1.tif","r2.tif","r3.tif"),proxy=T)
r1<- foo[1]*0
r1[foo[1] > 4000 & foo[2] < 3000] <- 1
r1[foo[1] > 4000 & foo[2] >= 3000 & foo[2] <= 8000] <- 2
r1[foo[1] > 4000 & foo[2] > 8000 & foo[3] < 2000] <- 4
r1[foo[1] > 4000 & foo[2] > 8000 & foo[3] >= 2000] <- 2
# plot(r1) #this works just fine if you run it
#why doesn't the below work?
write_stars(r1,dsn="out.tif")
Attempting to write out the file results in the following error:
Error in st_as_stars.list(mapply(fun, x, i, value = value, SIMPLIFY = FALSE), :
!is.null(dx) is not TRUE
If instead of writing out the file, I plot the raster, it works just fine/as expected.
Perhaps the issue is just my failure to understand that this answer applies to me too:
How to reassign cell/pixel values in R stars objects
First of all thank you for the effort you have made to make available a minimal reproducible example. Unfortunately the image you use is very heavy... and my pc is very old ! ;-) So I chose to use your example with another image (the test image of the stars library), easier to handle for my old computer.
So please find below a reprex which describes step by step the approach.
Reprex
STEP 1: Create three dummy stars proxy objects from the test image of the stars library
library(stars)
r <- system.file("tif/L7_ETMs.tif", package = "stars")
r1 <- read_stars(r, RasterIO = list(bands=1), proxy = TRUE)
r2 <- read_stars(r, RasterIO = list(bands=2), proxy = TRUE)
r3 <- read_stars(r, RasterIO = list(bands=3), proxy = TRUE)
STEP 2: Write every stars proxy object as .tif files on disk
write_stars(r1,dsn="r1.tif")
write_stars(r2,dsn="r2.tif")
write_stars(r3,dsn="r3.tif")
STEP 3: Merge r1, r2 and r3 in the stars proxy object foo
foo <- read_stars(c("r1.tif","r2.tif","r3.tif"), proxy = TRUE)
foo <- merge(foo)
STEP 4: Visualization of the foo stars proxy object
plot(foo)
If you want to display a specific band, proceed like this (here, showing band 3):
plot(foo[,,,3], main = st_dimensions(foo)["band"]$band$values[3])
STEP 5: Run your chunk of code
r1 <- foo[,,,3]*0 #0 create a proxy with 0s that we will replace using rules below
r1[foo[,,,1] > 40 & foo[,,,2] < 30] <- 1
r1[foo[,,,1] > 40 & foo[,,,2] >= 30 & foo[,,,2] <= 70] <- 2
r1[foo[,,,1] > 40 & foo[,,,2] > 70 & foo[,,,3] < 7] <- 4
r1[foo[,,,1] > 40 & foo[,,,2] > 70 & foo[,,,3] >= 7] <- 2
STEP 6: Visualization of the output
plot(r1)
NB: I have deliberately not included the output raster here because at the end of the execution of your chunk of code, all pixels of the test image have the value 2. The output image is therefore a monochrome raster without any interest [this result is consistent with the pixel values of the original test image].
STEP 7: Save the output
write_stars(r1, dsn = "out.tif")
STEP 8: Checks that the file has been successfully written to the disk
file.exists("out.tif")
#> [1] TRUE
Created on 2021-12-10 by the reprex package (v2.0.1)

Download a bibtex from GitHub into R

I have a BibTex file stored in GitHub, here:
https://raw.github.com/zoometh/C14/master/neonet/references_france.bib
The file shows bibliographical references like that:
#article{Binder18,
title={Modelling the earliest north-western dispersal of Mediterranean Impressed Wares: new dates and Bayesian chronological model.},
author={Binder, Didier and Lanos, Philippe and Angeli, Lucia and Gomart, Louise and Guilaine, Jean and Manen, Claire and Maggi, Roberto and Muntoni, Italo M and Panelli, Chiara and Radi, Giovanna and others},
journal={Documenta praehistorica.},
volume={44},
pages={54--77},
year={2018},
publisher={University of Ljubljana Department of Archaeology}
}
#inproceedings{Briois09,
title={L'abri de Buholoup: de l'{\'E}pipal{\'e}olithique au N{\'e}olithique ancien dans le piedmont central des Pyr{\'e}n{\'e}es},
author={Briois, Fran{\c{c}}ois and Vaquer, Jean},
booktitle={De M{\'e}diterran{\'e}e et d'ailleurs...: m{\'e}langes offerts {\`a} Jean Guilaine},
pages={141--150},
year={2009}
}
...
I want to download it into R but the following code not working:
library(bibtex)
bib <- read.bib('https://raw.github.com/zoometh/C14/master/neonet/references_france.bib')
# Error: unable to open file to read
... But, it works when I read it from a local folder.
How can I download a .bib from GitHub into R/RStudio ?
I don't think read.bib works remotely so you could download the file first.
library(utils)
URL <- "https://raw.github.com/zoometh/C14/master/neonet/references_france.bib"
download.file(url = URL, destfile=basename(URL))
library(bibtex)
bib <- read.bib(basename(URL))
head(bib,1)
# Binder D, Lanos P, Angeli L, Gomart L, Guilaine J, Manen C, Maggi R, Muntoni IM, Panelli C, Radi
# G, others (2018). “Modelling the earliest north-western dispersal of Mediterranean Impressed
# Wares: new dates and Bayesian chronological model.” _Documenta praehistorica._, *44*, 54-77.

how to read gz file in R and convert the output into a proper data frame

Hello I am trying to read dat.gz file but the generated dataframe is very messed up, it has one column and thousands of rows, each row seems to have all the information with no tags or separation. does any body have an idea how to produce a proper data out of this file, as it should many columns with useful information. this is what I tried:
File_URL <- gzcon(url(paste('ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.dat.gz')))
DATA_mir <- readLines(File_URL)
dat <- read.csv(textConnection(DATA_mir),sep="\t", header=T, comment.char="#",
na.strings=".", stringsAsFactors=FALSE,
quote="", fill=FALSE)
Many thanks.
The file is not in a clean tabular form, you'll have to parse it in a smart way.
$> zcat miRNA.dat.gz | head -n 20
ID cel-let-7 standard; RNA; CEL; 99 BP.
XX
AC MI0000001;
XX
DE Caenorhabditis elegans let-7 stem-loop
XX
RN [1]
RX PUBMED; 11679671.
RA Lau NC, Lim LP, Weinstein EG, Bartel DP;
RT "An abundant class of tiny RNAs with probable regulatory roles in
RT Caenorhabditis elegans";
RL Science. 294:858-862(2001).
XX
RN [2]
RX PUBMED; 12672692.
RA Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB,
RA Bartel DP;
RT "The microRNAs of Caenorhabditis elegans";
RL Genes Dev. 17:991-1008(2003).
XX
My guess is that this has been already done. Alternatively you could try mirbase.db package.

Importing CSV files with CRLF broken lines in R

I'm an urban planner migrating towards spatial data analysis. I am not oblivious to R and programming in general but since I don't have the proper training my skills are limited sometimes.
At the moment I am trying to analyse about 50 CSV files containing financial data concerning public auctions which are from 60000 to 300000 lines long with 39 fields. The files are exports from the Romanian national public auctioning system, which is a form-like platform.
The issue is that some of the lines are broken by CRLF line endings in the middle of the address fields. I suspect that when people entered their address in the form they copy/pasted it from other files where it was multiline.
The issue cannot be resolved by Find&Replace as this will also replace the correct CRLF at the end of the line.
As an example the data is formatted something like this and has a CRLF after each line(They used ^ as the delimiter):
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15;
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^
In order to properly process the data I would need the CSV to be read like this, by removing only the CRLF that break lines - which Find&Replace cannot do:
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^
I have found a possible solution (Is there a way in R to join broken lines of csv file?), but it required some tweaking to fit my needs. The end result is that the code below hangs and does not reach the end of the process, even on small sample files.
My alteration of the accepted solution code from the above mentioned post:
dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
if(dat[k] == "") {k <- k + 1
print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr("^", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, "^"))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row
Counting fields between delimiters seems like a good solution but I am unable to find a good way to do this and my R programming skills are not enough apparently.
So is there any way to fix this type of broken CSV files in R?
Working files sample can be accessed here: http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv
Thanks for any help you can give!
The trouble seems to be that a ^ is a special character. If you step through your code you will see that you have 627 variables instead of 39. It is making each character a variable. Try this:
dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "\\^")) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
#if(dat[k] == "") {k <- k + 1
#print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
#}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr("\\^", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, "\\^"))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row
Sorry missed that difference in your code and mine. You don't want fixed=true. changing it to the above gives you this:
> varnames
[1] "Castigator" "CastigatorCUI" "CastigatorTara"
[4] "CastigatorLocalitate" "CastigatorAdresa" "Tip"
[7] "TipContract" "TipProcedura" "AutoritateContractanta"
[10] "AutoritateContractantaCUI" "TipAC" "TipActivitateAC"
[13] "NumarAnuntAtribuire" "DataAnuntAtribuire" "TipIncheiereContract"
[16] "TipCriteriiAtribuire" "CuLicitatieElectronica" "NumarOfertePrimite"
[19] "Subcontractat" "NumarContract" "DataContract"
[22] "TitluContract" "Valoare" "Moneda"
[25] "ValoareRON" "ValoareEUR" "CPVCodeID"
[28] "CPVCode" "NumarAnuntParticipare" "DataAnuntParticipare"
[31] "ValoareEstimataParticipare" "MonedaValoareEstimataParticipare" "FonduriComunitare"
[34] "TipFinantare" "TipLegislatieID" "FondEuropean"
[37] "ContractPeriodic" "DepoziteGarantii" "ModalitatiFinantare"
We can determine the last row of each record by checking whether it ends in a numeric field. Then using cumsum we can label the rows from the same record using 1, 2, 3, ... . Finally paste them together.
# test data
Lines <- "Name^FiscCode^Country^Adress^SomeData^
SomeCompany^235356^Romania^Adress1
Adress2^ 565863
SomeCompany^235356^Romania^Adress1^ 565863"
# for real problem use readLines("myfile")[-1]
L <- readLines(textConnection(Lines))[-1]
g <- rev(cumsum(rev(grepl("\\^ *\\d+$", L)))) ##
g <- max(g) - g + 1
L2 <- tapply(L, g, paste, collapse = " ")
read.table(text = L2, sep = "^")
The above works for the data shown in the question but if there are differences in the actual data to what you showed then some modifications may be needed depending on what those differences are.
Note: If there are always four ^ characters in each record try replacing the line marked ## with this:
cnt <- count.fields(textConnection(L), sep = "^") - 1
g <- rev(cumsum(rev(cumsum(cnt) %% 4 == 0)))
Update The question has changed to provide new sample data. Note that the answer posted works with it but of course you need to replace 4 with 38 since the new data has 38 delimiters per record whereas the old data had 4. Also the old data had a header and the new data does not so we have removed those occurrences of -1 used to drop the header. Here is a self contained example that can be copied and pasted into R.
Lines <- "Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15;
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^"
L <- readLines(textConnection(Lines))
cnt <- count.fields(textConnection(L), sep = "^") - 1 # 38 4 34 4 34
g <- rev(cumsum(rev(cumsum(cnt) %% 38 == 0)))
g <- max(g) - g + 1 # 1 2 2 3 3
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.table(text = L2, sep = "^")
dim(DF)
## [1] 3 39
The sample data does not contain comment characters (#) or single or double quotes but if it did contain these are parts of their data then adding comment.char = "", quote = "" to the count.fields and read.table calls would be needed.

API request with R

I try to do geocoding of French addresses. I'd like to use the following website : http://adresse.data.gouv.fr/
There is an example on this website on how is working the API but I think it's some Linux code and I'd like to translate in R code. The aim is to give a csv file with addresses and the result should be geo coordinates.
Linux code (example give on the website)
http --timeout 600 -f POST http://api-adresse.data.gouv.fr/search/csv/ data#path/to/file.csv
I tried to "translate" this in R with the following code
library(httr)
library(RCurl)
queryResults=POST("http://api-adresse.data.gouv.fr/search/csv/",body=list(data=fileUpload("file.csv")))
result_geocodage=content(queryResults)
But unfortunately I have a bad request error.
Does somebody knows what I'm missing in the translation to R?
Thanks!
Here's an example. First, some example data plus the request:
library(httr)
df <- data.frame(c("13 Boulevard Chanzy", "Gloucester St"),
c("93100 Montreuil", "Jersey"))
write.csv2(df, tf <- tempfile(fileext = ".csv"))
res <- POST("http://api-adresse.data.gouv.fr/search/csv/",
timeout(600),
body = list(data = upload_file(tf)))
Then, the result:
content(res, sep = ";", row.names = 1)
# c..13.Boulevard.Chanzy....Gloucester.St.. c..93100.Montreuil....Jersey.. latitude longitude
# 1 13 Boulevard Chanzy 93100 Montreuil 48.85825 2.434462
# 2 Gloucester St Jersey 49.46712 1.145554
# result_label result_score result_type result_id result_housenumber
# 1 13 Boulevard Chanzy 93100 Montreuil 0.88 housenumber ADRNIVX_0000000268334929 13
# 2 2 Résidence le Jersey 76160 Saint-Martin-du-Vivier 0.24 housenumber ADRNIVX_0000000311480901 2
# result_name result_street result_postcode result_city result_context result_citycode
# 1 Boulevard Chanzy NA 93100 Montreuil 93, Seine-Saint-Denis, Île-de-France 93048
# 2 Résidence le Jersey NA 76160 Saint-Martin-du-Vivier 76, Seine-Maritime, Haute-Normandie 76617
Or, just the coordinates:
subset(content(res, sep = ";", row.names = 1, check.names = FALSE), select = c("latitude", "longitude"))
# latitude longitude
# 1 48.85825 2.434462
# 2 49.46712 1.145554

Resources