R: convert email addresses into unique integers - r
R beginner with what seems to be a pretty simple problem :
I have a number of email logs that I have read into R in the format:
>log1
Date Time From To
1 2000-01-01 00:00:00 bob#mail.com test1#mail.com
2 2000-01-02 01:00:00 carolyn #mail.com test2#mail.com
3 2000-01-03 02:00:00 chris#mail.com test3#mail.com
4 2000-01-04 03:00:00 chris #mail.com test4#mail.com
5 2000-01-05 04:00:00 alan#mail.com test5#mail.com
6 2000-01-06 05:00:00 alan.#mail.com test6#mail.com
I need to change log1$From and log1$To to a global unique numeric identifier, such that when I read in other logs later any given email address will receive the same identifier as previous logs.
I have tried:
id <- as.numeric(as.character(log1[,3])))
id<-as.numeric(levels(log1[,3])))
id <- charToRaw(log1[,4]), base=16)
Would some kind soul please help me out – Thanks!
Apologies should probably have included this:
Date=c( "01/01/2000" ,"02/01/2000" ,"03/01/2000", "04/01/2000" ,"05/01/2000" ,"06/01/2000","07/01/2000","08/01/2000",
"09/01/2000","10/01/2000","11/01/2000", "12/01/2000" ,"13/01/2000", "14/01/2000", "15/01/2000","16/01/2000"
,"17/01/2000","18/01/2000","19/01/2000","20/01/2000","01/01/2000","02/01/2000")
Time=c("00:00:00","01:00:00","02:00:00", "03:00:00" ,"04:00:00" ,"05:00:00", "06:00:00" ,"07:00:00", "08:00:00", "09:00:00" ,"10:00:00",
"11:00:00", "12:00:00","13:00:00", "14:00:00","15:00:00","16:00:00","17:00:00","18:00:00","19:00:00","00:00:00" ,"00:00:00")
From=c("bob.shults#mail.com","carolyn.green#mail.com","chris.long#mail.com","christi.nicolay#mail.com","alan.aronowitz#mail.com","alan.comnes#mail.com",
"dab#sprintmail.com","ana.correa#mail.com","andrew.fastow#mail.com","elena.kapralova#mail.com","bob.shults#mail.com","carolyn.green#mail.com",
"chris.long#mail.com","christi.nicolay#mail.com","alan.aronowitz#mail.com","alan.comnes#mail.com","dab#sprintmail.com","ana.correa#mail.com",
"andrew.fastow#mail.com","elena.kapralova#mail.com","bob.shults#mail.com","bob.shults#mail.com")
To=c("ana.correa#mail.com","test2#mail.com","test3#mail.com","test4#mail.com","test5#mail.com","test6#mail.com","test7#mail.com",
"test8#mail.com","test9#mail.com","test10#mail.com","test11#mail.com","test12#mail.com","test13#mail.com","test14#mail.com",
"test15#mail.com","test16#mail.com","test17#mail.com","test18#mail.com","test19#mail.com","test20#mail.com","ana.correa#mail.com","ana.correa#mail.com")
log<-data.frame(Date=Date,Time=Time,From=From,To=To)
Attempt at using MD5 to generate globally unique identifiers: Note how the identifier for ana.correa#mail.com is a correct match within ID_to but is not within ID_from
ID_to<-data.frame()
ID_from<-data.frame()
for (i in 1:nrow(log)){
to<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,4], algo='md5'), 2), c(1, 9, 17, 25), c(8, 16, 24, 32)),sep=""))
(ID_to<-rbind(ID_to,to))
from<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,3], algo='md5'), 2), c(1, 9, 17, 25),c(8, 16, 24, 32)),sep=""))
(ID_from<-rbind(ID_from,from))
}
ID_to[,3]<-paste(ID_to[,1],ID_to[,2], sep="")
ID_from[,3]<-paste(ID_from[,1],ID_from[,2], sep="")
edgelist<-data.frame(ID_from[,3],log[,3],ID_to[,3],log[,4],log[,1],log[,2])
print(edgelist)
ID_from...3. log...3. ID_to...3. log...4. log...1. log...2.
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 01/01/2000 00:00:00
38124472891255273775 carolyn.green#mail.com 1251903296725454474 test2#mail.com 02/01/2000 01:00:00
29070047663451376630 chris.long#mail.com 17074276751156451031 test3#mail.com 03/01/2000 02:00:00
8261398433828474582 christi.nicolay#mail.com 1563683670909194033 test4#mail.com 04/01/2000 03:00:00
18727221862165338513 alan.aronowitz#mail.com 26735368323826533112 test5#mail.com 05/01/2000 04:00:00
5680838251168988404 alan.comnes#mail.com 2923605896229594830 test6#mail.com 06/01/2000 05:00:00
2351312285811012730 dab#sprintmail.com 17171333544033270402 test7#mail.com 07/01/2000 06:00:00
328278708432069254 ana.correa#mail.com 33840664403556851587 test8#mail.com 08/01/2000 07:00:00
1127901879852039037 andrew.fastow#mail.com 1973548136161209824 test9#mail.com 09/01/2000 08:00:00
7349515121496417787 elena.kapralova#mail.com 5680838251168988404 test10#mail.com 10/01/2000 09:00:00
27488842661591306920 bob.shults#mail.com 328278708432069254 test11#mail.com 11/01/2000 10:00:00
38124472891255273775 carolyn.green#mail.com 1127901879852039037 test12#mail.com 12/01/2000 11:00:00
29070047663451376630 chris.long#mail.com 27488842661591306920 test13#mail.com 13/01/2000 12:00:00
8261398433828474582 christi.nicolay#mail.com 38124472891255273775 test14#mail.com 14/01/2000 13:00:00
18727221862165338513 alan.aronowitz#mail.com 29070047663451376630 test15#mail.com 15/01/2000 14:00:00
5680838251168988404 alan.comnes#mail.com 8261398433828474582 test16#mail.com 16/01/2000 15:00:00
2351312285811012730 dab#sprintmail.com 2351312285811012730 test17#mail.com 17/01/2000 16:00:00
328278708432069254 ana.correa#mail.com 7349515121496417787 test18#mail.com 18/01/2000 17:00:00
1127901879852039037 andrew.fastow#mail.com 41762759923562968495 test19#mail.com 19/01/2000 18:00:00
7349515121496417787 elena.kapralova#mail.com 24894056753582090007 test20#mail.com 20/01/2000 19:00:00
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 01/01/2000 00:00:00
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 02/01/2000 00:00:00
Attempt at levels/factor method:
Getting an error:
log <- union(levels(log[,3]), levels(log[,4]))
>Error in emails[, 3] : incorrect number of dimensions
You can use MD5 to generate globally unique identifiers since it has a very low probability of collisions, but since its output is 128-bit you need a few numbers to represent it (four integers in 32-bit R, two integers in 64-bit R). This should be easy to deal with using short numeric vectors, though.
Here is how you can generate such a vector of four integers for an email address (or any other string for that matter):
library(digest)
email <- 'test1#gmail'
as.numeric(paste('0x', substr(rep(hmac('secret56f8a7', email, algo='md5'), 4), c(1, 9, 17, 25), c(8, 16, 24, 32)), sep=''))
You could use algo='crc32' and obtain just one integer, but this isn't recommended since collisions are far more likely with CRC.
you need to create a unique id for every email in your logs. One way would be to calculate the crc checksum of every email and use that as a identifier, but it will be very long number. Or you could implement a hashmap in R and make the email the key of the hashmap.
I think this will do what you want, and it's efficient, and you can do it using only base packages...
Procedure:
1.Convert both columns to factors
2.Union the factor levels, in exactly the same way, so that each email has a unique ID in the factor levels.
3.Change the entries in each column to the number corresponding to their factor level. As a result, we can identify the times when "test1#gmail.com" sent and received emails by simply looking up "1" in both columns.
log1$From <- as.factor(log1$From)
log1$To <- as.factor(log1$To)
emails <- union(levels(log1$From), levels(log1$To))
levels(log1$From) <- emails
levels(log1$To) <- emails
log1$From <- as.numeric(log1$From)
log1$To <- as.numeric(log1$To)
It will probably be a good idea to keep a record of the original email addresses, as I have done here. Then if you were interested in, say, which emails test1#gmail.com sent:
log1[log1$From == which(emails == "test1#gmail.com"), ]
should do the trick! You can write a procedure to make that look much cleaner as well...
Related
convert date and time from factor to numerics in R
I have a dataframe which contains date and time for the columns. Let's name this dataframe date_time. Since the data type is factor type, I would like to convert the whole column of date_time to numerics without changing anything, eg 2020-01-20 14:02:50 to 20200120140250. I have about 1000 rows of data. Does anyone knows how to produce the output? I have tried as.numeric and gsub but they doesnt work. I think using POSIXct might work but I do not understand the reasoning behind it. example of my data: 2020-07-08 21:40:26 2020-07-08 16:48:57 2020-07-01 15:54:10 2020-07-13 20:27:06 2020-07-27 16:08:12 and the list goes on.
You can try: gsub("[[:punct:] ]", "", as.character(as.POSIXct("2020-01-20 14:02:50"))) The as.character keeps the visual output instead working with the underlying numbers. UDPATE: date_time <- data.frame(time = as.POSIXct( c("2020-07-08 21:40:26", "2020-07-08 16:48:57", "2020-07-01 15:54:10", "2020-07-13 20:27:06", "2020-07-27 16:08:12", "2020-01-20 14:02:50"))) date_time$num_time <- gsub("[[:punct:] ]", "", as.character(date_time$time))
Solution with lubricdate dt1 <- as.factor(c("2020-07-08 21:40:26", "2020-07-08 16:48:57", "2020-07-01 15:54:10", "2020-07-13 20:27:06", "2020-07-27 16:08:1")) dt <- data.frame(date=ymd_hms(dt1)) dt class(dt$date) Result date 1 2020-07-08 21:40:26 2 2020-07-08 16:48:57 3 2020-07-01 15:54:10 4 2020-07-13 20:27:06 5 2020-07-27 16:08:01 > class(dt$date) [1] "POSIXct" "POSIXt"
How to extract characters from strings of varying length based on their position?
I'm going bald trying to figure this out. I am trying to extract characters from strings of varying length based on position. For example, lets say I have a data frame with 8 rows and 1 column titled: "Time" The column has the following values in each row: TIME 5 20 100 105 150 1050 2005 2010 I'm trying to figure out how to extract the characters in these strings and into the following format based on position. If you look, you can see that the last two characters are placed 00:here:00 while all other characters are placed here:00:00. The result should be: 5 -> 00:05:00 (If there's only 1 characters) 20 -> 00:20:00 (If there are 2 characters) 100 -> 1:00:00 (If there are 3 characters) 105 -> 1:05:00 (If there are 3 characters) 150 -> 1:50:00 (If there are 3 characters) 1050 -> 10:50:00 (If there are 4 characters) 2005 -> 20:05:00 (If there are 4 characters) 2010 -> 20:10:00 (If there are 4 characters) Essentially the pseudo code would be something along the lines of from dataframe$column1 extract the last two characters and place it 00:here:00 - all remaining characters should go here:00:00
format( as.POSIXct(sprintf("%04.0f", TIME), format="%H%M"), format="%H:%M:%S") [1] "00:05:00" "00:20:00" "01:00:00" "01:05:00" [5] "01:50:00" "10:50:00" "20:05:00" "20:10:00" The sprintf pads the character values of hte time variable with leading 0's. The as.POSIXct reads in the time as the current date with that time and then the outer format.POSIXt function removes the date characters.
What about this approach? time <- readLines(n = 8) 5 20 100 105 150 1050 2005 2010 sub("(\\d{2})(\\d{2})", "\\1:\\2:00", sprintf("%04d", as.numeric(time))) # [1] "00:05:00" "00:20:00" "01:00:00" "01:05:00" "01:50:00" "10:50:00" "20:05:00" "20:10:00"
How about this: library(stringr) TIME <- c('5','20','100','105','150','1050','2005','2010') TIMEpadded <- paste0(str_pad(TIME, width=4, pad=0), '00') sub('(.+)(.{2})(.{2})', '\\1:\\2:\\3', TIMEpadded) ## [1] "00:05:00" "00:20:00" "01:00:00" "01:05:00" "01:50:00" "10:50:00" "20:05:00" "20:10:00"
public static String transform (String in) { while (in.length() < 4) { in = "0" + in; } in = in.substring(0,2) + ":" + in.substring(2,4) + ":00"; return(in); } public static void main (String[] args) { for (String s : new String[]{"5", "20", "100", "105", "150", "1050", "2005", "2010"}) { System.out.println(s + " -> " + transform(s)); } } The method transform: 1) adds zeros to a time value in the string in, till obtaining a 4-digits string. 2) inserts ":" each two characters in the string and a trailing "00". The method main runs the example you provided, with the following output: 5 -> 00:05:00 20 -> 00:20:00 100 -> 01:00:00 105 -> 01:05:00 150 -> 01:50:00 1050 -> 10:50:00 2005 -> 20:05:00 2010 -> 20:10:00
sqlSave, How to Write data to SQL developer having date Column containing hyphen
I have a dataframe data,Which Contains the columns having integers,and columns containing date and time,As shown >head(data,2) PRESSURE AMBIENT_TEMP OUTLET_PRESSURE COMP_STATUS DATE TIME predict 1 14 65 21 0 2014-01-09 12:45:00 0.6025863 2 17 65 22 0 2014-01-10 06:00:00 0.6657910 And Now i'm going to write this back to Sql database by the chunck sqlSave(channel,data,tablename = "ANL_ASSET_CO",append = T) Where channel is connection name,But this gives error [RODBC] Failed exec in Update 22018 1722 [Oracle][ODBC][Ora]ORA-01722: invalid number But When i try excluding the date column ,it writes back without any error. > sqlSave(channel,data[,c(1:4,7)],tablename = "ANL_ASSET_CO",append = T) > sqlSave(channel,data[,c(1:4,6:7)],tablename = "ANL_ASSET_CO",append = T) Because of the date column the data is not writing to ORACLE SQL developer,Could be problem with the hyphen. How can i write , Any help !!
>class(data$DATE) [1] "POSIXct" "POSIXt" So had to change the data type as character >data$DATE <- as.character(data$DATE) >sqlSave(channel,data,tablename = "ANL_ASSET_CO",append=T) This one worked!!
Read a CSV file in R, and select each element
Sorry if the title is confusing. I can import a CSV file into R, but once I would like to select one element by providing the row and col index. I got more than one elements. All I want is to use this imported csv as a data.frame, which I can select any columns, rows and single cells. Can anyone give me some suggestions? Here is the data: SKU On Off Duration(hr) Sales C010100100 2/13/2012 4/19/2012 17:00 1601 238 C010930200 5/3/2012 7/29/2012 0:00 2088 3 C011361100 2/13/2012 5/25/2012 22:29 2460 110 C012000204 8/13/2012 11/12/2012 11:00 2195 245 C012000205 8/13/2012 11/12/2012 0:00 2184 331 CODE: Dat = read.table("Dat.csv",header=1,sep=',') Dat[1,][1] #This is close to what I need but is not exactly the same SKU 1 C010100100 Dat[1,1] # Ideally, I want to have results only with C010100100 [1] C010100100 3861 Levels: B013591100 B024481100 B028710300 B038110800 B038140800 B038170900 B038260200 B038300700 B040580700 B040590200 B040600400 B040970200 ... YB11624Q1100 Thanks!
You can convert to character to get the value as a string, and no longer as a factor: as.character(Dat[1,1]) You have just one element, but the factor contains all levels. Alternatively, pass the option stringsAsFactors=FALSE to read.table when you read the file, to prevent creation of factors for character values: Dat = read.table("Dat.csv",header=1,sep=',', stringsAsFactors=FALSE )
Remove duplicate rows from xts object
I am having trouble deleting duplicated rows in an xts object. I have a R script that will download tick financial data of a currency and convert it to an xts object of OHLC format. The script also pulls new data every 15 minutes. The new data is downloaded from the first trade of today to the last recorded trade of today. The old previous data downloaded was stored in .Rdata format and called. Then the new data is added to the old data and it overwrites the old data in .Rdata format. Here is an example of what my data looks like: .Open .High .Low .Close .Volume .Adjusted 2012-01-07 00:00:11 6.69683 7.01556 6.38000 6.81000 48387.58 6.81000 2012-01-08 00:00:09 6.78660 7.20000 6.73357 7.11358 57193.53 7.11358 2012-01-09 00:00:57 7.08362 7.19100 5.81000 6.32570 148406.85 6.32570 2012-01-10 00:01:01 6.32687 6.89000 6.00100 6.36000 110210.25 6.36000 2012-01-11 00:00:07 6.44904 7.13800 6.41266 6.90000 99442.07 6.90000 2012-01-12 00:01:02 6.90000 6.99700 6.33700 6.79999 140116.52 6.79999 2012-01-13 00:02:01 6.78211 6.80400 6.40000 6.41000 60228.77 6.41000 2012-01-14 00:00:23 6.42000 6.50000 6.23150 6.31894 25392.98 6.31894 Now if I run the script again I will add the new data to the xts. .Open .High .Low .Close .Volume .Adjusted 2012-01-07 00:00:11 6.69683 7.01556 6.38000 6.81000 48387.58 6.81000 2012-01-08 00:00:09 6.78660 7.20000 6.73357 7.11358 57193.53 7.11358 2012-01-09 00:00:57 7.08362 7.19100 5.81000 6.32570 148406.85 6.32570 2012-01-10 00:01:01 6.32687 6.89000 6.00100 6.36000 110210.25 6.36000 2012-01-11 00:00:07 6.44904 7.13800 6.41266 6.90000 99442.07 6.90000 2012-01-12 00:01:02 6.90000 6.99700 6.33700 6.79999 140116.52 6.79999 2012-01-13 00:02:01 6.78211 6.80400 6.40000 6.41000 60228.77 6.41000 2012-01-14 00:00:23 6.42000 6.50000 6.23150 6.31894 25392.98 6.31894 2012-01-14 00:00:23 6.42000 6.75000 6.22010 6.57157 75952.01 6.57157 As you can see the last line is the same day as the second to last line. I want to keep the last row for the last date and delete the second to last row. When I try the following code to delete duplicated rows it does not work, the duplicated rows stay there. xx <- mt.xts[!duplicated(mt.xts$Index),] xx .Open .High .Low .Close .Volume .Adjusted I do not get any result. How can I delete duplicate data entries in an xts object using the Index as the indicator of duplication?
Should't it be index(mt.xts) rather than mt.xts$Index? The following seems to work. # Sample data library(xts) x <- xts( 1:10, rep( seq.Date( Sys.Date(), by="day", length=5 ), each=2 ) ) # Remove rows with a duplicated timestamp y <- x[ ! duplicated( index(x) ), ] # Remove rows with a duplicated timestamp, but keep the latest one z <- x[ ! duplicated( index(x), fromLast = TRUE ), ]
In my case, x <- x[! duplicated( index(x) ),] did not work as intended, because the system somehow makes date-time unique in each row. x <- x[! duplicated( coredata(x) ),] This may work if the previous solution did not help.