I wrote a function to anonymize names in a data frame given some key and it comes to a crawl once it gets to anonymizing very many names but I don't understand why.
The data frame in question is a set of 4733 tweets collected through the Twitter API where each row is a tweet with 32 columns of data. The names are to be anonymized regardless of which row they show up in, so I'd like to not limit the function to looking at only a couple of those 32 columns.
The key is a data frame containing 211121 pairs of real and fake names, both real and fake being unique in the data frame. The function slows down immensely after about 100k names are anonymized.
The function looks like the following:
pseudonymize <- function(df, key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
}
}
Is there some obvious thing here that would cause the slowing? I'm not at all experienced with optimizing code for speed.
EDIT1:
Here are a few lines from the data frame to be anonymized.
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"#jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"#abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"#tdesj3 #belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"
Here are a few lines from the key.
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
EDIT2:
I've simplified the DF down to only the two columns that would need anonymizing, and this made things much faster, but it still putters out after doing about 155k names.
As requested in the comments, here's the dput() output for the first three lines of the DF that's to be anonymized.
structure(list(
utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
texte = c("#EmilyIsPro ik lol", "#NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "#NikkiErica21 lol yes _Ã\231։")
),
row.names = c(NA, 3L),
class = "data.frame")
And here's the dput() for the first three lines of the key.
structure(list(
realNames = c("________", "____________aho", "___________ass"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
),
row.names = c(NA, 3L),
class = "data.frame")
Acting on the data as a vector rather than a data.frame will be much more efficient. I ran into some encoding issues so converted the text to UTF-8 using iconv; If the names contain non-ASCII characters this would need some handling.
key1 <- data.frame(
realNames = c("________", "____________aho", "___________ass",
"___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker",
"A_A", "B_B", "C_C", "D_D", "E_E"),
stringsAsFactors = FALSE
)
pseudonymize1 <- function(df, key) {
mat <- as.matrix(df)
dims <- attr(mat, which = "dim")
cnam <- colnames(df)
vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
for (name in split(key, f = seq_len(nrow(key)))) {
vec <- gsub(
vec,
pattern = name$realNames,
replacement = name$fakeNames,
fixed = TRUE)
}
mat <- vec
attr(mat, which = "dim") <- dims
df <- as.data.frame(mat, stringsAsFactors = FALSE)
colnames(df) <- cnam
df
}
pseudonymize1(df1, key1)
# utilisateur texte
# 1 A_A #D_D ik lol
# 2 B_B #C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3 B_B #C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢
library(microbenchmark)
microbenchmark(
pseudonymize(df1, key1),
pseudonymize1(df1, key1)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371 100 b
# pseudonymize1(df1, key1) 287.683 306.1905 333.678 314.950 339.8705 497.301 100 a
A concern I have with 155k names is that when searching as a regular expression you will find names contained in other names. This could be in the true name within the true name (e.g. Emily within EmilyIsPro), or the true name within a previously replaced fake name. You will want to test for this, and consider using a random hash instead of a name-like fake name.
Related
I have a data.frame like this
z <- structure(list(ID = c("R-HSA-977606", "R-HSA-977443", "R-HSA-166658",
"R-HSA-166663", "R-HSA-1236394", "R-HSA-390522", "R-HSA-3232118",
"R-HSA-1630316", "R-HSA-112315", "R-HSA-112314"), GeneRatio = c("6/189",
"6/189", "6/189", "4/189", "5/189", "4/189", "3/189", "7/189",
"11/189", "9/189")), row.names = c("R-HSA-977606", "R-HSA-977443",
"R-HSA-166658", "R-HSA-166663", "R-HSA-1236394", "R-HSA-390522",
"R-HSA-3232118", "R-HSA-1630316", "R-HSA-112315", "R-HSA-112314"
), class = "data.frame")
Is it possible to add a 3rd column with the ratio from the 2nd column calculated? i.e. 6/189=0.0317. So in the third column I should have 0.0317.
As it is a string expression, we can use eval/parse
z$newColumn <- sapply(z$GeneRatio, function(x) eval(parse(text = x)))
-output
> z
ID GeneRatio newColumn
R-HSA-977606 R-HSA-977606 6/189 0.03174603
R-HSA-977443 R-HSA-977443 6/189 0.03174603
R-HSA-166658 R-HSA-166658 6/189 0.03174603
R-HSA-166663 R-HSA-166663 4/189 0.02116402
R-HSA-1236394 R-HSA-1236394 5/189 0.02645503
R-HSA-390522 R-HSA-390522 4/189 0.02116402
R-HSA-3232118 R-HSA-3232118 3/189 0.01587302
R-HSA-1630316 R-HSA-1630316 7/189 0.03703704
R-HSA-112315 R-HSA-112315 11/189 0.05820106
R-HSA-112314 R-HSA-112314 9/189 0.04761905
Or a faster option would be to split by / (or use read.table to create two columns and then divide (assuming the expression includes only division)
z$newColumn <- Reduce(`/`, read.table(text = z$GeneRatio,
header = FALSE, sep = "/"))
This code could be refined but it will work with the eval function
# 1- Creating empty column
z$GeneRatioNum <- NA
# 2- Filling it with eval function
for(i in 1:nrow(z)){z$GeneRatioNum[i] <- (eval(parse(text = z$GeneRatio[i])))}
I have downloaded some GDP data in .xls-format from the OECD website. However, to make this data workable in R, I need to reformat the data to a .csv file. More specifically, I need the year, day and month in the first column, and after the comma I need the GDP values (for example: 1990-01-01, 234590).
The column with GDP values can be easily copied and transposed, but how does one quickly add dates? Is there a fast way to do this, without having to add in the dates manually?
Thanks for the help!
Best,
Sean
PS. Link to (one of) the specific OECD files: https://ufile.io/8ogav or https://stats.oecd.org/index.aspx?queryid=350#
PSS. I have now changed the file to this:
Which I would like to transform into the same style as example 1.
Codes that I use for reading in data:
gdp.start <- c(1970,1) # type "double"
gdp.end <- c(2018,1)
gdp.raw <- "rawData/germany_gdp.csv"
gdp.table <- read.table(gdp.raw, skip = 1, header = F, sep = ',', stringsAsFactors = F)
gdp.ger <- ts(gdp.table[,2], start = gdp.start, frequency = 4) # time-series representation
PSS.
dput(head(gdp.table))
structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
Using your data:
z <- structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
dat <- read.csv2(text=paste(z$V1, collapse='\n'), stringsAsFactors=FALSE, header=FALSE)
dat
# V1 V2
# 1 Q2-1970 1.438.810
# 2 Q3-1970 1.465.684
# 3 Q4-1970 1.478.108
# 4 Q1-1971 1.449.712
# 5 Q2-1971 1.480.136
# 6 Q3-1971 1.505.743
and a simple function to replace quarters with the first date of each quarter
quarters <- function(s, format) {
qs <- c("Q1","Q2","Q3","Q4")
dts <- c("01-01", "04-01", "07-01", "10-01")
for (i in seq_along(qs))
s <- sub(qs[i], dts[i], s)
if (! missing(format))
s <- as.Date(s, format=format)
s
}
We can change them into strings of dates, preserving the order:
str(quarters(dat$V1))
# chr [1:6] "04-01-1970" "07-01-1970" "10-01-1970" "01-01-1971" ...
or we can convert into Date objects by setting the format:
str( quarters(dat$V1, format='%m-%d-%Y') )
# Date[1:6], format: "1970-04-01" "1970-07-01" "1970-10-01" "1971-01-01" ...
so replacing the column with the actual Date object is simply dat$V1 <- quarters(dat$V1, format='%m-%d-%Y').
If have two csv data frames data1 and data2 of dimension/size n1*n2 and m1*m2. I would like to create a new data frame consisting of differences: If (and only if)
data1[i,1] = data2[j,1] & data1[i,3] = data2[j,3]
then I want to consider
difference[i,z] <- abs(data1[i,x]-data2[i,y])
Is it possible to this in a simple manner, for instance using for/if?
difference <- matrix(nrow = max{n1,m1}, ncol = 3)
for (i in 1:n1) {
for (j in 1:m1) {
if(data1[i,1] == data2[j,1] & data1[i,3] == data2[j,3]){
difference[i,1] = data1[i,1]
difference[i,2] = data1[i,3]
difference[i,3] = data1[i,6]-data2[j,7]
}
}
This code is obviously far from being complete and I have several issues:
(1) I don't know if it is realizable using for loops/if conditional. If yes, being unfamiliar with R, I'm not sure if I need to put a 'print(something)' at the end of the loops.
(2) data1/2[i,1] is of type character. Hence I'm not sure if
data1[i,1] == data2[j,1] & data1[i,3] == data2[j,3]
is well-defined.
(3) The 'difference' matrix/frame should have as many rows as the number of i's and j's where
data1[i,1] = data2[j,1] & data1[i,3] = data2[j,3]
I do not know what this number is. Therefore I cannot really specify the size of 'difference'.
EDIT:
data1 = read.csv("path/to/data1.csv") ## Prices of 157 products each at
## 122 time points; (column1=Product, column3=date, column7=price)
data2 = read.csv("path/to/data2.csv") ## Prices of 118 products each at
## 122 time points; (column1=Product, column3=date, column6=price)
## the 122 time points are the same for both frames
## But: data1 contains some products data2 doesn't and vice versa
## I want to compare prices of the same products at the same time
So far, I've done it manually for product X1:
priceX1 = as.data.frame(data1[c(1,122),7])
priceX2 = as.data.frame(data2[c(5,126),6]) ## Product X2 starts at row 5
differenceX1 <- abs(priceX1 - priceX2)
The problem is I'd have to repeat this for all products contained in both data1 and data2.
RE-EDIT: dput(data1) returns
...), class = "factor"),
COMMENT = c(NA, ..., NA)), .Names = c("PRODUCT", "QUALIFIER_I",
"DATE", "QUALIFIER_II", "QUOTATION_DATE", "PROD_DATE", "PRICE",
"TYPE", "ID", "COMMENT"), row.names = c(NA, 14400L), class
= "data.frame")
"..." stands for me omitting a long list of products that couldn't fit here.
dput(data2) returns
..., NA, NA, NA)), .Names = c("PRODUCT", "QUALIFIER_II",
"DATE", "QUALIFIER_I", "Data2_source", "PRICE"), row.names = c(NA,
19161L), class = "data.frame")
"..." stand for me omitting a huge list of prices that couldn't fit in here.
You can find all pairs (i,j) which satisfy your condition by merging the two data.frames:
differences = merge(data1, data2, by=c('PRODUCT','DATE'))
This avoids for-loops entirely, and you can easily define the new column:
differences$Diff = abs(differences$PRICE.x - differences$PRICE.y)
I am attempting to implement the sports Simple Rating System in R and am having a lot of trouble. I can do it in excel, which you can see in this video: http://www.screencast.com/users/Sports-Reference/folders/PFR/media/cea8d1e3-ed91-431f-a0df-1336c03b1268
To test this out I am using the same data (basically) as in the video, which you can access here: https://drive.google.com/file/d/0Bzr6jaapQdMRUFhpU0h4NDBNaTQ/view?usp=sharing
The first lines lines of the data frame could be built like this, but there are 206 records:
dat$Week <- c(1,1,1)
dat$Team <- c("Arizona Cardinals", "Arizona Cardinals", "Arizona Cardinals")
dat$Location <- c("HOME", "AWAY", "HOME")
dat$Opponent <- c("San Francisco 49ers", "Jacksonville Jaguars", "Indianapolis Colts")
dat$Points.For <- c(16, 31, 10)
dat$Points.Against <- c(20, 17, 31)
I set up the data with the following code:
library(dplyr)
## Load data
dat <- read.csv("data/footballsrstestdata.csv", stringsAsFactors = FALSE)
## Delete last row
dat <- dat[-207,]
## Create MOV and adjusted MOV variables
dat <- mutate(dat, mov = Points.For - Points.Against)
dat <- mutate(dat, adjMOV = mapply(function(x, y)
{if(y == "HOME") x - 1.3529 else x + 1.3529}, dat$mov, dat$Location))
## Create SRS df
srsTable <- data.frame(teams = unique(dat$Team), stringsAsFactors = FALSE)
## Create average MOV
srsTable <- mutate(srsTable,
avgAdjMOV = sapply(teams, function(x) {
mean(dat$adjMOV[dat$Team == x])
}, USE.NAMES = FALSE))
## Create placeholder varaiables
dat$oSRS <- NA
srsTable$SOS <- NA
srsTable$SRS <- NA
In excel, you would have two sheets that are referenced to each other creating a circular reference that iterates until they reach stability. Each line in the games data would have the opponent's SRS, and on the team SRS sheet, the SRS would be equal to the avgAdjMOV + the average opponents SRS from the game data sheets. I am assuming that I need to create a function with a series of equations and use the optim function to solve them, but can't figure how to set this up so that they work over both the game data dataframe and the teams dataframe simultaneously.
Does anyone have any ideas on how to proceed?
Thank you!
So I suggest setting up everything as a System of Equations Problem :
Ratings = adjMOV+(1/nGames)*Sched*Ratings
Where Ratings is a size N vector with the ratings of each team, adjMOV a size N vector with the adjusted Margin of Victory for each team, and Sched a NxN matrix of how many times each team faced each other. The order of rows in the vectors and rows and columns in the matrix should be the same. Now with some linear algebra your system of equations should look like:
(I-(1/nGames)*Sched)*Ratings = adjMOV
Now it is easy to solve it using any Non-Linear Solver, I have used the lsei() method in the limSolve package, so your code should look like:
library(limSolve)
A = (diag(N)-(1/nGames)*Sched)
b = adjMOV
solution = lsei(A=A,B=b)
Ratings = solution$X
I know it has been a more than a year but I hope this helps.
While cleaning up a dataframe I found out that assignments into subsets works by columns and not by lines, an unfortunate result when doing dataset cleanup as you typically search cases of issues and then apply your correction across multiple lines.
# example table
releves <- structure(list(cult2015 = c("bp", "bp"), prec2015 = c("?", "?"
)), .Names = c("cult2015", "prec2015"), row.names = c(478L, 492L
), class = "data.frame")
# assignement to a subset
iBad2 <- which(releves$cult2015 == "bp" & releves$prec2015 == "?")
releves[iBad2,c("cult2015","prec2015")] <- c("b","p")
I understand that the "filling" of the matrices is done by columns and hence, the repetition of the provided vector is done on each column but is there any option to get: "b", "p" on each line and not:
> releves
cult2015 prec2015
478 b b
492 p p
I wrote the following function that does the job, at least in the cases I faced:
# allows to to assigment of newVals to a subset spanning over multiple rows
AssignToSubsetByRow <- function(dat,rows,cols,newVals){
if(is.null(dim(newVals))&length(rows)*length(cols)> length(newVals)){
fullRep <- rep(newVals,each=length(rows))
}else{
fullRep <- newVals
}
dat[rows,cols] <- fullRep
return(dat)
}
And doing the job fine:
releves <- AssignToSubsetByRow(releves,iBad2,c("cult2015","prec2015"),c("b","p"))
> releves
cult2015 prec2015
478 b p
492 b p