I wrote a function to anonymize names in a data frame given some key and it comes to a crawl once it gets to anonymizing very many names but I don't understand why.
The data frame in question is a set of 4733 tweets collected through the Twitter API where each row is a tweet with 32 columns of data. The names are to be anonymized regardless of which row they show up in, so I'd like to not limit the function to looking at only a couple of those 32 columns.
The key is a data frame containing 211121 pairs of real and fake names, both real and fake being unique in the data frame. The function slows down immensely after about 100k names are anonymized.
The function looks like the following:
pseudonymize <- function(df, key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
}
}
Is there some obvious thing here that would cause the slowing? I'm not at all experienced with optimizing code for speed.
EDIT1:
Here are a few lines from the data frame to be anonymized.
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"#jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"#abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"#tdesj3 #belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"
Here are a few lines from the key.
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
EDIT2:
I've simplified the DF down to only the two columns that would need anonymizing, and this made things much faster, but it still putters out after doing about 155k names.
As requested in the comments, here's the dput() output for the first three lines of the DF that's to be anonymized.
structure(list(
utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
texte = c("#EmilyIsPro ik lol", "#NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "#NikkiErica21 lol yes _Ã\231։")
),
row.names = c(NA, 3L),
class = "data.frame")
And here's the dput() for the first three lines of the key.
structure(list(
realNames = c("________", "____________aho", "___________ass"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
),
row.names = c(NA, 3L),
class = "data.frame")
Acting on the data as a vector rather than a data.frame will be much more efficient. I ran into some encoding issues so converted the text to UTF-8 using iconv; If the names contain non-ASCII characters this would need some handling.
key1 <- data.frame(
realNames = c("________", "____________aho", "___________ass",
"___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker",
"A_A", "B_B", "C_C", "D_D", "E_E"),
stringsAsFactors = FALSE
)
pseudonymize1 <- function(df, key) {
mat <- as.matrix(df)
dims <- attr(mat, which = "dim")
cnam <- colnames(df)
vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
for (name in split(key, f = seq_len(nrow(key)))) {
vec <- gsub(
vec,
pattern = name$realNames,
replacement = name$fakeNames,
fixed = TRUE)
}
mat <- vec
attr(mat, which = "dim") <- dims
df <- as.data.frame(mat, stringsAsFactors = FALSE)
colnames(df) <- cnam
df
}
pseudonymize1(df1, key1)
# utilisateur texte
# 1 A_A #D_D ik lol
# 2 B_B #C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3 B_B #C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢
library(microbenchmark)
microbenchmark(
pseudonymize(df1, key1),
pseudonymize1(df1, key1)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371 100 b
# pseudonymize1(df1, key1) 287.683 306.1905 333.678 314.950 339.8705 497.301 100 a
A concern I have with 155k names is that when searching as a regular expression you will find names contained in other names. This could be in the true name within the true name (e.g. Emily within EmilyIsPro), or the true name within a previously replaced fake name. You will want to test for this, and consider using a random hash instead of a name-like fake name.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm looking for a suggestion: I'm trying to re-order/group a data frame by a variable value.
For example transforming a native data frame VARS
into something like this:
So far, I've tried for-loops with cbind/rbind depending on how the data is organized, aggregate, apply, etc. But there's always some wrinkle that prevents the methods from working.
I appreciate any help!
First I'd like to point out reading up on how to give a usefule example, along with the raw data using dput will go a long way to getting feedback. That said:
For the dataset you showed:
A <- structure(list(Var_Typer = c("cnt", "Cont", "cnt", "cnt", "fact",
"fact", "Char", "Char", "Cont"), R_FIELD = c("Gender", "Age",
"Activation", "WakeUpStroke", "ArMode", "PreHospActiv", "EMTag",
"EMTdx", "EMTlams")), .Names = c("Var_Typer", "R_FIELD"), row.names = c(NA,
-9L), class = "data.frame")
> head(A)
Var_Typer R_FIELD
1 cnt Gender
2 Cont Age
3 cnt Activation
4 cnt WakeUpStroke
5 fact ArMode
6 fact PreHospActiv
B <- apply(
dcast(A, Var_Typer ~ R_FIELD, value.var = 'R_FIELD'), 1, function(i){
ndf <- as.data.frame(rbind(i[complete.cases(i)]))
colnames(ndf) <- c('Class',1:(length(ndf)-1))
ndf
}) %>% rbind.pages %>% (function(x){
x[is.na(x)] <- "..."
x
})
Class 1 2 3
1 Char EMTag EMTdx ...
2 cnt Activation Gender WakeUpStroke
3 Cont Age EMTlams ...
4 fact ArMode PreHospActiv ...
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I pulled in a large .csv file with columns such as "paid" and "description"
I am trying to figure out how to only pull the "paid" column when the "description" is Bronchitis or some other illness that is in the column.
This would be like doing a pivot table in Excel and filtering only on a certain Description and receiving all of the individual paid rows.
Paid Description val
$500 Bronchitis 1.5
$3,250 'Complication of Pregnancy/Childbirth' 2.2
$5,400 Burns 3.3
$20.50 Bronchitis 4.4
$24 Ashtma 1.2
If your data is
paid <- c(300,200,150)
desc <- c("bronchitis","headache","broken.leg")
df <- data.frame(paid, desc)
Try
df[desc=="bronchitis",c("paid")]
# the argument ahead of the comma filters the row,
# the argument after the comma refers to the column
# > df[desc=="bronchitis",c("paid")]
# [1] 300
or
library(dplyr)
df %>% filter(desc=="bronchitis") %>% select(paid)
# filter refers to the row condition
# select filters the output column(s)
# > df %>% filter(desc=="bronchitis") %>% select(paid)
# paid
# 1 300
Using data.table
library(data.table)#v1.9.5+
setkey(setDT(df1), Description)[.('Bronchitis'),'Paid', with=FALSE]
# Paid
#1: $500
#2: $20.50
data
df1 <- structure(list(ex = c("Description", "Bronchitis",
"Complication of Pregnancy/Childbirth",
"Burns", "Bronchitis", "Ashtma"), data = c("val", "1.5", "2.2",
"3.3", "4.4", "1.2")), .Names = c("ex", "data"), class = "data.frame",
row.names = c("Paid", "$500", "$3,250", "$5,400", "$20.50", "$24"))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
How can i plot heatmap for the following data with ID in Y-axis and its corresponding names in the X axis
ID Name1 Name2 Name3 Name4 Name5 Name6
Gp2 2,86148 7,86926 5,00778 3,6586 5,66554 2,00694
Cldn10 3,30779 8,03876 4,73097 4,4237 7,96975 3,54605
Cldn10 4,261 8,7293 4,4683 4,3483 9,03017 4,68187
read.table allows you to specify that the data contains a header and row names:
x = read.table('filename', header = TRUE, row.names = 1)
heatmap(as.matrix(x))
This assumes that the data does not contain ',' as a thousands separator. If you want to use , as decimal point, just specify the appropriate option:
x = read.table('filename', header = TRUE, row.names = 1, dec = ',')
If your input data contains commas as thousands separators, you need to remove them first:
raw_data = readLines('filename')
raw_data = gsub('(\\d+),(\\d+)', '\\1\\2', raw_data)
x = read.table(text = raw_data, header = TRUE, row.names = 1)
This question already has answers here:
read/write data in libsvm format
(7 answers)
Closed 8 years ago.
It turns out the format I wanted is called "SVM-Light" and is described here http://svmlight.joachims.org/.
I have a data frame that I would like to convert to a text file with format as follows:
output featureIndex:featureValue ... featureIndex:featureValue
So for example:
t = structure(list(feature1 = c(3.28, 6.88), feature2 = c(0.61, 1.83
), output = c("1", "-1")), .Names = c("feature1", "feature2",
"output"), row.names = c(NA, -2L), class = "data.frame")
t
# feature1 feature2 output
# 1 3.28 0.61 1
# 2 6.88 1.83 -1
would become:
1 feature1:3.28 feature2:0.61
-1 feature1:6.88 feature2:1.83
My code so far:
nvars = 2
l = array("row", nrow(t))
for(i in(1:nrow(t)))
{
l = t$output[i]
for(n in (1:nvars))
{
thisFeatureString = paste(names(t)[n], t[[names(t)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
but I am not sure how to complete and write the results to a text file.
Also the code is probably not efficient.
Is there a library function that does this? as this kind of output format seems common for Vowpal Wabbit for example.
I couln't find a ready-made solution, although the svm-light data format seems to be widely used.
Here is a working solution (at least in my case):
############### CONVERT DATA TO SVM-LIGHT FORMAT ##################################
# data_frame MUST have a column 'target'
# target values are assumed to be -1 or 1
# all other columns are treated as features
###################################################################################
ConvertDataFrameTo_SVM_LIGHT_Format <- function(data_frame)
{
l = array("row", nrow(data_frame)) # l for "lines"
for(i in(1:nrow(data_frame)))
{
# we start each line with the target value
l[i] = data_frame$target[i]
# then append to the line each feature index (which is n) and its
# feature value (data_frame[[names(data_frame)[n]]][i])
for(n in (1:nvars))
{
thisFeatureString = paste(n, data_frame[[names(data_frame)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
return (l)
}
###################################################################################
If you don't mind not having the column names in the output, I think you could use a simple apply to do that:
apply(t, 1, function(x) paste(x, collapse=" "))
#[1] "3.28 0.61 1" "6.88 1.83 -1"
And to adjust the order of appearance in the output to your function's output you could do:
apply(t[c(3, 1, 2)], 1, function(x) paste(x, collapse=" "))
#[1] "1 3.28 0.61" "-1 6.88 1.83"