Fixing Column Issue When Importing Data in R - r

Currently having an issue importing a data set of tweets so that every observation is in one column
This is the data before import; it includes three cells for each tweet, and a blank space in between.
T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz
library(tidyverse)
tweets1 <- read_csv("tweets.txt.gz", col_names = F,
skip_empty_rows = F)
This is the output:
Parsed with column specification:
cols(
X1 = col_character()
)
Warning message:
“71299 parsing failures.
row col expected actual file
35 -- 1 columns 2 columns 'tweets.txt.gz'
43 -- 1 columns 2 columns 'tweets.txt.gz'
59 -- 1 columns 2 columns 'tweets.txt.gz'
71 -- 1 columns 5 columns 'tweets.txt.gz'
107 -- 1 columns 3 columns 'tweets.txt.gz'
... ... ......... ......... ...............
See problems(...) for more details.
”
# A tibble: 1,220,233 x 1
X1
<chr>
1 "T\t2009-06-11 00:00:03"
2 "U\thttp://twitter.com/imdb"
3 "W\tNo Post Title"
4 NA
5 "T\t2009-06-11 16:37:14"
6 "U\thttp://twitter.com/ncruralhealth"
7 "W\tNo Post Title"
8 NA
9 "T\t2009-06-11 16:56:23"
10 "U\thttp://twitter.com/boydjones"
# … with 1,220,223 more rows
The only issue are the many parsing failures, where problems(tweets1) shows that R expected one column, but got multiple. Any ideas on how to fix this? My output should provide me with 1.4 million rows according to my Professor, so unsure if this parsing issue is the key here. Any help is appreciated!

Maybe something like this will work for you.
data
data <- 'T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz'
For a large file, fread() should be quick. The sep = NULL is saying basically just read in full lines. You will replace input = data with file = "tweets.txt.gz".
library(data.table)
read_rows <- fread(input = data, header = FALSE, sep = NULL, blank.lines.skip = TRUE)
processing
You could just stay with data.table, but I noticed you in the tidyverse already.
library(dplyr)
library(stringr)
library(tidyr)
Basically I am grabbing the first character (T, U, W) and storing it into a variable called Column. I am adding another column called Content for the rest of the string, with white space trimmed on both ends. I also added an ID column so I know how to group the clusters of 3 rows.
Then you basically just pivot on the Column. I am not sure if you wanted this last step or not, so remove as needed.
read_rows %>%
mutate(ID = rep(1:3, each = n() / 3),
Column = str_sub(V1, 1, 1),
Content = str_trim(str_sub(V1, 2))) %>%
select(-V1) %>%
pivot_wider(names_from = Column, values_from = Content)
result
# A tibble: 3 x 4
ID T U W
<int> <chr> <chr> <chr>
1 1 2009-06-11 00:00:03 http://twitter.com/imdb No Post Title
2 2 2009-06-11 16:37:14 http://twitter.com/ncruralhealth No Post Title
3 3 2009-06-11 16:56:23 http://twitter.com/boydjones "listening to \"Big Lizard - The Dead Milkmen\" ♫ http://blip.fm/~81kwz"

Related

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59

R Beginner struggling with extremely messy XLSX

I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.

How to convert a string that has the headers and values with ID per string into a dataframe in R

I need to know how to convert strings in the text file into a data frame for analysis.
I got one line which has an ID per customer, which has the column heading and value in and is separated by semi colon ';'
For example:
{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}
The column headings are ID, TimeStamp, Event, Status, Text and any others that come before the equal "=" sign.
The values under the column headings will be after the equals sign "=", see the picture this is the end result I want to achieve.
Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
}
}
You can see the semi colon ";" separates each variable, can someone be able to separate and make R identify what is a column heading and what is a value to be placed underneath the respected heading with respect to the customer ID (the primary key).
Note that each line may not have the same column headings for the next one.
The image supplied is what I want to achieve in the end but I am having great difficulty to do so in R. It is not a json file or a XML format it is a file that was dumped in text format where I need to extract and analyse the information in a dataframe format before I can do any insights.
Any suggestions? Would there be a better way than say using regular expressions? E.g. stringr package?
txt <- 'Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
} ' # note need for single quotes
Then read in with readLines and remove leading and trailing lines and hten remove {,}, and the double quotes, and finally read with scan:
RL <- readLines(textConnection(txt))
rl <- RL[-1]
input <- scan(text=gsub('[{}"]',"", rl[1:6]), sep=';', what="")
input[1:12]
#------------------
[1] " ID=12345" "TimeStamp=2019-02-26 00:15:42"
[3] "Event=StatusEvent" "Status=WiLoMonitorStart"
[5] "Text=mnew inactivity failure on cable" ""
[7] " ID=12346" "TimeStamp=2019-02-26 00:15:43"
[9] "Event=StatusEvent" "Status=MetroCode"
[11] "Text=AU" ""
Then you can process like any key-value pair input with "ID" being the delimiter. Another way that woud keep the origianl lines together in a list would be:
sapply( gsub('[{}"]',"", rl[1:6]), function(x) scan(text=x, sep=";", what=""))
#----------------
Read 6 items
Read 6 items
Read 6 items
Read 8 items
Read 5 items
Read 6 items
$` ID=12345;TimeStamp=2019-02-26 00:15:42;Event=StatusEvent;Status=WiLoMonitorStart;Text=mnew inactivity failure on cable;`
[1] " ID=12345" "TimeStamp=2019-02-26 00:15:42" "Event=StatusEvent"
[4] "Status=WiLoMonitorStart" "Text=mnew inactivity failure on cable" ""
# only printed the result from the first line
Convert url query key-value pairs to data frame
Here's a tidyverse solution:
library(tidyverse)
data.frame(txt) %>%
# tidy strings:
mutate(txt = trimws(gsub("Statements|\\s{2,}|[^\\w=; -]", "", txt, perl = TRUE))) %>%
# separate into rows by splitting on ";":
separate_rows(txt, sep = ";") %>%
# separate into two columns by splitting on "=":
separate(txt, into = c("header", "value"), sep = "=") %>%
na.omit() %>%
group_by(header) %>%
# create grouped row ID:
mutate(rowid = row_number()) %>%
ungroup() %>%
# cast wider:
pivot_wider(rowid,
names_from = "header",
values_from = "value")
# A tibble: 6 × 12
rowid ID TimeStamp Event Status Text MacAd…¹ LogTime Domain Secon…² Seria…³ Url
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 12345 2019-02-26 001542 StatusEvent WiLoM… mnew… AB2334… 2019-0… Willi… No_Perl 34121 werl…
2 2 12346 2019-02-26 001543 StatusEvent Metro… AU NA NA NA NA 34151 NA
3 3 12347 2019-02-26 001543 StatusEvent LoWiV… Pass… NA NA NA NA NA NA
4 4 12349 2019-02-26 001542 DomainEvent NA NA NA NA NA NA NA NA
5 5 12351 2019-02-26 001545 CollectionCallEve… NA NA NA NA NA NA NA NA
6 6 12352 2019-02-26 001546 CollectionCallEve… NA NA NA NA NA NA NA NA
# … with abbreviated variable names ¹​MacAddress, ²​SecondaryDomain, ³​SerialNumber
Data:
txt <- 'Statements {
"{ID=12345;TimeStamp=""2019-02-26 00:15:42"";Event=StatusEvent;Status=""WiLoMonitorStart"";Text=""mnew inactivity failure on cable"";}"
"{ID=12346;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""MetroCode"";Text=""AU"";}"
"{ID=12347;TimeStamp=""2019-02-26 00:15:43"";Event=StatusEvent;Status=""LoWiValidation"";Text=""Password validation 2.5GHz for AES: BigBong"";}"
"{ID=12349;TimeStamp=""2019-02-26 00:15:42"";Event=DomainEvent;MacAddress=""AB:23:34:EF:YN:OT"";LogTime=""2019-02-26 00:15:48"";Domain=""Willing ind"";SecondaryDomain=""No_Perl"";}"
"{ID=12351;TimeStamp=""2019-02-26 00:15:45"";Event=CollectionCallEvent;SerialNumber=""34121"";}"
"{ID=12352;TimeStamp=""2019-02-26 00:15:46"";Event=CollectionCallEvent;SerialNumber=""34151"";Url=""werlkdfa/vierjwerret/vre34f3/df343rsdf343+t45rf/dfgr3443"";}"
} '

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Delete row from data.frame based on condition

I have some repeated measures data I'm trying to clean in R. At this point, it is in the long format and I'm trying to fix some entries before I move to a wide format - for example, if people took my survey too many times I'm going to drop the rows. I have two main problems that I'm trying to solve:
Changing an entry
If someone took the survey from the "pre-test link" when it was actually supposed to be a post-test, I'm fixing it with the following code:
data[data$UserID == 52118254, "Prepost"][2] <- 2
This filters out the entries from that person based on ID, then changes the second entry to be coded as a post-test. This code has enough meaning that reviewing it tells me what is happening.
Dropping a row
I'm struggling to get meaningful code to delete extra rows - for example if someone accidentally clicked on my link twice. I have data like the following:
UserID Prepost Duration..in.seconds.
1 52118250 1 357
2 52118284 1 226
3 52118284 1 11 #This is an extra attempt to remove
4 52118250 2 261
5 52118284 2 151
#to reproduce:
structure(list(UserID = c(52118250, 52118284, 52118284, 52118250, 52118284), Prepost = c("1", "1", "1", "2", "2"), Duration..in.seconds. = c("357", "226", "11", "261", "151")), class = "data.frame", row.names = c(NA, -5L), .Names = c("UserID", "Prepost", "Duration..in.seconds."))
I can filter by UserID to see who has taken it too many times and I'm looking for a way to easily remove those rows from the dataset. In this case, UserID 52118284 has taken it three times and the second attempt needs to be removed. If it is "readable" like the other fix that is better.
I'd use a collection of dplyr functions as shown below. To explain:
group_by(UserID) will help to apply functions separately to each User.
mutate(click_n = row_number()) iteratively counts User appearances and saves it as a new variable click_n.
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(click_n = row_number())
#> Source: local data frame [5 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118284 1 11 2
#> 4 52118250 2 261 1
#> 5 52118280 2 151 1
filter(click_n == 1) can then be used to keep only 1st attempts as shown below.
data <- data %>%
group_by(UserID) %>%
mutate(click_n = row_number()) %>%
filter(click_n == 1)
data
#> Source: local data frame [4 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118250 2 261 1
#> 4 52118280 2 151 1
Note that this approach assumes that your data frame is ordered. I.e., first clicks appear close to the top.
If you're unfamiliar with %>%, look for help on the "pipe operator".
EXTRA:
To bring the comment into answer, once you're comfortable with what's going on here, you can skip the mutate line a just do the following:
data %>% group_by(UserID) %>% filter(row_number() == 1)
A simple solution to remove duplicates is below:
subset(data, !duplicated(data$UserID))
However, you may want to consider also subsetting by duration, such as if the duration is less than 30 seconds.
Thanks #Simon for the suggestions. One criteria I wanted was that the code made sense as I "read" it. As I thought more, another criteria is that I wanted to be deliberate about what changes to make. So I incorporated Simon's recommendation to make a separate column and then use dplyr::filter() to exclude those variables. Here's what an example segment of code looked like:
#Change pre/post entries
data[data$UserID == 52118254, "Prepost"][2] <- 2
#Mark rows to delete
data$toDelete <- NA #Makes new empty column for marking deletions
data[data$UserID == 52118284,][2, "toDelete"] <- 1 #Marks row for deletion
#Filter to exclude rows
data %>% filter(is.na(toDelete))
#Optionally add "%>% select(-toDelete)" to remove the extra column
In my context, advantages here are that everything is deliberate rather than automatic and changes are anchored to data rather than row numbers that might change. I'd still welcome any feedback or other ways of achieving this (maybe in a single step).

Resources