I have some repeated measures data I'm trying to clean in R. At this point, it is in the long format and I'm trying to fix some entries before I move to a wide format - for example, if people took my survey too many times I'm going to drop the rows. I have two main problems that I'm trying to solve:
Changing an entry
If someone took the survey from the "pre-test link" when it was actually supposed to be a post-test, I'm fixing it with the following code:
data[data$UserID == 52118254, "Prepost"][2] <- 2
This filters out the entries from that person based on ID, then changes the second entry to be coded as a post-test. This code has enough meaning that reviewing it tells me what is happening.
Dropping a row
I'm struggling to get meaningful code to delete extra rows - for example if someone accidentally clicked on my link twice. I have data like the following:
UserID Prepost Duration..in.seconds.
1 52118250 1 357
2 52118284 1 226
3 52118284 1 11 #This is an extra attempt to remove
4 52118250 2 261
5 52118284 2 151
#to reproduce:
structure(list(UserID = c(52118250, 52118284, 52118284, 52118250, 52118284), Prepost = c("1", "1", "1", "2", "2"), Duration..in.seconds. = c("357", "226", "11", "261", "151")), class = "data.frame", row.names = c(NA, -5L), .Names = c("UserID", "Prepost", "Duration..in.seconds."))
I can filter by UserID to see who has taken it too many times and I'm looking for a way to easily remove those rows from the dataset. In this case, UserID 52118284 has taken it three times and the second attempt needs to be removed. If it is "readable" like the other fix that is better.
I'd use a collection of dplyr functions as shown below. To explain:
group_by(UserID) will help to apply functions separately to each User.
mutate(click_n = row_number()) iteratively counts User appearances and saves it as a new variable click_n.
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(click_n = row_number())
#> Source: local data frame [5 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118284 1 11 2
#> 4 52118250 2 261 1
#> 5 52118280 2 151 1
filter(click_n == 1) can then be used to keep only 1st attempts as shown below.
data <- data %>%
group_by(UserID) %>%
mutate(click_n = row_number()) %>%
filter(click_n == 1)
data
#> Source: local data frame [4 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118250 2 261 1
#> 4 52118280 2 151 1
Note that this approach assumes that your data frame is ordered. I.e., first clicks appear close to the top.
If you're unfamiliar with %>%, look for help on the "pipe operator".
EXTRA:
To bring the comment into answer, once you're comfortable with what's going on here, you can skip the mutate line a just do the following:
data %>% group_by(UserID) %>% filter(row_number() == 1)
A simple solution to remove duplicates is below:
subset(data, !duplicated(data$UserID))
However, you may want to consider also subsetting by duration, such as if the duration is less than 30 seconds.
Thanks #Simon for the suggestions. One criteria I wanted was that the code made sense as I "read" it. As I thought more, another criteria is that I wanted to be deliberate about what changes to make. So I incorporated Simon's recommendation to make a separate column and then use dplyr::filter() to exclude those variables. Here's what an example segment of code looked like:
#Change pre/post entries
data[data$UserID == 52118254, "Prepost"][2] <- 2
#Mark rows to delete
data$toDelete <- NA #Makes new empty column for marking deletions
data[data$UserID == 52118284,][2, "toDelete"] <- 1 #Marks row for deletion
#Filter to exclude rows
data %>% filter(is.na(toDelete))
#Optionally add "%>% select(-toDelete)" to remove the extra column
In my context, advantages here are that everything is deliberate rather than automatic and changes are anchored to data rather than row numbers that might change. I'd still welcome any feedback or other ways of achieving this (maybe in a single step).
Related
I have a data frame with 80 existing rows and 6 variables, they are:
Row_ID
CatName
CatAge
Request
Friends
ID,
and I need to add some outliers to the dataset of generated data by adding a row on to the end containing specific data.
I attempted the following but it does not work. Any tips on how to get this to work?
```{r, create row 1, echo=TRUE,include=TRUE}
Cat_dataframe %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
```
Your command looks pretty good to me:
library(tidyverse)
df <- tribble(~"Row_ID", ~"CatName", ~"CatAge", ~"Request", ~"Friends", ~"ID",
"1", "name1", "31", "request1", "2", "051245")
df %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
#> # A tibble: 2 × 6
#> Row_ID CatName CatAge Request Friends ID
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 name1 31 request1 2 051245
#> 2 30 Carla 30 30 8 500000
Created on 2022-04-03 by the reprex package (v2.0.1)
You may have an issue with your chunk title (i.e. try {r create_row_1, echo=TRUE, include=TRUE} instead of {r, create row 1, echo=TRUE,include=TRUE}) and you may have an issue with different data types, e.g. if "CatAge" is an integer in your original dataframe and a character string in your add_rows() command (age=31 and age="31" are different types).
If you edit your original question to include the error message/s you're getting it will very likely make it easier troubleshoot your problem.
So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39
I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.
I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.
I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?
I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.