Using R to redistribute character values across multiple columns in a dataset - r

I am a new R user with a survey dataset that asks respondents (indicated by their IDs 100,101,102, and 103 in rows) to list common crimes and security problems in their neighborhoods. Types of incidents, crime1-crime3, are distributed across the columns. I know that the dataset is not well organized, but this is the structure of the output generated by Google form surveys for "select all that apply" questions.
I would like to write R code to reconfigure the dataset so that each type of crime/problem (for example, theft) has its own column. Then the character values could be replaced with 1. I have reorganized a small excerpt of the larger dataset by hand to show the end result I am looking for. Any suggestions would be greatly appreciated!
I initially tried to use gather() to collect all character values into one column and then redistribute into new columns but was unable to get it to work.
Original dataset:
respondentID crime1 crime1 crime3
100 vandalism other 0
101 other 0 0
102 drugs theft other
103 drugs theft vandalism
Trying to convert to:
respondentID drugs theft vandalism other
100 0 0 1 1
101 0 0 0 1
102 1 1 0 1
103 1 1 1 0

First, convert from wide to long:
df <- structure(list(respondentID = 100:103, crime1 = c("vandalism",
"other", "drugs", "drugs"), crime1.1 = c("other", "0", "theft",
"theft"), crime3 = c("0", "0", "other", "vandalism")),
class = "data.frame", row.names = c(NA,
-4L))
df_long <- df %>%
gather(key="crime_no", value="crime", -respondentID) %>%
select(-crime_no) %>%
filter(crime != "0")
Explanation: first line loads your data (you should use dput next time you ask a question). Second line converts to a format with one column per crime. We do not need the "crime_no" column, as you are not interested whether a crime was 1, 2 or 3. Finally, we don't need the 0 crimes (we will fill that out automatically later).
Now, calculate the statistics:
df_stat <- df_long %>% group_by(respondentID, crime) %>% summarise(n=n())
Explanation: while in your example data each person committed a type of crime only once, I assume that this is not the general case, and some times you will see other numbers. We first group the data by respondent and crime, and next we count how many times each combination occurs.
And back to wide format:
df_wide <- df_stat %>% spread(key=crime, value=n, fill=0)
Explanation: we now convert to the "wide" format, where each crime has its own column. We use parameter fill=0, so if there is missing data (i.e., a person did not commit a given crime), then we will insert 0 rather than NA.
And here is your result:
respondentID drugs other theft vandalism
<int> <dbl> <dbl> <dbl> <dbl>
1 100 0 1 0 1
2 101 0 1 0 0
3 102 1 1 1 0
4 103 1 0 1 1
Next time you ask a question, please
use dput so we can easily load your example data
show some code: how did you try to tackle the problem on your own?

Related

How can I group by one variable in terms of status of a different variable in a longitudinal situation in R?

I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))

binning, grouping data in r using values withing a specific range to determine an event or change for survival analysis

I have a dataframe as follows:
df <- data.frame(as.date=c("14/06/2016","15/06/2016","16/06/2016","17/06/2016","18/06/2016","19/06/2016","20/06/2016","21/06/2016","22/06/2016","23/06/2016",
"24/06/2016","04/07/2016","05/07/2016","06/07/2016","07/07/2016","08/07/2016","09/07/2016","10/07/2016","11/07/2016","12/07/2016",
"13/07/2016","14/07/2016","15/07/2016","17/07/2016","18/07/2016","19/07/2016","20/07/2016","21/07/2016","22/07/2016","01/08/2016",
"02/08/2016","03/08/2016","04/08/2016","05/08/2016","06/08/2016","07/08/2016","08/08/2016","09/08/2016","10/08/2016","11/08/2016",
"12/08/2016","13/08/2016","14/08/2016","15/08/2016","16/08/2016","17/08/2016","18/08/2016","19/08/2016","20/08/2016","21/08/2016",
"22/08/2016","23/08/2016","24/08/2016","25/08/2016","26/08/2016","27/08/2016","28/08/2016","29/08/2016","30/08/2016","31/08/2016",
"01/09/2016","02/09/2016","03/09/2016","04/09/2016","05/09/2016","06/09/2016","07/09/2016","08/09/2016","09/09/2016","10/09/2016",
"11/09/2016","12/09/2016","13/09/2016","14/09/2016","15/09/2016","16/09/2016","17/09/2016","18/09/2016","19/09/2016","20/09/2016"),
wear=c("0","55","0","0","0","0","8","8","15","25","30","37","43","49","52","52","55","57","57","61","67","69","2","2","7",
"10","13","14","16","16","19","22","22","24","25","26","29","29","33","34","34","36","38","44","45","48","50","55",
"56","58","0","4","0","4","4","6","9","9","12","14","16","17","25","25","33","36","44","46","48","52","55","59",
"8","9","9","12","24","33","36","44"))
the data is an example of wear rate on a type of metal on a machine, it increases over time them drops to 0, indicating an event or a change,
but the problem that I have is that the wear value doesn't drop off to 0, as you can see from the data, there are 2 variables
as.date = date over time,
wear = wear of metal on a part over time
RANGE in between changes are:
55-0,
60-2,
58-0,
59-8
when it drops from a large number to 0 it is easy to code,I use the following code to change,and add new variables called Status & id
{Creates 2 new columns status & id
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$Status <- 0# fills in column status with all zeros
df$Status[wear > -10 & wear == 0] <- 1 # fill in 1s when wear = 0
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$id <-1# fills in column status with '1's
for(i in 2:nrow(df)){
if(df$Status[i-1]==0){
df$id[i]=df$id[i-1]
}
else {
df$id[i]=df$id[i-1]+1
}
}
}
it will work OK to catch a drop in wear values to 0 but when there isn't, as in the data examples, the wear drops take place from 55-0, 69-2, 58-0, 59-8, within the real data set sometimes there are occasions when the drop in wear values will be negative, not sure on correct way to achieve this, I tried messing around with binning and grouping the data but was unsuccessful.
this is a sample of the data, in the real data set there are 100+ events, mostly a wear value drop to 0 but between 10-20 occasions either dropping to negative values or a values < 10.
I think for-loop is inefficient. We can do something like this using the dplyr and lubridate package.
library(dplyr)
library(lubridate)
df2 <- df %>%
# Convert the as.date column to date class
# Convert the wear column to numeric
mutate(as.date = dmy(as.date),
wear = as.numeric(as.character(wear))) %>%
# Create column show the wear of previous record
mutate(wear2 = lag(wear)) %>%
mutate(Diff = wear - wear2)
The idea is to shift the wear column by 1, and then calculate the difference between the wear of the date and the previous date. the results are saved in the new column as Diff. Here is what the new data frame looks like.
head(df2)
# as.date wear wear2 Diff
# 1 2016-06-14 0 NA NA
# 2 2016-06-15 55 0 55
# 3 2016-06-16 0 55 -55
# 4 2016-06-17 0 0 0
# 5 2016-06-18 0 0 0
# 6 2016-06-19 0 0 0
After this, you can define a threshold in Diff to filter out an end of a period. For example, here I defined the threshold to be -50. You can see that the filter function successfully identify four periods.
# Filter Diff <= -50
df2 %>% filter(Diff <= -50)
# as.date wear wear2 Diff
# 1 2016-06-16 0 55 -55
# 2 2016-07-15 2 69 -67
# 3 2016-08-22 0 58 -58
# 4 2016-09-13 8 59 -51
One final note, in your original data frame, the wear column is in factor, but you calculate it as numeric. This is dangerous. I used wear = as.numeric(as.character(wear)) to convert the column to numeric, but it would be great if you can create the numeric column in the first place.

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Delete row from data.frame based on condition

I have some repeated measures data I'm trying to clean in R. At this point, it is in the long format and I'm trying to fix some entries before I move to a wide format - for example, if people took my survey too many times I'm going to drop the rows. I have two main problems that I'm trying to solve:
Changing an entry
If someone took the survey from the "pre-test link" when it was actually supposed to be a post-test, I'm fixing it with the following code:
data[data$UserID == 52118254, "Prepost"][2] <- 2
This filters out the entries from that person based on ID, then changes the second entry to be coded as a post-test. This code has enough meaning that reviewing it tells me what is happening.
Dropping a row
I'm struggling to get meaningful code to delete extra rows - for example if someone accidentally clicked on my link twice. I have data like the following:
UserID Prepost Duration..in.seconds.
1 52118250 1 357
2 52118284 1 226
3 52118284 1 11 #This is an extra attempt to remove
4 52118250 2 261
5 52118284 2 151
#to reproduce:
structure(list(UserID = c(52118250, 52118284, 52118284, 52118250, 52118284), Prepost = c("1", "1", "1", "2", "2"), Duration..in.seconds. = c("357", "226", "11", "261", "151")), class = "data.frame", row.names = c(NA, -5L), .Names = c("UserID", "Prepost", "Duration..in.seconds."))
I can filter by UserID to see who has taken it too many times and I'm looking for a way to easily remove those rows from the dataset. In this case, UserID 52118284 has taken it three times and the second attempt needs to be removed. If it is "readable" like the other fix that is better.
I'd use a collection of dplyr functions as shown below. To explain:
group_by(UserID) will help to apply functions separately to each User.
mutate(click_n = row_number()) iteratively counts User appearances and saves it as a new variable click_n.
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(click_n = row_number())
#> Source: local data frame [5 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118284 1 11 2
#> 4 52118250 2 261 1
#> 5 52118280 2 151 1
filter(click_n == 1) can then be used to keep only 1st attempts as shown below.
data <- data %>%
group_by(UserID) %>%
mutate(click_n = row_number()) %>%
filter(click_n == 1)
data
#> Source: local data frame [4 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118250 2 261 1
#> 4 52118280 2 151 1
Note that this approach assumes that your data frame is ordered. I.e., first clicks appear close to the top.
If you're unfamiliar with %>%, look for help on the "pipe operator".
EXTRA:
To bring the comment into answer, once you're comfortable with what's going on here, you can skip the mutate line a just do the following:
data %>% group_by(UserID) %>% filter(row_number() == 1)
A simple solution to remove duplicates is below:
subset(data, !duplicated(data$UserID))
However, you may want to consider also subsetting by duration, such as if the duration is less than 30 seconds.
Thanks #Simon for the suggestions. One criteria I wanted was that the code made sense as I "read" it. As I thought more, another criteria is that I wanted to be deliberate about what changes to make. So I incorporated Simon's recommendation to make a separate column and then use dplyr::filter() to exclude those variables. Here's what an example segment of code looked like:
#Change pre/post entries
data[data$UserID == 52118254, "Prepost"][2] <- 2
#Mark rows to delete
data$toDelete <- NA #Makes new empty column for marking deletions
data[data$UserID == 52118284,][2, "toDelete"] <- 1 #Marks row for deletion
#Filter to exclude rows
data %>% filter(is.na(toDelete))
#Optionally add "%>% select(-toDelete)" to remove the extra column
In my context, advantages here are that everything is deliberate rather than automatic and changes are anchored to data rather than row numbers that might change. I'd still welcome any feedback or other ways of achieving this (maybe in a single step).

Looping through csv and dumping to data frame in R

Using R, I am trying to take a csv file, loop through it, extract values, and dump them into a data frame. There are four columns in the csv: ID, UG_inst, Freq, and Year. Specifically, I want to loop through the UG_inst column by institution name for each year (2010-11,2011-12,2012-13,and 2013-14) and put the value at that cell into the respective "cell" in the R data frame. Right now, the csv just has a Year column, but the data frame I've created has a column for each year. The ultimate idea is to be able to create bar graphs representing the frequency per institution per year. Currently, the code below throws up NO errors, but appears to do nothing to the R data frame "j".
A couple of caveats: 1) Doing a nested for loop was making my head spin, so I decided to just use 2010-11 for now and just loop through the institution name. Since there are only 4 years, I can rewrite this four times, each time with a different year. 2) Also, in the csv, there are repeat names. So, if an institution name appears twice (will be adjacent rows in the csv due to alphabetical arrangement), is there a way to dump the SUM of these into the data frame in R?
All relevant info is below. Thanks so much for any help!!!!
Here is a link to the .csv file: https://www.dropbox.com/s/9et7muchkrgtgz7/UG_inst_ALL.csv
And here is the R code I am trying:
abc <- read.csv(insert file path to above csv here)
inst_string <- unique(abc$UG_inst)
j <- data.frame("UG_inst"=inst_string,"2010-11"=NA,"2011-12"=NA,"2012-13"=NA,"2013-14"=NA)
for (i in inst_string) {
inst.index <- which(abc$UG_inst == i && abc$Year == "2010-11")
j$X2010.11[j$Ug_inst==i] <- abc$Freq[inst.index]
}
Instead of using a nested loop (or a loop at all) I suggest using the reshape() function in base R.
abc <- read.csv("UG_inst_ALL.csv")
abc <- abc[2:4]
reshape(data = abc,
v.names = "Freq",
timevar = "Year",
idvar = "UG_inst",
direction = "wide")
This is known as "reshaping" your data, and you are going from a "long" format to a "wide" format.
In addition to base R's reshape function, here are a few other options to consider.
I'll assume that we are starting with data read in like the following.
abc <- read.csv("~/Downloads/UG_inst_ALL.csv", row.names = 1)
head(abc)
# UG_inst Freq Year
# 1 Abilene Christian University 0 2010-11
# 2 Adams State University 0 2010-11
# 3 Adrian College 1 2010-11
# 4 Agnes Scott College 0 2010-11
# 5 Alabama A&M University 1 2010-11
# 6 Albion College 1 2010-11
Option 1: xtabs
out <- as.data.frame.matrix(xtabs(Freq ~ UG_inst + Year, abc))
head(out)
# 2010-11 2011-12 2012-13 2013-14
# Abilene Christian University 0 1 0 0
# Adams State University 0 0 0 1
# Adrian College 1 0 0 0
# Agnes Scott College 0 0 1 0
# Alabama A&M University 1 3 1 2
# Albion College 1 0 0 0
Option 2: dcast from "reshape2"
library(reshape2)
head(dcast(abc, UG_inst ~ Year, value.var = "Freq"))
Option 3: spread from "tidyr"
library(dplyr)
library(tidyr)
abc %>% select(-X) %>% group_by(UG_inst) %>% spread(Year, Freq)

Resources