R cleaning, change data format horizontal to vertical, repeating some data - r

Right now, my data looks like this:
Coder Bill Witness1name Witness1job Witness2name Witness2Job
Joe 123 Fred Plumber Bob Coach
Karen 122 Sally Barista Helen Translator
Harry 431 Lisa Swimmer N/A N/A
Frank 301 N/A N/A N/A N/A
But I want my data to look like this:
Coder Bill WitnessName WitnessJob
Joe 123 Fred Plumber
Joe 123 Bob Coach
Karen 122 Sally Barista
Karen 122 Helen Translator
Harry 431 Lisa Swimmer
Frank 301 N/A N/A
So I want to take it from the coder/bill level to the "witness" level. Some coder/bills have up to 10 witnesses in their rows. Some have no witnesses, but I do not want to completely drop them from the dataset (see Frank).
All help is appreciated! I am familiar with the tidyverse package.

For those interested, I figured it out.
I had to change all the column names like this:
Witness1Name to Witness1_Name
Witness1Job to Witness1_Job
etc.
Then I ran this:
cleandata <- pivot_longer(mddata, cols = -c(Coder, Bill),
names_to = c("Witness", ".value"),
names_pattern = 'Witness(\\d)_(.*)') %>%
drop_na(Name)
And it gave me this:
Coder Bill Witness Name Job
Joe 123 1 Fred Plumber
Joe 123 2 Bob Coach
Karen 122 1 Sally Barista
Karen 122 2 Helen Translator
Harry 431 1 Lisa Swimmer
Close enough to what I wanted

Related

R- How to reshape Long to Wide with multiple variables/columns

I started off with the following subset of my data
UserID labelnospaces responses
1 Were you given any info? yes
1 By using this service..? yes
1 How satisfied are you? Very satisfied
2 Were you given any info? no
2 By using this service..? no
2 How satisfied are you? unsatisfied
By using the code below, I was able to get from long to wide perfectly
service_L_to_W<- reshape(data=service, idvar="UserID",
timevar = "labelnospaces",
direction = "wide")
Using the code above, I got (this is what I wanted)
UserID Were you given any info? By using this service..? How satisfied are you?
1 yes yes very satisfied
2 no no unsatisfied
My question is how do I edit my code so that I can convert my data (with the extra variables/columns) from long to wide:
UserID Full Name DOB EncounterID QuestionID Name Type labelnospaces responses
1 John Smith 1-1-90 13 505 Intro Check Were you given any info? yes
1 John Smith 1-1-90 13 506 Care Check By using this service.. yes
1 John Smith 1-1-90 13 507 Out Check How satisfied are you? vsat
2 Jane Doe 2-2-80 14 505 Intro Check Were you given any info? no
2 Jane Doe 2-2-80 14 506 Care Check By using this service.. no
2 Jane Doe 2-2-80 14 507 Out Check How satisfied are you? unsat
Some variables are can be better to together
df %>%
pivot_wider(id_cols = c(UserID, Full.Name, DOB, EncounterID), names_from = c(QuestionID, QName, labelnospaces), values_from = responses)
UserID Full.Name DOB EncounterID `505_Intro_Were you given any info?` `506_Care_By using this service..`
<int> <chr> <chr> <int> <chr> <chr>
1 1 John Smith 1-1-90 13 yes yes
2 2 Jane Doe 2-2-80 14 no no
`507_Out_How satisfied are you?`
<chr>
1 vsat
2 unsat

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

If conditions and copying values from different rows

I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125),
Name=c("Harry","David","David","Harry","Peter","Peter","John","Alex","Alex","Mary","Mary","Dan","Joe","Joe"),
Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1),
OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress"),
NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete"))
The data should look like this
I want to create another column called EditedBy that applies the following logic.
IF the project in row 1 equals the project in row 2 AND the New Value in row 1 equals "Open" THEN take the name from row 2. If either of the first two conditions are False, then stick with the name in the first row.
So the data should look like this
How can I do this?
We can do this with data.table
library(data.table)
setDT(Data)[, EditedBy := Name[2L] ,.(Project, grp=cumsum(NewValue == "Open"|
shift(NewValue == "System Declined", fill=TRUE)))]
Data
# Project Name Value OldValue NewValue EditedBy
# 1: 123 Harry 1 Open David
# 2: 123 David 4 Open In Progress David
# 3: 123 David 7 In Progress Complete David
# 4: 123 Harry 3 Complete Open Peter
# 5: 123 Peter 8 Open In Progress Peter
# 6: 123 Peter 9 In Progress Complete Peter
# 7: 124 John 8 Complete Open Alex
# 8: 124 Alex 3 Open In Progress Alex
# 9: 124 Alex 2 In Progress System Declined Alex
#10: 124 Mary 5 System Declined In Progress Mary
#11: 124 Mary 6 In Progress Complete Mary
#12: 125 Dan 2 Open Joe
#13: 125 Joe 2 Open In Progress Joe
#14: 125 Joe 1 In Progress Complete Joe

How to see n rows after a specific row

I'm trying to track user actions but I'd like to see what they do AFTER a specific event. How do I get the next n amount of lines?
For example below, I'd like to know what the user is doing after they "Get mushroom" to see if they are eating it. I'd want to reference the "Get mushroom" for each User and see the next few lines after that.
User Action
Bob Enter chapter 1
Bob Attack
Bob Jump
Bob Get mushroom
Bob Open inventory
Bob Eat mushroom
Bob Close inventory
Bob Run
Mary Enter chapter 1
Mary Get mushroom
Mary Attack
Mary Jump
Mary Attack
Mary Open inventory
Mary Close inventory
I'm not sure how to approach this after grouping by users. Expected results would be something like the below if i wanted 3 lines below
User Action
Bob Get mushroom # Action I want to find and the next 3 lines below it
Bob Open inventory
Bob Eat mushroom
Bob Close inventory
Mary Get mushroom # Action I want to find and the next 3 lines below it
Mary Attack
Mary Jump
Mary Attack
Thank you.
Two alternatives with dplyr and data.table:
library(dplyr)
df1 %>%
group_by(User) %>%
slice(rep(which(Action == 'Get-mushroom'), each=4) + 0:3)
library(data.table)
setDT(df1)[df1[, rep(.I[Action == 'Get-mushroom'], each=4) + 0:3, User]$V1]
both result in:
User Action
1: Bob Get-mushroom
2: Bob Open-inventory
3: Bob Eat-mushroom
4: Bob Close-inventory
5: Mary Get-mushroom
6: Mary Attack
7: Mary Jump
8: Mary Attack
Try this:
df
User Action
1 Bob Enterchapter1
2 Bob Attack
3 Bob Jump
4 Bob Getmushroom
5 Bob Openinventory
6 Bob Eatmushroom
7 Bob Closeinventory
8 Bob Run
9 Mary Enterchapter1
10 Mary Getmushroom
11 Mary Attack
12 Mary Jump
13 Mary Attack
14 Mary Openinventory
15 Mary Closeinventory
indices <- which(df$Action == 'Getmushroom')
n <- 3
# ensure that x + n does not go beyond the #rows of df
do.call(rbind, lapply(indices, function(x)df[x:min(x+n, nrow(df)),]))
User Action
4 Bob Getmushroom
5 Bob Openinventory
6 Bob Eatmushroom
7 Bob Closeinventory
10 Mary Getmushroom
11 Mary Attack
12 Mary Jump
13 Mary Attack
First find out the indices which has the term Get mushroom using which
You can use lapply on every indices and get the next 3 indices using seq.
args <- which(df$Action == "Get mushroom")
df[unlist(lapply(args, function(x) seq(x, x+3))), ]
# User Action
#4 Bob Get mushroom
#5 Bob Open inventory
#6 Bob Eat mushroom
#7 Bob Close inventory
#10 Mary Get mushroom
#11 Mary Attack
#12 Mary Jump
#13 Mary Attack
Or a similar approach (as suggested by #Sotos in comments)
df[sapply(args, function(x) seq(x, x+3)), ]
This sapply solution would work on dataframe and not on data.table as it does not accept 2-column matrix.
For it to work on data.table, you can unlist it using c
df[c(sapply(args, function(x) seq(x, x+3))), ]

cbind for multiple table() functions

I'm trying to count the frequency of multiple columns in a data.frame.
I used the table function on each column and bound them all by cbind, and was going to use the aggregate function after to calculate the means by my identifier.
Example:
df1
V1 V2 V3
George Mary Mary
George Mary Mary
George Mary George
Mary Mary George
Mary George George
Mary
Frequency<- as.data.frame(cbind(table(df1$V1), table(df1$V2), table(df1$V3)))
row.names V1
George 3
Mary 3
1
George 1
Mary 4
1
George 3
Mary 2
The result I get (visually) is a 2 column data frame, but when I check the dimension of Frequency, I get a result implying that the 2nd column only exists.
It's causing me trouble when I try to rename the columns and run the aggregate function, errors I get for rename:
colnames(Frequency) <- c("Name", "Frequency")
Error in names(Frequency) <- c("Name", "Frequency") :
'names' attribute [2] must be the same length as the vector [1]
The Final purpose is to run an aggregate command and get the mean by name:
Name.Mean<- aggregate(Frequency$Frequency, list(Frequency.Name), mean)
Desired output:
Name Mean
George Value
Mary Value
Using mtabulate (data from #user3169080's post)
library(qdapTools)
d1 <- mtabulate(df1)
is.na(d1) <- d1==0
colMeans(d1, na.rm=TRUE)
# Alice George Mary
# 4.0 3.0 2.5
I hope this is what you were looking for:
> df1
V1 V2 V3
1 George George George
2 Mary Mary Alice
3 George George George
4 Mary Mary Alice
5 <NA> George George
6 <NA> Mary Alice
7 <NA> <NA> George
8 <NA> <NA> Alice
> ll=unlist(lapply(df1,table))
> nn=names(ll)
> nn1=sapply(nn,function(x) substr(x,4,nchar(x)))
> mm=data.frame(ll)
> mm$names=nn1
> tapply(mm$ll,mm$names,mean)
> Mean=tapply(mm$ll,mm$names,mean)
> data.frame(Mean)
Mean
Alice 4.0
George 3.0
Mary 2.5

Resources