R Order only one factor level (or column if after) to affect order long to wide (using spread) - r

I have a problem after changing my dataset from long to wide (using spread, from the tidyr library on the Result_Type column). I have the following example df:
Group<-c("A","A","A","B","B","B","C","C","C","D", "D")
Result_Type<-c("Final.Result", "Verification","Test", "Verification","Final.Result","Fast",
"Verification","Fast", "Final.Result", "Test", "Final.Result")
Result<-c(7,1,8,7,"NA",9,10,12,17,50,11)
df<-data.frame(Group, Result_Type, Result)
df
Group Result_Type Result
1 A Final.Result 7
2 A Verification 1
3 A Test 8
4 B Verification 7
5 B Final.Result NA
6 B Fast 9
7 C Verification 10
8 C Fast 12
9 C Final.Result 17
10 D Test 50
11 D Final.Result 11
In the column Result_type there are many possible result types and in some datasets I have Result_Type 's that will not occur in other datasets. However, one level: Final.Resultdoes occur in every dataset.
Also: This is example data but the actual data has many different columns, and as these differ across the datasets I use, I used spread (from the tidyr library) so I don't have to give any specific column names other than my target columns.
library("tidyr")
df_spread<-spread(df, key = Result_Type, value = Result)
Group Fast Final.Result Test Verification
1 A <NA> 7 8 1
2 B 9 NA <NA> 7
3 C 12 17 <NA> 10
4 D <NA> 11 50 <NA>
What I would like is that once I convert the dataset from long to wide, Final.Result is the first column, how the rest of the columns is arranged doesn't matter, so I would like it to be like this (without calling any names of the other columns that are spread, or using order index numbers):
Group Final.Result Fast Test Verification
1 A 7 <NA> 8 1
2 B NA 9 <NA> 7
3 C 17 12 <NA> 10
4 D 11 <NA> 50 <NA>
I saw some answers that indicated you can reverse the order of the spreaded columns, or turn off the ordering of spread, but that doesn't make sure that Final.Result is always the first column of the spread levels.
I hope I am making myself clear, it's a little complicated to explain. If someone needs extra info I will be happy to explain more!

spread creates columns in the order of the key column's factor levels. Within the tidyverse, forcats::fct_relevel is a convenience function for rearranging factor levels. The default is that the level(s) you specify will be moved to the front.
library(dplyr)
library(tidyr)
...
levels(df$Result_Type)
#> [1] "Fast" "Final.Result" "Test" "Verification"
Calling fct_relevel will put "Final.Result" as the first level, keeping the rest of the levels in their previous order.
reordered <- df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result"))
levels(reordered$Result_Type)
#> [1] "Final.Result" "Fast" "Test" "Verification"
Adding that into your pipeline puts Final.Result as the first column after spreading.
df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result")) %>%
spread(key = Result_Type, value = Result)
#> Group Final.Result Fast Test Verification
#> 1 A 7 <NA> 8 1
#> 2 B NA 9 <NA> 7
#> 3 C 17 12 <NA> 10
#> 4 D 11 <NA> 50 <NA>
Created on 2018-12-14 by the reprex package (v0.2.1)

One option is to refactor Result_Type to put final.result as the first one:
df$Result_Type<-factor(df$Result_Type,levels=c("Final.Result",as.character(unique(df$Result_Type)[!unique(df$Result_Type)=="Final.Result"])))
spread(df, key = Result_Type, value = Result)
Group Final.Result Verification Test Fast
1 A 7 1 8 NA
2 B NA 7 NA 9
3 C 17 10 NA 12
4 D 11 NA 50 NA
If you'd like you can use this opportunity to also sort the rest of the columns whichever way you want.

Related

Remove duplicates while keeping NA in R

I have data that looks like the following:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
print(a)
ID score
A 1
B 2
C 3
C 3
<NA> 5
<NA> 6
I am trying to remove duplicates without R treating <NA> as duplicates to get the following:
b<-data.frame(ID=c("A","B","C",NA,NA),score=c(1,2,3,5,6),stringsAsFactors=FALSE)
print(b)
ID score
A 1
B 2
C 3
<NA> 5
<NA> 6
I have tried the following:
b<-a[!duplicated(a$ID),]
library(dplyr)
b<-distinct(a,ID)
print(b)
But both treat <NA> as a duplicate ID and remove one, but I want to keep all instances of <NA>. Thoughts? Thank you!
A straight forward approach is to break the original data frame down into 2 parts where ID is NA and where it is not. Perform your distinct filter and then combine the data frames back together:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
aprime<-a[!is.na(a$ID),]
aNA<-a[is.na(a$ID),]
b<-aprime[!duplicated(aprime$ID),]
b<-rbind(b, aNA)
With a little work, one can reduce this down to a 1-2 line lines of code.
using dplyr:
b%>%group_by(ID,score)%>%distinct()
# A tibble: 5 x 2
# Groups: ID, score [5]
ID score
<chr> <dbl>
1 A 1
2 B 2
3 C 3
4 <NA> 5
5 <NA> 6
Found a very simple way to do this simply using the base duplicated() function.
b<-a[!duplicated(a$ID, incomparables = NA),]
Setting incomparables = NA makes R read NA duplicates as FALSE, therefore including them in the result dataset.

R: Transposing from long to wide and aggregating rows with matching ID

This is something I've been working around for a while just making separate data frames and doing full_join but I think there's an easier way.
Overall, I'm wanting to calculate the differences between an individual ID's value from time 1 to time 2 by type from a long form data frame. This is one of the ways I think I could do it but if other people have other techniques or ideas I'd like to hear them too.
However, I'd also like to know how to address this transposing issue anyway because I'm curious.
Here's my issue.
I have a data frame in long form with 5 different measures for two different time periods. I want to convert this data frame from long form into a wide form so that instead of having a DF look like this (note, not all types are included -- just did 2 for sake of length):
(example df1)
ID Time Value Type
1 1 7 Type1
1 2 8 Type1
2 1 9 Type1
2 2 10 Type1
1 1 13 Type2
1 2 15 Type2
2 1 17 Type2
2 2 19 Type2
I want it to look more like this:
(example df 2)
ID Type1.1 Type1.2 Type2.1 Type2.2
1 7 8 13 15
2 9 10 17 19
I use:
library(dplyr)
library(tidyr)
df.new <- df %>%
spread(Type, Measurement.Value)
and get this from example df 1 which is on the right track:
(example df 3)
ID Time Type1 Type2
1 1 7 13
1 2 8 15
2 1 9 17
2 2 10 19
But now I want to spread the time for each type. When I do something like this on example df3:
newer.df <- df.new %>%
spread(Time, Type1)
to make this:
ID Type1.1 Type1.2
1 7 NA
1 NA 8
2 9 NA
2 NA 10
So, it's producing an NA for each row -- is there a way I can collapse rows on to each other by ID? I think I'm missing something.
Remember, in my example code I'm only using 2 types but in reality I have 5 types -- just wanted to give simplified code.
We can use dcast() from reshape2 package.
library(reshape2)
dcast(df, ID ~ Type + Time, value.var = "Value")
# ID Type1_1 Type1_2 Type2_1 Type2_2
#1 1 7 8 13 15
#2 2 9 10 17 19
Or using the original tidyr package, we could do this:
library(tidyr)
df$Type <- paste(df$Type, df$Time, sep="_")
df$Time <- NULL
spread(df, key=Type, value=Value)
ID Type1_1 Type1_2 Type2_1 Type2_2
1 7 8 13 15
2 9 10 17 19
Nulling the time column did the trick for me. It seems that spread considers all columns not used otherwise as what dcast would call id.vars. There might be a more elegant solution using tidyr, though.

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

How to get na.omit with data.table to only omit NAs in each column

Let's say I have
az<-data.table(a=1:6,b=6:1,c=4)
az[b==4,c:=NA]
az
a b c
1: 1 6 4
2: 2 5 4
3: 3 4 NA
4: 4 3 4
5: 5 2 4
6: 6 1 4
I can get the sum of all the columns with
az[,lapply(.SD,sum)]
a b c
1: 21 21 NA
This is what I want for a and b but c is NA. This is seemingly easy enough to fix by doing
az[,lapply(na.omit(.SD),sum)]
a b c
1: 18 17 20
This is what I want for c but I didn't want to omit the values of a and b where c is NA. This is a contrived example in my real data there could be 1000+ columns with random NAs throughout. Is there a way to get na.omit or something else to act per column instead of on the whole table without relying on looping through each column as a vector?
Expanding on my comment:
Many base functions allow you to decide how to treat NA. For example, sum has the argument na.rm:
az[,lapply(.SD,sum,na.rm=TRUE)]
In general, you can also use the function na.omit on each vector individually:
az[,lapply(.SD,function(x) sum(na.omit(x)))]

Skip NA values using "FUN=first"

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.
A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]
Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))
Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")
Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

Resources