How to use "cast" in reshape without aggregation - r

In many uses of cast I've seen, an aggregation function such as mean is used.
How about if you simply want to reshape without information loss.
For example, if I want to take this long format:
ID condition Value
John a 2
John a 3
John b 4
John b 5
John a 6
John a 2
John b 1
John b 4
To this wide-format without any aggregation:
ID a b
John 2 4
John 3 5
Alex 6 1
Alex 2 4
I suppose that this is assuming that observations are paired and you were missing value would mess this up but any insight is appreciated

In such cases you can add a sequence number:
library(reshape2)
DF$seq <- with(DF, ave(Value, ID, condition, FUN = seq_along))
dcast(ID + seq ~ condition, data = DF, value.var = "Value")
The last line gives:
ID seq a b
1 John 1 2 4
2 John 2 3 5
3 John 3 6 1
4 John 4 2 4
(Note that we used the sample input from the question but the sample output in the question does not correspond to the sample input.)

Related

How to make a true ranking by giving same ranking to same score while the rest follow the true order?

Here is the toy sample of 8 students with grades from A to D. I would like to give a ranking which reflects the true order while students with same grade shall have same ranking.
It seems the .GRP is most likely the right approach, but it goes with order of numbers, how can I skip the position occupied by the students with same grade, with data.table? Thanks.
DT <- data.table(GRADE = c("A","B","B","C",rep("D",4)))
DT[, GRP:=.GRP, by = GRADE][, RANK:= c(1,2,2,4,5,5,5,5)]
# GRADE GRP RANK
#1: A 1 1
#2: B 2 2
#3: B 2 2
#4: C 3 4
#5: D 4 5
#6: D 4 5
#7: D 4 5
#8: D 4 5
An option is frank
DT[, RANK := frank(GRADE, ties.method = 'min')]
DT$RANK
#[1] 1 2 2 4 5 5 5 5
Or in dplyr with min_rank
library(dplyr)
DT %>%
mutate(RANK = min_rank(GRADE))

subseting columns by the name of rows of another dataframe

I need to subset the columns of a dataframe taking into account the rownames of another dataframe.(in R)
Im trying to select the representative species of Brazilian Amazon subseting a great Brazilian database taking into account the percentage of representative location, information which is in another dataframe
> a <- data.frame("John" = c(2,1,1,2), "Dora" = c(1,1,3,2), "camilo" = c(1:4),"alex"=c(1,2,1,2))
> a
John Dora camilo alex
1 2 1 1 1
2 1 1 2 2
3 1 3 3 1
4 2 2 4 2
> b <- data.frame("SN" = 1:3, "Age" = c(15,31,2), "Name" = c("John","Dora","alex"))
> b
SN Age Name
1 1 15 John
2 2 31 Dora
3 3 2 alex
> result <- a[,rownames(b)[1:3]]
Error in `[.data.frame`(a, , rownames(b)[1:3]) :
undefined columns selected
I want to get this dataframe
John Dora alex
1 2 1 1
2 1 1 2
3 1 3 1
4 2 2 2
The simple a[,b$Name] does not work because b$Name is considered a factor. Be careful because it won't throw an error but you will get the wrong answer!
But this is easy to fit by using a[,as.character(b$Name)]instead!

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!
Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))
I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.
You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Adding column with a value based on entry order of specific factor [duplicate]

This question already has answers here:
In R, how do I create consecutive ID numbers for each repetition in a separate variable?
(3 answers)
Closed 9 years ago.
I would like to add a column to a data frame where the values in the column are based upon the entry order for a specific factor in another column. So specifically for my data I would like to have a "1" for the first visit to a point, a "2" for the second visit, a "3" for the third etc. However, some points have repetitive visits for a given date and should share the same visit number.
The data frame is pre-sorted and looks something like this:
Transect Point Date
1 BEN 1 5/7/12
2 BEN 1 5/10/12
3 BEN 1 5/10/12
4 BEN 2 5/8/12
5 BEN 2 5/11/12
6 BEN 2 5/13/12
I would like to get something like this:
Transect Point Date Vist
1 BEN 1 5/7/12 1
2 BEN 1 5/10/12 2
3 BEN 1 5/10/12 2
4 BEN 2 5/8/12 1
5 BEN 2 5/11/12 2
6 BEN 2 5/13/12 3
Assuming your data.frame is called SODF, use ave:
within(SODF, {
Visit <- ave(Point, Point, FUN = seq_along)
})
# Transect Point Date Visit
# 1 BEN 1 5/7/12 1
# 2 BEN 1 5/10/12 2
# 3 BEN 1 5/13/12 3
# 4 BEN 2 5/8/12 1
# 5 BEN 2 5/11/12 2
If you are grouping by more than one column, for example "Transect" and "Point", change the ave statement to:
ave(Point, Transect, Point, FUN = seq_along)
There are, of course, other approaches, both using base R and using packages. Several of these are summarized and benchmarked by #Arun in his answer here.
Update to address new question requirements
One quick solution that comes to mind considering your new requirement is to first extract the unique cases, perform the index generation as done above, and merge the resulting table with your original table.
SODFunique <- SODF[!duplicated(SODF), ]
SODFunique <- within(SODFunique, {
Visit <- ave(Point, Transect, Point, FUN = seq_along)
})
merge(SODF, SODFunique, sort = FALSE)
# Transect Point Date Visit
# 1 BEN 1 5/7/12 1
# 2 BEN 1 5/10/12 2
# 3 BEN 1 5/10/12 2
# 4 BEN 2 5/8/12 1
# 5 BEN 2 5/11/12 2
# 6 BEN 2 5/13/12 3

Resources