How do I select distinct from a dataframe in R? - r

I have a dataframe in R. I want to see what groups are in the dataframe. If this were a SQL database, I would do Select distinct group from dataframe. Is there a way to perform a similar operation in R?
> head(orl.df)
long lat order hole piece group id
1 3710959 565672.3 1 FALSE 1 0.1 0
2 3710579 566171.1 2 FALSE 1 0.1 0

The unique() function should do the trick:
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat
x y
1 1 1
2 1 1
3 2 3
> unique(dat)
x y
1 1 1
3 2 3
Edit: For your example (didn't see the group part)
unique(orl.df$group)

I think the table() function is also a good choice.
table(orl.df$group)
It also tell you the number the items in each group.

Related

Expand.grid with unknown number of columns

I have the following data frame:
map_value LDGroup ComboNum
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 3 1
1 3 2
I want to find all combinations, selecting one from each LD group. Expand.grid seems to work for this, doing
expand.grid(df[df$LDGroup==1,3],df[df$LDGroup==2,3],df[df$LDGroup==3,3])
My problem is that I have about 500 map_values I need to do this for and I do not know what number of LDGroups will exist for each map_value. Is there a way to dynamically provide the function arguments?
We can split the 3rd column by the 'LDGroup' and apply the expand.grid
out <- expand.grid(split(df$ComboNum, df$LDGroup))
names(out) <- paste0("Var", names(out))

R Conditional Counter based on multiple columns

I have a data frame with multiple responses from subjects (subid), which are in the column labelled trials. The trials count up and then start over within one subject.
Here's an example dataframe:
subid <- rep(1:2, c(10,10))
trial <- rep(1:5, 4)
response <- rnorm(20, 10, 3)
df <- as.data.frame(cbind(subid,trial, response))
df
subid trial response
1 1 1 3.591832
2 1 2 8.980606
3 1 3 12.943185
4 1 4 9.149388
5 1 5 10.192392
6 1 1 15.998124
7 1 2 13.288248
I want a column that increments every time trial starts over within one subject ID (subid):
df$block <- c(rep(1:2, c(5,5)),rep(1:2, c(5,5)))
df
subid trial response block
1 1 1 3.591832 1
2 1 2 8.980606 1
3 1 3 12.943185 1
4 1 4 9.149388 1
5 1 5 10.192392 1
6 1 1 15.998124 2
7 1 2 13.288248 2
The trials are not predictable in where they will start over. My solution so far is messy and utilizes a for loop.
Solution:
block <- 0
blocklist <- 0
for (i in seq_along(df$trial)){
if (df$trial[i]==1){
block = block + 1}else
if (df$trial!=1){
block = block}
blocklist<- c(blocklist, block)
}
blocklist <- blocklist[-1]
df$block <- blocklist
This solution does not start over at a new subid. Before I came to this I was attempting to use Wickham's tidyverse with mutate() and ifelse() in a pipe. If somebody knows a way to accomplish this with that package I would appreciate it. However, I'll use a solution from any package. I've searched for about a day now and don't think this is a duplicate question to other questions like this.
We can do this with ave from base R
df$block <- with(df, ave(trial, subid, FUN = function(x) cumsum(x==1)))
Or with dplyr
library(dplyr)
df %>%
group_by(subid) %>%
mutate(block = cumsum(trial==1))

Assigning values to correlative series in r

I hope you can help me with this issue I have.
I have a big dataframe, to simplify it, it look like this:
df <- data.frame(radius = c (2,3,5,7,4,6,9,8,3,7,8,9,2,4,5,2,6,7,8,9,1,10,8))
df$num <- c(1,2,3,4,5,6,7,8,9,10,11,1,12,13,1,14,15,16,17,18,19,1,1)
df
The column $num has correlative series (1-11, 1, 12-13, 1, 14-19,1,1)
I would like to assign a value (sorted) per each correlative serie as a column. the outcome should be like this:
df$outcome <- c(1,1,1,1,1,1,1,1,1,1,1,2,3,3,4,5,5,5,5,5,5,6,7)
df
thanks a lot!
A.
We can get the difference between adjacent elements in 'num' using diff and check whether it is not equal to 1. The logical output will be one less than the length of the 'num' vector. We pad with 'TRUE' and cumsum to get the expected output.
df$outcome <- cumsum(c(TRUE,diff(df$num)!=1))
df$outcome
#[1] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 5 5 5 5 5 5 6 7

increase in one variable nested within another column in R + setting 0 as starting value

I'm trying to use the diff function to calculate the increase in a variable ("damage") in this dataset (df). I want to fill the column "damage_new" with this new variable. The values that you see now are the values I would like to have.
df = data.frame(id=c(1,1,1,2,2), trial=c(1,3,4,1,2), damage=(1,NA,3,1,5))
df
ID TRIAL DAMAGE DAMAGE_NEW
1 1 1 0
1 3 NA NA
1 4 3 NA
2 1 1 0
2 2 5 4
If I run
diff(df$damage) it will calculate the difference in the whole dataset.
two things that I haven't managed are:
-how to nest the difference within the values of another column? Specifically, I want to calculate the damage increase (for the whole dataset), but within a single individual (ID), of which I have repeated measurements.
-I also would like to have the damage_new column to be the same length as the rest of the dataset (to attach it), and for each individual, have the first value of damage_new set to 0, since obviously the first measurement has no reference.
-To further describe the dataset, I have NAs in the 'damage" column, which I suspect will lead to more NAs in the damage_new column, but I would like to keep them (and I wonder how the function deals with them?). I also don't have the same number of measurements per individual (they will have a different number of trials, with some missing in between).
thanks a lot for the always fast and efficient answers!
The dplyr package is great for this kind of things:
library(dplyr)
df %>% group_by(id) %>% mutate(damage_new=c(0,diff(damage)))
Source: local data frame [5 x 4]
Groups: id
id trial damage damage_new
1 1 1 1 0
2 1 3 NA NA
3 1 4 3 NA
4 2 1 1 0
5 2 2 5 4
You can read more about dplyr usage here
Update
If you'd like to go with the base R, you could do:
df$damage_new <- ave(df$damage,df$id,FUN=function(v) c(0,diff(v)))
which will produce the same df.
Library data.table is your friend there:
> library(data.table)
> setDT(df)
> setkey(df, id, trial)
> df[,new_damage:=c(0,diff(damage)),by=id]
> df
id trial damage new_damage
1: 1 1 1 0
2: 1 3 NA NA
3: 1 4 3 NA
4: 2 1 1 0
5: 2 2 5 4
On the diff working with NA, anything you withdraw from NA gives NA:
> diff(c(1,3,4,NA,5,7))
[1] 2 1 NA NA 2

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources