How to Index subjects using R [duplicate]

How to Index subjects using R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 7 years ago.
I am working in R and I have a Data set that has multiple entries for each subject. I want to create an index variable that indexes by subject. For example:
Subject Index
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
7 D 1
8 D 2
9 E 1
The first A entry is indexed as 1, while the second A entry is indexed as 2. The first B entry is indexed as 1, etc.
Any help would be excellent!

Here.s a quick data.table aproach
library(data.table)
setDT(df)[, Index := seq_len(.N), by = Subject][]
# Subject Index
# 1: A 1
# 2: A 2
# 3: B 1
# 4: C 1
# 5: C 2
# 6: C 3
# 7: D 1
# 8: D 2
# 9: E 1
Or with base R
with(df, ave(as.numeric(Subject), Subject, FUN = seq_along))
## [1] 1 2 1 1 2 3 1 2 1
Or with dplyr (don't run this on a data.table class)
library(dplyr)
df %>%
group_by(Subject) %>%
mutate(Index = row_number())

Using dplyr
library(dplyr)
df %>% group_by(Subject) %>% mutate(Index = 1:n())
You get:
#Source: local data frame [9 x 2]
#Groups: Subject
#
# Subject Index
#1 A 1
#2 A 2
#3 B 1
#4 C 1
#5 C 2
#6 C 3
#7 D 1
#8 D 2
#9 E 1

Related

Manipulating large dataset with dcast

Apologies if this is a repeat question but I could not find the specific answer I am looking for. I have a dataframe with counts of different species caught on a given trip. A simplified example with 5 trips and 4 species is below:
trip = c(1,1,1,2,2,3,3,3,3,4,5,5)
species = c("a","b","c","b","d","a","b","c","d","c","c","d")
count = c(5,7,3,1,8,10,1,4,3,1,2,10)
dat = cbind.data.frame(trip, species, count)
dat
> dat
trip species count
1 1 a 5
2 1 b 7
3 1 c 3
4 2 b 1
5 2 d 8
6 3 a 10
7 3 b 1
8 3 c 4
9 3 d 3
10 4 c 1
11 5 c 2
12 5 d 10
I am only interested in the counts of species b for each trip. So I want to manipulate this data frame so I end up with one that looks like this:
trip2 = c(1,2,3,4,5)
species2 = c("b","b","b","b","b")
count2 = c(7,1,1,0,0)
dat2 = cbind.data.frame(trip2, species2, count2)
dat2
> dat2
trip2 species2 count2
1 1 b 7
2 2 b 1
3 3 b 1
4 4 b 0
5 5 b 0
I want to keep all trips, including trips where species b was not observed. So I can't just subset the data by species b. I know I can cast the data so species are the columns and then just remove the columns for the other species like so:
library(dplyr)
library(reshape2)
test = dcast(dat, trip ~ species, value.var = "count", fun.aggregate = sum)
test
> test
trip a b c d
1 1 5 7 3 0
2 2 0 1 0 8
3 3 10 1 4 3
4 4 0 0 1 0
5 5 0 0 2 10
However, my real dataset has several hundred species caught on thousands of trips, and if I try to cast that many species to columns R chokes. There are way too many columns. Is there a way to specify in dcast that I only want to cast species b? Or is there another way to do this that doesn't require casting the data? Thank you.

Here is a data.table approach which I suspect will be very fast for you:
library(data.table)
setDT(dat)
result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip]
result
trip species count
1: 1 b 7
2: 2 b 1
3: 3 b 1
4: 4 b 0
5: 5 b 0

We can use tidyverse
library(dplyr)
library(tidyr)
dat %>%
filter(species == 'b') %>%
group_by(trip, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
complete(trip = unique(dat$trip), fill = list(species = 'b', count = 0))
# A tibble: 5 x 3
# trip species count
# <dbl> <chr> <dbl>
#1 1 b 7
#2 2 b 1
#3 3 b 1
#4 4 b 0
#5 5 b 0

R reset counter based on two columns [duplicate]

This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).

We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)

R: reshape dataframe with duplicated variable names labeled var.1, var.2 [duplicate]

This question already has answers here:
R: reshaping wide to long [duplicate]
(1 answer)
Using tidyr to combine multiple columns [duplicate]
(1 answer)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I'm hoping to reshape a dataframe in R so that a set of columns read in with duplicated names, and then renamed as var, var.1, var.2, anothervar, anothervar.1, anothervar.2 etc. can be treated as independent observations. I would like the number appended to the variable name to be used as the observation so that I can melt my data.
For example,
dat <- data.frame(ID=1:3, var=c("A", "A", "B"),
anothervar=c(5,6,7),var.1=c(C,D,E),
anothervar.1 = c(1,2,3))
> dat
ID var anothervar var.1 anothervar.1
1 1 A 5 C 1
2 2 A 6 D 2
3 3 B 7 E 3
How can I reshape the data so it looks like the following:
ID obs var anothervar
1 1 A 5
1 2 C 1
2 1 A 6
2 2 D 2
3 1 B 7
3 2 E 3
Thank you for your help!

We can use melt from data.table that takes multiple patterns in the measure
library(data.table)
melt(setDT(dat), measure = patterns("^var", "anothervar"),
variable.name = "obs", value.name = c("var", "anothervar"))[order(ID)]
# ID obs var anothervar
#1: 1 1 A 5
#2: 1 2 C 1
#3: 2 1 A 6
#4: 2 2 D 2
#5: 3 1 B 7
#6: 3 2 E 3

As for a tidyverse solution, we can use unite with gather
dat %>%
unite("1", var, anothervar) %>%
unite("2", var.1, anothervar.1) %>%
gather(obs, value, -ID) %>%
separate(value, into = c("var", "anothervar"))
# ID obs var anothervar
#1 1 1 A 5
#2 2 1 A 6
#3 3 1 B 7
#4 1 2 C 1
#5 2 2 D 2
#6 3 2 E 3

convert dataframe to wide format in r ?(transpose and concatenate) [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 4 years ago.
let's say
df:
user actions
1 A
1 B
1 c
2 A
2 D
3 B
4 C
4 D
I want to convert to this format
new_df:
user action1 action2 action3
1 A B C
2 A D NA
3 B NA NA
4 C D NA
please note that the number of columns in new_df is equal to the max number of actions among users. it should insert NA if the user is less that max actions.
how can I do it?

You can use rle to create a column to store action1, action2, etc. Then use dcast from data.table package to turn the data into a wide format.
df$coln <- paste0("actions", unlist(lapply(rle(df$user)$lengths, seq_len)))
data.table::dcast(df, user ~ coln, value.var="actions")
In response to OP's comment, you can pad the beginning with 0 as follows:
df$coln <- paste0("actions", sprintf("%02d", unlist(lapply(rle(df$user)$lengths, seq_len))))
Using data.table package:
df <- read.table(text="user actions
1 A
1 B
1 C
1 D
1 E
1 F
1 G
1 H
1 I
1 J
1 K
2 A
2 D
3 B
4 C
4 D", header=TRUE)
library(data.table)
setDT(df)
dcast(setDT(df)[, coln := sprintf("actions%02d", seq_len(.N)), by=.(user)],
user ~ coln, value.var="actions")

A solution using tidyverse approach
df <- read.table(text = "user actions
1 A
1 B
1 c
2 A
2 D
3 B
4 C
4 D", header = TRUE)
library(tidyr)
library(dplyr)
df %>%
group_by(user) %>%
mutate(index = paste0("action", row_number())) %>%
spread(index, actions)
#> # A tibble: 4 x 4
#> # Groups: user [4]
#> user action1 action2 action3
#> <int> <fct> <fct> <fct>
#> 1 1 A B c
#> 2 2 A D <NA>
#> 3 3 B <NA> <NA>
#> 4 4 C D <NA>
Created on 2018-04-11 by the reprex package (v0.2.0).

R group by key get max value for multiple columns

I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.

Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.

to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1

The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)

dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to Index subjects using R [duplicate] - r

Using dplyr library(dplyr) df %>% group_by(Subject) %>% mutate(Index = 1:n()) You get: #Source: local data frame [9 x 2] #Groups: Subject # # Subject Index #1 A 1 #2 A 2 #3 B 1 #4 C 1 #5 C 2 #6 C 3 #7 D 1 #8 D 2 #9 E 1

Related

Manipulating large dataset with dcast

R reset counter based on two columns [duplicate]

R: reshape dataframe with duplicated variable names labeled var.1, var.2 [duplicate]

convert dataframe to wide format in r ?(transpose and concatenate) [duplicate]

R group by key get max value for multiple columns

Categories

Resources