Add a column for rank, then ranking by group - r

I have managed to add a column for ranking for my data frame
lowest.mortality.upper<-nrow(lowest.mortality)
## Add a ranking column
lowest.mortality$ranking<-c(1:lowest.mortality.upper)
However now I have to rank a bigger dataset based on another column state. So it would read
AK 1
AK 2
TX 1
TX 2
TX 3
I could use a for loop but thats so 1980's. I'm sure that subset or lapply should work but I can't figure out how
Thanks

Seems like you want to add a sequence column by group. There are several options.
A base R solution using ave is
df1$indx <- with(df1, ave(seq_along(grp), grp, FUN=seq_along))
Or this can be done with getanID from splitstackshape
library(splitstackshape)
getanID(df1, 'grp')[]
# grp .id
#1: AK 1
#2: AK 2
#3: TX 1
#4: TX 2
#5: TX 3
Or
library(dplyr)
df1 %>%
group_by(grp) %>%
mutate(indx = row_number())
data
df1 <- structure(list(grp = c("AK", "AK", "TX", "TX", "TX")), .Names =
"grp", class = "data.frame", row.names = c(NA, -5L))

Related

How to count unique values over multiple columns using R?

Let's say I have the following df:
1 2 3
home, work work, home home, work
leisure, work work, home, leisure work, home
home, leisure work, home home, work
I want to count all unique variables over the entire data.frame (not by columns or row, I'm interested in the cell values)
So the output should look like this:
freq
home, work 3
leisure, work 1
home, leisure 1
work, home 3
work, home, leisure 1
I have not found a way to do that. The count() function seems to work only with single columns.
Thank you very much for the help:)
You could unlist and use table to get count in base R :
stack(table(unlist(df)))
#Same as
#stack(table(as.matrix(df)))
If you prefer tidyverse get data in long format using pivot_longer and count.
df %>%
tidyr::pivot_longer(cols = everything()) %>%
dplyr::count(value)
# A tibble: 5 x 2
# value n
# <chr> <int>
#1 home,leisure 1
#2 home,work 3
#3 leisure,work 1
#4 work,home 3
#5 work,home,leisure 1
data
df <- structure(list(X1 = c("home,work", "leisure,work", "home,leisure"
), X2 = c("work,home", "work,home,leisure", "work,home"), X3 = c("home,work",
"work,home", "home,work")), class = "data.frame", row.names = c(NA, -3L))
With tidyverse, we can use gather
library(dplyr)
library(tidyr)
df %>%
gather %>%
count(value)
# value n
#1 home,leisure 1
#2 home,work 3
#3 leisure,work 1
#4 work,home 3
#5 work,home,leisure 1
data
df <- structure(list(X1 = c("home,work", "leisure,work", "home,leisure"
), X2 = c("work,home", "work,home,leisure", "work,home"), X3 = c("home,work",
"work,home", "home,work")), class = "data.frame", row.names = c(NA, -3L))

Reshape dataframe that has years in column names

I am trying to reshape a wide dataframe in R into a long dataframe. Reading over some of the functions in reshape2 and tidyr they all seem to just handle if you have 1 variable you are splitting whereas I have ~10. Each column has the type variables names and the year and I would like it split so that the years become a factor in each row and then have significantly less columns and an easier data set to work with.
Currently the table looks something like this.
State Rank Name V1_2016 V1_2017 V1_2018 V2_2016 V2_2017 V2_2018
TX 1 Company 1 2 3 4 5 6
I have tried to melt the data with reshape2 but it came out looking like garbage and being 127k rows when it should only be about 10k.
I am trying to get the data to look something like this.
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
An option with melt from data.table that can take multiple measure based on the patterns in the column names
library(data.table)
nm1 <- unique(sub(".*_", "", names(df)[-(1:3)]))
melt(setDT(df), measure = patterns("V1", "V2"),
value.name = c("V1", "V2"), variable.name = "Year")[,
Year := nm1[Year]][]
# State Rank Name Year V1 V2
#1: TX 1 Company 2016 1 4
#2: TX 1 Company 2017 2 5
#3: TX 1 Company 2018 3 6
data
df <- structure(list(State = "TX", Rank = 1L, Name = "Company", V1_2016 = 1L,
V1_2017 = 2L, V1_2018 = 3L, V2_2016 = 4L, V2_2017 = 5L, V2_2018 = 6L),
class = "data.frame", row.names = c(NA,
-1L))
One dplyr and tidyr possibility could be:
df %>%
gather(var, val, -c(1:3)) %>%
separate(var, c("var", "Year")) %>%
spread(var, val)
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
It, first, transforms the data from wide to long format, excluding the first three columns. Second, it separates the original variable names into two new variables: one containing the variable prefix, second containing the year. Finally, it spreads the data.

Creating a new column of consecutive token (like n-gram) in R

I have this dataset;
A B
URBAN 1
PLAN 2
I wish that new column is added like this;
A A` B
URBAN URB 1
URBAN RBA 1
URBAN BAN 1
PLAN PLA 2
PLAN LAN 2
How do I make the A' column in R?
dat=read.table(text="A B
URBAN 1
PLAN 2",h=T,stringsAsFactors=F)
library(zoo)
d=lapply(dat$A,function(y)
rollapply(1:nchar(y),3,function(x)substr(y,min(x),max(x))))
data.frame(dat[rep(dat$B,lengths(d)),],A1=unlist(d),row.names = NULL)
A B unlist.d.
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN
Here is one possible way. I am sure there are much more concise way to handle this job. But I think the following will do. For each row in mydf, I applied substr() to create three-letter elements. The Map() part is producing the elements. Since there are some non-desired elements, I further subsetted them with another lapply(). Finally, unnest() splits elements in each list and create a long-format data.
library(tidyverse)
mydf %>%
mutate(whatever = lapply(1:nrow(mydf), function(x) {
unlist(Map(function(j, k) substr(mydf$A[x], start = j, stop = k),
1:nchar(mydf$A[x]), 3:nchar(mydf$A[x])))
}) %>%
lapply(function(x) x[nchar(x) ==3])) %>%
unnest(whatever)
A B whatever
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN
DATA
mydf <- structure(list(A = c("URBAN", "PLAN"), B = 1:2), .Names = c("A",
"B"), class = "data.frame", row.names = c(NA, -2L))
Here is an option with str_match
library(stringr)
merge(stack(lapply(setNames(str_match_all(mydf$A, "(?=(...))"),
mydf$A), `[`, , 2))[2:1], mydf, by.x = 'ind', by.y = 'A')
Or using similar idea with tidyverse
library(purrr)
library(dplyr)
mydf %>%
mutate(Anew = str_match_all(A, "(?=(...))") %>%
map(~.x[,2])) %>%
unnest
# A B Anew
#1 URBAN 1 URB
#2 URBAN 1 RBA
#3 URBAN 1 BAN
#4 PLAN 2 PLA
#5 PLAN 2 LAN

Join two data frames based on one column of a frame and two columns of another

So I have two data frames, info and towers, with examples in the following:
Info:
ID Date
1132 01/09/2015
1156 02/09/2015
1132 04/09/2015
1101 04/09/2015
Towers:
Tower ID1 ID2
1 1132 1101
2 1520 1156
The values in the ID column of Info will always match either ID1 or ID2 in Towers. I want to join the frames based on that information, so my joined frame should be:
ID Date Tower
1132 01/09/2015 1
1156 02/09/2015 2
1132 04/09/2015 1
1101 04/09/2015 2
I know dplyr's semi_join makes something like what I need, but I understand it requires a match in both value and column name. Given that these columns have different names, I don't know if it will work properly. Is there a method I could use here?
library(dplyr)
tidyr::gather(df2, Tower2, ID, -Tower) %>% select(-Tower2) %>% right_join(df, "ID")
df
structure(list(ID = c(1132, 1156, 1132, 1101), Date = structure(c(1L,
2L, 3L, 3L), .Label = c("01/09/2015", "02/09/2015", "04/09/2015"
), class = "factor")), .Names = c("ID", "Date"), row.names = c(NA,
-4L), class = "data.frame")
df2
structure(list(Tower = 1:2, ID1 = c(1132L, 1520L), ID2 = c(1101L,
1156L)), .Names = c("Tower", "ID1", "ID2"), class = "data.frame", row.names = c(NA,
-2L))
We can use melt from data.table. Convert the 'data.frame' to 'data.table' (setDT(df2)), melt from 'wide' to 'long' format and join with the original dataset 'df' on 'ID'.
library(data.table)
melt(setDT(df2), id.var="Tower", value.name = "ID")[df, on = "ID"][, variable := NULL][]
# Tower ID Date
#1: 1 1132 01/09/2015
#2: 2 1156 02/09/2015
#3: 1 1132 04/09/2015
#4: 1 1101 04/09/2015
We could also do this without any join and using only base R (no external packages, without any loops (sapply is a loop in disguise)). Here, the idea is to replicate the second dataset 'Tower' column by the number of columns except the 'Tower' i.e. 2, set the names of that vector by unlisting the columns of 'df2' except the 'Tower' (unlist(df2[-1])) and use that to match the 'ID' column in the first dataset (as.character(df$ID)) to return the 'Tower' that corresponds to the 'ID'.
df$Tower <- setNames( rep(df2$Tower, 2), unlist(df2[-1]))[as.character(df$ID)]
df$Tower
#[1] 1 2 1 1
You really don't actually need to join; you can just make a new column, as long as you evaluate grouped by row:
Info %>% rowwise() %>%
mutate(Tower = Towers[ID == Towers$ID1 | ID == Towers$ID2, 'Tower'])
## Source: local data frame [4 x 3]
## Groups: <by row>
##
## # A tibble: 4 x 3
## ID Date Tower
## <int> <fctr> <int>
## 1 1132 01/09/2015 1
## 2 1156 02/09/2015 2
## 3 1132 04/09/2015 1
## 4 1101 04/09/2015 1
or equivalently in full base R,
Info$Tower <- sapply(Info$ID, function(x){Towers[x == Towers$ID1 | x == Towers$ID2, 'Tower']})
Another approach using melt(also suggested by #SymbolixAU in comment) from reshape2 package and using df & df2 of #Sumedh's post.
library(reshape2)
library(dplyr)
melt(df2,value.name = "ID",id.vars = "Tower") %>% right_join(df,by = "ID") %>% select(-variable)
We can also do this by using base R reshape function as this:
reshape(data = df2,direction = "long",varying = c("ID1","ID2"),v.names = "ID") %>% right_join(df,by = "ID") %>% select(-c(time,id))

restructure data frame in R

I'm wondering if there is an easy way to restructure some data I have. I currently have a data frame that looks like this...
Year Cat Number
2001 A 15
2001 B 2
2002 A 4
2002 B 12
But what I ultimately want is to have it in this shape...
Year Cat Number Cat Number
2001 A 15 B 2
2002 A 4 B 12
Is there a simple way to do this?
Thanks in advance
:)
One way would be to use dcast/melt from reshape2. In the below code, first I created a sequence of numbers (indx column) for each Year by using transform and ave. Then, melt the transformed dataset keeping id.var as Year, and indx. The long format dataset is then reshaped to wide format using dcast. If you don't need the suffix _number, you can use gsub to remove that part.
library(reshape2)
res <- dcast(melt(transform(df, indx=ave(seq_along(Year), Year, FUN=seq_along)),
id.var=c("Year", "indx")), Year~variable+indx, value.var="value")
colnames(res) <- gsub("\\_.*", "", colnames(res))
res
# Year Cat Cat Number Number
#1 2001 A B 15 2
#2 2002 A B 4 12
Or using dplyr/tidyr. Here, the idea is similar as above. After grouping by Year column, generate a indx column using mutate, then reshape to long format with gather, unite two columns to a single column VarIndx and then reshape back to wide format with spread. In the last step mutate_each, columns with names that start with Number are converted to numeric column.
library(dplyr)
library(tidyr)
res1 <- df %>%
group_by(Year) %>%
mutate(indx=row_number()) %>%
gather("Var", "Val", Cat:Number) %>%
unite(VarIndx, Var, indx) %>%
spread(VarIndx, Val) %>%
mutate_each(funs(as.numeric), starts_with("Number"))
res1
# Source: local data frame [2 x 5]
# Year Cat_1 Cat_2 Number_1 Number_2
#1 2001 A B 15 2
#2 2002 A B 4 12
Or you can create an indx variable .id using getanID from splitstackshape (from comments made by #Ananda Mahto (author of splitstackshape) and use reshape from base R
library(splitstackshape)
reshape(getanID(df, "Year"), direction="wide", idvar="Year", timevar=".id")
# Year Cat.1 Number.1 Cat.2 Number.2
#1: 2001 A 15 B 2
#2: 2002 A 4 B 12
data
df <- structure(list(Year = c(2001L, 2001L, 2002L, 2002L), Cat = c("A",
"B", "A", "B"), Number = c(15L, 2L, 4L, 12L)), .Names = c("Year",
"Cat", "Number"), class = "data.frame", row.names = c(NA, -4L
))

Resources