Reshaping different variables for selecting values from one column in R - r

Below, a sample of my data, I have more Rs and Os.
A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5
I want to get the following data
A R O Value
1 3 1 3
1 5 2 3
1 6 3 4
2 3 1 3
2 5 2 4
2 7 3 4
3 4 1 4
3 5 2 5
3 6 3 5
I try the melt function, but I was unsuccessful. Any help would be very much appreciated.

A solution using dplyr and tidyr. The key is to use gather to collect all the columns other than A, and the use extract to split the column, and then use spread to convert the data frame back to wide format.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(Column, Number, -A) %>%
extract(Column, into = c("Column", "ID"), regex = "([A-Z]+)([0-9]+)") %>%
spread(Column, Number) %>%
select(A, R, O = ID, Value = O)
dt2
# A R O Value
# 1 1 3 1 3
# 2 1 5 2 3
# 3 1 6 3 4
# 4 2 3 1 3
# 5 2 5 2 4
# 6 2 7 3 4
# 7 3 4 1 4
# 8 3 5 2 5
# 9 3 6 3 5
DATA
dt <- read.table(text = "A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5",
header = TRUE)

Related

Convert an integer from a scraped string in a database in r

I am struggling to find a way to convert a string that has both numbers and letters into just a number in R. I web-scraped data, and now want to convert one column from a string into a number. The last column of my df, Clean.data$Drafted..tm.rnd.yr currently reads like, "Arizona / 1st / 5th pick / 2011". I am trying to extract the pick number, so for that example, I would want to just extract "5". Is there anyway to do this? I am fairly new to R.
library(rvest)
library(magrittr)
library(dplyr)
library(purrr)
years <- 2010:2020
urls <- paste0(
'https://www.pro-football-reference.com/draft/',
years,
'-combine.htm')
combine.data <- map(
urls,
~read_html(.x) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame()
) %>%
set_names(years) %>%
bind_rows(.id = "year") %>%
filter(Pos == 'CB' | Pos == "S")
Clean.data <- combine.data[!rowSums(combine.data == "")> 0,]
This is my code so far.
You can use regex to extract the relevant number from the data.
Clean.data$pick_number <- as.integer(sub('.*?/\\s(\\d+).*', '\\1',
Clean.data$Drafted..tm.rnd.yr.))
Clean.data$pick_number
# [1] 5 2 5 3 1 1 4 1 5 3 3 4 1 4 3 5 3 2 2 4 3 1 5 1 5 7 2
# [28] 5 3 7 1 2 3 4 7 7 2 3 3 5 3 5 7 3 2 2 5 3 5 4 4 6 1 3
# [55] 6 7 6 4 2 4 3 2 6 5 2 3 5 3 1 2 2 4 3 1 3 6 4 6 2 2 2
# [82] 4 1 6 3 3 4 5 2 1 3 3 7 3 1 2 1 4 4 5 3 1 2 4 3 2 7 3
#[109] 3 4 5 2 4 5 1 7 2 6 5 4 2 6 4 4 5 4
This extracts the digits after the first "/".

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

group cases by shared values in r [duplicate]

This question already has answers here:
R: define distinct pattern from values of multiple variables [duplicate]
(3 answers)
Closed 5 years ago.
I have a dataset like this:
case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3
I would like to create a grouping variable.
This variable should have the same values when both x and y are the same.
I do not care what this value is but it is to group them. Because in my dataset if x and y are the same for two cases they are probably part of the same organization. I want to see which organizations there are.
So my preferred dataset would look like this:
case x y org
1 4 5 1
2 4 5 1
3 8 9 2
4 7 9 3
5 6 3 4
6 6 3 4
How would I have to program this in R?
As you said , I do not care what this value is, you can just do following
dt$new=as.numeric(as.factor(paste(dt$x,dt$y)))
dt
case x y new
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
A solution from dplyr using the group_indices.
library(dplyr)
dt2 <- dt %>%
mutate(org = group_indices(., x, y))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
If the group numbers need to be in order, we can use the rleid from the data.table package after we create the org column as follows.
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(org = group_indices(., x, y)) %>%
mutate(org = rleid(org))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 2
4 4 7 9 3
5 5 6 3 4
6 6 6 3 4
Update
Here is how to arrange the columns in dplyr.
library(dplyr)
dt %>%
arrange(x)
case x y
1 1 4 5
2 2 4 5
3 5 6 3
4 6 6 3
5 4 7 9
6 3 8 9
We can also do this for more than one column, such as arrange(x, y) or use desc to reverse the oder, like arrange(desc(x)).
DATA
dt <- read.table(text = " case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3",
header = TRUE)

Assign value to group based on condition in column

I have a data frame that looks like the following:
> df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
value = c(3,4,3,4,5,6,6,4,9))
> df
group date value
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
I want to create a new column that contains the date value per group that is associated with the value "4" from the value column.
The following data frame shows what I hope to accomplish.
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
As we can see, group 1 has the newValue "2" because that is the date associated with the value "4". Similarly, group two has newValue 4 and group three has newValue 8.
I assume there is an easy way to do this using ave() or a range of dplyr/data.table functions, but I have been unsuccessful with my many attempts.
Here's a quick data.table one
library(data.table)
setDT(df)[, newValue := date[value == 4L], by = group]
df
# group date value newValue
# 1: 1 1 3 2
# 2: 1 2 4 2
# 3: 1 3 3 2
# 4: 2 4 4 4
# 5: 2 5 5 4
# 6: 2 6 6 4
# 7: 3 7 6 8
# 8: 3 8 4 8
# 9: 3 9 9 8
Here's a similar dplyr version
library(dplyr)
df %>%
group_by(group) %>%
mutate(newValue = date[value == 4L])
Or a possible base R solution using merge after filtering the data (will need some renaming afterwards)
merge(df, df[df$value == 4, c("group", "date")], by = "group")
Here is a base R option
df$newValue = rep(df$date[which(df$value == 4)], table(df$group))
Another alternative using lapply
do.call(rbind, lapply(split(df, df$group),
function(x){x$newValue = rep(x$date[which(x$value == 4)],
each = length(x$group)); x}))
# group date value newValue
#1.1 1 1 3 2
#1.2 1 2 4 2
#1.3 1 3 3 2
#2.4 2 4 4 4
#2.5 2 5 5 4
#2.6 2 6 6 4
#3.7 3 7 6 8
#3.8 3 8 4 8
#3.9 3 9 9 8
One more base R path:
df$newValue <- ave(`names<-`(df$value==4,df$date), df$group, FUN=function(x) as.numeric(names(x)[x]))
df
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
10 3 11 7 8
I used a test on variable length groups. I assigned the date column as the names for the logical index of value equal to 4. Then identify the value by group.
Data
df = data.frame(group = c(1,1,1,2,2,2,3,3,3,3),
date = c(1,2,3,4,5,6,7,8,9,11),
value = c(3,4,3,4,5,6,6,4,9,7))

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

Resources