I have a simple data frame column transformation which can be done using an if/else loop, but I was wondering if there was a better way to do this.
The initial data frame is,
df <-data.frame(cbind(x=rep(10:15,3), y=0:8))
df
x y
1 10 0
2 11 1
3 12 2
4 13 3
5 14 4
6 15 5
7 10 6
8 11 7
9 12 8
10 13 0
11 14 1
12 15 2
13 10 3
14 11 4
15 12 5
16 13 6
17 14 7
18 15 8
what I need to do is replace the values in column 'y' such that
'0' gets replaced with '2',
'1' gets replaced with '2.2',
'2' gets replaced with '2.4',
...
...
'6' gets replaced with '3.2'
'7' gets replaced with '3.3'
'8' gets replaced with '10'
so that I end up with something like,
> df
x y
1 10 2.0
2 11 2.2
3 12 2.4
4 13 2.6
5 14 2.8
6 15 3.0
7 10 3.2
8 11 3.3
9 12 10.0
10 13 2.0
11 14 2.2
12 15 2.4
13 10 2.6
14 11 2.8
15 12 3.0
16 13 3.2
17 14 3.3
18 15 10.0
I have searched and found several proposals but couldnt get them to work. One of the attempts was something like,
> levels(factor(df$y)) <- c(2,2.2,2.4,2.6,2.8,3,3.2,3.3,10)
Error in levels(factor(df$y)) <- c(2, 2.2, 2.4, 2.6, 2.8, 3, 3.2, 3.3, :
could not find function "factor<-"
But I get the error message shown above.
Can anyone help me with this?
Use the fact that y+1 is an index for the replacement
something like
replacement <- c(2,2.2,2.4,2.6,2.8,3,3.2,3.3,10)
df <- within(df, z <- replacement[y+1])
Or, using data.table for syntatic sugar and memory efficiency
library(data.table)
DT <- as.data.table(df)
DT[, z := replacement[y+1]]
How about:
mylevels <- c(2,2.2,2.4,2.6,2.8,3,3.2,3.3,10)
df$z <- as.numeric(as.character(factor(df$y,labels=mylevels)))
This also matches your desired outcome:
transform(df,z=ifelse(y==7,3.3,ifelse(y==8,10,2+y/5)))
Related
I've downloaded a table from wikipedia and in some columns there are links next to numbers. Is this possible to delete it ?
In column in Rstudio it looks like this:
402[38]
[38] - this is what I don't want.
We can do this easily in base R with Regex:
a <- data.frame(V1 = paste0(1:20, sprintf("[%s]", 50:70))
a$V2 <- gsub("\\[.*?\\]","", a$V1)
V1 V2
1 1[50] 1
2 2[51] 2
3 3[52] 3
4 4[53] 4
5 5[54] 5
6 6[55] 6
7 7[56] 7
8 8[57] 8
9 9[58] 9
10 10[59] 10
11 11[60] 11
12 12[61] 12
13 13[62] 13
14 14[63] 14
15 15[64] 15
16 16[65] 16
17 17[66] 17
18 18[67] 18
19 19[68] 19
20 20[69] 20
21 1[70] 1
And this conveniently works for the case of multiple references as well:
a <- data.frame(V1 = paste0(1:20, sprintf("[%s][%s]", 50:70, 80:100)))
I want to replicate observations based on the values of the variable iptw to create pseudo-populations for further analysis.
For example, if iptw=4.5, then weight=5 should be created, and the observation/row multiplied by 5. Likewise, if iptw=2.3, then weight=2, and that row is multiplied by 2, which is equivalent to adding the corresponding observation twice to the data frame.
Here is my dataset:
dtNEW <- data.table(id = 1:4, x1 = 10:13, x2=21:24, iptw=c(2.3,0.6,4.5,0.1))
There is a similar question here but the solutions there do not answer my question.
Assuming you want to replicate the ith row round(iptw[i]) times:
dtNEW[rep(1:.N, round(iptw)), ]
giving:
id x1 x2 iptw
1: 1 10 21 2.3
2: 1 10 21 2.3
3: 2 11 22 0.6
4: 3 12 23 4.5
5: 3 12 23 4.5
6: 3 12 23 4.5
7: 3 12 23 4.5
Another option is uncount from tidyr
library(tidyr)
uncount(dtNEW, round(iptw))
# id x1 x2 iptw
#1: 1 10 21 2.3
#2: 1 10 21 2.3
#3: 2 11 22 0.6
#4: 3 12 23 4.5
#5: 3 12 23 4.5
#6: 3 12 23 4.5
#7: 3 12 23 4.5
My dataset has as features: players IDs, team, weeks and points.
I want to calculate the mean of TEAM points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For team = A, week = 7, the result will be the average of POINTS for team = A and weeks 2, 3, 4, 5 and 6.
The dataset can be created using the following code:
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
I would like to have a solution without a heavy looping, because the dataset is huge.
I have done related questions here that maybe will help, but I could not adapt to this case.
Question 1
Question 2
Thank you!
We adapt the approach from my answer to one of your other questions if you want a dplyr solution:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
It might first be easier to aggregate by team:
team_data <- mydata %>%
select(-player_id) %>%
group_by(team, week) %>%
arrange(week) %>%
summarise(team_points = sum(points)) %>%
mutate(rolling_team_mean = roll_mean(lag(team_points), k=5)) %>%
arrange(team)
team_data
# A tibble: 30 x 4
# Groups: team [2]
team week team_points rolling_team_mean
<fctr> <int> <dbl> <dbl>
1 A 1 13 NA
2 A 2 11 13.00
3 A 3 6 12.00
4 A 4 13 10.00
5 A 5 19 10.75
6 A 6 10 12.40
7 A 7 13 11.80
8 A 8 16 12.20
9 A 9 16 14.20
10 A 10 12 14.80
# ... with 20 more rows
Then, if you like we can put everything back together:
mydata <- inner_join(mydata, team_data) %>%
arrange(week, team, player_id)
mydata[1:12, ]
player_id team week points team_points rolling_team_mean
1 1 A 1 4 13 NA
2 2 A 1 9 13 NA
3 3 B 1 10 12 NA
4 4 B 1 2 12 NA
5 1 A 2 8 11 13
6 2 A 2 3 11 13
7 3 B 2 9 12 12
8 4 B 2 3 12 12
9 1 A 3 5 6 12
10 2 A 3 1 6 12
11 3 B 3 7 12 12
12 4 B 3 5 12 12
Here's one way:
# compute points per team per week
pts <- with(mydata, tapply(points, list(team, week), sum, default = 0))
pts
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 11 6 13 19 10 13 16 16 12 17 11 13 10 4
#B 12 12 12 11 10 6 13 11 6 9 5 7 13 13 6
# compute the 5-week averages
sapply(setNames(seq(2, ncol(pts)), seq(2, ncol(pts))),
function(i) {
apply(pts[, seq(max(1, i - 5), i - 1), drop = FALSE], 1, mean)
})
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 12 10 10.75 12.4 11.8 12.2 14.2 14.8 13.4 14.8 14.4 13.8 12.6
#B 12 12 12 11.75 11.4 10.2 10.4 10.2 9.2 9.0 8.8 7.6 8.0 9.4
This will give the wrong result if the week variable has gaps.
Simple question, I think. Basically, I want to use the concept "less than or equal to a number" as the condition to select the row of one column, and then find the value on the same row in another column. But what happens if the number stated in the condition isn't found in the first column?
Let's assume this is my data frame:
df<-as.data.frame((matrix(c(1:10,11:20), nrow = 10, ncol = 2)))
df
V1 V2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Let's assume I want to use the condition <=5 in df$V1 to obtain the row that is used to find the value of the same row in df$V2.
df[which(df$V1 <= 5),2]
15
But what happens if the number used in the condition isn't found? Let's assume this is my new data.frame
V1 V2
1 1 11
2 2 12
3 3 13
4 4 14
5 6 15
6 7 16
7 8 17
8 9 18
9 10 19
10 11 20
Using the same above command df[which(df$V1 <= 5),2], I obtain a different answer. For some reason I obtain the entire column instead of one number.
11 12 13 14 15 16 17 18 19 20
Any suggestions?
Use the subset operator:
df[df[,2]<= 5,1]
I want to make new column in my data set with the values determined by values in another data set, but it's not as simple as the values in one column being a function of the values in the other. Here's an example:
>df1
chromosome position
1 1 1
2 1 2
3 1 4
4 1 5
5 1 7
6 1 12
7 1 13
8 1 15
9 1 21
10 1 23
11 1 24
12 2 1
13 2 5
14 2 7
15 2 8
16 2 12
17 2 15
18 2 18
19 2 21
20 2 22
and
>df2
chromosome segment_start segment_end segment.number
1 1 1 5 1.1
2 1 6 20 1.2
3 1 21 25 1.3
4 2 1 7 2.1
5 2 8 16 2.2
6 2 18 22 2.3
I want to make a new column in df1 called 'segment', and the value in segment is to be determined by which segment (as determined by 'segment_start', 'segment_end', and 'chromosome' from df2) the value in 'position' belongs to. For example, in df1, row 7, position=13, and chromosome=1. Because 13 is between 6 and 20, the entry in my hypothetical 'segment' column would be 1.2, from row 2 of df2, because 13 falls between segment_start and segment_end from that row (6 and 20, respectively), and the 'chromosome' value from df1 row 7 is 1, just as 'chromosome' in df2 row 2 is 1.
Each row in df1 belongs to one of the segments described in df2; that is, it lies on the same chromosome as one of the segments, and its 'position' is >=segment_start and <=segment_end. And I want to get that information into df1, so it says what segment each position belongs to.
I was thinking of using an if function, and started with:
if(df1$position>=df2$segment_start & df1$position<=df2$segment_end & df1$chromosome==df2$chromosome) df1$segment<-df2$segment.number
But am not sure that way will be feasible. If nothing else maybe the code can help illustrate what it is I'm trying to do. Basically, I want match each row by its position and chromosome to a segment in df2. Thanks.
This appears to be a rolling join. You can use data.table for this
require(data.table)
DT1 <- data.table(df1, key = c('chromosome','position'))
DT2 <- data.table(df2, key = c('chromosome','section_start'))
# this will perform the join you want (but retain all the
# columns with names names of DT2)
# DT2[DT1, roll=TRUE]
# which is why I have renamed and subset here)
DT2[DT1, roll=TRUE][ ,list(chromosome,position = segment_start,segment.number)]
# chromosome position segment.number
# 1: 1 1 1.1
# 2: 1 2 1.1
# 3: 1 4 1.1
# 4: 1 5 1.1
# 5: 1 7 1.2
# 6: 1 12 1.2
# 7: 1 13 1.2
# 8: 1 15 1.2
# 9: 1 21 1.3
# 10: 1 23 1.3
# 11: 1 24 1.3
# 12: 2 1 2.1
# 13: 2 5 2.1
# 14: 2 7 2.1
# 15: 2 8 2.2
# 16: 2 12 2.2
# 17: 2 15 2.2
# 18: 2 18 2.3
# 19: 2 21 2.3
# 20: 2 22 2.3
You really need to check out the GenomicRanges package from Bioconductor. It provides the data structures that are appropriate for your use case.
First, we create the GRanges objects:
gr1 <- with(df1, GRanges(chromosome, IRanges(position, width=1L)))
gr2 <- with(df2, GRanges(chromosome, IRanges(segment_start, segment_end),
segment.number=segment.number))
Then we find the overlaps and do the merge:
hits <- findOverlaps(gr1, gr2)
gr1$segment[queryHits(hits)] <- gr2$segment.number[subjectHits(hits)]
I'm going to assume that the regions in df2 are non-overlapping, continuous and complete (not missing any positions from df1). I seem to do this differently every time I try, so here's my latest idea.
First, make sure chromosome is a factor in both data sets
df1$chromosome<-factor(df1$chromosome)
df2$chromosome<-factor(df2$chromosome)
Now I want to unwrap, chr/pos into one over all generic position, i'll do that with
ends<-with(df2, tapply(segment_end, chromosome, max))
offset<-head(c(0,cumsum(ends)),-1)
names(offset)<-names(ends)
This assigns unique position values to all positions across all chromosomes and it tracks the offset to the beginning of each chromosome in this new system. Now we will build a translation function from the data in df2
seglookup <- approxfun(with(df2, offset[chromosome]+segment_start), 1:nrow(df2),
method="constant", rule=2)
We use approxfun to find the right interval in the genetic position space for each segment. Now we use this function on df1
segid <- with(df1, seglookup(offset[chromosome]+position))
Now we have the correct ID for each position. We can verify this by merging the data and looking at the results
cbind(df1, df2[segid,-1])
chromosome position segment_start segment_end segment.number
1 1 1 1 5 1.1
2 1 2 1 5 1.1
3 1 4 1 5 1.1
4 1 5 1 5 1.1
5 1 7 6 20 1.2
6 1 12 6 20 1.2
7 1 13 6 20 1.2
8 1 15 6 20 1.2
9 1 21 21 25 1.3
10 1 23 21 25 1.3
11 1 24 21 25 1.3
12 2 1 1 7 2.1
13 2 5 1 7 2.1
14 2 7 1 7 2.1
15 2 8 8 16 2.2
16 2 12 8 16 2.2
17 2 15 8 16 2.2
18 2 18 18 22 2.3
19 2 21 18 22 2.3
20 2 22 18 22 2.3
So it looks like we did alright.