Rank function to rank multiple variables in R - r

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.

Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

R - Issues while calling a user-defined function

I have the following dataframe named "dataset"
> dataset
V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450
I have two helper functions as below
get_A <- function(p){
return(data.frame(Scorecard = p,
Results = dataset[nrow(dataset),(p+1)]))
} #Pulls the value from the last row for a given value of (p and offset by 1)
get_P <- function(p){
return(data.frame(Scorecard= p,
Results = dataset[p,ncol(dataset)]))
} #Pulls the value from the last column for a given value of p
I have the following dataframe on which I need to run the above helper functions. There will be NAs because I'm reading this "data_sub" dataframe from an excel file which can have unequal rows for the two columns.
> data_sub
Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA
When I call the helper functions, I get some strange results as shown below:
> get_P(data_sub[complete.cases(data_sub$Key_P),]$Key_P)
Scorecard Results
1 2 1837
2 3 315
3 4 621
> get_A(data_sub[complete.cases(data_sub$Key_A),]$Key_A)
Scorecard Results.V2 Results.V4 Results.V6
1 1 12 8 11
2 3 12 8 11
3 5 12 8 11
Warning message:
In data.frame(Scorecard = p, Results = dataset[nrow(dataset), (p + :
row names were found from a short variable and have been discarded
The call to the get_P() helper function is working the way I want. I'm getting the "Results" for each non-NA value in data_sub$Key_P as a dataframe.
But the call to the get_A() helper function is giving strange results and also a warning.I was expecting it to give a similar dataframe as given the call to get_P(). Why is this happening and how can I make get_A() to give the correct dataframe? Basically, the output of this should be
Scorecard Results
1 1 12
2 3 8
3 5 11
I found this link related to the warning but it's unhelpful in solving my issue.
The following works
get_P <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ]
data.frame(
Scorecard = data_sub$Key_P,
Results = df[data_sub$Key_P, ncol(df)])
}
get_P(df, data_sub)
# Scorecard Results
#1 2 1837
#2 3 315
#3 4 621
get_A <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ];
data.frame(
Scorecard = data_sub$Key_A,
Results = as.numeric(df[nrow(df), data_sub$Key_A + 1]))
}
get_A(df, data_sub)
# Scorecard Results
#1 1 12
#2 3 8
#3 5 11
To avoid the warning, we need to strip rownames with as.numeric in get_A.
Another tip: It's better coding practice to make get_P and get_A a function of both df and data_sub to avoid global variables.
Sample data
df <- read.table(text =
" V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450", header = T, row.names = 1)
data_sub <- read.table(text =
" Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA", header = T, row.names = 1)

adding rnorm to a column in loop

I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454
The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634
It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

R: How to do fastest replacement in R?

I have a input dataframe like this (the real one is very large, so I want to do it faster):
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
A B C
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 5 9 13
I have a dataframe with replacement code like this (the entries here maybe more than df1):
df2 <- data.frame(X=c(1:15), Y=c(101:115))
X Y
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105
6 6 106
7 7 107
8 8 108
9 9 109
10 10 110
11 11 111
12 12 112
13 13 113
14 14 114
15 15 115
By matching df2$X with value in df1$A and df1$B, I want to get a new_df1 by replace df1$A and df1$B with the corresponding values in df2$Y, i.e. resulting this new_df1
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
Could you mind to give me some guidance how to do it faster in R, as my dataframe is very large? Many thanks.
As Thilo mentioned Nico's answer assumes that df2 is ordered by X and X contains every integer 1,2,3....
I would prefer to use match() as a more general case:
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
df2 <- data.frame(X=c(1:15), Y=c(101:115))
new_df1 <- df1
new_df1$A <- df2$Y[match(df1$A,df2$X)]
new_df1$B <- df2$Y[match(df1$B,df2$X)]
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
It's supereasy! You just need to get the proper offsets in the array.
So for instance, to get the Y column of df2 corresponding to the values in the A column of df1 you'll write df2$Y[df1$A]
Hence, your code will be:
df_new <- data.frame("A" = df2$Y[df1$A], "B" = df2$Y[df1$B], "C" = df1$C)
Here is another (one-liner) way of doing it.
> with(c(df2,df1),data.frame(A = Y[match(A,X)],B = Y[match(B,X)],C))
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
However I am not sure whether it will be faster than the other suggestions

Resources