Add scale column to data frame by factor - r

I'm attempting to add a column to a data frame that consists of normalized values by a factor.
For example:
'data.frame': 261 obs. of 3 variables:
$ Area : Factor w/ 29 levels "Antrim","Ards",..: 1 1 1 1 1 1 1 1 1 2 ...
$ Year : Factor w/ 9 levels "2002","2003",..: 1 2 3 4 5 6 7 8 9 1 ...
$ Arrests: int 18 54 47 70 62 85 96 123 99 38 ...
I'd like to add a column that are the Arrests values normalized in groups by Area.
The best I've come up with is:
data$Arrests.norm <- unlist(unname(by(data$Arrests,data$Area,function(x){ scale(x)[,1] } )))
This command processes but the data is scrambled, ie, the normalized values don't match to the correct Areas in the data frame.
Appreciate your tips.
EDIT:Just to clarify what I mean by scrambled data, subsetting the data frame after my code I get output like the following, where the normalized values clearly belong to another factor group.
Area Year Arrests Arrests.norm
199 Larne 2002 92 -0.992843957
200 Larne 2003 124 -0.404975825
201 Larne 2004 89 -1.169204397
202 Larne 2005 94 -0.581336264
203 Larne 2006 98 -0.228615385
204 Larne 2007 8 0.006531868
205 Larne 2008 31 0.418039561
206 Larne 2009 25 0.947120880
207 Larne 2010 22 2.005283518

Following up your by attempt:
df <- data.frame(A = factor(rep(c("a", "b"), each = 4)),
B = sample(1:4, 8, TRUE))
ll <- by(data = df, df$A, function(x){
x$B_scale <- scale(x$B)
x
}
)
df2 <- do.call(rbind, ll)

data <- transform(data, Arrests.norm = ave(Arrests, Area, FUN = scale))
will do the trick.

Related

A concise way to extract some elements of a "survfit" object into a data frame

I load a data set from the survival library, and generate a survfit object:
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
This object is a list:
> str(fit)
List of 13
$ n : int 228
$ time : int [1:186] 5 11 12 13 15 26 30 31 53 54 ...
$ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...
$ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...
...
Now I specify some members (all same length) that I want to turn into a data frame:
members <- c("time", "n.risk", "n.event")
I'm looking for a concise way to make a data frame with the three list members as columns, with the columns named time, n.risk, n.event (not fit$time, fit$n.risk, fit$n.event)
Thus the resulting data frame should look like this:
time n.risk n.event
[1,] 5 228 1
[2,] 11 227 3
[3,] 12 224 1
...
This is OK
data.frame(unclass(fit)[members])
Another (more canonical) way is
with(fit, data.frame(time, n.risk, n.event))
The broompackage contains functions to tidy up the results of regression models and present them in an object of class data.frame. For those unfamiliar with the tidy philosophy, please see Tidy data [ 1 ]
library(broom)
#create tidy dataframe and subset by the columns saved in members
df <- tidy(fit)[,members]
head(df)
# time n.risk n.event
#1 5 228 1
#2 11 227 3
#3 12 224 1
#4 13 223 2
#5 15 221 1
#6 26 220 1
[ 1 ] Wickham, Hadley . "Tidy Data." Journal of Statistical Software [Online], 59.10 (2014): 1 - 23. Web. 16 Jun. 2017
Used cbind to bind the dataframes, then used names to change the name of columns
time=as.data.frame(fit$time)
n.risk=as.data.frame(fit$n.risk)
n.event=as.data.frame(fit$n.event)
members2=cbind(time,n.risk,n.event)
names(members2)=c("time","n.risk","n.event")
head(members2)
time n.risk n.event
1 5 228 1
2 11 227 3
3 12 224 1
4 13 223 2
5 15 221 1
6 26 220 1
library(survival)
data(lung)
lung$SurvObj <- with(lung, Surv(time, status == 2))
fit <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
str(fit)
members<-data.frame(time=fit$time,n.risk=fit$n.risk,n.event=fit$n.event)
members

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

Sorting data.frame in r [duplicate]

I am new to R, and want to sort a data frame called "weights". Here are the details:
>str(weights)
'data.frame': 57 obs. of 1 variable:
$ attr_importance: num 0.04963 0.09069 0.09819 0.00712 0.12543 ...
> names(weights)
[1] "attr_importance"
> dim(weights)
[1] 57 1
> head(weights)
attr_importance
make 0.049630556
address 0.090686474
all 0.098185517
num3d 0.007122618
our 0.125433292
over 0.075182467
I want to sort by decreasing order of attr_importance BUT I want to preserve the corresponding row names also.
I tried:
> weights[order(-weights$attr_importance),]
but it gives me a "numeric" back.
I want a data frame back - which is sorted by attr_importance and has CORRESPONDING row names intact. How can I do this?
Thanks in advance.
Since your data.frame only has one column, you need to set drop=FALSE to prevent the dimensions from being dropped:
weights[order(-weights$attr_importance),,drop=FALSE]
# attr_importance
# our 0.125433292
# all 0.098185517
# address 0.090686474
# over 0.075182467
# make 0.049630556
# num3d 0.007122618
Here is the big comparison on data.frame sorting:
How to sort a dataframe by column(s)?
Using my now-preferred solution arrange:
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
library(plyr)
arrange(dd,desc(z),b)
b x y z
1 Low C 9 2
2 Med D 3 1
3 Hi A 8 1
4 Hi A 9 1
rankdata.txt
regno name total maths science social cat
1 SUKUMARAN 400 78 89 73 S
2 SHYAMALA 432 65 79 87 S
3 MANOJ 500 90 129 78 C
4 MILYPAULOSE 383 59 88 65 G
5 ANSAL 278 39 77 60 O
6 HAZEENA 273 45 55 56 O
7 MANJUSHA 374 50 99 52 C
8 BILBU 408 81 97 72 S
9 JOSEPHROBIN 374 57 85 68 G
10 SHINY 381 70 79 70 S
z <- data.frame(rankdata)
z[with(z, order(-total+ maths)),] #order function maths group selection
z
z[with(z, order(name)),] # sort on name
z

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

duplicate rows and create new data frame in R

I have a R data frame called intraPByGroup as follows:
group, week1, week2, week3, week4
kiwi,23,43,54,23
eggplant,22,32,33,63
jasmine,23,454,12,654
coconut,32,56,22,31
What I want to do is to create a new data frame which are like the following
user,week1,week2,week3,week4
eggplantA,22,32,33,63
eggplantB,22,32,33,63
eggplantC,22,32,33,63
jasmineA,23,454,12,654
jasmineB,23,454,12,654
jasmineC,23,454,12,654
Basically, the idea is: from the original data set, I select two groups (eggplant and jasmine), and I want to create a new dataframe. This new data frame has "user" variable instead of "group". Each user name is actually "groupname+A(B or C)", and all the rest values are duplicated for all users in the same group.
How should I do that in R?
I am thinking of firstly drop the group names and select a row, and compose one new row, then repeat doing this for each selected group.
eggFrame <- intraPByGroup[intraPByGroup$group=="eggplant",-1]
eggFrame1 <- eggFrame
eggFrame1["user"] <- "Eggplant-A"
eggFrame2 <- eggFrame
eggFrame2["user"] <- "Eggplant-B"
total <- rbind(eggFrame1,eggFrame2)
I think repeatedly doing rbind is stupid, even in this way, is there any other faster ways to do it?
You can do something like this
data <- subset(data, group %in% c("eggplant", "jasmine"))[rep(1:2, each = 3), ]
data$group <- factor(paste0(data$group, LETTERS[1:3]))
data
## group week1 week2 week3 week4
## 2 eggplantA 22 32 33 63
## 2.1 eggplantB 22 32 33 63
## 2.2 eggplantC 22 32 33 63
## 3 jasmineA 23 454 12 654
## 3.1 jasmineB 23 454 12 654
## 3.2 jasmineC 23 454 12 654
If for any reason you don't like the rownames like this and you want to change "group" to "user"
rownames(data) <- NULL
names(data)[1] <- "user"
data
## user week1 week2 week3 week4
## 1 eggplantA 22 32 33 63
## 2 eggplantB 22 32 33 63
## 3 eggplantC 22 32 33 63
## 4 jasmineA 23 454 12 654
## 5 jasmineB 23 454 12 654
## 6 jasmineC 23 454 12 654

Resources