I am trying to create some descriptive statistics and histograms out of ordered variables (range 0 to 10). I used the following commands:
class(data$var1)
describe(as.numeric(data$var1))
But R starts from 1 and counts the "refusal" values as a further numeric value.
How can I let R start from 0 and ignore the "refusal" values?
Thank you.
Edit: I was able to let R ignore "refusal" value using the following command:
is.na (data$var1[data$var1=="Refusal"]) <- TRUE
But when I search for possible solution about the 0 values I am only finding suggestion on how to ignore/remove 0 values...
Edit2: This is a sample of my data,
[1] 5 8 8 8 Refusal 10 8 Refusal 7
[10] 7 8 7 8 8 8 8 8 8
[19] 8 0 9 Refusal 6 10 7 7 9
as you can see the range is from 0 to 10 but using the R library "psych" and the command "describe" the output range is always 1 to 11 and this invalidates the whole statistics.
> class(data$var1)
[1] "factor"
> describe(as.numeric(data$var1), na.rm=TRUE)
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 1115 8.38 1.94 9 8.57 1.48 1 11 10 -1.06 1.42 0.06
Sorry for the ongoing editing but I am new of stackoverflow.com
Have a look at how factors work, with ?factor, or looking at the example question here. In essence, each level is given a number starting at 1, hence ending at 11 if you have 11 unique values. Conversion of a factor to numeric returns these codes, rather than the underlying numbers they relate to. To do this, first convert to character, then to numeric. See the difference between these code snippets:
#create data
set.seed(0)
a <- factor(sample(c(0:10,"refusal"),50,T)) #Some dummy data
class(a)
# [1] "factor"
snippet 1 - how you're doing it
describe(as.numeric(a),na.rm=TRUE)
#as.numeric(a)
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#50 0 11 6.28 2.00 2.00 4.00 6.00 8.75 10.00 11.00
#
#1 2 3 4 5 6 7 8 9 10 11
#Frequency 2 5 5 4 2 8 6 5 3 6 4
#% 4 10 10 8 4 16 12 10 6 12 8
snippet 2 - correct way
describe(as.numeric(as.character(a)),na.rm=TRUE)
#as.numeric(as.character(a))
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#46 4 10 5.304 1.0 1.0 3.0 5.0 8.0 9.5 10.0
#
#0 1 2 3 4 5 7 8 9 10
#Frequency 2 5 4 2 8 6 5 3 6 5
#% 4 11 9 4 17 13 11 7 13 11
#Warning message:
# In describe(as.numeric(as.character(a)), na.rm = TRUE) :
# NAs introduced by coercion
Note the difference in range (even if my describe function isn't the same as yours). The warning refers to the "refusals which are converted to NAs as they don't represent a number
Related
I have learned imputation of NA values in r, we normally find the average (if it is numeric) of the data and put that in NA place of particular column. But i wanna ask that what should i do if instead of NA, the place is empty i.e. the cell is empty of any column.
Please help me.
Let's start with some test data:
person_id <- c("1","2","3","4","5","6","7","8","9","10")
inches <- as.numeric(c("56","58","60","62","64","","68","70","72","74"))
height <- data.frame(person_id,inches)
height
person_id inches
1 1 56
2 2 58
3 3 60
4 4 62
5 5 64
6 6 NA
7 7 68
8 8 70
9 9 72
10 10 74
The blank was already replaced with NA in height$inches.
You could also do this yourself:
height$inches[height$inches==""] <- NA
Now to fill in the NA with the average from the non-missing values of inches.
options(digits=4)
height$inches[is.na(height$inches)] <- mean(height$inches,na.rm=T)
height
person_id inches
1 1 56.00
2 2 58.00
3 3 60.00
4 4 62.00
5 5 64.00
6 6 64.89
7 7 68.00
8 8 70.00
9 9 72.00
10 10 74.00
I have three independent measures of a variable, and they are subject to a lot of noise and sporadic sources of error that can be quite large. I would like to discard the value furthest away from the others, remember which one is discarded, and then calculate the mean with the remaining two. For example,
a b c
15 6 7
11 10 3
5 12 6
would become
a b c ave discard
15 6 7 6.5 15
11 10 3 10.5 3
5 12 6 5.5 12
Try:
ddf
a b c
1 15 6 7
2 11 10 3
3 5 12 6
ddf$ave = apply(ddf[1:3], 1, function(x) {
x = sort(x)
ifelse(abs(x[1]-x[2]) > abs(x[2]-x[3]), mean(x[2:3]), mean(x[1:2]))
}
)
ddf$discard = apply(ddf[1:3], 1, function(x) {
x = sort(x)
ifelse(abs(x[1]-x[2]) > abs(x[2]-x[3]), x[1], x[3])
}
)
ddf
a b c ave discard
1 15 6 7 6.5 15
2 11 10 3 10.5 3
3 5 12 6 5.5 12
You question is underspecified. Say the three values are 1000, 2000 and 3000. Which would you discard? Should the answer be 1500 or 2500?
If all you're looking for is a robust measure of central tendency, the median might be a good start (?median in R).
I would highly appreciate if somebody could help me out with this. This looks simple but I have no clue how to go about it.
I am trying to work out the percentage change in one row with respect to the previous one. For example: my data frame looks like this:
day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8
. .
. .
. .
365 27.2
What I am trying to do is to calculate the percentage change in each row with respect to previous row. For example:
day value
1 21
2 (day2-day1/day1)*100
3 (day3-day2/day2)*100
4 (day4-day3/day3)*100
5 (day5-day4/day4)*100
6 (day6-day5/day5)*100
7 (day7-day6/day6)*100
8 (day8-day7/day7)*100
. .
. .
. .
365 (day365-day364/day364)*100
and then print out only those days where the there was a percentage increase of >50% from the previous row
Many thanks
You are looking for diff(). See its help page by typing ?diff. Here are the indices of days that fulfill your criterion:
> value <- c(21,23.4,10.7,5.6,3.2,35.2,12.9,67.8)
> which(diff(value)/head(value,-1)>0.5)+1
[1] 6 8
Use diff:
value <- 100*diff(value)/value[2:length(value)]
Here's one way:
dat <- data.frame(day = 1:10, value = 1:10)
dat2 <- transform(dat, value2 = c(value[1], diff(value) / head(value, -1) * 100))
day value value2
1 1 1 1.00000
2 2 2 100.00000
3 3 3 50.00000
4 4 4 33.33333
5 5 5 25.00000
6 6 6 20.00000
7 7 7 16.66667
8 8 8 14.28571
9 9 9 12.50000
10 10 10 11.11111
dat2[dat2$value2 > 50, ]
day value value2
2 2 2 100
You're looking for the difffunction :
x<-c(3,1,4,1,5)
diff(x)
[1] -2 3 -3 4
Here is another way:
#dummy data
df <- read.table(text="day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8", header=TRUE)
#get index for 50% change
x <- sapply(2:nrow(df),function(i)((df$value[i]-df$value[i-1])/df$value[i-1])>0.5)
#output
df[c(FALSE,x),]
# day value
#6 6 35.2
#8 8 67.8
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")
I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.
xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4
numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.
Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].
What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?