I'm looking for a way to remove rows in a data frame with less than 3 observations. Let me explain the matter in a better way.
I have a dataframe with 6 indipendent variables and 1 dependent. As I'm doing a density plot in ggplot2 using faceting, variables with less than 3 observations are not plotted (obviously). I'm looking for a way to delete these rows with less than 3 observations. this is an example of the data:
'data.frame': 432 obs. of 6 variables:
$ ID : Factor w/ 439 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Forno : Factor w/ 8 levels "Micro","Macro",..: 1 1 1 6 6 6 4 4 4 5 ...
$ Varieta: Factor w/ 11 levels "cc","dd",..: 11 11 11 6 6 6 1 1 1 6 ...
$ Impiego: Factor w/ 5 levels "aperto","chiuso",..: 2 2 2 3 3 3 2 2 2 5 ...
$ MediaL : num 60.7 58.9 60.5 55.9 56.1 ...
$ MediaL.sd : num 4.81 4.79 4.84 5.27 5.64 ...
ggplot code:
ggplot(d1,aes(MediaL))+geom_density(aes(fill=Varieta),colour=NA,alpha=0.5)+
scale_fill_brewer(palette="Set1")+facet_grid(Forno~Impiego)+
theme(axis.text.x=element_text(angle=90,hjust=1))+theme_mio +xlim(45,65)+
stat_bin(geom="text",aes(y=0,label=..count..),size=2,binwidth=2)
I would like to remove all the interactions with less than 3 observations.
Providing the actual output of your sample data would be useful. You can provide this via dput(yourObject) instead of the text representation you provided. However, it does seem like the same basic approach below works equally well with a matrix, data.frame, and table data structure.
#Matrix
x <- matrix(c(5,4,4,3,1,5,1,8,2), ncol = 3, byrow = TRUE)
x[x < 3] <- NA
#----
[,1] [,2] [,3]
[1,] 5 4 4
[2,] 3 NA 5
[3,] NA 8 NA
#data.frame
xd <- as.data.frame(matrix(c(5,4,4,3,1,5,1,8,2), ncol = 3, byrow = TRUE))
xd[xd < 3] <- NA
#----
V1 V2 V3
1 5 4 4
2 3 NA 5
3 NA 8 NA
#Table. Simulate some data first
set.seed(1)
samp <- data.frame(x1 = sample(c("acqua", "fango", "neve"), 20, TRUE),
x2 = sample(c("pippo", "pluto", "paperino"), 20, TRUE))
x2 <-table(samp)
x2[x2 < 3] <- NA
#----
x2
x1 paperino pippo pluto
acqua 3
fango 3
neve 3 3
ggplot generally likes data to be in long format, most often achieved via the melt() command in reshape2. If you provide your plotting code, that may illustrate a better way to remove the data you don't want to plot.
Related
I am trying to convert missing factor values to NA in a data frame, and create a new data frame with replaced values but when I try to do that, previously character factors are all converted to numbers. I cannot figure out what I am doing wrong and cannot find a similar question. Could anybody please help?
Here are my codes:
orders <- c('One','Two','Three', '')
ids <- c(1, 2, 3, 4)
values <- c(1.5, 100.6, 19.3, '')
df <- data.frame(orders, ids, values)
new.df <- as.data.frame(matrix( , ncol = ncol(df), nrow = 0))
names(new.df) <- names(df)
for(i in 1:nrow(df)){
row.df <- df[i, ]
print(row.df$orders) # "One", "Two", "Three", ""
print(str(row.df$orders)) # Factor
# Want to replace "orders" value in each row with NA if it is missing
row.df$orders <- ifelse(row.df$orders == "", NA, row.df$orders)
print(row.df$orders) # Converted to number
print(str(row.df$orders)) # int or logi
# Add the row with new value to the new data frame
new.df[nrow(new.df) + 1, ] <- row.df
}
and I get this:
> new.df
orders ids values
1 2 1 2
2 4 2 3
3 3 3 4
4 NA 4 1
but I want this:
> new.df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 NA 4
Convert empty values to NA and use type.convert to change their class.
df[df == ''] <- NA
df <- type.convert(df)
df
# orders ids values
#1 One 1 1.5
#2 Two 2 100.6
#3 Three 3 19.3
#4 <NA> 4 NA
str(df)
#'data.frame': 4 obs. of 3 variables:
#$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 1
#$ ids : int 1 2 3 4
#$ values: num 1.5 100.6 19.3 NA
Thanks to the hint from Ronak Shah, I did this and it gave me what I wanted.
df$orders[df$orders == ''] <- NA
This will give me:
> df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 <NA> 4
> str(df)
'data.frame': 4 obs. of 3 variables:
$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 NA
$ ids : num 1 2 3 4
$ values: Factor w/ 4 levels "","1.5","100.6",..: 2 3 4 1
In case you are curious about the difference between NA and as I was, you can find the answer here.
Your suggestion
df$orders[is.na(df$orders)] <- NA
did not work maybe becasuse missing entry is not NA?
I have a data frame like this:
> str(dynamics)
'data.frame': 3517 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ y2015: int 245 129 301 162 123 125 115 47 46 135 ...
$ y2016: int NA 385 420 205 215 295 130 NA NA 380 ...
I take out the 3 vectors and name them differently,
Column 1:
> plantid <- dynamics$id
> head(plantid)
[1] 1 2 3 4 5 6
Column 2:
(I divide it into different classes and label them 2,3,4 and 5)
> y15 <- dynamics$y2015
> year15 <- cut(y15, breaks = c(-Inf, 50, 100, 150, Inf), labels = c("2", "3", "4", "5"))
> str(year15)
Factor w/ 4 levels "2","3","4","5": 4 3 4 4 3 3 3 1 1 3 ...
> head(year15)
[1] 5 4 5 5 4 4
Levels: 2 3 4 5
Column 3:
(Same here)
> y16 <- dynamics$y2016
> year16 <- cut(y16, breaks = c(-Inf, 50, 100, 150, Inf), labels = c("2", "3", "4", "5"))
> str(year16)
Factor w/ 4 levels "2","3","4","5": NA 4 4 4 4 4 3 NA NA 4 ...
> head(year16)
[1] <NA> 5 5 5 5 5
Levels: 2 3 4 5
So far so good!
The problem arises when I combine the above 3 vectors by cbind() to form a new data frame, the newly created vector levels are gone
Look at my code:
SD1 = data.frame(cbind(plantid, year15, year16))
head(SD1)
and I get a data frame like this:
> head(SD1)
plantid year15 year16
1 1 4 NA
2 2 3 4
3 3 4 4
4 4 4 4
5 5 3 4
6 6 3 4
as you can see the levels of 2nd and 3rd column have changed from 2, 3, 4, 5 back to 1, 2, 3, 4
How do I fix that?
cbind is most commonly used to combine objects into matrices. It strips out special attributes from the inputs to help ensure that they are compatible for combining into a single object. This means that data types with special attributes (such as the name and format attributes for factors and Dates) will be simplified to their underlying numerical representations. This is why cbind turns your factors into numbers.
Conversely, data.frame() by itself will preserve the individual object attributes. In this case, your use of cbind is unnecessary. To preserve your factor levels, simply use:
SD1 <- data.frame(plantid, year15, year16)
I have a dataset with 49 columns.
'data.frame': 1351 obs. of 47 variables:
$ ID : Factor w/ 1351 levels "PID0001","PID0002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Survey: int 1 2 1 1 2 2 2 1 1 2 ...
$ hsinc1: int 2 4 4 4 5 4 3 3 1 1 ...
$ hsinc2: int 2 3 3 3 4 3 3 3 1 1 ...
$ hsinc3: int 4 4 2 3 3 4 5 4 5 5 ...
$ hsinc4: int 4 4 4 4 4 4 4 4 5 4 ...
$ hfair1: int 2 2 2 1 1 1 1 2 1 2 ...
$ hfair2: int 4 5 5 4 5 5 5 5 5 5 ...
$ hfair3: int 4 5 4 3 5 4 3 3 5 5 ...
etc ...
I want to reverse code columns 5,6,8,9,10,12,13,14,17 and 18 such that a score of 5 becomes a score of 1, and 4 becomes 2 etc.
At first, I thought this was achievable by using the psych::reverse.code() function, so I tried this:
With the -1's being the 5,6,8,9,10,12,13,14,17 and 18 columns.
library('psych')
keys <-c(1,1,1,1,-1,-1,1,-1,-1,-1,1,-1,-1,-1,1,1,-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df_rev <- reverse.code(keys, items = df, mini = rep(1,49), maxi = rep(5,49))
However, when I run this code, I get the following error:
Error in items %*% keys.d :
requires numeric/complex matrix/vector arguments
Can anybody help with this, please?
Another method I have just been trying is to create a subset of the original data frame, with just the columns I want to reverse code:
data_to_rev <- df[c(5,6,8,9,10,12,13,14,17,18)]
And then reverse coding this subset:
keys <- c(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1)
df_rev <- reverse.code(keys, items = data_to_rev, mini = rep(1,10), maxi = rep(5,10))
This works successfully. All variables are now reverse coded like I need them. However, how do I get this subset of reverse coded values and place it back into the original data frame - overwriting the old (non-reversed) columns?
Any help would be hugely appreciated, thank you!
EDIT - SOLUTION
I think I have managed to solve it using #MikeH's help.
I created a subset of just the participant ID's (the factor variable) data_ID <- df[1]
And then used:
data_rev <- reverse.code(keys, items = df[,-1], mini = rep(1,46), maxi = rep(5,46))
This leaves me with 2 data frames/subsets:
1 with all the participant ID's.
1 with all their data and columns 5,6,8,9,10,12,13,14,17 and 18 reverse coded.
I then used: data_final <- cbind(data_ID, data_rev) to join the 2 subsets back together.
Can anyone see anything wrong with this method? I think it has worked upon visual inspection...
df[c(5,6,8,9,10,12,13,14,17)] <- 6 - df[c(5,6,8,9,10,12,13,14,17)]
An efficient way to do it is to write the reverse function yourself and apply it to the columns you want
library(data.table)
start=1
end=5
myrev=function(x) end+start-x
dt=data.table(x=c(1,2,1,4),y=c(2,5,4,1))
cols=1:2
dt[, (cols) := lapply(.SD,myrev), .SDcols = cols]
Or
dt[, (cols) := end + start-cols]
I'm sure there's a super-easy answer to this. I am trying to combine ratings on subjects based on their unique ID. Here is a test dataset (called Aggregate_Test)I created, where the ID is unique to the subject, and the StaticScore was done by different raters:
ID StaticScore
1 6
2 7
1 5
2 6
3 7
4 8
3 4
4 5
After reading other posts carefully, I used aggregate to create the following dataset with new columns:
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
> StaticAggregate
ID StaticScore.1 StaticScore.2
1 1 6 5
2 2 7 6
3 3 7 4
4 4 8 5
This data frame has the following str:
> str(StaticAggregate)
'data.frame': 4 obs. of 2 variables:
$ ID : num 1 2 3 4
$ StaticScore: num [1:4, 1:2] 6 7 7 8 5 6 4 5
If I try to create a new variable by subtracting StaticScore.1 from StaticScore.2, I get the following error:
Staticdiff<-StaticScore.1-StaticScore.2
Error: object 'StaticScore.1' not found
So, please help me - what is this data structure created by aggregate? A matrix? How could I convert StaticScore.1 and StaticScore.2 to separate variables, or barring that, what is the notation to subtract one from the other to create a new variable?
We can do a dcast to create a wide format from long and subtract those columns to create the 'StaticDiff'
library(data.table)
dcast(setDT(Aggregate_Test), ID~paste0("StaticScore", rowid(ID)), value.var="StaticScore"
)[, StaticDiff := StaticScore1 - StaticScore2]
Regarding the specific question about the aggregate behavior, we are just concatenating (c) the 'StaticScore' by 'ID'. The default behavior is to create a matrix column in aggregate
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
This can be checked by looking at the str(StaticAggregate)
str(StaticAggregate)
#'data.frame': 4 obs. of 2 variables:
#$ ID : int 1 2 3 4
#$ StaticScore: int [1:4, 1:2] 6 7 7 8 5 6 4 5
How do we change it to normal columns?
It can be done with do.call(data.frame
StaticAggregate <- do.call(data.frame, StaticAggregate)
Check the str again
str(StaticAggregate)
#'data.frame': 4 obs. of 3 variables:
# $ ID : int 1 2 3 4
# $ StaticScore.1: int 6 7 7 8
# $ StaticScore.2: int 5 6 4 5
Now, we can do the calcuation as showed in the OP's post
StaticAggregate$Staticdiff <- with(StaticAggregate, StaticScore.1-StaticScore.2)
StaticAggregate
# ID StaticScore.1 StaticScore.2 Staticdiff
#1 1 6 5 1
#2 2 7 6 1
#3 3 7 4 3
#4 4 8 5 3
As the str output shown in the question indicates, StaticAggregate is a two column data.frame whose second column is a two column matrix, StaticScore. We can display the matrix like this:
StaticAggregate$StaticScore
## [,1] [,2]
## [1,] 6 5
## [2,] 7 6
## [3,] 7 4
## [4,] 8 5
To create a new column with the difference:
transform(StaticAggregate, diff = StaticScore[, 1] - StaticScore[, 2])
## ID StaticScore.1 StaticScore.2 diff
## 1 1 6 5 1
## 2 2 7 6 1
## 3 3 7 4 3
## 4 4 8 5 3
Note that there are no columns in StaticAggregate or in StaticAggregate$StaticScore named StaticScore.1 and StaticScore.2. StaticScore.1 in the heading of the data.frame print output just denotes the first column of the StaticScore matrix.
The reason that the matrix has no column names is that the aggregate function c does not produce them. If we change the original aggregate to this then they would have names:
StaticAggregate2 <- aggregate(StaticScore ~ ID, Aggregate_Test, setNames, c("A", "B"))
StaticAggregate2
## ID StaticScore.A StaticScore.B
## 1 1 6 5
## 2 2 7 6
## 3 3 7 4
## 4 4 8 5
Now we can write this using the column names of the matrix:
StaticAggregate2$StaticScore[, "A"]
## [1] 6 7 7 8
StaticAggregate2$StaticScore[, "B"]
## [1] 5 6 4 5
Note that there is a significant advantage of the way R's aggregate works as it allows simpler access to the results -- the kth column of the matrix is the kth result of the aggregate function. This is in contrast to having the k+1st column of the data.frame representing the kth result of the aggregate function. This may not seem like much of a simplification here but for more complex problems it can be a significant simplification if you need to access the statistics matrix. Of course, you can always flatten it to 3 columns if you want
do.call(data.frame, StaticAggregate)
but once you think about it for a while you may find that the structure it provides is actually more convenient.
Suppose I have generated a vector using the following statement:
x1 <- rep(4:1, sample(1:100,4))
Now, when I try to count the number of occurrences using the following commands
count(x1)
x freq
1 1 40
2 2 57
3 3 3
4 4 46
or
as.data.frame(table(x1))
x1 Freq
1 1 40
2 2 57
3 3 3
4 4 46
In both cases, the order of occurrence is not preserved. I want to preserve the order of occurrence, i.e. the output should be like this
x1 Freq
1 4 46
2 3 3
3 2 57
4 1 40
What is the cleanest way to do this? Also, is there a way to coerce a particular order?
You are looking for rle function
rle(x1)
## Run Length Encoding
## lengths: int [1:4] 12 2 23 52
## values : int [1:4] 4 3 2 1
You can order the table like this:
set.seed(42)
x1 <- rep(4:1, sample(1:100,4))
table(x1)[order(unique(x1))]
# x1
# 4 3 2 1
# 92 93 29 81
One way is to convert your variable to factor and specify the desired order with the levels argument. From ?table: "table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels"; "It is best to supply factors rather than rely on coercion.". So by converting to factor yourself, you are in charge over the coercion and the order set by levels.
x1 <- rep(factor(4:1, levels = 4:1), sample(1:100,4))
table(x1)
# x1
# 4 3 2 1
# 90 72 11 16