How to drop unused labels from a memisc:data.set in R? - r

I want to drop all unsued labels from a data.set.
Let's assume this example data.set (which is class from the memisc package).
library(memisc)
d <- data.set(a = sample(1:10), b=rep(c(14,72),5))
labels(d$b) <- c('First' = 14, 'no-use' = 33, 'Second' = 72)
The resulting data.set:
Data set with 10 observations and 2 variables
a b
1 4 First
2 1 Second
3 9 First
4 8 Second
5 7 First
6 10 Second
7 5 First
8 3 Second
9 2 First
10 6 Second
You see that for b only two values used but it has three labels.
> labels(d$b)
Values and labels:
14 'First'
33 'no-use'
72 'Second'
How can I drop the unused label (33) from there? The point is all unsued labels should be droped and I don't know which one is unused. I would know how to remove 33 explicite. But that is not the goal.
I know from the basic-R data.frame the function droplevels(). Would be nice to have something like droplabels().

This isn't very compact, but you could use the following
labels(d$b) <- labels(d$b)[seq_len(length(unique(d$b)))]
update
Your question states you want to drop '72' when it looks like you want to drop '33'. Regardless, the following function will drop any unused labels
labels(d$b) <- labels(d$b)[labels(d$b)#values %in% unique(d$b)]
The following will drop all unused labels for all elements of a list
for (i in seq_along(d)) {
if(!is.null(labels(d[[i]]))) {
labels(d[[i]]) <- labels(d[[i]])[labels(d[[i]])#values %in% unique(d[[i]])]
}
}

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Using match and apply in R

> df = data.frame(id = 1:5, ch_1 = 11:15,ch_2= 10:14,selection = c(11,13,12,14,12))
> df
id ch_1 ch_2 selection
1 1 11 10 11
2 2 12 11 13
3 3 13 12 12
4 4 14 13 14
5 5 15 14 12
Given this data set I need an additional column that follow the rules:
if selection is one of the two choices (ch_1 and ch_2), return the number of the choice (1 or 2)
if the selection is not of the two choices, return 3
I need a way to do this for every row. For a single row, doing the following code works just fine, but I can't seem to find a way to use it with apply to run it to each single row of a dataframe.Looking for a solution that can be applied to more than just two columns and that runs faster than doing a traditional loop
df=df[1,]
if (df$selection %in% df[,paste("ch_",1:2,sep="")]) {
a = which(df[,paste("ch_",1:2,sep="")]==df$selection)
} else {
a = 3
}
# OR
ifelse(df$selection %in% df[,paste("ch_",1:2,sep="")],1,3)
# OR
match(df$selection,df[,paste("ch_",1:2,sep="")])
Compare the vector to the other columns with ==, add a final column which is always TRUE, and then take the index of the first TRUE in each row using max.col
max.col(cbind(df$selection == df[c("ch_1","ch_2")], TRUE), "first")
#[1] 1 3 2 1 3
This should easily extend to n columns then.
You could do this with nested ifelse,
with(df, ifelse(selection == ch_1, 1L, ifelse(selection == ch_2, 2L, 3L)))
# [1] 1 3 2 1 3
but I'm rarely fond of nesting them. If this is all you need (and you never need more than two), then this might suffice.
One alternative is using dplyr::case_when,
with(df, dplyr::case_when(selection == ch_1 ~ 1, selection == ch_2 ~ 2, TRUE ~ 3))
and it can be easily used within a dplyr::mutate if you are already using the package.

How to construct and add to a data frame with named columns?

I cannot figure out how to do this without throwing errors. I have a set of column names for my data frame I want to create and add to that looks like this:
x <- c("A", "B", "C")
So, I go down through the loop and I calculate some numerical values in a vector, say:
z <- c(1, 5, 7, 8, 34, 5)
z is the same dimension each time through the loop.
The first time through (or even outside the loop) I want to initialize a data frame by doing something like:
df$x[1] <- z
so I have a data frame that looks like:
A
1 1
2 5
3 7
4 8
5 34
6 5
The next time through the loop I want to add another column to df with a column heading being the second element of x, and a set of new z values. If the data frame has to be completely dimensioned ahead of time, I could calculate variables outside the loop to do this, say, M and N, but these may change from one run to the next.
I cannot seem to figure out how to do this. Suggestions much appreciated.
Try this:
set.seed(1)
#set the column names
x <- c("A", "B", "C")
#create the list that later we will convert to a data.frame
df<-setNames(vector("list",length(x)),x)
#loop to produce the various z
for (i in 1:length(x)) {
#do some stuff to evaluate z
z<-sample(5)
#assign to an element of df
df[[i]]<-z
}
#coerce to a data.frame
df<-as.data.frame(df)
# A B C
#1 2 5 2
#2 5 4 1
#3 4 2 3
#4 3 3 4
#5 1 1 5

Merge two columns into one, delete colnames

I have a table like:
a
n_msi2010 n_msi2011
1 -0.122876 1.818750
2 1.328930 0.931426
3 -0.111653 4.400060
4 1.222900 4.500450
5 3.604160 6.110930
I would like to merge these two columns into one column to obtain (I don't want to keep column names):
a
n_msi2010
1 -0.122876
2 1.328930
3 -0.111653
4 1.222900
5 3.604160
6 1.818750
7 0.931426
8 4.400060
9 4.500450
10 6.110930
When I am using prefabricated data like
x <- cbind(c(1, 2, 3), c(4, 5, 6))
colnames(x)<-c("a","b")
c(t(x))
# 1 4 2 5 3 6
c((x))
# 1 2 3 4 5 6
the column merging works fine. Only in "a" exemple id doesn't work and it creates 2 separate vectors. I don't really understand why. Any help? Thanks
It seems like your question is about column versus row order vector creation from a data.frame.
Using t() on a data.frame converts the data.frame to a matrix, and using c() on the matrix removes its dimensions.
With that knowledge, you can try:
# create a vector of values, column by column
c(as.matrix(a)) # you are missing the `as.matrix` in your current approach
# create a vector of values, row by row
c(t(a)) # you already know this works
Other approaches to get the "column by column" result would be:
unlist(a, use.names = FALSE)
stack(a)[, "values"] # add `drop = FALSE` if you want to retain a data.frame
Not a elegant way but it seems it can combine two or several columns to one.
n_msi2010 <- 1:5
n_msi2011 <- 6:10
a <- data.frame(n_msi2010, n_msi2011)
vector <- vector()
for (i in 1:dim(a)[2]){
vector <- append(vector, as.vector(a[,i]))
vector
}
You may do
as.matrix(vector) or data.frame(vector)

Remove the rows of data frame whose cells match a given vector

I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops.
for example
ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data.frame")
remove all rows that have cells that contain the values 8,9,10
I guess i could use ff[ !ff[,1] %in% c(8, 9, 10), ] or subset(ff, !ff[,1] %in% c(8,9,10) )
but in order to remove all the values from the dataset i have to parse each column (probably with a for loop, something i wish to avoid).
Is there any other (cleaner) way?
Thanks a lot
apply your test to each row:
keeps <- apply(ff, 1, function(x) !any(x %in% 8:10))
which gives a boolean vector. Then subset with it:
ff[keeps,]
j.1 j.2 j.3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
11 11 12 13
12 12 13 14
13 13 14 15
>
I suppose the apply strategy may turn out to be the most economical but one could also do either of these:
ff[ !rowSums( sapply( ff, function(x) x %in% 8:10) ) , ]
ff[ !Reduce("+", lapply( ff, function(x) x %in% 8:10) ) , ]
Vector addition of logical vectors, (equivalent to any) followed by negation. I suspect the first one would be faster.

Resources