Changing columns positions in a data frame without total reassignment - r

I want to swap two columns in a data.frame. I know I could do something like:
dd <- dd[c(1:4, 6:5, 7:10)]
But I find it inelegant, potentially slow (?) and not program-friendly (you need to know length(dd), and even have some cases if the swapped columns are close or not to that value...)
Is there an easy way to do it without reassigning the whole data frame?
dd[2:3] <- dd[3:2]
Turns out to be very "lossy" because the [ <- only concerns the values, and not the attributes. So for instance:
(dd <- data.frame( A = 1:4, Does = 'really', SO = 'rock' ) )
dd[3:2]
dd[2:3] <- dd[2:1]
print(dd)
The column names are obviously not flipped...
Any idea? I could also add a small custom function to my very long list, but grrr... should be a way. ;-)

It's not a single function, but relatively simple:
dd[replace(seq(dd), 2:3, 3:2)]
A SO Does
1 1 rock really
2 2 rock really
3 3 rock really
4 4 rock really

This:
dd[,2:3] <- dd[,3:2]
works, but you have to update the names as well:
names(dd)[2:3] <- names(dd)[3:2]

Related

Generate a new column based on data in multiple columns

I have a dataset from a colleague.
In the dataset we record the location where a given skin problem is.
We record up to 20 locations for the skin problem.
i.e
scaloc1 == 2
scaloc2 == 24
scaloc3 == NA
scalocn......
Would mean the skin problem was in place 1 and 24 and nowhere else
I want to reorganise the data so that instead of being like this it is
face 1/0 torso 1/0 etc
So for example if any of scaloc1 to scalocn contain the value 3 then set the value of face to be 1.
I had previously done this in STATA using:
foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}
I feel like I should be able to do this using either a dreaded for loop or possibly something from the apply family?
I tried
dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
dataframe$facescalp[X>=1 & X<7] <-1
})
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1
But didn't work....
I've not really used apply before but did have a good root around examples here and can't find one which accurately describes my current issue.
An example dataset is available:
https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0
If anything not clear or there is a good explanation for this already in a different answer please do let me know.
If I understand your problem correctly, the easiest way to solve it would probably be the following (this uses your example data set that you provided read in and stored as df)
# Add an ID column to identify each patient or skin problem
df$ID <- row.names(df)
# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID, loc, -ID)
# Order by ID
dfl <- dfl[order(dfl$ID), ]
# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]
# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)
Because you have a lot of conditions that you will need to apply in order to fill the various body part columns, the most efficient rout would probably to create a lookup table and use the match function. There are many examples on SO that describe using match for situations like this.
Very helpful.
I ended up using a variant of this approach
data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20"))
#Make a single long dataframe
data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))
#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1
data <- inner_join(data,locations, by = "ID")
#This brings the data back together

R Include lists of Strings in Dataframe

I am trying to create an artificial dataframe of words contributed and deleted by users of Wikipedia for each edit that they make, the end result should look like this:
I created some artifical data to build such a frame but I'm having problems with the variables "Tokens Added" and "Tokens deleted".
I thought creating them as lists of lists would allow me to include them in dataframes even if the elements do not always have equal length. But apparently thats not the case. Instead, R creates a variable for each individual token. thats not feasible because it would create millions of variables. Here is some code to exemplify:
a <- c(1,2,3)
e <- list(b = as.list(c("a","b")),c = as.list(c(1L,3L,5L,4L)),d = as.list(c(TRUE,FALSE,TRUE)))
DF <- cbind(a,e)
U <- data.frame(a,e)
I would like to have it like this:
Is this possible at all in R with dataframes (I tried dearching for answers already but they were either for different questions or too technical for me)? Any help is much appreciated!
You can do exactly what you want if you are willing to use library(tibble):
library(tibble)
a <- c(1,2,3)
e <- list(b = as.list(c("a","b")),c = as.list(c(1L,3L,5L,4L)),d = as.list(c(TRUE,FALSE,TRUE)))
tibble(a,e)
# A tibble: 3 × 2
a e
<dbl> <list>
1 1 <list [2]>
2 2 <list [4]>
3 3 <list [3]>
A tibble or tbl_df will behave just like you are used to with a traditional data.frame but allow you some nice extra functionality like storing lists of various lengths in a column.
I don't think what you want is possible using a vector of lists (as you suggest in your question). This is mainly because you can't create a vector of lists in R (see: How to create a vector of lists in R?)
However, one option (if you really want a data.frame) would be to coerce everything to a character (the most flexible type in R). Something like this might work for you:
e <- c(paste0(c("a","b"),collapse=","), paste0(c(1L,3L,5L,4L), collapse = ","), paste0(c(TRUE,FALSE,TRUE), collapse = ","))
U <- data.frame(a,e, stringAsFactors = F)
U
# a e
#1 1 a,b
#2 2 1,3,5,4
#3 3 TRUE,FALSE,TRUE
Then you can back out the value of each cell with a split. Something like:
strsplit(U$e, ",")
Thanks for all the suggestions everyone! I think I found a simpler solution though. Just in case anyone else has a similar problem in the future, this is what I did:
a <- c(1,2,3)
b <- c("a","b")
c <- c(1L,3L,5L,4L)
d <- c(TRUE,FALSE,TRUE)
e <- list(b,c,d);e
DF <- data.frame(a,I(e));DF
The I() inhibit function apparently prevents the lists from being converted and the column behaves just like a list of lists as far as I can tell so far. The class of the e column is however not "list" but "AsIs". I don't know whether this might cause problems further down the line, if so, I will update this answer!
EDIT
So it turns out that some functions do not take the AsIs class as input. To convert it back to a usefull character string, you can simply use unlist() on every row.
Try this:
cbind(a,lapply(e,function(x) paste(unlist(x),collapse=",")))

R list/table manipulation replacing "for" loop with sapply?

I am attempting in R to just add a simple constant to a column of a table with e.g.
dim(exampletable)
[1] 3900 2
to add a value on the second column, what I do and works is:
newtable <- exampletable
for (i in 1:nrow(newtable)){newtable[i,2] <- exampletable[i,2] + constant}
but this seems a bit overkill. Is there a more elegant way to do it with, say sapply?
Thanks, Johannes
R is vectorised and has very handy syntax for operations that tend to be more verbose in other languages. What you have described is possibly the worst implementation of what you want to do, and pretty much the antithesis of what R is about. Instead, use R's inbuilt vectorisation and live a happy long life!
There are so many ways to do this, but the canonical way (excepting the use of column index integers rather than column names) is:
newtable[,2] <- newtable[,2] + constant
e.g.
df <- data.frame( x = 1:3 )
df$y <- df$x + 1
df
# x y
#1 1 2
#2 2 3
#3 3 4
I recommend reading up on the basics of R. There are several good tutorials on the info page of the r tag.
Try this:
#Dummy data
exampletable <- data.frame(x=runif(3900), y=runif(3900))
#Define new constant
MyConstant <- 10
#Make newtable with MyConstant update
newtable <- exampletable
newtable$y <- newtable$y + MyConstant
This is basics of R language, read some manuals.

Recoding variables in R using the %in% operator to avoid NAs

I am scoring a psychometric instrument at work and want to recode a few variables. Basically, each question has five possible responses, worth 0 to 4 respectively. That is how they were coded into our database, so I don't need to do anything except sum those. However, there are three questions that have reversed scores (so, when someone answers 0, we score that as 4). Thus, I am "reversing" those ones.
The data frame basically looks like this:
studyid timepoint date inst_q01 inst_q02 ... inst_q20
1 2 1995-03-13 0 2 ... 4
2 2 1995-06-15 1 3 ... 4
Here's what I've done so far.
# Survey Processing
# Find missing values (-9) and confusions (-1), and sum them
project_f03$inst_nmiss <- rowSums(project_f03[,4:23]==-9)
project_f03$inst_nconfuse <- rowSums(project_f03[,4:23]==-1)
project_f03$inst_nmisstot <- project_f03$inst_nmiss + project_f03$inst_nconfuse
# Recode any missing values into NAs
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
rm(x)
Now, everything so far is pretty fine, I am about to recode the three reversed ones. Now, my initial thought was to do a simple loop through the three variables, and do a series of assignment statements something like below:
# Questions 3, 11, and 16 are reversed
for(x in c(3,11,16)+3) {
project_f03[project_f03[,x]==4,x] <- 5
project_f03[project_f03[,x]==3,x] <- 6
project_f03[project_f03[,x]==2,x] <- 7
project_f03[project_f03[,x]==1,x] <- 8
project_f03[project_f03[,x]==0,x] <- 9
project_f03[,x] <- project_f03[,x]-5
}
rm(x)
So, the five assignment statements just reassign new values, and the loop just takes it through all three of the variables in question. Since I was reversing the scale, I thought it was easiest to offset everything by 5 and then just subtract five after all recodes were done. The main issue, though, is that there are NAs and those NAs result in errors in the loop (naturally, NA==4 returns an NA in R). Duh - forgot a basic rule!
I've come up with three alternatives, but I'm not sure which is the best.
First, I could obviously just move the NA-creating code after the loop, and it should work fine. Pros: easiest to implement. Cons: Only works if I am receiving data with no innate (versus created) NAs.
Second, I could change the logic statement to be something like:
project_f03[!is.na(project_f03[,x]) && project_f03[,x]==4,x] which should eliminate the logic conflict. Pros: not too hard, I know it works. Cons: A lot of extra code, seems like a kludge.
Finally, I could change the logic from
project_f03[project_f03[,x]==4,x] <- 5 to
project_f03[project_f03[,x] %in% 4,x] <- 5. This seems to work fine, but I'm not sure if it's a good practice, and wanted to get thoughts. Pros: quick fix for this issue and seems to work; preserves general syntatic flow of "blah blah LOGIC blah <- bleh". Cons: Might create black hole? Not sure what the potential implications of using %in% like this might be.
EDITED TO MAKE CLEAR
This question has one primary component: Is it safe to utilize %in% as described in the third point above when doing logical operations, or are there reasons not to do so?
The second component is: What are recommended ways of reversing the values, like some have described in answers and comments?
The straightforward answer is that there is no black hole to using %in%. But in instances where I want to just discard the NA values, I'd use which: project_f03[which(project_f03[,x]==4),x] <- 5
%in% could shorten that earlier bit of code you had:
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
#could be
for(x in 4:23) {project_f03[project_f03[,x] %in% c(-9,-1), x] <- NA}
Like #flodel suggested, you can replace that whole block of code in your for-loop with project_f03[,x] <- rev(0:4)[match(project_f03[,x], 0:4, nomatch=10)]. It should preserve NA. And there are probably more opportunities to simplify code.
It doesn't answer your question, but should fix your problem:
cols <- c(3,11,16)+3
project_f03[, cols] <- abs(project_f03[, cols]-4)
## or a lot of easier (as #TylerRinker suggested):
project_f03[, cols] <- max(project_f03[, cols]) - project_f03[, cols]

R- Please help. Having trouble writing for loop to lag date

I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.

Resources