I have a dataset from a colleague.
In the dataset we record the location where a given skin problem is.
We record up to 20 locations for the skin problem.
i.e
scaloc1 == 2
scaloc2 == 24
scaloc3 == NA
scalocn......
Would mean the skin problem was in place 1 and 24 and nowhere else
I want to reorganise the data so that instead of being like this it is
face 1/0 torso 1/0 etc
So for example if any of scaloc1 to scalocn contain the value 3 then set the value of face to be 1.
I had previously done this in STATA using:
foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}
I feel like I should be able to do this using either a dreaded for loop or possibly something from the apply family?
I tried
dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
dataframe$facescalp[X>=1 & X<7] <-1
})
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1
But didn't work....
I've not really used apply before but did have a good root around examples here and can't find one which accurately describes my current issue.
An example dataset is available:
https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0
If anything not clear or there is a good explanation for this already in a different answer please do let me know.
If I understand your problem correctly, the easiest way to solve it would probably be the following (this uses your example data set that you provided read in and stored as df)
# Add an ID column to identify each patient or skin problem
df$ID <- row.names(df)
# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID, loc, -ID)
# Order by ID
dfl <- dfl[order(dfl$ID), ]
# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]
# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)
Because you have a lot of conditions that you will need to apply in order to fill the various body part columns, the most efficient rout would probably to create a lookup table and use the match function. There are many examples on SO that describe using match for situations like this.
Very helpful.
I ended up using a variant of this approach
data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20"))
#Make a single long dataframe
data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))
#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1
data <- inner_join(data,locations, by = "ID")
#This brings the data back together
Related
I am trying to measure political ideology on Twitter (by using Rtweet). I now have a dataframe consisting of +100 politicians user_id's along with two ideal point scores on 'factor 1' and 'factor 2' (both factors have a range of 1-4). It looks like this (called kandidat):
Navne
Faktor 1
Faktor 2
"Politician1"
3.5
1.0
"Politician2"
2.0
4.0
Etc...
X
X
I would then like to detect if random Twitter users follow one or more of the politicians from my dataset. If they e.g. follow two of the politicians in my dataset - "Politician1" and "Politician2" - I will then assign a mean of the two politicians ideal point scores on the two factors to the user. An example of a Twitteruser following these two politicians could then be factor 1 = (3.5+1.0)/2 = 2.25 and factor 2 = (2.0+4.0)/2 = 3.00.
So I've tried to create a simplified loop including only two journalists from Twitter called 'testusers', who both follow a large share of the politicians in my dataset. The loop should then check whether the respective journalists follow one or more of the politicians: If they follow, then the loop should assign the mean of the values like described above. If not, they should be automatically removed from the dataset. The loop below does run, but unfortunately provides a wrong output (see table below the code):
### loop ###
for(i in 1:ncol(testusers)){
#pick politician1 of dataset
politician1_friends <- get_friends(testusers$Navne[1])
#intersect with candidate data
ids_intersect = intersect(politician1_friends$user_id, kandidat$user_id)
if(length(ids_intersect == 0)){
testusers[i, "anyFriends"] <- FALSE #user has no friends in the politicians df
} else {
#assign values to user based on intersect
politicians_friends = kandidat[kandidat$user_id %in% ids_intersect,]
s1_mean <- mean(politicians_friends$faktor1, na.rm=TRUE)
s2_mean <- mean(politicians_friends$faktor2, na.rm=TRUE)
testusers[i, "faktor1"] <- s1_mean
testusers[i, "faktor2"] <- s2_mean
testusers[i, "anyFriends"] <- TRUE #user has friends in the politicians dataset
}
# etc.
}
The code above gives me this output:
Navne
anyFriends
"Politician1"
FALSE
"Politician2"
NA
The structure of testusers is: structure(list(Navne = c("Politician1", "Politician2"), anyFriends = c(FALSE, NA)), row.names = 1:2, class = "data.frame"). And I can't post the whole structure of kandidat, since it's too big: but it's a dataframe consisting of politicians (with all the informations from the function look_up() like user_id, screen_name, text etc.
So I guess the code needs som minor changes, but I haven't figured them out yet. Ideally the output (df) should consist of "only" three dataframe columns: 1) UserID/Name 2) Faktor1 3) Faktor2?
I think what you want is another data.frame or so containing your users, and their 'scores'. R likes to work with such data frames rather than with lists.
I am now assuming, that you have a data.frame containing your politicians etc. and their scores along the two dimensions, as well as a data.frame with the users you're interested in, such like
kandidat <- data.frame(user_id = 1:2, name = c("Politician1", "Politican2"), Faktor1 = c(3.5, 2), Faktor2 = c(1,4))
my_users <- data.frame(name = c("Max", "Mara"))
Now if you want to work with a for-loop, you can do something like
find_f <- function(df){
F1_mean <- c()
F2_mean <- c()
anyFriends <- c()
for(i in 1:nrow(df)){
#pick user1 of dataset
user_friends <- get_friends(df$name[i])
#intersect with our candidatedata
ids_intersect = intersect(user_friends$user_id, politicians$user_id)
if(length(ids_intersect)==0){
anyFriends <- c(anyFriends, FALSE) # User has no friends in the politicians df
} else {
#assign values to user based on intersect - don't know what to do here
kandidat_friends = kandidat[kandidat$user_id %in% ids_intersect,]
F1_mean <- c(F1_mean, mean(kandidat_friends$Faktor1, na.rm=TRUE))
F2_mean <- c(F2_mean, mean(kandidat_friends$Faktor2, na.rm=TRUE))
anyFriends <- c(anyFriends, TRUE) # user has friends in the politicans dataset
}
}
df$Faktor1 <- F1_mean
df$Faktor2 <- F2_mean
df$anyFriends <- anyFriends
return(df[df$anyFriends,])
}
my_users2 <- find_f(my_users)
This is by far not a very brief solution, but I think it is easy to understand. The most important thing is, that you work with data.frames rather than lists, it is much easier in R. In each iteration, we get the friends of the user, see whether there is any intersection with the politicians. If not, we assign the boolean value FALSE to the anyFriends variable in the my_users dataframe, so we can easily filter them out in the end. If there is an intersection, we take the mean of the two scores of the selected politicians and assign them to the respective user entry.
No need for the IDEOLOGISCORE list in my opinion. Also, please be aware that I didn't test the code above and it might be that there are typos. Just check whether it works for you :)
I have created a while loop that is being executed across a sizable data set. The loop is as such:
i = 1
while(i<=m){
Date = London.Events$start_time[i]
j=1
while(j<=n){
Hotel = London.Hotels$AS400.ID[j]
Day.Zero[i,j] = sum(London.Bookings$No.of.Rooms[London.Bookings$Stay.Date == Date & London.Bookings$Legacy.Hotel.Code == Hotel])
j=j+1
}
i=i+1
}
Where:
m = 9957 #Number of Events
n = 814 #Number of Hotels
Day.Zero = as.data.frame(matrix(0, 9957, 814))
Briefly explained, for each combination of date and hotel (pulled from two other data frames), produce the sum from the column London.Bookings$No.of.Rooms and deposit that into the corresponding row of the matrix.
The loop appears to run without error, however when stopping it after 5 mins+ it is still running and nowhere near complete!
I would like to know how one of the apply family of functions could be used as a replacement here for much faster completion.
Thanks!
Probably,
xtabs(No.of.Rooms ~ Stay.Date + Legacy.Hotel, data = London.Bookings)
gets you something similar to what you want.
Using library dplyr, you can do something like the following (assuming your input data frame has such column names - vaguely interpreted from your code / question):
library(dplyr)
London.Bookings %>% group_by(Legacy.Hotel.Code, Stay.Date) %>% summarise(Total.No.of.Rooms = sum(No.of.Rooms))
I need to select data for 3 variables and place them in a new data frame using a single line of code. The data frame I'm pulling from is Dance, the 3 variables are Lindy, Blues and Contra.
I have this:
Dance$new<-subset(Dance$Type==Lindy, Dance$Type==Blues, Dance$Type==Contra)
Can you tell what I'm doing wrong?
There are a number of ways you can do this, but I'd forget the subset part
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",]
If you only want specific columns
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",c("Col1", "Col2")]
Alternatively
danceNew <- Dance[Dance$Type %in% c("Blues", "Contra", "Lindy"),]
Again, if you only want specific columns do the same. The advantage with the final options is you can pass the values in as a variable, thereby making it more dynamic, e.g
danceNames <- c("Lindy", "Blues", "Contra")
danceNew <- Dance[Dance$Type %in% danceNames,]
you're mixing up the variables and the dataframes
this should do the trick..
if your initial dataframe is called "Dance" and the new dataframe is called "Dance.new":
Dance.new <- subset(Dance, Dance$Type=="Lindy" & Dance$Type=="Blues" & Dance$Type=="Contra"); row.names(Dance.new) <- NULL
I like using "row.names(Dance.new) <- NULL" line so I won't have the useless column of "row.names" in the new dataframe
Thanks for your help everyone. This is what ended up working for me.
dancenew<-subset(Dance, Type=="Lindy" | Type== "Blues" | Type=="Contra")
I want to swap two columns in a data.frame. I know I could do something like:
dd <- dd[c(1:4, 6:5, 7:10)]
But I find it inelegant, potentially slow (?) and not program-friendly (you need to know length(dd), and even have some cases if the swapped columns are close or not to that value...)
Is there an easy way to do it without reassigning the whole data frame?
dd[2:3] <- dd[3:2]
Turns out to be very "lossy" because the [ <- only concerns the values, and not the attributes. So for instance:
(dd <- data.frame( A = 1:4, Does = 'really', SO = 'rock' ) )
dd[3:2]
dd[2:3] <- dd[2:1]
print(dd)
The column names are obviously not flipped...
Any idea? I could also add a small custom function to my very long list, but grrr... should be a way. ;-)
It's not a single function, but relatively simple:
dd[replace(seq(dd), 2:3, 3:2)]
A SO Does
1 1 rock really
2 2 rock really
3 3 rock really
4 4 rock really
This:
dd[,2:3] <- dd[,3:2]
works, but you have to update the names as well:
names(dd)[2:3] <- names(dd)[3:2]
I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.