Assigning values through a loop - r

I am trying to measure political ideology on Twitter (by using Rtweet). I now have a dataframe consisting of +100 politicians user_id's along with two ideal point scores on 'factor 1' and 'factor 2' (both factors have a range of 1-4). It looks like this (called kandidat):
Navne
Faktor 1
Faktor 2
"Politician1"
3.5
1.0
"Politician2"
2.0
4.0
Etc...
X
X
I would then like to detect if random Twitter users follow one or more of the politicians from my dataset. If they e.g. follow two of the politicians in my dataset - "Politician1" and "Politician2" - I will then assign a mean of the two politicians ideal point scores on the two factors to the user. An example of a Twitteruser following these two politicians could then be factor 1 = (3.5+1.0)/2 = 2.25 and factor 2 = (2.0+4.0)/2 = 3.00.
So I've tried to create a simplified loop including only two journalists from Twitter called 'testusers', who both follow a large share of the politicians in my dataset. The loop should then check whether the respective journalists follow one or more of the politicians: If they follow, then the loop should assign the mean of the values like described above. If not, they should be automatically removed from the dataset. The loop below does run, but unfortunately provides a wrong output (see table below the code):
### loop ###
for(i in 1:ncol(testusers)){
#pick politician1 of dataset
politician1_friends <- get_friends(testusers$Navne[1])
#intersect with candidate data
ids_intersect = intersect(politician1_friends$user_id, kandidat$user_id)
if(length(ids_intersect == 0)){
testusers[i, "anyFriends"] <- FALSE #user has no friends in the politicians df
} else {
#assign values to user based on intersect
politicians_friends = kandidat[kandidat$user_id %in% ids_intersect,]
s1_mean <- mean(politicians_friends$faktor1, na.rm=TRUE)
s2_mean <- mean(politicians_friends$faktor2, na.rm=TRUE)
testusers[i, "faktor1"] <- s1_mean
testusers[i, "faktor2"] <- s2_mean
testusers[i, "anyFriends"] <- TRUE #user has friends in the politicians dataset
}
# etc.
}
The code above gives me this output:
Navne
anyFriends
"Politician1"
FALSE
"Politician2"
NA
The structure of testusers is: structure(list(Navne = c("Politician1", "Politician2"), anyFriends = c(FALSE, NA)), row.names = 1:2, class = "data.frame"). And I can't post the whole structure of kandidat, since it's too big: but it's a dataframe consisting of politicians (with all the informations from the function look_up() like user_id, screen_name, text etc.
So I guess the code needs som minor changes, but I haven't figured them out yet. Ideally the output (df) should consist of "only" three dataframe columns: 1) UserID/Name 2) Faktor1 3) Faktor2?

I think what you want is another data.frame or so containing your users, and their 'scores'. R likes to work with such data frames rather than with lists.
I am now assuming, that you have a data.frame containing your politicians etc. and their scores along the two dimensions, as well as a data.frame with the users you're interested in, such like
kandidat <- data.frame(user_id = 1:2, name = c("Politician1", "Politican2"), Faktor1 = c(3.5, 2), Faktor2 = c(1,4))
my_users <- data.frame(name = c("Max", "Mara"))
Now if you want to work with a for-loop, you can do something like
find_f <- function(df){
F1_mean <- c()
F2_mean <- c()
anyFriends <- c()
for(i in 1:nrow(df)){
#pick user1 of dataset
user_friends <- get_friends(df$name[i])
#intersect with our candidatedata
ids_intersect = intersect(user_friends$user_id, politicians$user_id)
if(length(ids_intersect)==0){
anyFriends <- c(anyFriends, FALSE) # User has no friends in the politicians df
} else {
#assign values to user based on intersect - don't know what to do here
kandidat_friends = kandidat[kandidat$user_id %in% ids_intersect,]
F1_mean <- c(F1_mean, mean(kandidat_friends$Faktor1, na.rm=TRUE))
F2_mean <- c(F2_mean, mean(kandidat_friends$Faktor2, na.rm=TRUE))
anyFriends <- c(anyFriends, TRUE) # user has friends in the politicans dataset
}
}
df$Faktor1 <- F1_mean
df$Faktor2 <- F2_mean
df$anyFriends <- anyFriends
return(df[df$anyFriends,])
}
my_users2 <- find_f(my_users)
This is by far not a very brief solution, but I think it is easy to understand. The most important thing is, that you work with data.frames rather than lists, it is much easier in R. In each iteration, we get the friends of the user, see whether there is any intersection with the politicians. If not, we assign the boolean value FALSE to the anyFriends variable in the my_users dataframe, so we can easily filter them out in the end. If there is an intersection, we take the mean of the two scores of the selected politicians and assign them to the respective user entry.
No need for the IDEOLOGISCORE list in my opinion. Also, please be aware that I didn't test the code above and it might be that there are typos. Just check whether it works for you :)

Related

R unique combinations from given ranges quickly and using less system resource

This is a follow up question from here:
https://stackoverflow.com/a/55912086/3988575
I have a dataset like this:
ID=as.character(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
IQ=c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,121.7,122.3,120.9,122.4,125.7,126.4,128.2,129.1,121.2,128.4,127.6,125.1)
Section=c("A","A","B","B","A","B","B","A","B","A","B","B","A","A","B","B","A","B","B","A")
zz=data.frame(ID,IQ,Section)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
What I would like to do is to match people by the range of their IQ (which was the previous question).
Now, I want to create multiple levels of the ranges. For example one range can be 10 IQ classes: 120-121,121-122,122-123....129-130. Another example is a single IQ class:120-130. All the possible combinations of the above can be obtained by:
IQ_Class=c(120,121,122,123,124,125,126,127,128,129,130)
n = length(IQ_Class)-2
all_combin=expand.grid(replicate(n, 0:1, simplify = FALSE))
all_combin$First=1
all_combin$Last=1
all_combin_new=all_combin[c("First",names(all_combin)[1:(length(names(all_combin))-2)],"Last")] #Reorder columns
all_combin_new = t((apply(all_combin_new,1,function(x)(x*IQ_Class)))) #Multiply by IQ classes
all_combin_new = apply(all_combin_new, 1, function(x) { x[x!=0] })
Note that the final object all_combin_new provides a list of lists of all the classes (a total of 512 classes in total).
Now what I want to do is to take one class (one element from all_combin_new) and create all the combinations of ID's in that particular IQ class by their section. Save this dataset and take the next class from all_combin_new and repeat the operation.
From the previous answer, I was able to to modify the code to consider the combinations by Section by changing the following in the previous question:
zz1=list("list",length(all_combin_new))
for (i in 1:length(all_combin_new)){ #changed this line to run for all combinations in all_combin_new
zz2=all_combin_new[[i]]
zz11=zz_new%>%
mutate(ID=as.character(ID),vec=as.character(cut(IQ,zz2,right=F)))%>%
group_by(vec,Section)%>% #Changed this line
summarize(if(n()>1)list(data.frame(t(combn(ID,2)),stringsAsFactors = F))
else list(data.frame(X1=ID,X2=ID,stringsAsFactors = F)))%>%
unnest()%>%
bind_cols(read.csv(text=gsub("[^0-9,]","",.$vec),h=F))
zz1[[i]]=as.data.frame(zz11)
}
My actual dataset has about 10K (as compared to zz_new here) observations with 20 Sections (leading to 2^18=262144 ranges of IQ as compared to the the length of all_combin_new list here = 512). This causes two main issues:
a) Time: The speed is extremely slow. Is there a way to increase the speed?
b) Size of objects created: In my tests, even without considering as high number of combinations, the lists grow too big and the code fails. What alternate approaches could I use here? Note that in the list of list that I obtain here, I also need to do further computations.
Any help will be appreciated. Thanks in advance.
P.S.Please let me know if any part is unclear or any part of the code has some inadvertent errors.
Edit: Now with loop to go through all IQ combos and to include Section as a key on join.
I used the sample data in the linked question. Instead of making a list and looping, this does everything at once.
Note there is a cartesian product, so it may still run into memory issues. If you're having trouble, you can always try data.table as you can have non-equi joins.
library(tidyverse)
zz <- tibble(ID=1:12
,IQ=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
,Section=c("A","A","B","B","A","B","B","A","B","A","B","B")
)
IQ_Class <- c(120,122,124,126)
IQ_Classes <- data.frame(First = 1
,expand.grid(replicate(length(IQ_Class)-2, 0:1, simplify = FALSE))
,Last = 1)
IQ_Classes <- IQ_Classes * IQ_Class[col(IQ_Classes)]
IQ_Classes_List <- apply(IQ_Classes, 1, function(x) { x[x!=0] })
all_combos <- lapply(IQ_Classes_List
, function(IQs)
{
z_cut <- zz%>%
mutate(cut_range = cut(IQ, IQ_Class, right = F, labels = F))
inner_join(z_cut
, z_cut %>%
select(V2 = ID, cut_range, Section)
, by = c('cut_range', 'Section'))%>%
filter(V2 > ID) %>%
mutate(Previous_IQ_class = IQs[cut_range],
Next_Class = IQs[cut_range+1])
}
)%>%
bind_rows(.id = 'IQ_List')

Generate a new column based on data in multiple columns

I have a dataset from a colleague.
In the dataset we record the location where a given skin problem is.
We record up to 20 locations for the skin problem.
i.e
scaloc1 == 2
scaloc2 == 24
scaloc3 == NA
scalocn......
Would mean the skin problem was in place 1 and 24 and nowhere else
I want to reorganise the data so that instead of being like this it is
face 1/0 torso 1/0 etc
So for example if any of scaloc1 to scalocn contain the value 3 then set the value of face to be 1.
I had previously done this in STATA using:
foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}
I feel like I should be able to do this using either a dreaded for loop or possibly something from the apply family?
I tried
dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
dataframe$facescalp[X>=1 & X<7] <-1
})
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1
But didn't work....
I've not really used apply before but did have a good root around examples here and can't find one which accurately describes my current issue.
An example dataset is available:
https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0
If anything not clear or there is a good explanation for this already in a different answer please do let me know.
If I understand your problem correctly, the easiest way to solve it would probably be the following (this uses your example data set that you provided read in and stored as df)
# Add an ID column to identify each patient or skin problem
df$ID <- row.names(df)
# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID, loc, -ID)
# Order by ID
dfl <- dfl[order(dfl$ID), ]
# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]
# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)
Because you have a lot of conditions that you will need to apply in order to fill the various body part columns, the most efficient rout would probably to create a lookup table and use the match function. There are many examples on SO that describe using match for situations like this.
Very helpful.
I ended up using a variant of this approach
data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20"))
#Make a single long dataframe
data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))
#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1
data <- inner_join(data,locations, by = "ID")
#This brings the data back together

Storing hierarchical data as a list of matrices in R

Noob R question here from a Matlab/Python user. I have a dataset with hundreds of different users, each of whom has a unique number of rows of data, and would like to store the data as a list of matrices. So user 1 may have a matrix of 500 rows, user 2 may have a matrix of 250, and so on. This will be used as the inputs for a hierarchical logit with a mixture of normals to explain each user's betas. Column 1 of my dataset is a user id, and the rest of the cols are numerical values.
data <- read.csv("hierarchical_dataset.csv", header=FALSE)
nlgtt = length(table(data[[1]])) # number of users
users = names(table(data[[1]])) # user ids
All good so far, but here is where I'm getting my error:
TV = matrix()
testdata = list()
for (i in 1:nlgtt)
{ TV[i] = matrix(table(data[[1]])[[i]]) # number of rows per user
print(TV[i]) # should equal the below line
print(dim(data[data[[1]] == users[i], ])) # should equal the above line
testdata[i] = data[data[[1]] == users[i], ] # store hierarchically by user
}
When I run the above, the printed values match, so the correct data are being retrieved, but then I get simple repetitions of the users[i] value (the user id) for each testdata[i], and also a number of items to replace is not a multiple of replacement length error. I'm sure this is just a simple formatting issue, but have looked around and failed to turn up anything that resolves my problem. Help appreciated!
You might want to use base::lapply from the apply family functions. Please see the example below.
dataset <- data.frame(user_id = c(1,1,2), variable_a = c(1,2,3))
lapply(unique(dataset$user_id), function(id) dataset[dataset$user_id == id,])
The output gives a 2 element list with matrices of user_id and variable_a, where each element is unique by user.

repeated random samples from subsets within data

I am an R novice and have been running into a brick wall with what should be a simple for loop process. Data consists of a list with dimensions of 81161 by 9. The observations are of individuals over time. The current need is to isolate an unique group of observations and randomly extract one of the observation data points. At this stage I have reviewed and been attempting a few options none of which are being executed property. First the for loop then apply.
To provide a better idea of the workflow I have outlined. This should be a relatively straight forward split-apply-combine. The apply is a sample with restriction to unique individual_days. To do this the code goes through the basic defining of over all dimensions, then defining of unique values, a sort and rank (from which the unique individuals_day are set to an ordianl scale and then these are linked back to the originsl data using the individuals_day as the key). From this point I have attempted two alternatives with the for loops --- first using a split by rank to provide DSrank$'1, 2, 3...n' (attempted to be used in example 2) and using the subset seen in example 1. A single sample would then be extracted at random and collated into a sub-dataset. From this point other analysis will be performed.
### example 1: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SDS[[i]] <- sample(nrow(SplitDS$[i]), 1, replace=FALSE)
`SDS[[i]]["Samples"] < i
}
head(SDS)
### example 2: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SubSDID <- subset(DSID, DSrank == 'i', )
`SDS <- sample(nrow(SubSDID), 1, replace=FALSE)
}
head(SDS)
### example 3: apply subset
bootstrap <- lapply(1:length(UID), function(i) {
`samp <- sample(1:nrow(DSID$DSrank, rep = TRUE)
`DSID$DSrank[sampl, ]
}
These have been based off examples I have found through CRAN, stackoverflow, and other R-code search results.
If you have any suggestions, tips, or tricks that you could share it would be greatly appreciated.
MB

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources