I have a list of responses to 7 questions from a survey, each their own column, and am trying to find the response within the first 6 that is closest (numerically) to the 7th. Some won't be the exact same, so I want to create a new variable that produces the difference between the closest number in the first 6 and the 7th. The example below would produce 0.
s <- c(1,2,3,4,5,6,3)
s <- t(s)
s <- as.data.frame(s)
s
Any help is deeply appreciated. I apologize for not having attempted code as nothing I have tried has actually gotten close.
How about this?
which.min( abs(s[1, 1:6] - s[1, 7]))
I'm assuming you want it generalized somehow, but you'd need to provide more info for that. Or just run it through a loop :-)
EDIT: added the loop from the comment and changed exactly 2 tiny things.
s <- c(1,2,3,4,5,6,3)
t <- c(1,2,3,4,5,6,7)
p <- c(1,2,3,4,5,6,2)
s <- data.frame(s,t,p)
k <- t(s)
k <- as.data.frame(k)
k$t <- NA ### need to initialize the column
for(i in 1:3){
## need to refer to each line of k when populating the t column
k[i,]$t <- which.min(abs(k[i, 1:6] - k[i, 7])) }
Related
I would like to try out a normalisation method a friend recommended, in which each col of a df should be subtracted, at first from the first col and next from every other col of that df.
eg:
df <- data.frame(replicate(9,1:4))
x_df_1 <- df[,1] - df[2:ncol(df)]
x_df_2 <- df[,2] - df[c(1, 3:ncol(df))]
x_df_3 <- df[,3] - df[c(1:2, 4:ncol(df))]
...
x_cd_ncol(df) <- df[c(1: (1-ncol(df)))]
As the df has 90 cols, doing this by hand would be terrible (and very bad coding). I am sure there must be an elegant way to solve this and to receive at the end a list containing all the dfs, but I am totally stuck how to get there. I would appreciate a dplyr method (for familiarity) but any working solution would be fine.
Thanks a lot for your help!
Sebastian
I may have found a solution that I am sharing here.
Please correct me if im wrong.
This is a permutation without replacement task.
The original df has 90 cols.
Lets check how many combinations there are possible first:
(from: https://davetang.org/muse/2013/09/09/combinations-and-permutations-in-r/)
comb_with_replacement <- function(n, r){
return( factorial(n + r - 1) / (factorial(r) * factorial(n - 1)) )
}
comb_with_replacement(90,2) #4095 combinations
Now using a modified answer from here: https://stackoverflow.com/a/16921442/10342689
(df has 90 cols. don't know how to create this proper as an example df here.)
cc_90 <- combn(colnames(df), 90)
result <- apply(cc_90, 2, function(x) df[[x[1]]]-df[[x[2]]])
dim(result) #4095
That should work.
In R one can index using negative indices to represent "all except this index".
So we can re-write the first of your normalization rows:
x_df_1 <- df[,1] - df[2:ncol(df)]
# rewrite as:
x_df_1 <- df[,1] - df[,-1]
From this, it's a pretty easy next step to write a loop to generate the 90 new dataframes that you generated 'by hand':
list_of_dfs=lapply(seq_len(ncol(df)),function(x) df[,x]-df[,-x])
This seems to be somewhat different to what you're proposing in your own answer to your question, though...
Every time I try to calculate this line "DHS <- mean(ahebachelors2008) - mean(ahebachelors1992)" I receive an NA answer. Calculating mean(ahe2008) works but calculating mean(ahebachelors2008) does not work.
setwd("~/Google Drive/R Data")
data <- read.csv('cps92_08.csv')
year <- data$year
year1992 <- subset(data,year<2000)
year2008 <- subset(data,year>2000)
ahe1992 <- (year1992$ahe)
ahe2008 <- (year2008$ahe)
max(ahe1992)
min(ahe1992)
mean(ahe1992)
median(ahe1992)
sd(ahe1992)
max(ahe2008)
min(ahe2008)
mean(ahe2008)
median(ahe2008)
sd(ahe2008)
adjahe <- ahe1992*(215.2/140.3)
max(adjahe)
min(adjahe)
mean(adjahe)
median(adjahe)
sd(adjahe)
D <- mean(ahe2008) - mean(adjahe)
education <- data$bachelor
ahebachelors1992 <- subset(adjahe, education>0)
ahehighschool1992 <- subset(adjahe,education<1)
ahebachelors2008 <- subset(ahe2008,education>0)
ahehighschool2008 <- subset(ahe2008,education<1)
DHS <- mean(ahebachelors2008) - mean(ahebachelors1992)
education is the same length as data, whereas ahe2008 is a subset of data. So when you pass education as the condition on ahe2008, it creates NAs (because that's the corresponding value in ahe2008 for those elements.
Here's a simpler example:
d1<-c(1:5)
d2<-c(1:5,1:5)
subset(d1,d2==1)
[1] 1 NA
Possible solutions would be to create separate bachelor vectors for each year, or to not continuously subset but just use multiple conditions where you need them.
If you're trying to avoid typing the full data$something every time, consider using with(), or even better - the dplyr package.
For example, all the code leading up to the last line could be replaced with this (assuming I didn't miss anything):
DHS <- mean(with(data,ahe[year>2000 & education>0])) -
mean(with(data,ahe[year<2000 & education>0]*(215.2/140.3))
(If you're new to R, note that the [] structure is a simpler way to call on subset).
You might also want to consider using summary which will give you min, median, mean, and max, leaving you with just sd to add manually.:
summary(with(data,ahe[year>2000]))
If the values you are trying to calculate mean on contain NA then the output will be NA. You can overcome it by adding na.rm = TRUE to your mean:
DHS <- mean(ahebachelors2008, na.rm=TRUE) - mean(ahebachelors1992, na.rm=TRUE)
Given a matrix (m),
I want to remove from it the subjects given by a changing vector,
I am trying to do a loop but it does only remove the last input:
m= matrix(1:4,10,3);
changing_vector = c(2,1) or c(1,4) # etc..
for(j in 1:length(changing_vector))
{
a = subData[!(subData$subject== changing_vector [j]),]
}
Someone know why it does not work? Do you propose any other way to do it?
Thanks in advance for your help,
G.
Always try to post reproducible examples, that others can see what you are trying to do. Also try to be very precise, as it is sometimes very hard to understand what people want to do (as in your case).
Maybe this can help you with your promlem:
m <- matrix(1:5, 15, 5)
vec <- c(x,y)
for(i in 1:nrow(m)){
z[i] <- any(m[i,] %in% vec)
}
m <- m[!x,]
I appreciate your help, but altough it did not work to solve the issue, here what I did:
# removing subjects who did not reach a performance> 70 % (for ex.- easier to # understand this way
subjectsTOremove= which((performance<70)
vector_poz = c();
for(j in 1:length(subjectsTOremove))
{
S_to_remove= subjectsTOremove[j]
a = data[!(data$subject== S_to_remove),]
aa = which(data$subid == subjectsTOremove[j])
vector_poz = c(vector_poz,aa)
}
# then this subjects rows are transformed in NaN and the NaN removed
data[vector_poz,]=NaN # this tranf allows to check visually the data out
data= data[complete.cases(data),]
The question is: Create a function that takes in a numeric vector. The output should be a vector with running mean values. The i-th element of the output vector should be the mean of the values in the input vector from 1 to i.
My main problem is in the for loop, which is as follows:
x1 <- c(2,4,6,8,10)
for (i in 2: length(x1)){
ma <- sum(x1[i-1] , x1[i]) / i
print(ma)
mresult <- rbind(ma)
}
View(ma)
I know there must be something wrong in it. But I am just not sure what it is.
As you have noticed there are more efficient ways using already available functions and packages to achieve what you are trying to do. But here is how you would go about fixing your loop
x1 <- c(2,4,6,8,10)
mresult = numeric(0) #Initiate mresult. Or maybe you'd want to initiate with 0
for (i in 2: length(x1)){
ma <- sum(x1[1:i])/i #You were originally dividing the sum of (i-1)th and ith value by i
print(ma) #This is optional
mresult <- c(mresult,ma) #Since you have only an array, there is no need to rbind
}
View(ma) #The last computed average
View(mresult) #All averages
I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl is a data frame that contains d being the device id and s being the number of sequences of length 300.
First, I am labelling (column Sid) all the sequences rep(1,300) followed by rep(2,300) and so on till rep(s,300). Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid vector though.
I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.
Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).
#subset processed data by device
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data$Sid <- 0
temp.data[1:(s*300),4] <- rep(1:300,s)
temp.data <- subset(temp.data,temp.data$Sid!="0")
seqid <- NA
for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }
temp.data$Sid <- seqid
predata <- as.data.frame(matrix(numeric(0),s,600))
for(k in 1:s){
temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
predata[k,] <- t(stack(temp.data2)[,1])
}
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.
Any help will be appreciated.
Thanks
Ok, I am going to suggest a radical simplification of your code within the loop. However, it is hard to verify that I in fact did assume the right thing without having sample data. So please ensure that my predata in fact equals yours.
First the code:
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data <- temp.data[1:(s*300),]
predata <- cbind(matrix(temp.data[,1], byrow=T, ncol=300), matrix(temp.data[,2], byrow=T, ncol=300))
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
What I understand you are doing: Take the first 300*s elements from your subset(data, data$Devide == d. This might easily be done using the command
temp.data <- temp.data[1:(s*300),]
Afterwards, you collect a matrix that has the first row c(temp.data[1:300, 1], temp.data[1:300, 2]), and so on for all further rows. I do this using the matrix command as above.
I assume that your outer loop could be transformed in a call to tapply or something similar, but therefore, we would need more context.