I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")
Related
I would count with the func table() in R how many time a value occures in a cell. But, some cell contains more value divided by colon. I report an example below:
example <- data.frame(c("A","B","A:::B"))
table(example)
the result is:
A A:::B B
1 1 1
but i want something like this
A B
2 2
I try to duplicate the rows with this characteristics, but the dataset is already too large and duplicate rows makes dataset impossible to use. How can i do?
thanks
We can split the column values by ::: and get the table
table(unlist(strsplit(example[[1]], "\\:+")))
# A B
# 2 2
I have a dataframe composed by several paired columns. So, for example, the first column is a list of names and the second column contains numeric values quantifying the variables of the first column. In the third column I have again a list of names and the fourth column is numeric and quantifies variables of the third column and so on.
I now want to automatically subset the first two columns to make a separate dataframe and the third-fourth columns to make a second dataframe. The final aim is to align the rows by name.
For example, from dataframe a
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
I would obtain a dataframe containing names_a and values_a and another dataframe containing names_b and values_b, then aligning them to have dataframe a1:
names_a1<-c("a","b","c","d","e","f")
values_a1<-c(1,2,3,4,0,0)
values_b1<-c(5,6,0,0,7,8)
a1<-as.data.frame(cbind(names_a1,values_a1,values_b1))
Any suggestion?
Thanks in advance for any help
I can help for the first Part of your request. Please see how to create the separated data frames.
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
#When you subset a data frame you focus on observations (rows), not on the variables (columns). You can create 2 new data frames out of the existing one.
#df contain 3+4 Variable
a34 <- data.frame(cbind(as.vector(a$names_b),as.vector(a$values_b)))
colnames(a34) <-c("names_b","values_b")
#then "subset" a (in fact you create a new one and replace it)
a <- data.frame(cbind(as.vector(a$names_a),as.vector(a$values_a)))
colnames(a) <-c("names_a","values_a")
This result in:
> a
names_a values_a
1 a 1
2 b 2
3 c 3
4 d 4
> a34
names_b values_b
1 a 5
2 b 6
3 e 7
4 f 8
here is my question.
I have a dataframe with 30 rows (corresponding to 30 questions in a questionnaire) with values from 1 to 5 as answers.
I would like to sum all values equal to 1 that appears in the 30 rows.
I tried with the command aggregate, but it doesn't work.
The question could use more clarity, code would help, but I will give you a theoretical of what I believe you are asking for
If you have a data frame df such that:
questions ob1 ob2 ob 3
q1 5 3 1
q2 2 1 1
q3 4 1 5
and you want to add up all the values where something is equal to answer of 1 you have a number of options, but the most obvious is simply subset with a logical
or you could
sumob1<- sum(df$ob1[ , which(df$ob1==1)])
Watch for the leading comma in the [] it tells R to include all rows (on the left side of the comma) and just the values equal to the subset column on the right.
Which basically says I would like to make sumob1 equal to the sum of the column ob1 for all row cells in which column df$ob1 has a value of 1.
You can do that for each column.
I would like to create a subset of a large data frame. I would like to select one row with each value for column 1 "class", based on having the lowest number for column 2 "random number".
For example, rows 1,2,and 3 all have the value 2 in column 1 and I would like to keep/subset row 3 as it has the lowest random number (3.446456). For this sample I would like to subset rows 3,4,7,8,9,10,11.
My dataset has over 10,000 rows, so is there a way of coding for this? I'm using R studio.
Thanks very much,
Class Random_number Score_1 Score_2 Score_3
2 5.575475 0.78464 0.747847 0.6746464
2 7.738382 0.73273 0.747474 0.6734652
2 3.456456 0.78464 0.747847 0.6746464
3 6.939399 0.23363 0.123555 0.6476384
4 10.99993 0.66654 0.565757 0.6565633
4 6.894898 0.54295 0.825264 0.2357674
4 5.575475 0.78464 0.747847 0.6746464
5 3.738382 0.73273 0.747474 0.6734652
6 3.456456 0.78464 0.747847 0.6746464
7 6.932119 0.23363 0.123555 0.6476384
7 17.11993 0.66654 0.565757 0.6565633
8 6.895898 0.54295 0.825264 0.2357674
Try ordering the data set by random number :
data<-data[order(data$Random_number),]
Then subset by taking out duplicate values of Class
data<-subset(data, !duplicated(Class))
I printed out the summary of a column variables as such:
summary(document$subject)
A,B,C,D,E,F,.. are the subjects belonging to a column of a data.frame where A,B,C,...appear many times in the column, and the summary above shows the number of times (frequency) these subjects have appeared in the file. Also, the term "OTHER" refers to those subjects which have appeared only once in the file, I also need to assign "1" to these subjects.
There are so many different subjects that it's difficult to list out all of them if we use command "c".
I want to build up a new column (or data.frame) and then assign these corresponding numbers (scores) to the subjects. Ideally, it will become this in the file:
A 198
B 113
C 96
D 69
A 198
E 65
F 62
A 198
C 113
BZ 21
BC 1
CJ 1
...
I wonder what command I should use to take the scores/values from the summary table and then build a new column to assign these values to the corresponding subjects in the file.
Plus, since it's a summary table printed by R, I don't know how to build it into a table in a file, or take out the values and subject names from the table. I also wonder how I could find out the subject names which appeared only once in the file, so that the summary table added them up into "OTHER".
Your question is hard to interpret without a reproducible example. Please take a look this threat for tips on how to do that:
How to make a great R reproducible example?
Having said that, here is how I interpret your question. You have two data frames, one with a score per subject and another with the subjects multiple times in a column:
Sum <- data.frame(subject=c("A","B"),score=c(1,2))
foo <- data.frame(subject=c("A","B","A"))
> Sum
subject score
1 A 1
2 B 2
> foo
subject
1 A
2 B
3 A
You can then use match() to match the subjects in one data frame to the other and create the new variable in the second data frame:
foo$score <- Sum$score[match(foo$subject, Sum$subject)]
> foo
subject score
1 A 1
2 B 2
3 A 1