When subsetting to remove rows only removes one column values - r

I have data that looks something like this:
name age profit
Ann -3 10
Ann -2 5
Ann 1 23
Ann 2 15
Josh -2 12
Josh -1 34
Josh 0 1
Josh 1 21
Josh 2 26
I want to remove those rows for which age is negative.
After using
subset(profitData,age>0,select=c(name,age,profit))
I get this:
name age profit
Ann 1 10
Ann 2 5
Ann 3 23
Ann 4 15
Josh 1 12
Josh 2 34
Josh 3 1
Josh 4 21
Josh 5 26
So, only the values from the age column are removed but not the entire row.
Any suggestions?

It seems like this will be OK:
profitData[profitData$age>0,]

To answer your specific subset question...
There's something hinky going on. I run your code and get your desired output. Start a clean session maybe:
profitData <- read.table(text="name age profit
Ann -3 10
Ann -2 5
Ann 1 23
Ann 2 15
Josh -2 12
Josh -1 34
Josh 0 1
Josh 1 21
Josh 2 26", header=T)
subset(profitData,age>0,select=c(name,age,profit))
## > subset(profitData,age>0,select=c(name,age,profit))
## name age profit
## 3 Ann 1 23
## 4 Ann 2 15
## 8 Josh 1 21
## 9 Josh 2 26

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

Removing certain values from a data frame

I know there are already some threads like this, but I could not find any solutions.
I have a dataframe that looks like this:
Name Age Sex Survived
1 Allison 0.17 female 1
2 Leah 0.33 female 0
3 David 0.8 male 1
4 Daniel 0.83 male 1
5 Alex 0.83 male 1
6 Jay 0.92 male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
I want to remove ages that are below 1. I want the data to look like this:
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Or to just remove the rows with ages < 1 altogether.
Following other solutions I tried this but it didn't work
mydata[mydata$Age<"1"&&mydata$Age>"0"] <- NA
Here are three ways to remove the rows:
mydata[mydata$Age > 1, ]
subset(mydata, Age > 1)
filter(mydata, Age > 1)
Here is how to make them NA:
mydata$Age[mydata$Age < 1] <- NA
Your issue is that you are using 1 as a character (in quotes). Character less/greater than work a little differently to numbers so be careful. Also make sure your Age column is numeric. The best way to do that is
mydata$Age <- as.numeric(as.character(mydata$Age))
so you don't accidentally mess up factor variables.
edit
put the wrong signs. fixed now
> mydata[mydata$Age<1, "Age"] <- NA
> mydata
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Update
Maybe you can use if Age is factor
mydata[as.numeric(as.character(mydata$Age))<1, "Age"] <- NA

How to sort with multiple conditions in R [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a very simple dataframe in R:
x <- data.frame("SN" = 1:7, "Age" = c(21,15,22,33,21,15,25), "Name" = c("John","Dora","Paul","Alex","Bud","Chad","Anton"))
My goal is to sort the dataframe by the Age and the Name. I am able to achieve this task partially if i type the following command:
x[order(x[, 'Age']),]
which returns:
SN Age Name
2 2 15 Dora
6 6 15 Chad
1 1 21 John
5 5 21 Bud
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
As you can see the dataframe is order by the Age but not the Name.
Question: how can i order the dataframe by the age and name at the same time? This is what the result should look like
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Note: I would like to avoid to use additional packages but using just the default ones
With dplyr:
library(dplyr)
x %>%
arrange(Age, Name)
SN Age Name
1 6 15 Chad
2 2 15 Dora
3 5 21 Bud
4 1 21 John
5 3 22 Paul
6 7 25 Anton
7 4 33 Alex
x[with(x, order(Age, Name)), ]
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex

R Dataframe: Add new column as sum of certain other columns

Name Trial# Result ResultsSoFar
1 Bob 1 14 14
2 Bob 2 22 36
3 Bob 3 3 39
4 Bob 4 18 57
5 Nancy 2 33 33
6 Nancy 3 87 120
Hello, say I have the dataframe above. What's the best way to generate the "ResultsSoFar" column which is a sum of that person's results up to and including that trial (Bob's results do not include Nancy's and vice versa).
With data.table you can do:
library(data.table)
setDT(df)[, ResultsSoFar:=cumsum(Result), by=Name]
df
Name Trial. Result ResultsSoFar
1: Bob 1 14 14
2: Bob 2 22 36
3: Bob 3 3 39
4: Bob 4 18 57
5: Nancy 2 33 33
6: Nancy 3 87 120
Note:
If Trial# is not sorted, you can do setDT(df)[, ResultsSoFar:=cumsum(Result[order(Trial.)]), by=Name] to get the right order for the cumsum

R: Increment Rank when the column group changes

Newbie to R, I've tried Googling but I'm failing find a solution.
Here's my data frame:
Name Value
Bob 50
Mary 55
John 51
Todd 50
Linda 56
Tom 55
So I've sorted it but what I need to add a rank column, so it looks like this:
Name Value Rank
Bob 50 1
Todd 50 1
John 51 2
Mary 55 3
Tom 55 3
Linda 56 4
So what I found is:
resultset$Rank <- ave(resultset$Name, resultset$Value, FUN = rank)
But this gives me:
Name Value Rank
Bob 50 1
Todd 50 2
John 51 1
Mary 55 1
Tom 55 2
Linda 56 1
So close but yet so far...
Here's a base-R solution:
uv <- unique(df$Value)
merge(df,data.frame(uv,r=rank(uv)),by.x="Value",by.y="uv")
which gives
Value Name r
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4
This is memory inefficient and has the side-effect of resorting your data. You could alternately do:
require(data.table)
DT <- data.table(df)
DT[order(Value),r:=.GRP,by=Value]
which gives
Name Value r
1: Bob 50 1
2: Mary 55 3
3: John 51 2
4: Todd 50 1
5: Linda 56 4
6: Tom 55 3
No need to sort... You can use dense_rank from "dplyr":
> library(dplyr)
> mydf %>% mutate(rank = dense_rank(Value))
Name Value rank
1 Bob 50 1
2 Mary 55 3
3 John 51 2
4 Todd 50 1
5 Linda 56 4
6 Tom 55 3
I guess your rank variable can be obtained by 1:length(unique(df$value)). Below is my trial.
df <- data.frame(name = c("Bob", "Mary", "John", "Todd", "Linda", "Tom"),
value = c(50, 55, 51, 50, 56, 55))
# rank by lengths of unique values
rank <- data.frame(rank = 1:length(unique(df$value)), value = sort(unique(df$value)))
merge(df, rank, by="value")
value name rank
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4

Resources