Removing certain values from a data frame - r

I know there are already some threads like this, but I could not find any solutions.
I have a dataframe that looks like this:
Name Age Sex Survived
1 Allison 0.17 female 1
2 Leah 0.33 female 0
3 David 0.8 male 1
4 Daniel 0.83 male 1
5 Alex 0.83 male 1
6 Jay 0.92 male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
I want to remove ages that are below 1. I want the data to look like this:
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Or to just remove the rows with ages < 1 altogether.
Following other solutions I tried this but it didn't work
mydata[mydata$Age<"1"&&mydata$Age>"0"] <- NA

Here are three ways to remove the rows:
mydata[mydata$Age > 1, ]
subset(mydata, Age > 1)
filter(mydata, Age > 1)
Here is how to make them NA:
mydata$Age[mydata$Age < 1] <- NA
Your issue is that you are using 1 as a character (in quotes). Character less/greater than work a little differently to numbers so be careful. Also make sure your Age column is numeric. The best way to do that is
mydata$Age <- as.numeric(as.character(mydata$Age))
so you don't accidentally mess up factor variables.
edit
put the wrong signs. fixed now

> mydata[mydata$Age<1, "Age"] <- NA
> mydata
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Update
Maybe you can use if Age is factor
mydata[as.numeric(as.character(mydata$Age))<1, "Age"] <- NA

Related

How to change an NA value in a specific row in R?

I am very new in R and still learning. My data is the Titanic.csv which has 891 observation and 13 variables. I would like to change the NA value on the 62 observation of PassengerID 62 in column 12 (column_name "Embarked") from NA to "S" and 830 observation to "C".
I found similar postings, but it didn't give me what I need.
How to replace certain values in a specific rows and columns with NA in R?
How to change NA value in a specific row and column?
My assignment is asking to use the below function.
boat<-within(boat,Embarked[is.na(Embarked)]<-"your choice here")
If I do this
boat<-within(boat,Embarked[is.na(Embarked)]<- "S")
or "C" in where it says "your choice here" it replaces both observations with either "S" or "C".
Below is the example of the Titanic.csv file.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Owen male 22 1 0 A/5 1717.25 S
2 1 1 Cumings,John female 38 1 0 PC 9971.28 C85 C
17 0 3 Rice, Eugene male 2 4 1 382 29.125 Q
18 1 2 Williams,Charles male 0 0 2443 13 S
60 0 3 Goodwin, William male 11 5 2 CA 21 46.9 S
61 0 3 Sirayanian, Orsen male 22 0 0 2669 7.2292 C
62 1 1 Icard, Amelie female 38 0 0 11357 80 B28 NA
63 0 1 Harris, Henry male 45 1 0 36973 83.475 C83 S
My apologies if the sample dataframe is somewhat condensed.
# df is you data frame, first one is the row e.g 62, second one is column e.g 12
df[62, 12]
# Now assign "S" with the `<-` operator
df[62, 12] <- "S"
# and check if NA is changed to S
df[62, 12]
#Embarked
#<chr>
# 1 S
# Same with
df[830, 12] <- "C"

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

R: combine two 2-dimensional crosstabs

> t <- read.csv("data.csv", sep=';')
> t
sex pacemaker smoker
1 female no never
2 female no never
3 male no never
4 male no former
5 male yes former
6 male yes former
7 female yes current
8 female yes former
9 female no current
> xtabs(~smoker+sex, data=t)
sex
smoker female male
current 2 0
former 1 3
never 2 1
> xtabs(~smoker+pacemaker, data=t)
pacemaker
smoker no yes
current 1 1
former 1 3
never 3 0
How can I combine two 2-dimensional crosstabs in R ?
Desired output:
| sex | pacemaker
smoker | female male | no yes
current | 2 0 | 1 1
former | 1 3 | 1 3
never | 2 1 | 3 0
I have renamed your data.frame to be df. This code should work for you.
cbind(xtabs(~smoker+sex, data=df), xtabs(~smoker+pacemaker, data=df))
female male no yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0
You might want to rename the pacemaker column headers.
colnames(XTab)[3:4] = c("Pacemaker_no", "Pacemaker_yes")
XTab
female male Pacemaker_no Pacemaker_yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0

Need help in data cleaning using R

I need some help in data cleaning using R.
my CSV file looks as follows.
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
...................................
5 Female 22 movies NA
5 Female 22 music NA
5 Female 22 travel NA
5 Female 22 cloths NA
5 Female 22 grocery NA
5 Female 22 books NA
5 Female 22 rent 1
5 Female 22 fuel NA
5 Female 22 utility NA
5 Female 22 online-shopping 2
So far My efforts are as follows.
mini <- read.csv("~/MS/coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories and also how do I rank the categories as per their position in CSV file.
For more clarity please refer the above CSV file and expected output.
text1='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,'
d1 <- read.table(text=text1, sep=",", head=T, as.is=T)
library(reshape2)
d2 <- melt(d1, id.vars=c("id","gender","age"))
names(d2)[5] <- "category"
names(d2)[4] <- "rank"
d2$rank <- gsub("category", "", d2$rank)
head(d2)
# id gender age rank category
# 1 1 Male 22 1 movies
# 2 2 Male 28 1 travel
# 3 3 Female 27 1 rent
# 4 4 Female 22 1 rent
# 5 5 Female 22 1 rent
# 6 1 Male 22 2 music
We can use gather from tidyr
library(tidyr)
d2 <- gather(d1, rank, category, -(1:3)) %>%
extract(rank, into='rank', '.*(\\d+)')
head(d2)
# id gender age rank category
#1 1 Male 22 1 movies
#2 2 Male 28 1 travel
#3 3 Female 27 1 rent
#4 4 Female 22 1 rent
#5 5 Female 22 1 rent
#6 1 Male 22 2 music

Help needed in Data cleaning using R

"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
So far my efforts are as follows.
mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories for each user.
Any help in this regard is appreciated.
library(reshape2)
library(tidyr)
mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
#
# id gender age value variable
# (int) (fctr) (int) (chr) (fctr)
# 1 1 Male 22 books NA
# 2 1 Male 22 cloths category4
# 3 1 Male 22 fuel NA
# 4 1 Male 22 grocery category5
# 5 1 Male 22 movies category1
# 6 1 Male 22 music category2
# 7 1 Male 22 online-shopping NA
# 8 1 Male 22 rent NA
# 9 1 Male 22 travel category3
# 10 1 Male 22 utiliy NA
# .. ... ... ... ... ...
Explanation
We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.
Data
df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""
Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:
reshapedf <- reshape(df, varying = c(4:13),
v.names = c("category"),
timevar=c("rank"),
times = c(1:10),
idvar = c("id", "gender", "age"),
new.row.names = 1:1000,
direction = "long")
# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
OUTPUT
id gender age rank category
1 1 Male 22 1 movies
2 1 Male 22 2 music
3 1 Male 22 3 travel
4 1 Male 22 4 cloths
5 1 Male 22 5 grocery
6 1 Male 22 6 NA
7 1 Male 22 7 NA
8 1 Male 22 8 NA
9 1 Male 22 9 NA
10 1 Male 22 10 NA
...

Resources