R: merging copies of the same variable - r

I have data like this in R:
subjID = c(1,2,3,4)
var1 = c(3,8,NA,6)
var1.copy = c(NA,NA,5,NA)
fake = data.frame(subjID = subjID, var1 = var1, var1 = var1.copy)
which looks like this:
> fake
subjID var1 var1.1
1 1 3 NA
2 2 8 NA
3 3 NA 5
4 4 6 NA
Var1 and Var1.1 represent the same variable, so each subject has NA for one column and a numerical value in the other (no one has two NAs or two numbers). I want to merge the columns to get a single Var1: (3, 8, 5, 6).
Any tips on how to do this?

If you're only dealing with two columns, and there are never two numbers or two NAs, you can calculate the row mean and ignore missing values:
fake$fixed <- rowMeans(fake[, c("var1", "var1.1")], na.rm=TRUE)

You can use is.na, which can be vectorised as:
# get all the ones we can from var1
var.merged = var1;
# which ones are available in var1.copy but not in var1?
ind = is.na(var1) & !is.na(var1.copy);
# use those to fill in the blanks
var.merged[ind] = var1.copy[ind];

It depends on how you want to merge if there are conflicts.
You could simply put all non-NA values in var.1.1 into the corresponding slot of var1. In case of conflicts, this will favour var.1.1.
var1[!is.na(var1.copy)] <- var1.copy[!is.na(var1.copy)]
You could just fill in all NA values in var1 with corresponding values of var1.1. In case of conflict, this will favour var1.
var1[is.na(var1)] <- var1.copy[is.na(var1)]

Related

Exclude variables based on pattern and melt

My data look like this
id var1 var1_a var2 var2_a var3 var3_a
1 1 7 7 8 9 4
2 2 4 8 7 6 5
3 5 5 1 2 3 4
4 6 9 5 6 7 8
I want to select var1, var2, and var3 only, and exclude var1_a, var2_a an var3_a. Name of variables may vary in length
I know I can use something like
dt.m<-melt(dt, id=1, measure.vars=c(1, 3, 5), na.rm=TRUE)
but I don't want to use this approach because I have too many of variables.
How ca I do this using patterns or a similar approach?
If the measure column names have a pattern to them then use grep to find which they are. In the example, the variables of interest all end in a digit so we could use this:
melt(dt, id = 1, measure = grep("\\d$", names(dt)), na.rm = TRUE)
or if the columns of interest are in predictable positions use seq or similar approach to generate the column numbers.
melt(dt, id = 1, measure = seq(2, 6, 2), na.rm = TRUE)
Other ways to pick out the names that work in the example are:
# pick out column names that have 4 characters
which(nchar(names(dt)) == 4)
# pick out names having no underscore and that are not first
grep("_", names(dt), invert = TRUE)[-1]
# pick out even positions
which( (1:ncol(dt)) %% 2 == 0)
Sorry I'd comment but I don't have enough rep yet. If your variables are actually named var1 var1_a, etc, you can use gsub
names1 = paste0("var",seq(1,100))
names2 = paste0("var",seq(1,100),"_a")
names = sample(c(names1, names2))
x = matrix(rnorm(200*10),nrow=10)
d = data.frame(x)
names(d) = names
d.m <- d[,which(gsub("_a","",names(d)) == names(d))]
print(names(d.m))

How to create a variable based on the values from more than one column in r

I have a data frame that has three variables with the valid values of 1,2,3,4,5,6,7 for each variable. If there isn't a numeric value assigned to the variable, it will show NA. The data frame a looks like below:
ak_eth co_eth pa_eth
1 NA 1 NA
2 NA NA 1
3 NA NA NA
4 2 NA NA
5 NA NA 4
6 NA NA NA
Each row could have NA across all three variables or have only one value in one of the three variables. I want to create a new variable called recode that takes values from the existing three variables. If all three existing variables are NA, the new value is NA; if one of the three existing variables has a value, then take that value for the new variable.
I've tried this, but it seems didn't work for me.
a$recode[is.na(a$ak_eth) & is.na(a$co_eth) & is.na(a$pa_eth)] <- "NA"
library(car)
a$recode <- recode(a$ak_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$co_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$pa_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
Any suggestions will be appreciated. Thanks!
We can use pmax
a$Recode_Var <- do.call(pmax, c(a, na.rm = TRUE))
Or use pmin
a$Recode_Var <- do.call(pmin, c(a, na.rm = TRUE))
Or another option is rowSums
r1 <- rowSums(a, na.rm = TRUE)
a$Recode_Var <- replace(r1, r1==0, NA)
NOTE: According to the OP's post Each row could have NA across all three variables or have only one value in one of the three variables

Assigning unique variable from a data.frame

This is a similiar question to this but my output results are different.
Take the data:
example <- data.frame(var1 = c(2,3,3,2,4,5),
var2 = c(2,3,5,4,2,5),
var3 = c(3,3,4,3,4,5))
Now I want to create example$Identity which take a value from 1:x for each unique var1 value
I have used
example$Identity <- apply(example[,1], 2, function(x)(unique(x)))
But I am not familiar with correct formatting function()
The output of example$Identity should be 1,2,2,1,3,4
This:
example$Identity <- as.numeric(as.factor(example$var1))
will give you the desired result:
> example$Identity
[1] 1 2 2 1 3 4
By wrapping the as.factor in as.numeric it starts counting the factor levels with 1 and so on.
Or you can use match
example$Identity <- with(example, match(var1, unique(var1)))
If the values are sorted as in the vector, findInterval can be also used
findInterval(example$var1, unique(example$var1))
#[1] 1 2 2 1 3 4

How to Preprocess data to handle missing values in R

I am trying to pre-process my data in R such that I can use the "attribute mean for all samples belonging to the same class as the given tuple"
The missing values or the values falling out of range have been already given a value -1 by the data source provider. But I want to replace those missing values according to the data mining principle stated above in bold. The column that is my class decider is "Accident severity" and I want to give the attribute mean for all samples belonging to the same level of accident severity as the level of severity of the tuple with the missing attribute value.
As there are multiple columns with missing values, I guess I will have to do the taskk repeatedly for all columns one at a time. What r command should I use.
There are mostly two types of data types(vectors) in my data frame.. Factor is for Date and Time columns where as integer is for most of the other columns.
Is there a way that I can upload a subset of the data set here on stack overflow?
here is the link to the reproducible data set https://drive.google.com/file/d/0B3cafW7J7xSfSkRTYWRWMHhaU2c/edit?usp=sharing
Update 2: Now that the data set is there , please help me change the values where there is a "-1" in any of the columns to a value that is the mean of all tuples that have the same value for the attribute "Accident_severity" as the tuple with the missing values..
Update 3: please ignore the colums "X2_roadclass" and "X2_Road_type" as they are mostly blank and I am dropping them. thanks
Please see if this is close to your need
library(ggplot2)
library(reshape)
library(plyr)
Create some data
set.seed(1)
df <- data.frame(severity=rep(c('high', 'moderate', 'low'), each = 3),
factor1 = rep(c(1,2,3), each = 6),
factor2 = rep(c(4,5,6), times = 3),
date=rep(c('2011-01-01','2011-01-03','2011-01-10'),
times = 3), stringsAsFactors = F)
With some -1
df$factor2[3] <- -1
df$factor1[1] <- -1
Replace them with NA
df[df == -1] <- NA
Reshape it
mdf <- melt(df, id.vars= c("severity", 'date'))
Summarize
ddply(mdf, .(severity, variable), summarise, mean=mean(value, na.rm = T))
severity variable mean
1 high factor1 1.6
2 high factor2 4.8
3 low factor1 2.5
4 low factor2 5.0
5 moderate factor1 2.0
6 moderate factor2 5.0
With the data provided, I'd do something like this
dt <- read.csv('./Stackoverflow/datatry1.csv')
#head(dt[ , -c(1:3) ]) # Exclude some unwanted colums
mdt <- melt(dt[ , -c(1:3) ], id.vars= c("Accident_Severity", 'Date',
'Day_of_Week', 'Time'))
dts <- ddply(mdt, .(Accident_Severity, variable), summarise,
mean=mean(value, na.rm = T))
dts
Accident_Severity variable mean
1 1 Number_of_Vehicles 1.00000000
2 1 X1st_Road_Class 3.00000000
3 1 X1st_Road_Number 503.00000000
4 1 Road_Type 6.00000000
5 1 Speed_limit 30.00000000
6 1 Junction_Detail 3.00000000
7 1 X2nd_Road_Class -1.00000000
...

Creating new variable from three existing variables in R

I have a dataset that looks like the one below, and I would like to create a new variable based on these variables, which can be used with the other variables in the dataset.
The first variable, ID, is a respondent identification number. The med variable are 1 and 2, indicating different treatments. Var1_v1 and Var1_v2 has four real options 1,2,3, or 9, and these options are only given to those who med ==1. If med ==2, NA appears in the Var1s. Var2 receives NA when med ==1 and has real values ranging from 1-3 when med==2.
ID <- c(1,2,3,4,5,6,7,8,9,10,11)
med <- c(1,1,1,1,1,1,2,2,2,2,2)
Var1_v1 <- c(2,2,3,9,9,9,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var1_v2 <- c(9,9,9,1,3,2,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var2 <- c(NA,NA,NA,NA,NA,NA,3,3,1,3,2)
#tables to show you what data looks like relative to med var
table(Var1_v1, med)
table(Var1_v2, med)
table(Var2, med)
I've been looking around for a while to figure out a recoding/new variable creation code, but I have had no luck.
Ultimately, I would like to create a new variable, say Var3, based on three conditions:
Uses the values from Var1_v1 if the value = 1, 2, or 3
Uses the values from Var1_v2 if the value = 1, 2, or 3
uses the values from Var2 if the values = 1, 2, or 3
And this variable should be able to match up with the ID number, so that it can be used within the dataset.
So the final variable should look like:
Var3 <- (2,2,3,1,3,2,3,3,1,3,2)
Thanks!
Something like
v <- Var1_v1
v[Var1_v2 %in% 1:3] <- Var1_v2[Var1_v2 %in% 1:3]
v[Var2 %in% 1:3] <- Var2[Var2 %in% 1:3]
v
[1] 2 2 3 1 3 2 3 3 1 3 2
which uses one of them as a base (you could also use a pure NA vector) and simply fills in only parts that match.

Resources