Creating new variable from three existing variables in R - r

I have a dataset that looks like the one below, and I would like to create a new variable based on these variables, which can be used with the other variables in the dataset.
The first variable, ID, is a respondent identification number. The med variable are 1 and 2, indicating different treatments. Var1_v1 and Var1_v2 has four real options 1,2,3, or 9, and these options are only given to those who med ==1. If med ==2, NA appears in the Var1s. Var2 receives NA when med ==1 and has real values ranging from 1-3 when med==2.
ID <- c(1,2,3,4,5,6,7,8,9,10,11)
med <- c(1,1,1,1,1,1,2,2,2,2,2)
Var1_v1 <- c(2,2,3,9,9,9,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var1_v2 <- c(9,9,9,1,3,2,NA,NA,NA,NA,NA) #ranges from 1-3, and 9
Var2 <- c(NA,NA,NA,NA,NA,NA,3,3,1,3,2)
#tables to show you what data looks like relative to med var
table(Var1_v1, med)
table(Var1_v2, med)
table(Var2, med)
I've been looking around for a while to figure out a recoding/new variable creation code, but I have had no luck.
Ultimately, I would like to create a new variable, say Var3, based on three conditions:
Uses the values from Var1_v1 if the value = 1, 2, or 3
Uses the values from Var1_v2 if the value = 1, 2, or 3
uses the values from Var2 if the values = 1, 2, or 3
And this variable should be able to match up with the ID number, so that it can be used within the dataset.
So the final variable should look like:
Var3 <- (2,2,3,1,3,2,3,3,1,3,2)
Thanks!

Something like
v <- Var1_v1
v[Var1_v2 %in% 1:3] <- Var1_v2[Var1_v2 %in% 1:3]
v[Var2 %in% 1:3] <- Var2[Var2 %in% 1:3]
v
[1] 2 2 3 1 3 2 3 3 1 3 2
which uses one of them as a base (you could also use a pure NA vector) and simply fills in only parts that match.

Related

Exclude variables based on pattern and melt

My data look like this
id var1 var1_a var2 var2_a var3 var3_a
1 1 7 7 8 9 4
2 2 4 8 7 6 5
3 5 5 1 2 3 4
4 6 9 5 6 7 8
I want to select var1, var2, and var3 only, and exclude var1_a, var2_a an var3_a. Name of variables may vary in length
I know I can use something like
dt.m<-melt(dt, id=1, measure.vars=c(1, 3, 5), na.rm=TRUE)
but I don't want to use this approach because I have too many of variables.
How ca I do this using patterns or a similar approach?
If the measure column names have a pattern to them then use grep to find which they are. In the example, the variables of interest all end in a digit so we could use this:
melt(dt, id = 1, measure = grep("\\d$", names(dt)), na.rm = TRUE)
or if the columns of interest are in predictable positions use seq or similar approach to generate the column numbers.
melt(dt, id = 1, measure = seq(2, 6, 2), na.rm = TRUE)
Other ways to pick out the names that work in the example are:
# pick out column names that have 4 characters
which(nchar(names(dt)) == 4)
# pick out names having no underscore and that are not first
grep("_", names(dt), invert = TRUE)[-1]
# pick out even positions
which( (1:ncol(dt)) %% 2 == 0)
Sorry I'd comment but I don't have enough rep yet. If your variables are actually named var1 var1_a, etc, you can use gsub
names1 = paste0("var",seq(1,100))
names2 = paste0("var",seq(1,100),"_a")
names = sample(c(names1, names2))
x = matrix(rnorm(200*10),nrow=10)
d = data.frame(x)
names(d) = names
d.m <- d[,which(gsub("_a","",names(d)) == names(d))]
print(names(d.m))

apply function by name of list

Imagine that I have a list
l <- list("a" = 1, "b" = 2)
and a data frame
id value
a 3
b 4
I want to match id with list names, and apply a function on that list with the value in data frame. For example, I want the sum of value in the data frame and corresponding value in the list, I get
id value
a 4
b 6
Anyone has a clue?
Edit:
A.
I just want to expand the question a little bit with. Now, I have more than one value in every elements of list.
l <- list("a" = c(1, 2), "b" =c(1, 2))
I still want the sum
id value
a 6
b 7
We can match the names of the list with id of dataframe, unlist the list accordingly and add it to value
df$value <- unlist(l[match(df$id, names(l))]) + df$value
df
# id value
#1 a 4
#2 b 6
EDIT
If we have multiple entries in list we need to sum every list after matching. We can do
df$value <- df$value + sapply(l[match(df$id, names(l))], sum)
df
# id value
#1 a 6
#2 b 7
You just need
df$value=df$value+unlist(l)[df$id]# vector have names can just order by names
df
id value
1 a 4
2 b 6
Try answer with Ronak
l <- list("b" = 2, "a" = 1)
unlist(l)[as.character(df$id)]# if you id in df is factor
a b
1 2
Update
df$value=df$value+unlist(lapply(l,sum))[df$id]

R: get corresponding value from another data frame

I'm new to R and here and I need some help to structure my data.
I have two data sets:
One of them is a long format within subjects data set which is large and looks a little bit like this:
long.format <- data.frame(subject.no = c(1, 1, 1, 1, 2, 2, 2, 2), condition = c("prime", "prime", "prime", "prime", "control", "control","control","control"), response = c(1,1,1,0,1,1,1,0))
subject.no condition response
>1 1 prime 1
>2 1 prime 1
>3 1 prime 1
>4 1 prime 0
>5 2 control 1
>6 2 control 1
>7 2 control 1
>8 2 control 0
The other one is already in wide format and looks like this
wide.format <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"))
subject age gender
>1 1 26 m
>2 2 27 f
The only thing I want to do now is to get the value in "condition" (and only this!) from the long format data frame to the corresponding subject in the wide data frame by adding a new column in the wide data frame (by using the columns subject.no and subject, respectively).
So the final data frame should look like this:
wide.format.aim <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"), condition = c("prime","control"))
subject age gender condition
>1 1 26 m prime
>2 2 27 f control
I've tried merging but this ended up with a long format data frame added with the information from the wide format data frame... but I want it the other way around...
This is what I've tried:
test.it <- merge(x=wide.format, y=long.format[,c("subject.no", "condition")], all.x=T, by.x="subject", by.y="subject.no")
Any suggestions?
Thanks in advance!
You are interested merging the unique values from long.format[,c("subject.no", "condition")]:
unique(long.format[,c("subject.no", "condition")])
# subject.no condition
#1 1 prime
#5 2 control
You can merge using those values
merge(x = wide.format,
y = unique(long.format[,c("subject.no", "condition")]),
by.x = "subject",
by.y = "subject.no")
# subject age gender condition
#1 1 26 m prime
#2 2 27 f control

How to Preprocess data to handle missing values in R

I am trying to pre-process my data in R such that I can use the "attribute mean for all samples belonging to the same class as the given tuple"
The missing values or the values falling out of range have been already given a value -1 by the data source provider. But I want to replace those missing values according to the data mining principle stated above in bold. The column that is my class decider is "Accident severity" and I want to give the attribute mean for all samples belonging to the same level of accident severity as the level of severity of the tuple with the missing attribute value.
As there are multiple columns with missing values, I guess I will have to do the taskk repeatedly for all columns one at a time. What r command should I use.
There are mostly two types of data types(vectors) in my data frame.. Factor is for Date and Time columns where as integer is for most of the other columns.
Is there a way that I can upload a subset of the data set here on stack overflow?
here is the link to the reproducible data set https://drive.google.com/file/d/0B3cafW7J7xSfSkRTYWRWMHhaU2c/edit?usp=sharing
Update 2: Now that the data set is there , please help me change the values where there is a "-1" in any of the columns to a value that is the mean of all tuples that have the same value for the attribute "Accident_severity" as the tuple with the missing values..
Update 3: please ignore the colums "X2_roadclass" and "X2_Road_type" as they are mostly blank and I am dropping them. thanks
Please see if this is close to your need
library(ggplot2)
library(reshape)
library(plyr)
Create some data
set.seed(1)
df <- data.frame(severity=rep(c('high', 'moderate', 'low'), each = 3),
factor1 = rep(c(1,2,3), each = 6),
factor2 = rep(c(4,5,6), times = 3),
date=rep(c('2011-01-01','2011-01-03','2011-01-10'),
times = 3), stringsAsFactors = F)
With some -1
df$factor2[3] <- -1
df$factor1[1] <- -1
Replace them with NA
df[df == -1] <- NA
Reshape it
mdf <- melt(df, id.vars= c("severity", 'date'))
Summarize
ddply(mdf, .(severity, variable), summarise, mean=mean(value, na.rm = T))
severity variable mean
1 high factor1 1.6
2 high factor2 4.8
3 low factor1 2.5
4 low factor2 5.0
5 moderate factor1 2.0
6 moderate factor2 5.0
With the data provided, I'd do something like this
dt <- read.csv('./Stackoverflow/datatry1.csv')
#head(dt[ , -c(1:3) ]) # Exclude some unwanted colums
mdt <- melt(dt[ , -c(1:3) ], id.vars= c("Accident_Severity", 'Date',
'Day_of_Week', 'Time'))
dts <- ddply(mdt, .(Accident_Severity, variable), summarise,
mean=mean(value, na.rm = T))
dts
Accident_Severity variable mean
1 1 Number_of_Vehicles 1.00000000
2 1 X1st_Road_Class 3.00000000
3 1 X1st_Road_Number 503.00000000
4 1 Road_Type 6.00000000
5 1 Speed_limit 30.00000000
6 1 Junction_Detail 3.00000000
7 1 X2nd_Road_Class -1.00000000
...

R: merging copies of the same variable

I have data like this in R:
subjID = c(1,2,3,4)
var1 = c(3,8,NA,6)
var1.copy = c(NA,NA,5,NA)
fake = data.frame(subjID = subjID, var1 = var1, var1 = var1.copy)
which looks like this:
> fake
subjID var1 var1.1
1 1 3 NA
2 2 8 NA
3 3 NA 5
4 4 6 NA
Var1 and Var1.1 represent the same variable, so each subject has NA for one column and a numerical value in the other (no one has two NAs or two numbers). I want to merge the columns to get a single Var1: (3, 8, 5, 6).
Any tips on how to do this?
If you're only dealing with two columns, and there are never two numbers or two NAs, you can calculate the row mean and ignore missing values:
fake$fixed <- rowMeans(fake[, c("var1", "var1.1")], na.rm=TRUE)
You can use is.na, which can be vectorised as:
# get all the ones we can from var1
var.merged = var1;
# which ones are available in var1.copy but not in var1?
ind = is.na(var1) & !is.na(var1.copy);
# use those to fill in the blanks
var.merged[ind] = var1.copy[ind];
It depends on how you want to merge if there are conflicts.
You could simply put all non-NA values in var.1.1 into the corresponding slot of var1. In case of conflicts, this will favour var.1.1.
var1[!is.na(var1.copy)] <- var1.copy[!is.na(var1.copy)]
You could just fill in all NA values in var1 with corresponding values of var1.1. In case of conflict, this will favour var1.
var1[is.na(var1)] <- var1.copy[is.na(var1)]

Resources