Search through range of 'education' variables and assign a number based on the highest qualification contained within these variables - r

I'm currently an R novice and have come across an issue while recoding variables in my dataset. I would really appreciate any advice you may have on this issue. I have several different "education_code" variables that contain information about a given individual's educational qualifications.
#create simulated data
df = data.frame(ID = c(1001,1002,1003,1004, 1005,1006,1007,1008,1009,1010,1011),
education_code_1 = c('1','2','1','1','NA', '5', '2', '3', 'NA','2','5'),
education_code_2 = c('2','4','3','4','5', '2','1','2','5','1','3'),
education_code_3 = c('3', '3','NA', '4','2', '1','NA','3','4','NA','2'))
Which looks like this:
ID education_code_1 education_code_2 education_code_3
1 1001 1 2 3
2 1002 2 4 3
3 1003 1 3 NA
4 1004 1 4 4
5 1005 NA 5 2
6 1006 5 2 1
7 1007 2 1 NA
8 1008 3 2 3
9 1009 NA 5 4
10 1010 2 1 NA
11 1011 5 3 2
Assuming that a higher value represents a higher educational level, I would like to create a new variable "Highest_degree_obtained" (below) that assigns a number based on the highest value contained within columns 2:4.
df$Highest_degree_obtained <- NA
Any suggestions on how to go about doing this?

You could just use apply
df$Highest_degree_obtained <- apply(df[, -1], 1, function(x) {
max(as.numeric(as.character(x)), na.rm = T)
})
df$Highest_degree_obtained
[1] 3 4 3 4 5 5 2 3 5 2 5

Related

creating adjacency network matrix (or list) from large csv dataset using igraph

i am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the adjacency) The challenge is that the amount of partners varies (from 0 to 40).
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 2 NA
2 3 1 NA NA
3 1 4 7 6
4 9 6 3 NA
.
.
my question is how to transform this into an edge list with weight which is undirected (just expressing interaction):
Id1 Id2 weight
1 2 2
1 3 2
1 4 1
2 3 1
3 4 2
. .
Does anyone have a tip what the best way to go is? Many thanks in advance!
This is a classic reshaping task. You can use the reshape2 package for this.
text <- "IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 2 NA
2 3 NA NA NA
3 1 4 7 6
4 9 NA NA NA"
data <- read.delim(text = text, sep = "")
library(reshape2)
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
head(edgelist, 4)
# IdMain value
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

counting occurrences in column and create variable in R

I am new on R and I have a data.frame , called "CT", containing a column called "ID" containing several hundreds of different identification numbers (these are patients). Most numbers appear once, but some others appear two or three times (therefore, in different rows).
In the CT data.frame, I would like to insert a new variable, called "countID", which would indicate the number of occurrences of these specific patients (multiple records should still appear several times).
I tried two different strategies after reading this forum:
1st strategy:
CT <- cbind(CT, countID=sequence(rle(CT.long$ID)$lengths)
But this doesn't work, I get only one count.
2nd strategy: create a data frame with two columns (one is ID, one is count) and the match this dataframe with CT:
tabs <- table(CT.long$ID)
out <- data.frame(item=names(unlist(tabs)),count=unlist(tabs)[],stringsAsFactors=FALSE)
rownames(out) = c()
head(out)
# item count
# 1 1.312 1
# 2 1.313 2
# 3 1.316 1
# 4 1.317 1
# 5 1.321 1
# 6 1.322 1
So this works fine but I can't melt the two data.frames: the number of rows doesn't match between "out" and "CT" (out has less rows of course).
Maybe someone has an elegant solution to add the number of occurrences directly in the data.frame CT, or correctly match the two data.frames?
You were almost there! rle will work very nicely, you just need to sort your table on ID before computing rle:
CT <- data.frame( value = runif(10) , id = sample(5,10,repl=T) )
# sort on ID when calculating rle
Count <- rle( sort( CT$id ) )
# match values
CT$Count <- Count[[1]][ match( CT$id , Count[[2]] ) ]
CT
# value id Count
#1 0.94282600 1 4
#2 0.12170165 2 2
#3 0.04143461 1 4
#4 0.76334609 3 2
#5 0.87320740 4 1
#6 0.89766749 1 4
#7 0.16539820 1 4
#8 0.98521044 5 1
#9 0.70609853 3 2
#10 0.75134208 2 2
data.table usually provides the quickest way
set.seed(3)
library(data.table)
ct <- data.table(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
st <- ct[,countid:=.N,by=id]
id item countid
1: 2 0.953 2
2: 9 0.535 2
3: 4 -0.584 2
4: 4 -2.161 2
5: 7 -1.320 3
6: 7 0.810 3
7: 2 1.342 2
8: 3 0.693 1
9: 6 -0.323 5
10: 7 -0.117 3
11: 6 -0.423 5
12: 6 -0.835 5
13: 6 -0.815 5
14: 6 0.794 5
15: 9 0.178 2
If you don't feel the need to use base R, plyr makes this task easy:
> set.seed(3)
> library(plyr)
> ct <- data.frame(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
> ct <- ddply(ct,.(id),transform,idcount=length(id))
> head(ct)
id item idcount
1 2 0.953 2
2 2 1.342 2
3 3 0.693 1
4 4 -0.584 2
5 4 -2.161 2
6 6 -0.323 5

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Subset data frame for rows equal to one value but the other

Suppose I have a data frame like this:
df <- data.frame(id = rep(c(1001:1004), each = 3), value = c(1,1,4,1,2,3,2,2,5,1,5,6))
df
id value
1 1001 1
2 1001 1
3 1001 4
4 1002 1
5 1002 2
6 1002 3
7 1003 2
8 1003 2
9 1003 5
10 1004 1
11 1004 5
12 1004 6
What is a good way to return me the IDs that have value equals to 1 but 3, i.e., any ID that has its corresponding value equals to 3 will be excluded but must have at least one value equals to 1? In this case, ID 1002 has 1 but also has 3 and shall be excluded. ID 1003 doesn't have any value equals to 1 and shall be excluded too. So ID 1001 and 1004 will be returned. Thanks!
You can get the ID's that contain 1 with df$id[df$value == 1], and likewise for 3 df$id[df$value == 3]. To exclude one set from the other, you can use setdiff.
In one command: with(df,setdiff(id[value == 1],id[value == 3]))
This is another alternative.
unique(df$id[df$value==1][! df$id[df$value==1] %in% df$id[df$value==3]])
[1] 1001 1004

Resources