processing data frame in R

processing data frame in R - r

I have this data frame. I would like to put each unique Dept and place the corresponding Name under each unique Dept. As you can see there are multiple Dept. For example, final dcoument should look like this:
Internet
Public-Web
Intranet
BackOffice
Batch
BackEnd
BackEnd
WebLogic
Oracle
dput(x)
structure(list(ID = c(1234L, 2345L, 6789L, 3456L, 7890L, 1987L
), Name = structure(c(5L, 3L, 2L, 1L, 6L, 4L), .Label = c("BackEnd",
"Batch", "Intranet", "Oracle", "Public-Web", "WebLogic"), class = "factor"),
Dept = structure(c(3L, 3L, 2L, 2L, 1L, 1L), .Label = c("BackEnd",
"BackOffice", "Internet"), class = "factor")), .Names = c("ID",
"Name", "Dept"), class = "data.frame", row.names = c(NA, -6L))
Any ideas how I would do this in R?

I'll assume you may have duplicates, and therefore use unique:
for(dept in unique(x$Dept)){
print(dept)
x2 <- subset(x,subset=Dept==dept)
for(name in unique(x2$Name)){
print(paste(sep=""," ",name))
}
}
Replace the print whith whatever you need.

You can use split to achieve this:
split(as.character(df$Name), df$Dept)
# $BackEnd
# [1] "WebLogic" "Oracle"
#
# $BackOffice
# [1] "Batch" "BackEnd"
#
# $Internet
# [1] "Public-Web" "Intranet"
If you want unique entries, then just do:
df <- unique(df[, 2:3])
split(as.character(df$Name), df$Dept)

Related

trying to summarize survey data for questions with 'select all that apply' using R

We have a survey that asks for 'select all that apply' so the result is a string inside quotes with the values separated by commas. i.e. "red, black,green"
There are other question about income so I have a factor with 'low, medium, high'
I want to be able to answer questions: What percent selected 'Red', then group that by income.
I can split the string with
'''df4 <- c("black,silver,green")'''
I can create a data frame with a timestamp and the split string with
'''t2 <- as.data.frame(c(df2[2],l2))'''
I am not able to understand how to do this for all rows at one time.
Here is a DPUT of the input:
structure(list(RespData = structure(1:2, .Label = c("1/20/2020",
"1/21/2020"), class = "factor"), CarColor = c("red,blue,green,yellow",
"black,silver,green")), row.names = c(NA, -2L), class = "data.frame")
and here is a DPUT of the desired output:
structure(list(RespData = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("1/20/2020", "1/21/2020"), class = "factor"),
Cars = structure(c(3L, 1L, 2L, 4L, 5L, 6L, 2L), .Label = c("blue",
"green", "red", "yellow", "black", "silver"), class = "factor")), row.names = c(NA,
-7L), class = "data.frame")
Example of Function:
MySplitFunc <- function(ListIn) {
# build an empty data frame and set the column names
x1.all <- ListIn[0,]
names(x1.all) <- c("ResponseTime", "Descriptive")
# for each row build the data and combine to growing list
for(x in 1:nrow(ListIn)) {
#print(x)
r1 <- ListIn[x,1]
c1 <- strsplit(ListIn[x,2],",")
x1 <- as.data.frame(c(r1,c1))
# set the names and combine to all
names(x1) <- c("ResponseTime", "Descriptive")
x1.all <- rbind(x1.all,x1)
}
# strip the whitespace
x1.all <- data.frame(lapply(x1.all, trimws), stringsAsFactors = TRUE)
return(x1.all)
}

how to remove duplicated strings and merge all columns strings in one?

I have a data looks like the following df
df<- structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("DNAJC11;FGOTG",
"MAPK14", "PPIB", "RBX1", "USP14"), class = "factor"), V2 = structure(c(4L,
3L, 2L, 1L, 1L), .Label = c("", "DNAJC9", "MAPK14", "USP14"), class = "factor"),
V3 = structure(c(3L, 2L, 4L, 5L, 1L), .Label = c("", "DNAJC11;FGOTG",
"GCLC", "GSR", "STIP1"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
I want to merge all columns into one and then keep the unique ones
for example the output should look like this
USP14
DNAJC11;FGOTG
MAPK14
PPIB
RBX1
DNAJC9
GCLC
GSR
STIP1
I tried to use meltfunction but I could not figure out how to do this, any comment is appreciated. Thanks

unique(as.vector(as.matrix(df)))
To remove the entries with no characters:
vec<-unique(as.vector(as.matrix(df)))
vec[-which(vec=="")]
or, courtesy #rawr
Filter(nzchar, unique(as.vector(as.matrix(df))))

Extracting data frames from a list based on column names in r

I am looking at extracting df's from within a list of multiple df's into separate data frames based on a condition (if the column names of a df within the list contains the name I am looking for).
For illustration purposes I have created an example which resembles the situation I am in.
I have list with multiple data frames and the dput of that list is given below:
structure(list(V1 = structure(list(lvef = c(0.965686195194885,
0.0806777632648268, -0.531729196500083, -0.511913109608259, -0.413670941196816,
-0.0501899795864357, -0.337583918771946, 1.16086745780346, -0.478358865835724,
-1.95009138673888), hbc = c(-0.389950511350405, -0.904388183933348,
0.811821977223064, -0.868381700124344, -0.637307418402866, -1.04703715824204,
-0.394340445217658, -0.194653869597247, 0.00822402232044511,
-0.145032587618231), id = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "NA", class = "factor")), .Names = c("lvef",
"hbc", "id"), row.names = c(NA, -10L), class = "data.frame"),
V2 = structure(list(ersta = c(-0.254360310986174, 0.3859806928747,
-0.135741797055127, 1.03929145413636, -0.484219739337178,
0.255476285148917, 1.0479422937128, 0.146613094683722, -0.914377222535014,
1.75052418161618, -0.275059500684816, 2.34861397588234, 0.00183723766664941,
0.97612891408903, 0.278868537504227, 0.456979477254684, 1.46323739326792,
0.664511602217853, 0.870420202897545, 1.38228375734407),
pgrsta = c(-1.49129812271989, 0.820330747101906, -0.0469488167129374,
0.471549380446308, -1.71312120132398, 0.0578140025416816,
1.67016363826724, 0.226180835709491, -2.00294530465909,
-0.0464857361954717, 0.306942902768782, -0.785096914460742,
0.283822632249141, -0.260774679911329, -1.2865970194309,
0.307972619170242, 0.223715024597144, -1.01642533651475,
-0.12229427204957, 0.223326519096996), id = structure(c(7L,
7L, 7L, 7L, 4L, 1L, 3L, 5L, 6L, 2L, 7L, 7L, 7L, 7L, 4L,
1L, 3L, 5L, 6L, 2L), class = "factor", .Label = c("-0.10863576856322",
"-0.317324527228699", "-0.422764348315332", "0.285132258310185",
"1.23305496219042", "1.39326602279981", "NA"))), .Names = c("ersta",
"pgrsta", "id"), row.names = c(NA, -20L), class = "data.frame"),
V3 = structure(list(hormrec = 1:15, event = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"NA"), class = "factor")), .Names = c("hormrec", "event"), row.names = c(NA,
-15L), class = "data.frame"), V4 = structure(list(asat = c(-0.321423784000631,
0.181345361079582, 0.389158724418319, -1.15251833725336,
-0.351981383678293, -0.506888212379408, 0.870705917350059,
-0.626883041051641, -0.321843006223371, -0.674564527029912,
-0.609383943267379, -0.181661119817784, -1.63676077872658
), lab = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L), .Label = c("btest", "NA", "rtest"), class = "factor")), .Names = c("asat",
"lab"), row.names = c(NA, -13L), class = "data.frame")), .Names = c("V1",
"V2", "V3", "V4"))
I am trying to extract data frames from the list based on the condition that if a data frame within the list contains the column name/s required then that data frame from the list should go into a separate data frame. So far, I have been able to extract the data frames into a list using the following code:
# function to extract required df's
trial <- function(x)
{
reqname <- c("hbc","ersta") # column names to check for
data <- x
lapply(seq(data), function(i){ # loop through all the data frames in the list
y <- data.frame(data[[i]]) # extract df in y
names <- names(y) # extract names of df
for(a in 1:length(reqname)) # loop through the length of reqname
{
if(reqname[a]%in%names) # check if column name/s present in current df
{
z <- y # extract df into another df
return(z) # return df
}
}
}
)
}
The above function returns a list of matching df's along with nulls where there was not a match. I am looking for a modification so that the selected data frame comes out separately. If there are two df's matching the requirement then the output should be two separate data frames.
I will appreciate all and any help in finding a solution.

You can easily use the lapply() plus a custom function to identify wanted outputs. For instance, if k is your list,
trial <- function(x)
{
reqnames <- c("hbc","ersta")
k <- lapply(k, function(x) any(names(x) %in% reqnames))
k <- which(k==1)
x[k]
}
This outputs a list with only the dataframes containing at least one of the names in reqnames.

We can remove the NULL elements with Filter
lst1 <- Filter(length, trial(lst))
If we need multiple data.frame objects in the global environment, use list2env after renaming the list elements with the object names
names(lst1) <- paste0('dat' seq_along(lst1))
list2env(lst1, envir = .GlobalEnv)

Reproduce a datset to different format in R

I have a dataset Data like below:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))
I wanted the above dataset to be converted as in the below format as a Data Frame(out_data).
Presently my Data has 3 columns - and need to covert these into 16 columns in below format.
I need to rehape my input - to exactly given in the screenshot as data frame.
I cannot change the below structure -
colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO PN","S_TY_CD")
The Multiple value columns in the inputnand are always in the below Format:
|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567 -
|PC|H_BLK|S_N/R_N|XC|YC
|8121|B01|SOMERSET STN -> |BS|BRF|LCT_DEC
|SN30|SMRT ------> |BRN|BO
If the
IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC` get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN
Is there an efficient way of doing this.

Here's one method of transformation. First I define some helper functions for the various sub problems.
#define out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC",
"XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")
#identify parts for each compound value
namevals <- function(ind, vals) {
names<-if (ind=="LOC") {
c("PC","H_BLK","S_N/R_N","XC","YC")
} else if (ind=="BN") {
c("B_N")
} else if (ind=="RN") {
c("S_N/R_N")
} else if (ind=="BS") {
c("BS","BRF","LCT_DEC")
} else if (ind=="BR") {
c("BRN","BO")
} else if (ind=="PN") {
c("PN")
}
stopifnot(length(names)==length(vals))
stopifnot(all(names %in% outcols))
names(vals)<-names
vals
}
#add missing values for row
fillrow <- function(nvals) {
r<-rep(NA, length(outcols))
r[match(names(nvals), outcols)]<-nvals
r
}
Now I apply these to each row of the data with mapply to return a character vector. Here we make sure to split the "values" column on the pipe and remove the leading pipe.
#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){
x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
fillrow(x)
},
as.character(Data$FN),
strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T),
as.character(Data$IND)
)
Finally we tidy the data up so it can be written out to a file with write.table. Note that all missing values are true R NA values. In the write.table, you can set na = "" if you'd rather they print out as blank values than the default "NA" value.
#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd

Replacing rows in R

In R am reading a file with comments as csv using
read.data.raw = read.csv(inputfile, sep='\t', header=F, comment.char='')
The file looks like this:
#comment line 1
data 1<tab>x<tab>y
#comment line 2
data 2<tab>x<tab>y
data 3<tab>x<tab>y
Now I extract the uncommented lines using
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data = read.data.raw[-comment_ind,]
Which leaves me:
data 1<tab>x<tab>y
data 2<tab>x<tab>y
data 3<tab>x<tab>y
I am modifying this data through some separate script which maintains the number of rows/cols and would like to put it back into the original read data (with the user comments) and return it to the user like this
#comment line 1
modified data 1<tab>x<tab>y
#comment line 2
modified data 2<tab>x<tab>y
modified data 3<tab>x<tab>y
Since the data I extracted in read.data preserves the row names row.names(read.data), I tried
original.read.data[as.numeric(row.names(read.data)),] = read.data
But that didn't work, and I got a bunch of NA/s
Any ideas?

Does this do what you want?
read.data.raw <- structure(list(V1 = structure(c(1L, 3L, 2L, 4L, 5L),
.Label = c("#comment line 1", "#comment line 2", "data 1", "data 2",
"data 3"), class = "factor"), V2 = structure(c(1L, 2L, 1L, 2L, 2L),
.Label = c("", "x"), class = "factor"), V3 = structure(c(1L, 2L, 1L,
2L, 2L), .Label = c("", "y"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
comment_ind = grep( '^#.*', read.data.raw[[1]])
read.data <- read.data.raw[-comment_ind,]
# modify V1
read.data$V1 <- gsub("data", "DATA", read.data$V1)
# rbind() and then order() comments into original places
new.data <- rbind(read.data.raw[comment_ind,], read.data)
new.data <- new.data[order(as.numeric(rownames(new.data))),]