I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1
Related
I have a dataframe, df, with several columns in it. I would like to create a function to create new columns dynamically using existing column names. Part of it is using the last four characters of an existing column name. For example, I would like to create a variable names df$rev_2002 like so:
df$rev_2002 <- df$avg_2002 * df$quantity
The problem is I would like to be able to run the function every time a new column (say, df$avg_2003) is appended to the dataframe.
To this end, I used the following function to extract the last 4 characters of the df$avg_2002 variable:
substRight <- function (x,n) {
substr(x, nchar(x)-n+1, nchar(x))
}
I tried putting together another function to create the columns:
revved <- function(x, y, z){
z = x * y
names(z) <- paste('revenue', substRight(x,4), sep = "_")
return x
}
But when I try it on actual data I don't get new columns in my df. The desired result is a series of variables in my df such as:
df$rev_2002, df$rev_2003...df$rev_2020 or whatever is the largest value of the last four characters of the x variable (df$avg_2002 in example above).
Any help or advice would be truly appreciated. I'm really in the woods here.
dat <- data.frame(id = 1:2, quantity = 3:4, avg_2002 = 5:6, avg_2003 = 7:8, avg_2020 = 9:10)
func <- function(dat, overwrite = FALSE) {
nms <- grep("avg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("avg_", "rev_", nms)
if (!overwrite) revnms <- setdiff(revnms, names(dat))
dat[,revnms] <- lapply(dat[,nms], `*`, dat$quantity)
dat
}
func(dat)
# id quantity avg_2002 avg_2003 avg_2020 rev_2002 rev_2003 rev_2020
# 1 1 3 5 7 9 15 21 27
# 2 2 4 6 8 10 24 32 40
I have a data set where the names of the columns are very messy, and I want to simplify them. Example data below:
structure(list(MemberID = 1L, This.was.the.first.question = "ABC",
This.was.the.first.date = 1012018L, This.was.the.first.city = "New York",
This.was.the.second.question = "XYZ", This.was.the.second.date = 11052018L,
This.was.the.second.city = "Boston"), .Names = c("MemberID",
"This.was.the.first.question", "This.was.the.first.date", "This.was.the.first.city",
"This.was.the.second.question", "This.was.the.second.date", "This.was.the.second.city"
), class = "data.frame", row.names = c(NA, -1L))
MemberID This was the first question This was the first date This was the first city This was the second question This was the second date This was the second city
1 ABC 1012018 New York XYZ 11052018 Boston
This is what I want the columns to look like:
MemberID Question_1 Date_1 City_1 Question_2 Date_2 City_2
So essentially the column name is the same but every 3rd column the number increases by 1. How would I do this? While this example data set small, my real data set is much larger and I want to learn how to do this by column indexing and iteration.
An easier option is to remove the substring except the last word and use make.unique
names(df1)[-1] <- make.unique(sub(".*\\.", "", names(df1)[-1]), sep="_")
names(df1)
#[1] "MemberID" "question" "date" "city" "question_1" "date_1" "city_1"
Or if we need the exact output as expected, extract the last word with sub and use ave to create the sequence based on duplicate names
v1 <- sub(".*\\.(\\w)", "\\U\\1", names(df1)[-1], perl = TRUE)
names(df1)[-1] <- paste(v1, ave(v1, v1, FUN = seq_along), sep="_")
names(df1)
#[1] "MemberID" "Question_1" "Date_1" "City_1"
#[5] "Question_2" "Date_2" "City_2"
#
# create vector of question name triplets
theList <- c("question_","date_","city_")
# create enough headings for 10 questions
questions <- rep(theList,10)
idNumbers <- 1:length(questions)
library(numbers)
# use mod function to group ids into triplets
idNumbers <- as.character(ifelse(mod(idNumbers,3)>0,floor(idNumbers/3)+1,floor(idNumbers/3)))
# concatenate question stems with numbers and add MemberID column at start of vector
questionHeaders <- c("MemberID",paste0(questions,idNumbers))
head(questionHeaders)
...and the output:
[1] "MemberID" "question_1" "date_1" "city_1" "question_2" "date_2"
use the colnames() or names() function to assign this vector as the column names of the data frame.
As noted in the comments on the OP, the question ID numbers can be generated by using the each= argument in rep(), eliminating the need for the mod() function.
idNumbers <- rep(1:10,each = 3)
I want to (manually) overwrite cells in column x of the dataframe df. But R produces an error. Consider
m = 1:2
n = 3:4
names(m)= c("o", "we")
names(n)= c("bn","lt")
s = c( "bb", "cc")
b = c( FALSE, TRUE)
df = data.frame( s, b)
df$x= list(m,n)
Now replace column x first row:
k = 5:6
names(k)= c("jh","jh")
df[1,"x"] = k ## error occurs here
Try this: df$x[[1]] = k
Since you are using a data frame you can not speak of columns and rows.. so you have to access the elements of the data frame with $ instead of "x".
Further, you are using lists in your data frame. You should access the elements of a list with [[1]] or any other position of the element in the list.
Hope this helps.
how can I store the output of sapply() to a dataframe where the index value is stored in first column and its value in corresponding 2nd column. For illustration, I have shown only 2 elements here, but there are 110 columns in my data. "loan" is the data frame.
cols <- sapply(loan,function(x) sum(is.na(x)))
cols
id
0
member_id
7
I want output as:
var value
id 0
member_id 7
I know that sapply() returns a vector, but when I print the vector, values are printed along with its some "index" e.g., column name if applied on a data frame. So, now when I want to store it as a data frame with two columns where 1st column contains the index part and the second column contains the value, how can I do it?
I found an answer to my question. For those who actually did understand my problem, this answer might make sense:
cols <- data.frame(sapply(loan ,function(x) sum(is.na(x))))
cols <- cbind(variable = row.names(cols), cols)
I wanted the row.names to be in a column of the same data frame corresponding to the values obtained from sapply.
We can use stack
stack(mylist)[2:1]
data
mylist <- list(df = 1, rf = 2)
Is this what you want?
Your original list:
L <- c("df",1,"rf",2)
L
[1] "df" "1" "rf" "2"
As a data frame:
N <- length(L)
df <- data.frame( var = L[seq(1,N,2)], value = L[seq(2,N,2)] )
df
var value
1 df 1
2 rf 2
I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things