data extraction from object list frq - r

The task is simple but I do something wrong. I use package sjmisc, and the function frq (frequency table). I would like to get acces to column: valid.prc and store it as a variable (last part is easy, but the initial one makes trouble, i.e. a$valid.prc doesn't work and result is NULL).
Sample data:
a <- sample(seq(from =1, to =7),size = 100,replace = T)
frequencytable <- frq(a)
How to extract data from column valid.prc? Many thanks for help.

frequencytable is a list, use [[ to subset list so that you have a dataframe and then extract column valid.prc as usual
class(frequencytable)
#[1] "sjmisc_frq" "list"
frequencytable[[1]]$valid.prc
#[1] 17 11 14 19 15 11 13 NA

Related

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

Creating a data frame from looping through a list of data frames in R

I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)

Parsing a string efficiently

So I've got a column in my data frame that is essentially one long characteristic string that is used to encode about variables for each record. It might look something like this:
string<-c('001034002025003996','001934002199004888')
But much longer.
The strings are structured so each 6 characters are paired together. So you can look at the string above like this:
001034 002025 003996
001934 002199 004888
The first three characters of these is a code corresponding to a certain variable and the next three correspond to the value of that variable. So the above can be broken down into three columns that look like this:
var001 var002 var003 var004
1 034 025 996 NA
2 934 199 NA 888
I need a way to parse this string and return a data frame with the expanded columns.
I wrote a nested loop that looks like this:
for(i in 1:length(string)){
text <- string[i]
for(j in seq(1,505,6)){
var <- substr(text,j, j+2)
var.value <- substr(text, j+3, j+5)
index <- (as.numeric(var))
df[i, index] <- var.value
}
}
where df is an empty data frame created to receive the data. This works, but is slow on larger amounts of data. Is there a better way to do this?
1) This one-liner produces a character matrix (which can easily be converted to a data.frame if need be). No packages are used.
read.dcf(textConnection(gsub("(...)(...)", "\\1: \\2\n", string)))
giving:
001 002 003 004
[1,] "034" "025" "996" NA
[2,] "934" "199" NA "888"
2) This alternative produces the same matrix. The read.table produces a long form data.frame and then tapply reshapes it to a wide matrix.
long <- read.table(text = gsub("(...)(...)", "\\1 \\2\n", string),
colClasses = "character", col.names = c("id", "var"))
tapply(long$var, list(gl(length(string), nchar(string[1])/6), long$id), c)

Linking two datasets

I have a dataset called "J_BL5H1", this includes :
Var1 Freq
4 10
8 10
10 13
11 7
13 3
17 10
19 10
25 1
26 4
27 8
53 13
From this dataset, I want to find all Var1s seperately, and I want to called this new data like J_BL5H1JNVar1Number, here Var1Number denotes to specific Var1s, e.g. "4, 8, 10".
I will use this :
J_BL5H1JNVar1Number <- J_BL5H1$Freq[1]
Here, I want to replace Var1Number to "Var1" values in the old data.
For example, if I want to know the "Freq[4]", my new data should be called like "J_BL5H1JN11", the "Var1Number" will be automatically replaced by the Var1 of Freq[4], in this case by 11.
I hope I can clearly state my problem, Thanks.
First use paste to create the names of the data.sets:
data.string <- "J_BL5H1LN"
split.var <- "Var1"
data.sets <- paste(data.string, J_BL5H1[, split.var], sep = "")
Then use a loop to assign the according values to the data sets:
for( i in seq_along(data.sets) ) assign(data.sets[i], J_BL5H1[i, "Freq"])
Now you have the data sets in your workspace:
ls()
Btw, if you want to access the different data sets without actually calling them every time, you can access them by name using the get function:
sapply(data.sets, get)

R: losing column names when adding rows to an empty data frame

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.
The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.
was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4
Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)
FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)
One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"
You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus
I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg
Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95
Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.

Resources