Iterate over Factors in a dataframe in R - r

I am rather new to R and struggling at the moment with a specific issue. I need to iterate over a dataframe with 1 variable returned from a SQL database so that I can ultimately issue additional SQL queries using the information in the 1 variable. I need help understanding how to do this.
Here is what I have
> dt
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by using it prints the entire list just as above
> dt[1]
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by dt[1,] it brings additional unwanted information.
> a<-dt[1,]
> a
[1] 5D2D3F03-286E-4643-8F5B-10565608E5F8
5 Levels: 4EB4D553-C680-4576-A854-54ED817226B0 ... 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
I need to isolate just the '5D2D3F03-286E-4643-8F5B-10565608E5F8' information and not the '5 levels......'.
I am sure this is simple, I just can't find it. any help is appreciated!
thanks!

There are two issues you need to address. One is that you want character data, not a factor variable (a factor is essentially a category variable). The other is that you want a simple vector of the values, not a data.frame.
1) To get the first column as a vector, use double-brackets or the $ notation:
a <- dt[[1]]
a <- dt[['Col']]
a <- dt$Col
Your notation dt[1,] does actually return the column as a vector too, but using the somewhat obscure fact that the [ method for data.frame objects will silently "drop" its value to a vector when using the two-index form dt[i,j], but not when using the one-index form dt[i]:
When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list. In this usage a drop argument is ignored, with a warning.
Think of "dropping" like unboxing the data - instead of getting a data.frame with a single column, you're just getting the column data itself.
2) To convert to character data, use one of the suggestions in the comments from #akrun or #Vlo:
a <- as.character(dt[[1]])
a <- as.character(dt[['Col']])
a <- as.character(dt$Col)
or use the API of whatever you're using to make the SQL query - or to read in the results of the query - not convert the strings to factors in the first place.

Related

Trying to find a better way to sorting the data in R

In my data frame I am trying to sort the data in descending order. I am using the below line of code for sorting my data and it works as intended.
CNS25VOL <- CNS25VOL[order(-CNS25VOL$MATVOL22), ]
However if I refer to the same column by it's index number, the code throws an error
CNS25VOL <- CNS25VOL[order(-CNS25VOL[, 2]), ]
Error thrown is
Error in CNS25VOL[, 2] : incorrect number of dimensions
While I do have a solution to what I am intending to do, but issue I see is if all of a sudden name of my column changes the code won't work. I know that their position will stay same in the data frame.
How can we handle it.
order(-CNS25VOL[, 2]) order here does expect a vector which you try to construct via the [] in CNS25VOL[, 2]. Normal dataframes will return a vector consisting only of the 2nd column. A tibble however will return a tibble with only one column.
You can reproduce the behaviour of normal data.frames with the drop = FALSE argument to [] as in
CNS25VOL[, 2, drop = TRUE]
Try to always be aware whether you are using a standard data.frame or a tibble or a data.table because they look very similar and are not in the details. Also see https://tibble.tidyverse.org/reference/subsetting.html
dplyr functions tend to give you a tibble back even if you fed them a classical data.frame.

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

R equivalent to the MATLAB structure?

Is there an R type equivalent to the Matlab structure type?
I have a few named vectors and I try to store them in a data frame. Ideally, I would simply access one element of an object and it would return the named vectors (like a structure in Matlab). I feel that using a data frame is not the right thing to do since it can store the values of the named vectors but not the names when they differ from one vector to the other.
More generally, is it possible to store a bunch of different objects in a single one in R?
Edit: As Joran said I think that list does the job.
l = list()
l$vec1 = namedVector1
l$vec2 = namedVector2
...
If I have a list of names
name1 = 'vec1'
name2 = 'vec2'
is there any way for the interpreter to understand that when I use a variable name like name1, I am not referring to the variable name but to its content? I have tried get(name1) but it does not work.
I could still be wrong about what you're trying to do, but I think this is the best you're going to get in terms of accessing each list element by name:
l <- list(a= 1:3,b = 1:10)
> ind <- "a"
> l[[ind]]
[1] 1 2 3
Namely, you're going to have to use [[ explicitly.

Getting at grouping term from within ddply function?

I have the following example data frame x
id val
a 1
a 2
a 3
b 2
b 4
b 9
I have a simple function that I apply while doing my SAC. Something like,
f.x <- function(x){
cat(x$id,"\n")
}
Note, all that I'm trying to do is print out the grouping term. This function is called using ddply as follows
ddply(x,.(id),f.x)
However, when I run this I get integer outputs which I suspect are indices of a list. How do I get the actual grouping term within f.x, in this case 'a' & 'b'?
Thanks much in advance.
Never be fooled into thinking that just because a variable looks like a bunch of characters, that it really is just a bunch of characters.
Factors are not characters. They are integer codes and will often behave like integer codes.
Coerce the values to character using as.character().
If every thing you want to do is print you should use d_ply.
In ddply (d_ply) actual argument passed to function (f.x here) is data.frame (subset of x data.frame), so you should modify your f.x function.
f.x <- function(x) cat(x[1,"id"],"\n")
d_ply(x,.(id),f.x)

Resources