R spread function (error in ... undefined columns selected) - r

I googled my error, but that didn't helped me.
Got a data frame, with a column x.
unique(df$x)
The result is:
[1] "fc_social_media" "fc_banners" "fc_nat_search"
[4] "fc_direct" "fc_paid_search"
When I try this:
df <- spread(data = df, key = x, value = x, fill = "0")
I got the error:
Error in `[.data.frame`(data, setdiff(names(data), c(key_var, value_var))) :
undefined columns selected
But that is very weird, because I used the spread function (in the same script) different times.
So I googled, saw some "solutions":
I removed all the "special" characters. As you can see, my unique
values do not contain special characters (cleaned it). But this didn't
help.
I checked if there are any columns with the same name. But all column names
are unique.
#Gregor, #Akrun:
> str(df)
'data.frame': 100 obs. of 22 variables:
$ visitor_id : chr "321012312666671237877-461170125342559040419" "321012366667112237877-461121705342559040419" "321012366661271237877-461170534255901240419" "321012366612671237877-461170534212559040419" ...
$ visit_num : chr "1" "1" "1" "1" ...
$ ref_domain : chr "l.facebook.com" "X.co.uk" "x.co.uk" "" ...
$ x : chr "fc_social_media" "fc_social_media" "fc_social_media" "fc_social_media" ...
$ va_closer_channel : chr "Social Media" "Social Media" "Social Media" "Social Media" ...
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "0" "0" "0" "0" ...
$ Hard Drive : chr "0" "0" "0" "0" ...

The error could be due to a column without a name i.e "". Using a reproducible example
library(tidyr)
spread(df, x, x)
Error in [.data.frame(data, setdiff(names(data), c(key_var,
value_var))) : undefined columns selected
We could make it work by changing the column name
names(df) <- make.names(names(df))
spread(df, x, x, fill = "0")
# X fc_banners fc_direct fc_nat_search fc_paid_search fc_social_media
#1 1 0 0 0 0 fc_social_media
#2 2 fc_banners 0 0 0 0
#3 3 0 0 fc_nat_search 0 0
#4 4 0 fc_direct 0 0 0
#5 5 0 0 0 fc_paid_search 0
data
df <- data.frame(x = c("fc_social_media", "fc_banners",
"fc_nat_search", "fc_direct", "fc_paid_search"), x1 = 1:5, stringsAsFactors = FALSE)
names(df)[2] <- ""

Related

Loop across rows and colums with nested data

I have the following data structure: Meetings in Persons in Groups. The groups met differently often and the number of group members varied for every meeting.
$ GroupID : chr "1" "1" "1" "1" ...
$ groupnames : chr "A&M" "A&M" "A&M" "A&M" ...
$ MeetiID : chr "1" "1" "2" "2" ...
$ Date_Meetings : chr "43293" "43293" "43298" "43298" ...
$ PersonID : num 171 185 171 185 185 113 135 113 135 113 ...
$ v_165 : chr "3" "3" "4" "3" ...
$ v_166 : chr "2" "2" "3" "3" ...
$ v_167 : chr "2" "4" "4" "3" ...
$ v_168 : chr "6" "7" "4" "5" ...
$ problemtypes_categories: chr "Knowledgeproblem" "Knowledgeproblem" "Motivationalproblem" "Coordinationproblem" ...
$ v_165_dicho : num 0 0 0 0 1 1 1 0 0 1 ...
$ v_166_dicho : num 0 0 0 0 0 0 0 0 0 0 ...
$ v_167_dicho : num 0 0 0 0 1 1 0 0 0 0 ...
Now I have to create a new variable that should be binary (0/1) with the name agreement_levels. So, every time, a person in one group has - regarding the same learning meeting - a same problem type category than the other learner(s) of the same group at the same meeting, both learners (or three or four, depending on the group size for a respective meeting) should get the value 1 at the agreement variable, else they should all get 0. Whenever a person (e.g., among four learners) already has a different category of problem than the others, there is a 0 on the agreement variable for all.
If only 1 person is in the data set for one and the same meeting, there must be a NA at agree. When one person has NA at the problemtype variable, however, and there are 2 people in the data set for the same meeting, both get 0 at agree; but if there are 4 people for the same meeting in the data set and one of them has NA at problemtype, then only this person but not the others get NA at agree.
I did already write a command, but it is not working yet and still does not consider the NAs:
GroupID1 <- df$GroupID[1:nrow,]
TreffID1 <- df$TreffID[1:nrow,]
for(i in 1:(GroupID1 -1){
for(j in 1:(TreffID1 -1){
if(df[i, 3] == df[i+1, 3]-1){
if(df[i, 15] == df[i+1, 15]-1){
df[c(i, i+1), 28] <- 1,
df[c(i, i+1), 28] <- 0
Many thanks in advance.
dput(head(df))
structure(list(GroupID = c("1", "1", "1", "1", "1", "2"), TreffID = c("1", "1",
"2", "2", "3", "1"), PersonID = c(171, 185, 171, 185,
185, 113), problemtypen_oberkategorien = c("Verständnisprobleme",
"Verständnisprobleme", "Motivationsprobleme", "Motivationsprobleme",
"Motivationsprobleme", "Motivationsprobleme"), passung.exkl = c("0",
"0", "0", "0", "1", "1")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Instead of loops, I used R's dplyr. I'm not sure if I got all your logic correct, since there was a lot there. For example, you didn't specify what would happen for NA problemtype and 3 people. But here is a starting point that uses group_by, so you are looking within each set of rows with the same GroupID and TreffID, and then mutate and case_when, which assign values to a new column, according to criteria, and then functions like n() that count how many rows and n_distinct that count distinct rows so you if it is ==1 then we know they are all the same.
library(tidyverse)
df <- df %>%
group_by(GroupID, TreffID) %>%
mutate(agreement_levels = case_when(n() == 1 ~ -1,
is.na(problemtypen_oberkategorien) & n() == 2 ~ 0,
is.na(problemtypen_oberkategorien) & n() > 2 ~ -1,
n_distinct(problemtypen_oberkategorien, na.rm = FALSE) == 1 ~ 1,
n_distinct(problemtypen_oberkategorien, na.rm = FALSE) > 1 ~ 0,
TRUE ~ -1),
agreement_levels = na_if(agreement_levels, -1)) %>%
select(GroupID, TreffID, problemtypen_oberkategorien, agreement_levels, everything())

colnames() function in R - Treating table values as independant objects/variables

I have a list of values which I would like to use as names for separate tables scraped from separate URLs on a certain website.
> Fac_table
[[1]]
[1] "fulltime_fac_table"
[[2]]
[1] "parttime_fac_table"
[[3]]
[1] "honorary_fac_table"
[[4]]
[1] "retired_fac_table"
I would like to loop through the list to automatically generate 4 tables with the respective names.
The result should look like this:
> fulltime_fac_table
職稱
V1 "教授兼系主任"
V2 "教授"
V3 "教授"
V4 "教授"
V5 "特聘教授"
> parttime_fac_table
職稱 姓名
V1 "教授" "XXX"
V2 "教授" "XXX"
V3 "教授" "XXX"
V4 "教授" "XXX"
V5 "教授" "XXX"
V6 "教授" "XXX"
I have another list, named 'headers', containing column headings of the respective tables online.
> headers
[[1]]
[1] "職稱" "姓名" "    研究領域"
[4] "聯絡方式"
[[2]]
[1] "職稱" "姓名" "研究領域" "聯絡方式"
I was able to assign values to the respective tables with this code:
> assign(eval(parse(text="Fac_table[[i]]")), as_tibble(matrix(fac_data,
> nrow = length(headers[[i]])))
This results in a populated table, without column headings, like this one:
> honorary_fac_table
[,1] [,2]
V1 "名譽教授" "XXX"
V2 "名譽教授" "XXX"
V3 "名譽教授" "XXX"
V4 "名譽教授" "XXX"
But was unable to assign column names to each table.
Neither of the code below worked:
> assign(colnames(eval(parse(text="Fac_table[1]"))), c(gsub("\\s", "", headers[[1]])))
Error in assign(colnames(eval(parse(text = "Fac_table[1]"))), c(gsub("\\s", :
第一個引數不正確
> colnames(eval(parse(text="Fac_table[i]"))) <- c(gsub("\\s", "", headers[[i]]))
Error in colnames(eval(parse(text = "Fac_table[i]"))) <- c(gsub("\\s", :
賦值目標擴充到非語言的物件
> do.call("<-", colnames(eval(parse(text="Fac_table[i]"))), c(gsub("\\s", "", headers[[i]])))
Error in do.call("<-", colnames(eval(parse(text = "Fac_table[i]"))), c(gsub("\\s", :
second argument must be a list
To simplify the issue, a reproducible example is as follows:
> varNamelist <- list(c("tbl1","tbl2","tbl3","tbl4"))
> colHeaderlist <- list(c("col1","col2","col3","col4"))
> tableData <- matrix([1:12], ncol=4)
This works:
> assign(eval(parse(text="varNamelist[[1]][1]")), matrix(tableData, ncol
> = length(colHeaderlist[[1]])))
But this doesn't:
> colnames(as.name(varNamelist[[1]][1])) <- colHeaderlist[[1]]
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3", "col4" :
attempt to set 'colnames' on an object with less than two dimensions
It seems like the colnames() function in R is unable to treat the strings as represented by "Fac_table[i]" as variable names, in which independent data (separate from Fac_table) can be stored.
> colnames(as.name(Fac_table[[1]])) <- headers[[1]]
Error in `colnames<-`(`*tmp*`, value = c("a", "b", "c", :
attempt to set 'colnames' on an object with less than two dimensions
Substituting for 'fulltime_fac_table' directly works fine.
> colnames(fulltime_fac_table) <- headers[[1]]
Is there any way around this issue?
Thanks!
There is a solution to this, but I think the current set up may be more complex than necessary if I understand correctly. So I'll try to make this task easier.
If you're working with one-dimensional data, I'd recommend using vectors, as they're more appropriate than lists for that purpose. So for this project, I'd begin by storing the names of tables and headers, like this:
varNamelist <- c("tbl1","tbl2","tbl3","tbl4")
colHeaderlist <- c("col1","col2","col3","col4")
It's still difficult to determine what the data format and origin for the input of these table is from your question, but in general, sometimes a data frame can be easier to work with than a matrix, as long as your not working with Big Data. The assign function is also typically not necessary for these sort of steps. Instead, when setting up a dataframe, we can apply the name of the data frame, the name of the columns, and the data contents all at once, like this:
tbl1 <- data.frame("col1"=c(1,2,3),
"col2"=c(4,5,6),
"col3"=c(7,8,9),
"col4"=c(10,11,12))
Again, we're using vectors, noted by the c() instead of list(), to fill each column since each column is it's own single dimension.
To check the output of tbl1, we can then use print():
print(tbl1)
col1 col2 col3 col4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If it's an option to create the tables closer to this way shown, that might make things easier than using so many lists and assign functions; that quickly becomes overly complicated.
But if you want at the end to store all the tables in a single place, you could put them in a list:
tableList <– list(tbl1=tbl1,tbl2=tbl2,tbl3=tbl3,tbl4=tbl4)
str(tableList)
List of 4
$ tbl1:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl2:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl3:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl4:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
I've found a work around solution based on #Ryan's recommendation, given by this code:
for (i in seq_along(url)){
webpage <- read_html(url[i]) #loop through URL list to access html data
fac_data <- html_nodes(webpage,'.tableunder') %>% html_text()
fac_data1 <- html_nodes(webpage,'.tableunder1') %>% html_text()
fac_data <- c(fac_data, fac_data1) #Store table data on each URL in a variable
x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
for (j in seq_along(headers[[i]])){
y <- cbind(x[,j]) #extract column data and store in temporary variable
colnames(y) <- as.character(headers[[i]][j]) #add column name
print(cbind(y)) #loop through headers list to print column data in sequence. ** cbind(y) will be overwritten when I try to store the result on a list with 'z <- cbind(y)'.
}
}
I am now able to print out all values, complete with headers of the data in question.
Follow-up questions have been posted here.
The final code solved this problem as well.

call REST API per row of data.frame in R

I'm trying to make an API call per row of my data.frame, while passing the value of a row as a parameter. There return value should be added to my frame
However, I'm failing to make my code work. Any help is appreciated.
Contents of APPLYapiCallTest.csv:
id title returnValue
01 "wine" ""
02 "beer" ""
03 "coffee" ""
Essentially call API with call("wine) and replace the empty field with the result e.g. beverage.
Here's what I got so far.
#call api per row using apply
library(jsonlite)
library(httr)
callAPI <- function(x) {
findWhat <- as.character(x)
#create ULR
url1 <- "http://api.nal.usda.gov/ndb/search/?format=json&q="
url2 <- "&sort=n&max=1&offset=0&api_key=KYG9lsu0nz31SG5yHGdAsM28IuCEGxWWlvdYqelI&location=Chicago%2BIL"
fURL <- paste(url1, findWhat, url2, sep="")
apiRet <- data.frame(fromJSON(txt=fURL, flatten = TRUE))
result <- apiRet[1,c(9)]
return(result)
}
tData <- data.frame(read.delim("~/Documents/R-SCRIPTS/DATA/APPLYapiCallTest"))
apply(tData[,c('title')], 1, function(x) callAPI(x) )
I think you'd be better off doing something like this:
library(jsonlite)
library(httr)
library(pbapply)
callAPI <- function(x) {
res <- GET("http://api.nal.usda.gov/ndb/search/",
query=list(format="json",
q=x,
sort="n",
offset=0,
max=16,
api_key=Sys.getenv("NAL_API_KEY"),
location="Berwick+ME"))
stop_for_status(res)
return(data.frame(fromJSON(content(res, as="text"), flatten=TRUE), stringsAsFactors=FALSE))
}
tData <- data.frame(id=c("01", "02", "03"),
title=c("wine", "beer", "coffee"),
returnValue=c("", "", ""),
stringsAsFactors=FALSE)
dat <- merge(do.call(rbind.data.frame, pblapply(tData$title, callAPI)),
tData[, c("title", "id")], by.x="list.q", by.y="title", all.x=TRUE)
str(dat)
## 'data.frame': 45 obs. of 12 variables:
## $ list.q : chr "beer" "beer" "beer" "beer" ...
## $ list.sr : chr "28" "28" "28" "28" ...
## $ list.start : int 0 0 0 0 0 0 0 0 0 0 ...
## $ list.end : int 13 13 13 13 13 13 13 13 13 13 ...
## $ list.total : int 13 13 13 13 13 13 13 13 13 13 ...
## $ list.group : chr "" "" "" "" ...
## $ list.sort : chr "n" "n" "n" "n" ...
## $ list.item.offset: int 0 1 2 3 4 5 6 7 8 9 ...
## $ list.item.group : chr "Beverages" "Beverages" "Beverages" "Beverages" ...
## $ list.item.name : chr "Alcoholic beverage, beer, light" "Alcoholic beverage, beer, light, BUD LIGHT" "Alcoholic beverage, beer, light, BUDWEISER SELECT" "Alcoholic beverage, beer, light, higher alcohol" ...
## $ list.item.ndbno : chr "14006" "14007" "14005" "14248" ...
## $ id : chr "02" "02" "02" "02" ...
This way you're making a structured call via httr::GET for each title in tData and then binding all the results together into a data frame and finally adding back the ID.
This also allows you to put NAL_API_KEY into .Renviron and avoid exposing it in your workflows (and on SO :-)
You can trim out the API result responses you don't need either in the function or outside it.

Transform to numeric a column with "NULL" values

I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.

What is R table function max size?

I'm using the R table() function, it only gives me 4222 rows, is there some kind of configuration to accept more rows?
table function is not limited to 4222 rows. Most likely, it is the printing limit that gives you the trouble.
Try:
options(max.print = 20000)
also, check the "real" number of rows:
tbl <- table(state.division, state.region)
nrow(tbl)
Nothing wrong with larger tables? What gave you that impression?
> set.seed(123)
> fac <- factor(sample(10000, 10000, rep = TRUE))
> fac2 <- factor(sample(10000, 10000, rep = TRUE))
> tab <- table(fac, fac2)
> str(tab)
'table' int [1:6282, 1:6279] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ fac : chr [1:6282] "1" "5" "7" "9" ...
..$ fac2: chr [1:6279] "1" "2" "3" "4" ...
Printing tab will cause problems - it takes a while to generate and then you'll get this message:
[ reached getOption("max.print") -- omitted 6267 rows ]]
You can alter that by changing options(max.print = XXXXX) where XXXXX is some large number. But I don't see what is gained by printing such a large table? If you were trying to do this to see if the correct table had been produced, size-wise, then
> dim(tab)
[1] 6282 6279
> str(tab)
'table' int [1:6282, 1:6279] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ fac : chr [1:6282] "1" "5" "7" "9" ...
..$ fac2: chr [1:6279] "1" "2" "3" "4" ...
help with that.

Resources