I am working on an economical research and have a data frame filled with regression coefficients using melt & tidy functions from broom package. My df:
> head(LmModGDP, 10)
Country variable term estimate std.error statistic p.value
1 Netherlands FDI_InFlow_MilUSD (Intercept) 5.354083e+02 5.974760e+01 8.961167 1.976417e-09
2 Netherlands FDI_InFlow_MilUSD value 2.400677e-03 1.409779e-03 1.702875 1.005189e-01
3 Netherlands FDI_InFlow_percGDP (Intercept) 6.184273e+02 6.723554e+01 9.197923 1.173719e-09
4 Netherlands FDI_InFlow_percGDP value -1.261933e+00 1.008740e+01 -0.125100 9.014067e-01
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
7 Netherlands FDI_OutFlow_MilUSD (Intercept) 5.106762e+02 5.939921e+01 8.597356 4.465840e-09
8 Netherlands FDI_OutFlow_MilUSD value 1.920313e-03 8.646908e-04 2.220808 3.528536e-02
9 Netherlands FDI_OutFlow_percGDP (Intercept) 2.593453e+02 5.334202e+01 4.861932 4.838082e-05
10 Netherlands FDI_OutFlow_percGDP value 3.931491e+00 5.332541e-01 7.372641 7.896681e-08
After I filter the df using any method (even simply by subseting or with dplyr package):
LmModGDP[LmModGDP$variable == "FDI_InStock_MilUSD",]
or
LmModGDP %>%
filter(variable == "FDI_InStock_MilUSD")
It returns the desired df but when I drag my mouse over the last column (p.value) in RStudio viewer it tells me that it is "Unknown Column" and the data still correct. Also when I use str or class function on it it shows that it is numeric but in the viewer it shows something else..
My desired df:
Country variable term estimate std.error statistic p.value
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
19 Romania FDI_InStock_MilUSD (Intercept) 3.122229e+01 3.313134e+00 9.423796 7.188216e-10
20 Romania FDI_InStock_MilUSD value 2.128223e-03 7.035679e-05 30.249006 8.588104e-22
When I try to use kable function to display it in markdown report p.value column shows only 0 values... not the actual ones.
Can someone help me ?
!! UP !!
Here's an output of str :
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 28 obs. of 7 variables:
$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
$ variable : Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 1 2 2 3 3 4 4 5 5 ...
$ term : chr "(Intercept)" "value" "(Intercept)" "value" ...
$ estimate : num 535.4083 0.0024 618.4273 -1.2619 311.0956 ...
$ std.error: num 59.7476 0.00141 67.23554 10.0874 27.19577 ...
$ statistic: num 8.961 1.703 9.198 -0.125 11.439 ...
$ p.value : num 1.98e-09 1.01e-01 1.17e-09 9.01e-01 1.20e-11 ...
- attr(*, "vars")= chr "Country" "variable"
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 14
..$ : int 0 1
..$ : int 2 3
..$ : int 4 5
..$ : int 6 7
..$ : int 8 9
..$ : int 10 11
..$ : int 12 13
..$ : int 14 15
..$ : int 16 17
..$ : int 18 19
..$ : int 20 21
..$ : int 22 23
..$ : int 24 25
..$ : int 26 27
- attr(*, "group_sizes")= int 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "biggest_group_size")= int 2
- attr(*, "labels")='data.frame': 14 obs. of 2 variables:
..$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
..$ variable: Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 2 3 4 5 6 7 1 2 3 ...
..- attr(*, "vars")= chr "Country" "variable"
..- attr(*, "drop")= logi TRUE
I cannot comment yet, this is why I write here an answer.
Could you show us the output of str(LmModGDP) ? Maybe the df is nested? Maybe it is not a pure df but has special properties. Have you tried forcing LmModGDP<-as.data.frame(LmModGDP) ?
Have you tried forcing LmModGDP$p.value<-as.numeric(LmModGDP$p.value) ?
Have you tried converting to data.table and see if the behavior is different after applying your filter on it?
UPDATE1:
Thanks for posting the str(). Your object is a "grouped_df". Have you tried ungroup(LmModGDP)?
Related
My data has 1,000 entries and here is the str of the first 2 elements:
> str(my_boots[1:2])
List of 2
$ :List of 4
..$ result : Named num [1:10] 0.118 0.948 4.317 1.226 1.028 ...
.. ..- attr(*, "names")= chr [1:10] "(Intercept)" "pvi2" "freqchal" "sexexp" ...
..$ output : chr "list()"
..$ warnings: chr(0)
..$ messages: chr(0)
$ :List of 4
..$ result : Named num [1:10] 0.202 0.995 2.512 1.057 0.5 ...
.. ..- attr(*, "names")= chr [1:10] "(Intercept)" "pvi2" "freqchal" "sexexp" ...
..$ output : chr "list()"
..$ warnings: chr(0)
..$ messages: chr(0)
The fields of interest are $result and $warnings; I want to return a tibble with the columns based on the names within the named list result where warning == "" (where no warning).
I'm new to purrr but I was able to get most of the way there using map_dfr(my_boots[1:2],"result") - this returns a tibble with the column names from the named numbers list but I would like to only return the ones where the entry under warnings is blank.
I wasn't sure how to create this structure manually but was able to create a single element of my_boots:
test <- list(
list("warnings" = c("blah")),
list("result" = c("alpha" = 1.1, "beta" = 2.1, "theta" =3.1, "blah" = 4.1))
)
Also: I'm using the tidyverse - thank you.
Starting with some dummy data.
library(tidyverse)
l <- list(
list(
result = 1:10,
warnings = character(0)
),
list(
result = 2:20,
warnings = "warn"
),
list(
result = 3:30,
warnings = character(0)
),
list(
result = 4:40,
warnings = "warn"
)
)
Use keep to keep only elements without warnings. map("result") pulls the result element out of each list.
l %>%
keep(~is_empty(.$warnings)) %>%
map("result")
#> [[1]]
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> [[2]]
#> [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [22] 24 25 26 27 28 29 30
I'm definitely a noob, though I have used R for various small tasks for several years.
For the life of me, I cannot figure out how to get the results from the "Desc" function into something I can work with. When I save the x<-Desc(mydata) the class(x) shows up as "Desc." In R studio it is under Values and says "List of 1." Then when I click on x it says ":List of 25" in the first line. There is a list of data in this object, but I cannot for the life of me figure out how to grab any of it.
Clearly I have a severe misunderstanding of the R data structures, but I have been searching for the past 90 minutes to no avail so figured I would reach out.
In short, I just want to pull certain aspects (N, mean, UB, LB, median) of the descriptive statistics provided from the Desc results for multiple datasets and build a little table that I can then work with.
Thanks for the help.
Say you have a dataframe, x, where:
x <- data.frame(i=c(1,2,3),j=c(4,5,6))
You could set:
desc.x <- Desc(x)
And access the info on any given column like:
desc.x$i
desc.x$i$mead
desc.x$j$sd
And any other stats Desc comes up with. The $ is the key here, it's how you access the named fields of the list that Desc returns.
Edit: In case you pass a single column (as the asker does), or simply a vector to Desc, you are then returned a 1 item list. The same principle applies but the usual syntax is different. Now you would use:
desc.x <- Desc(df$my.col)
desc.x[[1]]$mean
In the future, the way to attack this is to either look in the environment window in RStudio and play around trying to figure out how to access the fields, check the source code on github or elsewhere, or (best first choice) use str(desc.x), which gives us:
> str(desc.x)
List of 1
$ :List of 25
..$ xname : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i"
..$ label : NULL
..$ class : chr "numeric"
..$ classlabel: chr "numeric"
..$ length : int 3
..$ n : int 3
..$ NAs : int 0
..$ main : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i (numeric)"
..$ unique : int 3
..$ 0s : int 0
..$ mean : num 2
..$ meanSE : num 0.577
..$ quant : Named num [1:9] 1 1.1 1.2 1.5 2 2.5 2.8 2.9 3
.. ..- attr(*, "names")= chr [1:9] "min" ".05" ".10" ".25" ...
..$ range : num 2
..$ sd : num 1
..$ vcoef : num 0.5
..$ mad : num 1.48
..$ IQR : num 1
..$ skew : num 0
..$ kurt : num -2.33
..$ small :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 1 2 3
.. ..$ freq: num [1:3] 1 1 1
..$ large :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 3 2 1
.. ..$ freq: num [1:3] 1 1 1
..$ freq :Classes ‘Freq’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ level : Factor w/ 3 levels "1","2","3": 1 2 3
.. ..$ freq : int [1:3] 1 1 1
.. ..$ perc : num [1:3] 0.333 0.333 0.333
.. ..$ cumfreq: int [1:3] 1 2 3
.. ..$ cumperc: num [1:3] 0.333 0.667 1
..$ maxrows : num 12
..$ x : num [1:3] 1 2 3
- attr(*, "class")= chr "Desc"
"List of 1" means you access it by desc.x[[1]], and below that follow the $s. When you see something like num[1:3] that means it's an atomic vector so you access the first member like var$field$numbers[1]
str(list) # the list
List of 11
$ : int [1:62850] 1013128473 1010310348 1048245573 1034384956 1041152164 1044038741 1018034270 1028472668 1028965885 1009487677 ...
$ : int [1:76934] 1013175201 1008463364 1016595579 1015077603 1036297925 1033985605 1004670509 1002708962 1035740487 1033948421 ...
$ : int [1:63141] 1023522277 1028419750 1035072196 1015895913 1044665345 1045384789 1003817549 1007103029 1034294940 1048731747 ...
$ : int [1:66286] 1004375117 1015143512 1013554405 1029388459 1042758662 1002010773 1014659880 1010136990 1042787992 1034111995 ...
$ : int [1:59295] 1026598712 1046781801 1047773468 1029647490 1000445831 1004654396 1026574333 1028210894 1031396631 1017077460 ...
$ : int [1:39513] 1008628321 1031342452 1036618138 1025299916 1059540334 1044636981 1025831775 1020671796 1016064196 1000573822 ...
$ : int [1:52616] 1007104357 1035072196 1045300736 1013342439 1021471188 1014648594 1047521123 1006283327 1018237501 1052887674 ...
$ : int [1:53865] 1043482304 1006375883 1065831792 1025658285 1025898360 1042188555 1010986410 1036297925 1016468595 1042017564 ...
$ : int [1:74030] 1049026709 1076616323 1013343981 1009441716 1004974596 1032515221 1059905172 1011514112 1005423064 1006931636 ...
$ : int [1:62171] 1024128835 1006168791 1003374715 1042188555 1016219766 1002708962 1035781234 1039706286 1011430434 1055809196 ...
$ : int [1:66560] 1020967137 1029327077 1026256246 1046334023 1035156221 1017504075 1035065786 1043426434 1034294940 1019105475 ...
str(df) # the data frame
'data.frame': 3727518 obs. of 5 variables:
$ A: int 10001676 10001676 10002575 10002990 10003466 10005485 10005736 10005949 10006562 10007119 ...
$ 1: int 1020565642 1020565642 1008628321 1038358741 1045031612 1025102185 1011873328 1002079752 1028579827 1026598712 ...
$ 2: Factor w/ 2 levels "ÇäËì","ÐßÑ": 2 2 2 2 2 2 2 2 2 2 ...
$ 3: int 1 4 1 1 1 1 20 1 1 1 ...
$ 4: int 64 64 66 63 69 59 84 83 65 64 ...
I want to merge each vector in the list with the data frame by "A".
What I tried was:
for(n in 1:length(list))
{
newlist[[n]] <- merge(df, list[[n]], by.x = "A")
}
Error in merge.data.frame(rd_info, newengagementspermonth[[n]], by.x = "NEWNINUMBER") :
'by.x' and 'by.y' specify different numbers of columns
The input is a list of 11 vectors and a dataframe. the output should be a list of 11 dataframes with the each dataframe having number of rows equal to the length of the corresponding vector.
You could do something like this. First, explicitly transform each object in the list into a data.frame. Then, merge it with df. You need to specify by.x and by.y since the data.frames do not have the same names.
new list <- lapply(lapply(list,as.data.frame),function(x) merge(x,df,by.x="X[[i]]",by.y="A",all.x=TRUE))
With sample data:
list <- list(1:8,1:10,2:15)
df <- data.frame(A=1:15,
b=rnorm(15))
output
str(newlist)
List of 3
$ :'data.frame': 8 obs. of 2 variables:
..$ X[[i]]: int [1:8] 1 2 3 4 5 6 7 8
..$ b : num [1:8] 0.0127 0.2082 -0.271 0.421 -0.538 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ X[[i]]: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b : num [1:10] 0.0127 0.2082 -0.271 0.421 -0.538 ...
$ :'data.frame': 14 obs. of 2 variables:
..$ X[[i]]: int [1:14] 2 3 4 5 6 7 8 9 10 11 ...
..$ b : num [1:14] 0.208 -0.271 0.421 -0.538 0.506 ...
I have managed to subset and lapply a list of data frames as follows:
subsetDeathHA <-
na.omit(subset(outcome,select = c("Hospital Name", "mortailityRate", "State"), ))
orderSubsetDeathHA <-
subsetDeathHA[order(subsetDeathHA$"mr" , subsetDeathHA$"Hospital Name", subsetDeathHA$'State' ),]
splitOrderSubsetDeahtHA <-
split(orderSubsetDeathHA, orderSubsetDeathHA$'State')
aa<- lapply(splitOrderSubsetDeahtHA, function(x) { x[num,] })
num is the ranking number on a per State basis.
Using str(aa) shows this object is a list of (54) data.frames, where each data.frame is one object of 3 variables as follows:
List of 54
$ AK:'data.frame': 1 obs. of 3 variables:
..$ Hospital Name : chr NA
..$ mortalityRate : num NA
..$ State : chr NA
..- attr(*, "na.action")=Class 'omit' Named int [1:1986] 4 5 6 10 13 17 19 23 27 28 ...
.. .. ..- attr(*, "names")= chr [1:1986] "4" "5" "6" "10" ...
$ AL:'data.frame': 1 obs. of 3 variables:
..$ Hospital Name : chr "D C H REGIONAL MEDICAL CENTER"
..$ mortalityRate : num 15.8
..$ State : chr "AL"
..- attr(*, "na.action")=Class 'omit' Named int [1:1986] 4 5 6 10 13 17 19 23 27 28 ...
.. .. ..- attr(*, "names")= chr [1:1986] "4" "5" "6" "10" ...
What I can't seem to do is the following
1) Subset out the Hospital Name and the State by removing the mortalityRate variable and return a list of the resulting 54 objects/data frames.
2) Place row.names =F appropriately to suppress the indexing that R provides.
3) Even though I thought I had 'na'd out' the NA values in the first sub-setting operation,
when I print(aa), what follows is a sample of the output.
$AK
Hospital Name mr State
NA NA <NA> NA <NA>
$AL
Hospital Name mr State
56 D C H REGIONAL MEDICAL CENTER 15.8 AL
etc...
Any help/suggestions appreciated
I am trying to run an LDA using the topicmodels package in R. The example given in the manual uses Associated Press data and works nicely. However, when I try it on my own data I get topics whose terms are the document names. I have traced the problem to the fact that my term document matrix is the transpose of the way is should be (rows -> columns).
The example TDM:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
Whereas,my TDM has Terms as rows, and Docs as columns:
List of 6
$ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ...
$ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ...
$ nrow : int 5903
$ ncol : int 9
$ dimnames:List of 2
..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ...
..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Which is causing LDA(art_tdm,3) to build topics based on doc names, not terms within docs. Is this a change in the codebase of the tm package? I can't imagine what I would be doing to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts"))
art_tdm<-TermDocumentMatrix(art_cor)
Any help would be appreciated.
On the one hand you have an object of class "TermDocumentMatrix" and the other you have one of "DocumentTermMatrix".
You probably just need to do this:
art_tdm<-DocumentTermMatrix(art_cor)