How to write unique variable names with write_sav - r

I'm trying to write an SPSS file with some haven_labelled variables and factors that are created from that variable. It's just convenient for me and my use case to use nearly identical variable names. I've used all lower case for the haven_labelled variables and title case for the respective factor variable.
When I export the data frame with write_sav, the SPSS records the variable name of the title case factor with var1, rather than the title case, in this case Francophone. Note that when I change the name of the variable significantly, it prints the variable name.
#This makes the data frame of haven labelled variable and a corresponding factor
test<-structure(list(francophone = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), labels = c(Francophone = 1, `Not Francophone` = 0), label = "Dichotomous variable, R is francophone", class = c("haven_labelled", "vctrs_vctr", "double")), Francophone = structure(c(1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Not Francophone", "Francophone"), label = "Dichotomous variable, R is francophone", class = "factor")), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))
#Create a second factor variable equivalent to the first factor, but with a different variable name
test$Franco<-test$Francophone
library(tidyverse)
library(haven)
#Write out the file; sorry I do not know how to use tmpfile() in this case.
test %>%
write_sav(., path="~/Desktop/test2.sav")

To close the loop, variable names in SPSS have to be unique: https://www.ibm.com/docs/en/spss-statistics/version-missing?topic=list-variable-names
This has always been the case (and will probably not change).

Related

How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column

So what I have is two different data frames: the one I've been working on (df1) and the one with all the new data I need to put in the first one (df2). Df1 has several columns of zeroes, waiting for the data to be added in. Df2 has the data I need, and several more rows and columns that I don't care about beyond that data. Here is a small subset of the type of data I'm working with.
This is my first time posting my data so I hope I'm doing it right. Let me know if you need a different format.
df1:
structure(list(season = c(" FA15", " FA15", " FA15", " FA15",
" FA15", " FA15", " FA15", " FA15", " FA15", " FA15"), year = c("2015",
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015",
"2015"), territory.name = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), plot = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0"), color.band = c("APGBY",
"APGGU", "APGPW", "APGPW", "APGR", "APGUO", "APGUO", "APGUO",
"APGUO", "APGYR")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df2:
structure(list(bandnum = c(157328052, 160379101, 157328094, 151313455,
170364680, 160379104, 151373458, 157328066, 160379103, 160379105
), project = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("*ISSJ", "ISSJ"), class = "factor"), color.band = c("PAWR",
"WYWAR", "APGP", "APGO", "ABYG", "URYAR", "APBW", "WABG", "OBWAR",
"GBGAR"), sex = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 1L,
2L), .Label = c("?", "F", "M"), class = "factor"), age = structure(c(2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("AHY", "ASY",
"HY", "N", "SY"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
I've been chewing on this problem for a few days, trying different things and reading so many answers on stack overflow, but I'm failing to come up with a clear answer on how to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column.
Pretty much, I want r to see that both data frames have in the color.band column a listing for the band ABCDEF, and then take the value from df2$bandnum in the same row as ABCDEF and copy it to df1$bandnum in the ABCDEF row there.
I don't want to copy rows that are in df2 but not df1 into df1. I want to mark entries that exist in df1 but not df2 as N/A in the bandnum column.
Column names and data format for color band and band number have been standardized between the two data frames so everything should line up.
What I have so far with code is this:
> practicedf <- left_join(x=df1, y=df2, by = "color.band", all.x = TRUE)
%>% mutate(y = ifelse(is.na(df1$color.band), df1$bandnum, df1$color.band)) %>% select(df2$bandnum)
left_join seems to be the right one because it keeps all rows in the left (df1) data frame and only matching rows from the right (df2) data frame.
I get this error though:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c("APGBY", "APGGU", "APGPW", :
replacement has 1261 rows, data has 2559
color.band is a character vector while bandnum is numerical, is that a problem? What could be the problem here?
Edit: I had an error with having the column bandnum in both dataframes so I changed df2$bandnum to bandnum.y. My code is now
df1_test <- left_join(x=df1, y=df2, by = "color.band") %>% mutate(y =
ifelse(is.na(color.band), bandnum, color.band)) %>% select(bandnum.y)
but when I view(df1_test) it only shows me the column bandnum.y and it's not the same number of entries as my original df1
Here's a subset of df1_test (not the whole thing because it's 2600 entries)
Any way I can make it show the rest of my data as well?
structure(list(bandnum.y = c("171324972", "171324972", "171324972",
"178324697", "178324697", "178324697", "178324697", "178324697",
"178324697", "178324697", "170364505", "170364505", "170364505",
"170364505", "170364505", "170364505", NA, "178324692", "178324692",
"178324692")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
We cannot use the original dataset 'df1' columns after the join becuase it is a left_join. In tidyverse, we specify the unquoted column names. There is no all.x argument in left_join. It should be from merge
library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>%
mutate(y = ifelse(is.na(color.band), bandnum, color.band))
left_join does not have all.x = TRUE that is part of base R merge.
You could do the following in base R :
df1_test <- transform(merge(df1, df2, by = "color.band", all.x = TRUE),
y = ifelse(is.na(color.band), bandnum, color.band))
If I don't get you wrong, you want to update an old df (df1) with information from a new df (df2).
In data.table, you can try this:
libraty(data.table)
setDT(df1)
setDT(df2)
update.vars = intersect(names(df1), names(df2)) # update only common variables
df1[df2, c(update.vars) := df2[,update.vars, with=FALSE], on= 'color.band']
Generally this should work. But in the given data the 'merge' ids (color.band column) are not unique, which may affect the results.

R Convert categorical data to dummy set by other variable

I have this data set, I put a screenshot of real data instead of a code or something.
sorry for messing up, I am a newbie here in R
enter image description here
Then, I want to change the data into dummy set for "13 Source" categorical data, but it has to be summarized by "HH No". Which will look like this
enter image description here
I've tried to use to.dummy by varhandle, model.matrix but ended up messy dataset.
Could anybody help me how to deal with this?
Thanks a million in advance
There are a number of ways to make dummy variables from factors - here is one way to create a summary presence table.
Assume df is your data frame. You can use xtabs to start with, which will create a frequency table from your 2 columns.
By comparing to see if your values are > 0, you will get TRUE if > 0, and FALSE otherwise. Adding 0 at the end will make TRUE the number 1 and FALSE the number 0.
(xtabs(~ HH_No + Source, df) > 0) + 0
Output
Source
HH_No Deep_well Rainwater
1 1 1
3 1 1
4 0 1
Data
df <- structure(list(HH_No = c(1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3,
3, 3, 4, 4), Source = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("Deep_well",
"Rainwater"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))

Log response ratios with 'Metafor' with zeros

I'm using the 'metafor' package in R to perform log response ratios. Some of my means are zero, which seems to be the cause of a warning after my escalc command (since log(0) is -inf). The metafor package provides a method of adding a small value to zero to avoid this. The documentation states:
"Cell entries with a zero can be problematic especially for the relative risk and the odds ratio. Adding a small constant to the cells of the 2 × 2 tables is a common solution to this problem [...] When to = "only0", the value of add is added to each cell of the 2 × 2 tables only in those tables with at least one cell equal to 0."
For some reason this is not resolving my error, perhaps because my data is not a 2x2 table? (It is output from summarise with ddply from the ply package, similar to the formatting in this example). Must I replace the zero values with a small number manually or is there a more elegant way? (Note that in this example the rows with zero also have sample size of 1 and thus no variance and will be dropped from the analysis anyway. I just want to know how this works for the future).
Reproducible example:
dat<-dput(Bin_Y_count_summary_wide)
structure(list(Species.ID = c("CAFERANA", "TR11", "TR118", "TR500",
"TR504", "TR9", "TR9_US1"), Y_num_mean.early = c(2, 147.375,
4.5, 0.5, 12.5, 93.4523809523809, 5), N.early = c(1L, 4L, 2L,
4L, 4L, 7L, 2L), sd.early = c(NA, 174.699444284558, 6.36396103067893,
1, 22.4127939653523, 137.506118190001, 7.07106781186548), se.early = c(NA,
87.3497221422789, 4.5, 0.5, 11.2063969826762, 51.9724274972283,
5), Y_num_mean.late = c(0, 3.625, 2.98482142857143, 0.8, 3, 47.2,
0), N.late = c(1L, 4L, 7L, 10L, 10L, 8L, 1L), sd.late = c(NA,
7.25, 5.10407804830748, 1.75119007154183, 8.03118920210451, 40.7351024477486,
NA), se.late = c(NA, 3.625, 1.9291601697265, 0.553774924194538,
2.53968501984006, 14.4020335865659, NA), Y_num_mean.wet = c(NA,
71.5, 0, 12, 27, 0, NA), N.wet = c(NA, 2L, 1L, 2L, 2L, 2L, NA
), sd.wet = c(NA, 17.6776695296637, NA, 9.89949493661167, 38.1837661840736,
0, NA), se.wet = c(NA, 12.5, NA, 7, 27, 0, NA)), row.names = c(NA,
7L), .Names = c("Species.ID", "Y_num_mean.early", "N.early",
"sd.early", "se.early", "Y_num_mean.late", "N.late", "sd.late",
"se.late", "Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), class = "data.frame", reshapeWide = structure(list(
v.names = c("Y_num_mean", "N", "sd", "se"), timevar = "early_or_late",
idvar = "Species.ID", times = c("early", "late", "wet"),
varying = structure(c("Y_num_mean.early", "N.early", "sd.early",
"se.early", "Y_num_mean.late", "N.late", "sd.late", "se.late",
"Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), .Dim = c(4L,
3L))), .Names = c("v.names", "timevar", "idvar", "times",
"varying")))
# Warning produced from this command
test <- escalc(measure="ROM", m1i=Y_num_mean.early, sd1i=sd.early, n1i=N.early, m2i=Y_num_mean.late, sd2i=sd.late, n2i=N.late, data=dat, add=1/2, to="only0")
The paragraph you are quoting applies to measures that one can calculate based on 2x2 tables (i.e., RR, OR, RD, AS, and PETO). The add and to arguments do not have any effect for measures such as SMD and ROM.
The only way you can get a mean of 0 for a ratio scale variable (which is what use of response ratios assumes) is if every value is equal to 0. Therefore, by definition, the variance must also be 0. This applies whether the sample size is 1 (in which case the variance is of course also 0) or whether you have a larger sample size.
In general, whenever at least one of the two means is 0, one cannot calculate the log response ratio. Of course, one could start adding some kind of constant to the means manually (and the same for the SDs), but this seems rather arbitrary. The adjustments we can do to counts in 2x2 tables are motivated based on statistical theory (those adjustments are actually bias reductions, which also happen to make the calculation of certain measures possible when there is a 0 count).

summary and descriptive table for mixed data in R

I want to make a function that calculates some pre-determined summary statistic measures that I can apply to any dataset. I'll start off with an example here, but this is for datasets that could have a variety of datatypes - such as character, factor, numerical, dates, containing null values, etc.
I can do this easy enough if the data is all numeric - but handling the IF scenarios w/ apply, sapply, etc is where I run into trouble with the syntax.
When its all numeric I'm great since I can just do new_df = data.frame(min = sapply(mydf, 2,min).....etc....etc). I just can't get the syntax right when its more complicated like in my example below.
In the example below I have a data frame of 3 columns:
all numerical
numerical with a null
categorical column of data coded as a factor
I want to calculate the:
type...(character, factor, date, numeric, etc)
mean...when the data-type is numeric obviously , and excluding nulls
number of null values in the dataset
I think this is simple enough and I can run with it from here..
copy and paste this code and name as a variable for the data frame:
structure(list(allnumeric = c(10, 20, 30, 40), char_or_factor = structure(c(2L,
3L, 3L, 1L), .Label = c("bird", "cat", "dog"), class = "factor"),
num_with_null = c(10, 100, NA, NA)), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c(NA, -4L), class = "data.frame")
expected solution data frame (copy and assign to a variable):
structure(list(allnumeric = structure(c(3L, 2L, 1L), .Label = c("0",
"25", "numeric"), class = "factor"), char_or_factor = structure(c(2L,
NA, 1L), .Label = c("0", "character"), class = "factor"), num_with_null = structure(c(3L,
2L, 1L), .Label = c("2", "55", "numeric"), class = "factor")), .Names = c("allnumeric",
"char_or_factor", "num_with_null"), row.names = c("type", "mean",
"num_nulls"), class = "data.frame")
We can use sapply to loop over the columns, get the class, mean and number of NA elements, concatenate (c() and convert to data.frame
as.data.frame(sapply(df1, function(x) c(class(x), mean(x, na.rm=TRUE),
sum(is.na(x)))), stringsAsFactors=FALSE)

Dataframe manipulation: Convert certain columns of a dataframe into a list based on a key value column

I have a DF like the example created by the code below.
a = data.frame( name = c(rep("Tim",5),rep("John",3)),id = c(rep(1,5),rep(2,3)), value = 1:7)
And I want to transform it into a result that looks like this.
b = data.frame( name = c("Tim","John"), ID = c(1:2), b = NA)
b$value = list(c(1:5),c(6:8))
How would I go about doing this transformation?
For the actual data frame, I will have many columns to the left of the ID column, which I will want to perform calculations on with the columns of lists that will be created on the right side of the ID field.
For example, on the DF b above, I might want to perform a function call with "Tim" as an argument and loop through each individual element in the list = {1,2,3,4,5} and the output of that loop is another list with the same number of elements.
Try
aggregate(value~.,a, FUN=c)
# name id value
#1 Tim 1 1, 2, 3, 4, 5
#2 John 2 6, 7, 8
data
a <- structure(list(name = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L), .Label = c("John", "Tim"), class = "factor"), id = c(1,
1, 1, 1, 1, 2, 2, 2), value = 1:8), .Names = c("name", "id",
"value"), row.names = c(NA, -8L), class = "data.frame")

Resources