left_join duplicates even after troubleshooting - r

Sample data:
full<-structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AKS",
"AOK", "BTX", "GTX", "HKS", "JKS", "LOK", "MKS", "MOK", "PKS",
"SKS", "VTX"), class = "factor"), CT_NT = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("CT", "NT"), class = "factor"), Depth = c(5L,
10L, 15L, 5L, 10L, 15L), Site = c(1L, 1L, 1L, 1L, 1L, 1L), PW = c(22.8,
21.5, 18.2, 22.5, 20.5, 19.2), BD = c(1.1, 1.2, 1.1, 1.3, 1.3,
1.5)), .Names = c("Location", "CT_NT", "Depth", "Site", "PW",
"BD"), row.names = c(NA, 6L), class = "data.frame")
osu<-structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AKS",
"AOK", "BTX", "GTX", "HKS", "JKS", "LOK", "MKS", "MOK", "PKS",
"SKS", "VTX"), class = "factor"), CT_NT = structure(c(1L, 1L,
1L, 2L, 2L, 2L), .Label = c("CT", "NT"), class = "factor"), Depth = c(5L,
10L, 15L, 5L, 10L, 15L), pH = c(5.1, 5.4, 5.9, 5.2, 5.9, 6.2),
N = c(50, 31, 22, 35, 17, 8), P = c(122, 55, 34, 107, 23,
17), K = c(1301, 1202, 1078, 1196, 1028, 948), OM = c(2.3,
1.8, 1.5, 2.1, 1.4, 1.2), NH4 = c(19.3, 14.5, 11.6, 12.3,
8.6, 8.4), Sand = c(22.5, 25, 25, 25, 22.5, 18.8), Silt = c(56.3,
52.5, 50, 51.3, 52.5, 51.3), Clay = c(21.3, 22.5, 25, 23.8,
25, 30)), .Names = c("Location", "CT_NT", "Depth", "pH",
"N", "P", "K", "OM", "NH4", "Sand", "Silt", "Clay"), row.names = c(NA,
6L), class = "data.frame")
I am trying to join two datasets using left_join in dplyr. To my astonishment, I'm getting duplicate rows that are somehow not being identified as such. After reading all the other answers I could get my hands on here that seemed to address "join" issues (at least I'm not the only one who has them...?), I have tried:
Checking the group types of the joining variables in the two datasets
to ensure they match
Checking that I don't have duplicates within f1 or f2
Checking that the categorical columns I'm using to join are, in fact,
the same length and have the same contents. They're EXACTLY the same,
all the way down to the order I put them in
Explicitly specifying to dplyr to use Location, CT_NT, and Depth to
join
Letting dplyr figure out the joining variables itself Joining in both
orders Using inner_join--I ended up with f1 only
I've used left_join before and not had this issue, and it was with a very similar dataset (the pilot data to this full study, in fact). I thought I understood what left_join was doing, but now I'm wondering if I don't actually. I'm trying to get better with using dplyr, but unfortunately it's a lot of me bashing away at things until something works and I can figure out why it worked so I can reproduce it again later as needed.
Given my inexperience, I'm sure the answer is going to be frustratingly straightforward and simple, to the annoyance of everyone involved. Such is the life of learning to code, I guess. Thank you in advance for dealing with a rookie's doofy questions!
Here's my code:
f1<-full %>% #Build pilot_summary. Pipe pilot to...
group_by(Location,CT_NT,Depth,Site) %>% #group_by to work on CT or NT at each site
summarise_at(5:6,funs(mean)) %>% #calculate site means
ungroup(f1)
f1$Depth<-as.factor(f1$Depth)
f1$Site<-NULL
osu$Texture_Class<-NULL#Take out the texture class column
f2<- osu %>%
group_by(Location,CT_NT,Depth) %>% #group because otherwise R tries to crash on the next line of code...
arrange(Location,CT_NT,Depth) %>% #Put everything in order like f1, just in case
ungroup(f2)
f2$Depth<-as.factor(f2$Depth)
full_summary<-left_join(f1,f2)

Related

convert dataframe to time series for arima

I am having problems converting the following dataset to ts to be used with stats::arima
I was able to convert to xts objet but arima does not seem to like it.Can someone guide me on
how to convert it to ts? I really need to use arima model here. Thanks
library(ggfortify)
library(xts)
wt <- structure(list(SampleDate = structure(c(13687, 13694, 13701,
13708, 13715, 13722, 13729, 13736, 13743, 13750, 13757, 13764,
13771, 13778, 13785), class = "Date"), DOC = c(3, 10, 17, 24,
31, 38, 45, 52, 59, 66, 73, 80, 87, 94, 101), AvgWeight = c(1,
1.66666666666667, 2.06666666666667, 2.275, 3.83333333333333,
6.2, 7.4, 8.5, 10.25, 11.1, 13.625, 15.2, 16.375, 17.8, 21.5),
PondName = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Pond01", class = "factor")), row.names = c(NA,
15L), class = "data.frame")
pond <- as.xts(wt$AvgWeight,order.by=seq(as.Date("2007-06-23"), by=7, len=15))
d.arima <- arima(pond)
#arima is not recognized.....probably because I need a ts and not a xts object here.....
autoplot(d.arima, predict = predict(d.arima, n.ahead = 3,prediction.interval = TRUE,level=0.95),
ts.colour='dodgerblue',predict.colour='green',
predict.linetype='dashed',ts.size=1.5,conf.int.fill='azure3') + xlab('DOC') + ylab('AvgWeight-grs') +
theme_bw()
I get this weird plot...

Using conditional selection to create a subset of data

I have a dataset called dietox which has missing values (NA) for the Feed variable. I need to use conditional selection to create a subset of the data for which the rows with missing values are deleted.
The code I tried was:
dietox[!is.NA[dietox$Feed, ]
... but am not sure if that is right to create a subset.
dput(head(dietox))
dietox <- structure(list(Weight = c(26.5, 27.59999, 36.5, 40.29999, 49.09998,
55.39999), Feed = c(NA, 5.200005, 17.6, 28.5, 45.200001, 56.900002 ),
Time = 1:6, Pig = c(4601L, 4601L, 4601L, 4601L, 4601L, 4601L ),
Evit = c(1L, 1L, 1L, 1L, 1L, 1L), Cu = c(1L, 1L, 1L, 1L, 1L, 1L),
Litter = c(1L, 1L, 1L, 1L, 1L, 1L)),
.Names = c("Weight", "Feed", "Time", "Pig", "Evit", "Cu", "Litter"),
row.names = c(NA, 6L), class = "data.frame")
You have the right idea, but is.na is a function and so needs to be used with parenthesis.
dietox[!is.na(dietox$Feed), ]

Re-levelling in R for a xtab based on a condition

For a sample dataframe:
df <- structure(list(region = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("a", "b", "c", "d"), class = "factor"),
result = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L), weight = c(0.126,
0.5, 0.8, 1.5, 5.3, 2.2, 3.2, 1.1, 0.1, 1.3, 2.5)), .Names = c("region",
"result", "weight"), row.names = c(NA, 11L), class = "data.frame")
I draw a cross tabulation using:
df$region <- factor(df$region)
result <- xtabs(weight ~ region + result, data=df)
result
However I want to ensure the regions of the xtab are in order of magnitude of percentage 1s in sample. (i.e. 1s represent 29% of region a and 33% of region b). Therefore I would like the xtab to be reordered, so region b is first, then a.
I know I could use relevel, however this would be dependent on me looking at the result and re-levelling where appropriate.
Instead I want this to be automatic in the code and not dependent on the user (as this code will be running lots of times, and completing further analysis on the resulting xtab).
If anyone has any ideas, I would greatly appreciate it.
You can reorder the xtab on the values of the second column using order as follows:
result[order(result[, 2], decreasing=T),]
order ranks the values, adding decreasing=T ranks from top to bottom.

R - Changing colnames with a loop with names from another table

I am trying to split a large dataset and
assign colnames with a loop and
save all individual data back again in a single stacked file
I am using some sample data as follows:
so firstly I split the datasets into 2 based on number of sources in the first column and read in a list using the following code:
out <- split( sample , f = sample$Source)
now I am struggling to set up a loop to change the colnames for coloumn 2 to 8
by matching the existing colnames to the following 'info' table and replacing based on source name as in the first column of the 'info' table.
the info table looks like this:
so the loop should change the colnames similar to this:
I am just wondering if anyone has done something similar could advise me?
also when I try to join them together I can only set the colnames ones using the merge function. is there any way to stack them so that I can preserve the colname for each table and looks something like this? :
my sample input files are:
> dput(sample)
structure(list(Source = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), .Label = c("Stack 1", "Stack 2"), class = "factor"),
year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L), day = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hour = c(0L, 1L, 2L, 3L, 0L, 1L, 2L, 3L, 4L), `EXIT VEL` = c(26.2,
26.2, 26.2, 26.2, 22.4, 22.4, 22.4, 22.4, 22.4), TEMP = c(341L,
341L, 341L, 341L, 328L, 328L, 328L, 328L, 328L), `STACK DIAM` = c(1.5,
1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 2.5, 2.5), W = c(0L, 0L, 0L,
0L, 15L, 15L, 15L, 15L, 15L), Nox = c(39, 39, 39, 39, 33.3,
33.3, 33.3, 33.3, 33.3), Sox = c(15.5, 15.5, 15.5, 15.5,
17.9, 17.9, 17.9, 17.9, 17.9)), .Names = c("Source", "year",
"day", "hour", "EXIT VEL", "TEMP", "STACK DIAM", "W", "Nox",
"Sox"), class = "data.frame", row.names = c(NA, -9L))
> dput(stack_info)
structure(list(SNAME = structure(1:2, .Label = c("Stack 1", "Stack 2"
), class = "factor"), ISVARY = c(1L, 4L), VELVOL = c(1L, 4L),
TEMPDENS = c(0L, 2L), `DUM 1` = c(999L, 999L), `DUM 2` = c(999L,
999L), NPOL = c(2L, 2L), `EXIT VEL` = c(26.2, 22.4), TEMP = c(341L,
328L), `STACK DIAM` = c(1.5, 2.5), W = c(0L, 15L), Nox = c(39,
33.3), Sox = c(15.5, 17.9)), .Names = c("SNAME", "ISVARY",
"VELVOL", "TEMPDENS", "DUM 1", "DUM 2", "NPOL", "EXIT VEL", "TEMP",
"STACK DIAM", "W", "Nox", "Sox"), class = "data.frame", row.names = c(NA,
-2L))
thanks in advance
The best I ended with is this:
out <- split( sample , f = sample$Source) # your original step
stack_info[,1] <- as.character(stack_info[,1]) # To get strings column as strings and not index number later
out <- lapply( names(out), function(x) {
# Get the future names
new_cnames <- unname(unlist(stack_info[stack_info$SNAME == x,1:7]))
# replace the column names
colnames(out[[x]]) <- c("Source",new_cnames,colnames(out[[x]])[9:10] )
# Return the modified version without first column
out[[x]][,-1] })
sapply(out,write.table,append=T,file="",row.names=F,sep="|") # write (change "" to the file name you wish and sep to your desired separator and see ?write.table for more documentation)
The main idea is looping over the DF to change their colnames, I do update the list and loop again to write, you may want to append to file in the first loop.
I hope the comments are enough to get the code, tell me if it needs some details.
Output on screen (omitting warnings):
"Stack 1"|"1"|"1.1"|"0"|"999"|"999.1"|"2"|"Nox"|"Sox"
2010|1|0|26.2|341|1.5|0|39|15.5
2010|1|1|26.2|341|1.5|0|39|15.5
2010|1|2|26.2|341|1.5|0|39|15.5
2010|1|3|26.2|341|1.5|0|39|15.5
"Stack 2"|"4"|"4.1"|"2"|"999"|"999.1"|"2.1"|"Nox"|"Sox"
2010|1|0|22.4|328|2.5|15|33.3|17.9
2010|1|1|22.4|328|2.5|15|33.3|17.9
2010|1|2|22.4|328|2.5|15|33.3|17.9
2010|1|3|22.4|328|2.5|15|33.3|17.9
2010|1|4|22.4|328|2.5|15|33.3|17.9

Error in cor(data[, -1], use = "complete.obs") : 'x' must be numeric

I'm completely new to R - really have no clue what I'm doing to be honest. But I really need to run bivariate/multivariate regressions with this data following someone's advice and I'm stuck. Any help is greatly appreciated.
rm(list=ls())
setwd("C:/Users/Bogi/Documents/School/Honors Thesis/Voting and Economic Data")
data<-read.csv("BOGDAN_DATA1.csv")
head(data)
round(cor(data[,-1],use="complete.obs"),1)
Error in cor(data[, -1], use = "complete.obs") : 'x' must be numeric
dput
structure(list(REGION = structure(1:6, .Label = c("Altai Republic",
"Altai Territory", "Amur Region", "Arkhangelsk Region", "Astrakhan region",
"Belgorod region"), class = "factor"), PCT_CHANGE_VOTE = structure(c(2L,
3L, 5L, 4L, 6L, 1L), .Label = c("-13%", "-16%", "-17%", "-25%",
"-26%", "2%"), class = "factor"), PCT_CHANGE_GRP = structure(c(2L,
1L, 4L, 3L, 3L, 4L), .Label = c("10%", "17%", "19%", "27%"), class = "factor"),
PCT_CHANGE_INFLATION = structure(c(1L, 2L, 1L, 3L, 3L, 2L
), .Label = c("-2%", "-3%", "-4%"), class = "factor"), PCT_CHANGE_UNEMP = structure(c(5L,
4L, 1L, 2L, 6L, 3L), .Label = c("-13%", "-14%", "-17%", "-3%",
"5%", "7%"), class = "factor"), POVERTY = c(18.6, 22.6, 20.4,
14.4, 14.2, 8.6), POP_AGE1 = c(25.8, 16.9, 18.5, 17.1, 17.8,
15.2), POP_AGE2 = c(58.8, 59.6, 61.3, 60.4, 60.8, 60.3),
POP_AGE3 = c(15.4, 23.5, 20.2, 22.5, 21.4, 24.5), POP_URBAN = c(28.7,
55.2, 67, 76.2, 66.7, 66.4), POP_RURAL = c(71.3, 44.8, 33,
23.8, 33.3, 33.6), COMPUTER = c(46.4, 54.5, 66.1, 74, 65.1,
55.2), INTERNET = c(32.1, 41, 50.7, 66.5, 60, 50.7)), .Names = c("REGION",
"PCT_CHANGE_VOTE", "PCT_CHANGE_GRP", "PCT_CHANGE_INFLATION",
"PCT_CHANGE_UNEMP", "POVERTY", "POP_AGE1", "POP_AGE2", "POP_AGE3",
"POP_URBAN", "POP_RURAL", "COMPUTER", "INTERNET"), row.names = c(NA,
6L), class = "data.frame")
You could loop the columns 2:5 (lapply(data[2:5], ..)), remove the % in columns 2:5 (gsub('[%]',..)) and convert the columns to numeric. The output from gsub will be character class, convert it to numeric by as.numeric
data[2:5] <- lapply(data[2:5], function(x)
as.numeric(gsub('[%]', '', x)))
Cor1 <- round(cor(data[-1],use="complete.obs"),1)
Or you could remove the % in those columns using awk on shell (assuming ,
as delimiter)
awk 'BEGIN {OFS=FS=","} function SUB(F) {sub(/\%/,"", $F)}{SUB(2);SUB(3);SUB(4);SUB(5)}1' Bogdan.csv > Bogdan2.csv
Read the file with read.csv and run the cor
dat1 <- read.csv('Bogdan2.csv')
Cor2 <- round(cor(dat1[-1], use='complete.obs'), 1)
identical(Cor1, Cor2)
#[1] TRUE

Resources