How can I ignore null headers in a .csv file?
I have a csv file like this
http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv
and my code is
data <- read.csv('1.csv',header = T, sep=";")
So R tells me
more columns than column names
And I don't want to skip the header of the file
thank you!
I don't see the same behavior here. R adds default column names and NA to unavailable data.
> data <- read.csv("test.csv", header = TRUE, sep = ";")
> data
col1 col2 col3 col4 X X.1
1 val1 val2 val3 val4 val5 NA
2 val1 val2 val3 val4 val5 NA
Are you using the latest version?
But the error message tells you exactly what the problem is. you have more columns than column names.
download.file("http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv", destfile="1.csv")
D1 <- read.csv2("1.csv", skip=1, header=FALSE)
firstlines <- readLines("1.csv", 3)
splitthem <- strsplit(firstlines, ";")
sapply(splitthem, length)
# [1] 28 42 42
So you have 42 data columns (separated by semicolons) but 28 column names (again, separated by column names). How would R know which name you would want to go with which column? ("Computers are good at following instructions, but not at reading your mind." - Donald Knuth).
You need to edit the source file so that each column would have a name or skip the first row and then get the column names form somewhere else.
edit
yes, the idea is to take the first names and then standart variables like V1, V2, or
whatever- Otherwise is there a way to skip those?
Ok then I would just use the above with slight modification:
download.file("http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv", destfile="1.csv")
D <- read.csv2("1.csv", skip=1, header=FALSE)
header <- strsplit(readLines("1.csv", 3), ";")[[1]]
names(D)[1:length(header)] <- header
Now you have the first 28 variables named, and the rest named V29-V42.
You can "skip" the rest of the names in various ways. If you do as suggested in another answer (Dave) , basically as
names(D) <- header
... then variables 29-42 will have NA name. This is not a usable name, and you can address these variables only by column number. Or you can do:
names(D)[29:43] <- ""
Now you can't use these names either.
> D[[""]]
NULL
I think it is useful to give them names, as many data frame operations presume names. For example, suppose you have empty names ("" as above) and try to see the first few rows of your data frame:
head(D)
# skipped most of the output, keeping only column 42:
structure(c("-1", "70", ".5", "70", "266", "70"), class = "AsIs")
1 -1
2 70
3 .5
4 70
5 266
6 70
So when using head, you will see your data frame with funny names. Or another example:
D[1:3,29:31]
.1 .2
1 C_COMPONENTE_LIQ_DESDE_CO 243 LIQUIDACION TOPE CO
2 C_COMPONENTE_LIQ_DESDE_CO 243 RESIDUAL CO
3 C_COMPONENTE_LIQ_DESDE_CO 243 RESIDUAL CO
The first component now is named "", the second one ".1", and the third one ".2". Have a look at a quote from data.frame help file below:
The column names should be non-empty, and attempts to use empty names will have
unsupported results. Duplicate column names are allowed, but you need to use check.names
= FALSE for data.frame to generate such a data frame. However, not all operations on
data frames will preserve duplicated column names: for example matrix-like subsetting
will force column names in the result to be unique.
Or suppose you add some columns to the beginning of your data frame; if you have col names then you can still address what was previously 29th column as D$V29, but with D[,29] you will get something else.
Probably there are other examples. In other words, you can have "unnamed" columns in a data frame but I don't think it is a good idea. And technically, all columns in a data frame will always have a name (it can just be "" or NA), so why not have meaningful names? (Even V29 is better than nothing.)
Related
I want to modify the longitudinal data based on ID.
I want to check whether the IDs in data wave 1(A) and data in wave 2(B) match properly. Also, I want to combine the data of A and B into one file based on ID.
I tried to merge the file using merge() code and tried to check whether the ID matched through the sex variable. However, it is difficult to check ID if there is no same variable in both waves, and it does not directly check each ID.
ID <- c(1012,1102,1033,1204,1555)
sex <- c(1,0,1,0,1)
A <- cbind(ID,sex)
A <- as.data.frame(A)
ID <- c(1006,1102,1001,1033,1010,1234,1506,1999)
sex <- c(1,0,1,1,1,0,0,0)
B <- cbind(ID,sex)
B <- as.data.frame(B)
merge.AB<-merge(A,B,by="ID")
all(merge.AB$sex.x == merge.AB$sex.y)
1. Are there any way to merge A(wave1) and B(wave2) files by ID other than merge() code?
Since there are 2 or 3 wave1 files other than A, it would be nice to be able to combine several files at once.
2. Is there a way to check if two frames have the same ID directly?
I tried to combine the files and check matching IDs through cbind() code for combining the A and B. But I couldn't check them together because the number of rows between the A and B dataframe is different.
It would be helpful to use a loop(e.g. if, for, etc.), but it would be nice if there was a way to do it with a package or simple code.
3. How do I locate a row number with a mismatched ID?
I want to know the all of locations in the row(row number) for the example.
e.g.
mismatched ID in A: 1012,1204,1555
mismatched ID in B: 1006,1001,1010,1234,1506,1999
Question 1 : you can merge multiple dataframes with merge. You first need to create a list of the df you want to merge and then you could use Reduce.
df_list <- list(df1,df2,...dfn)
data=Reduce(function(x,y) merge(x=x,y=y,by="ID",all=T),df_list)
Alternatively using tidyverse:
library(tidyverse)
df_list %>% reduce(full_join, by='ID')
In your example, pay attention that it is not convenient to merge two df with the same variable name and that contain the same information. You could simply use
data=Reduce(function(x,y) merge(x=x,y=y,all=T), df_list)
to remove redundant information from merged df.
Question 2 : check IDs with setdiff() and intersect()
intersect() gives you the common values between two vectors
setdiff(x,y) gives you the values in x that are not present in y
intersect(A$ID,B$ID)
[1] 1102 1033
setdiff(A$ID,B$ID)
[1] 1012 1204 1555
setdiff(B$ID,A$ID)
[1] 1006 1001 1010 1234 1506 1999
Question 3 : a simple which() including %in% test will give you the position in the dataframe
which(!(A$ID %in% B$ID))
[1] 1 4 5
which(!(B$ID %in% A$ID))
[1] 1 3 5 6 7 8
I am trying to interpolate values from a data frame i have imported from excel. In the table are two columns (value 1 and Value 2) that I am trying to interpolate for each unique "Name". The data frame contains 550 rows with 90 unique Names (so each Name has more than one "Value1" value and more than one "Value2" value). There are also a bunch of irrelevant columns in the dataframe which I don't have a use for.
Example of data frame:
Name Value1 Value2 NotImportantvalue1 NotImportantvalue2
A 1 1 ABC ABC
A 2 1 ABC ABC
B 40 40 ABC ABC
C 30 30 ABC ABC
C 1 2 ABC ABC
D 2 400 ABC ABC
D 3 500 ABC ABC
D 40 2 ABC ABC
I've been messing around with for loops that cycles through a dataframe containing the unique values of Names trying to make it go through the Value1/Value2 columns in my.data where the name in the "Name" column matches the name in the unique dataframe, but I'm not getting the results I want.
Where I'm currently at with my code is to try to get Value1 and Value2 when i in the dataframe "Name" matches the value in the column "Name" in my.data and saving as a dataframe with the same name. After that I have to figure out how to interpolate the values in each dataframe.
#Set working directory
setwd("H:\\R-project")
#Set file path
file <- file.path("Data.xlsx")
#set library
library(XLConnect)
#Read data
my.data <- readWorksheetFromFile(file,sheet=1,startRow=1)
#Unique Names
Name <- data.frame(unique(my.data$Names))
colnames(Name) <- "Name"
for (i in Name$Name) {
assign(i, data.frame(my.data$Value1[my.data$Name==Name[[i],]], my.data$Value2[my.data$Name==Name[[i],]])
}
I'm also not sure if using 90 individual data.frames is the way to go or if i should use something like
name_list <- split(my.data, as.factor(my.data$name))
and interpolate from the list directly (although i don't know exactly how to do that either, for loops aren't my strong point).
Any guidance or help on how to continue would be greatly appreciated!
As you suggested
name_list <- split(my.data, my.data$name)
will give you a list of data frames that have been split by name
You can operate on that list using something like the following
lapply(name_list, function(x) approx(x$Value1, x$Value2))
You will need to provide more details on the desired output if you want a more specific answer.
I've written an apply where I want to 'loop' over a subset of the columns in a dataframe and print some output. For the sake of an example I'm just transforming based on dividing one column by another (I know there are other ways to do this) so we have:
apply(df[c("a","b","c")],2,function(x){
z <- a/df[c("divisor")]
}
)
I'd like to print the column name currently being operated on, but colnames(x) (for example) doesn't work.
Then I want to save a new column, based on each colname (a.exp,b.exp or whatever) into the same df.
For example, take
df <- data.frame(a = 1:3, b = 11:13, c = 21:23)
I'd like to print the column name currently being operated on, but
colnames(x) (for example) doesn't work.
Use sapply with column indices:
sapply(seq_len(ncol(df)), function(x) names(df)[x])
# [1] "a" "b" "c"
I want to save a new column, based on each colname (a.exp,b.exp or
whatever) into the same df.
Here is one way to do it:
(df <- cbind(df, setNames(as.data.frame(apply(df, 2, "^", 2)), paste(names(df), "sqr", sep = "."))))
# a b c a.sqr b.sqr c.sqr
# 1 1 11 21 1 121 441
# 2 2 12 22 4 144 484
# 3 3 13 23 9 169 529
I think a lot of people will look for this same issue, so I'm answering my own question (having eventually found the answers). As below, there are other answers to both parts (thanks!) but non-combining these issues (and some of the examples are more complex).
First, it seems the "colnames" element really isn't something you can get around (seems weird to me!), so you 'loop' over the column names, and within the function call the actual vectors by name [c(x)].
Then the key thing is that to assign, so create your new columns, within an apply, you use '<<'
apply(colnames(df[c("a","b","c")]),function(x) {
z <- (ChISEQCIS[c(paste0(x))]/ChISEQCIS[c("V1")])
ChISEQCIS[c(paste0(x,"ind"))] <<- z
}
)
The << is discussed e.g. https://stackoverflow.com/questions/2628621/how-do-you-use-scoping-assignment-in-r
I got confused because I only vaguely thought about wanting to save the outputs initially and I figured I needed both the column (I incorrectly assumed apply worked like a loop so I could use a counter as an index or something) and that there should be same way to get the name separately (e.g. colname(x)).
There are a couple of related stack questions:
https://stackoverflow.com/questions/9624866/access-to-column-name-of-dataframe-with-apply-function
https://stackoverflow.com/questions/21512041/printing-a-column-name-inside-lapply-function
https://stackoverflow.com/questions/10956873/how-to-print-the-name-of-current-row-when-using-apply-in-r
https://stackoverflow.com/questions/7681013/apply-over-matrix-by-column-any-way-to-get-column-name (easiest to understand)
There are other issues here addressing the same question, but I don't realize how to solve my problem based on it. So, I have 5 data frames that I want to merge rows in one unique data frame using rbind, but it returns the error:
"Error in row.names<-.data.frame(*tmp*, value = value) :
'row.names' duplicated not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘100’, ‘1000’, ‘10000’, ‘100000’, ‘1000000’, ‘1000001 [....]"
The data frames have the same columns but different number of rows. I thought the rbind command took the first column as row.names. So tried to put a sequential id in the five data frames but it doesn't work. I've tried to specify a sequential row names among the data frames via row.names() but with no success too. The merge command is not an option I think because are 5 data frames and successive merges will overwrite precedents. I've created a new data frame only with ids and tried to join but the resulting data frame don't append the columns of joined df.
Follows an extract of df 1:
id image power value pol class
1 1 tsx_sm_hh 0.1834515 -7.364787 hh FR
2 2 tsx_sm_hh 0.1834515 -7.364787 hh FR
3 3 tsx_sm_hh 0.1991938 -7.007242 hh FR
4 4 tsx_sm_hh 0.1991938 -7.007242 hh FR
5 5 tsx_sm_hh 0.2079365 -6.820693 hh FR
6 6 tsx_sm_hh 0.2079365 -6.820693 hh FR
[...]
1802124 1802124 tsx_sm_hh 0.1991938 -7.007242 hh FR
The four other df's are the same structure, except the 'id' columns that don't have duplicated numbers among it. 'pol' and 'image' columns are defined as levels.
and all.pol <- rbind(df1,df2,df3,df4,df5) return the this error of row.names duplicated.
Any idea?
Thanks in advance
I had the same error recently. What turned out to be the problem in my case was one of the attributes of the data frame was a list. After casting it to basic object (e.g. numeric) rbind worked just fine.
By the way row name is the "row numbers" to the left of the first variable. In your example, it is 1, 2, 3, ... (the same as your id variable).
You can see it using rownames(df) and set it using rownames(df) <- name_vector (name_vector must have the same length as df and its elements must be unique).
I had the same error.
My problem was that one of the columns in the dataframes was itself a dataframe. and I couldn't easily find the offending column
data.table::rbindlist() helped to locate it
library(data.table)
rbindlist(a)
# Error in rbindlist(a) :
# Column 25 of item 1 is length 2 inconsistent with column 1 which is length 16. Only length-1 columns are recycled.
a[[1]][, 25] %>% class # "data.frame" K- this should obviously be converted to a column or removed
After removing the errant columndo.call(rbind, a) worked as expected
So I have three data frames we will call them a,b,c
within each data frame there are columns called 1,2,3,4 with 54175 rows of data
Column 1 has id names that are the same in each data frame but not necessarily in the same order
Columns 2,3,4 are just numeric values
I want to pull out all the information from column 2 for a,b,c based on ID from column 1 so each values for a,b,c will correlate to the correct ID
I tried something like
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='1')
I get this error
Error in fix.by(by.x, x) : 'by' must match numbers of columns
Thank you for your help!
Couple problems:
Merge works two-at-a-time, no more.
You need to have the by column in the data.frames that are merged.
Fix these like this:
m1 <- merge(A[,c("1", "2")], B[,c("1", "2")])
m2 <- merge(m1, C[, c("1", "2")])
Then m2 should be the result you're looking for.
As an aside, it's pretty weird to use column names that are just characters of numbers. If they're in order, just use column indices (no quotes), and otherwise put something in them to indicate that they're names not numbers, e.g., R's default of "V1", "V2", "V3". Of course, the best is a meaningful name, like "id", "MeasureDescription", ...
You can either use merge two times:
merge(merge(a[1:2], b[1:2], by = "1"), c[1:2])
or Reduce with merge:
Reduce(function(...) merge(..., by = "1"), list(a[1:2], b[1:2], c[1:2]))
You have to merge them 2 at a time:
a<-data.frame(sample(1:100,100),100*runif(100),100*runif(100),100*runif(100))
colnames(a)<-1:4
b<-data.frame("C1"=sample(1:100,100),"C2"=100*runif(100),"C3"=100*runif(100),"C4"=100*runif(100))
colnames(b)<-1:4
c<-data.frame("C1"=sample(1:100,100),"C2"=100*runif(100),"C3"=100*runif(100),"C4"=100*runif(100))
colnames(c)<-1:4
f<-merge(a[,1:2],b[,1:2],by=(1))
f<-merge(f,c[,1:2],by=(1))
colnames(f)<-c(1,"A2","B2","C2")
head(f)
1 A2 B2 C2
1 1 54.63326 39.23676 28.10989
2 2 10.10024 56.08021 69.44268
3 3 45.02948 14.69028 22.44243
4 4 90.50883 33.61303 98.00917
5 5 13.80767 80.93382 77.22679
6 6 80.72241 27.22139 51.34516
I think the easiest way to answer this question is:
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='1')
should be by=(1)
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by=(1))
only when you want to merge by a column name, you need single quotes, for example:
m1 <- merge(A[,'2'], b[,'2'], c[,2'], by='ID')