Standardizing for Open Data

7 downloads 203 Views 14MB Size Report
Jun 26, 2013 - many developers shy away from using RDF and ... An application uses only 1-‐2 datasets, ... Had a succe
Standardizing for Open Data Ivan  Herman,  W3C   Open  Data  Week   Marseille,  France,  June  26  2013   Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/

(1)

Data  is  everywhere  on  the  Web!   l  Public,  private,  behind  enterprise  firewalls   l  Ranges  from  informal  to  highly  curated   l  Ranges  from  machine  readable  to  human  readable   l  HTML  tables,  twitter  feeds,  local  vocabularies,   spreadsheets,  …   l  Expressed  in  diverse  models     l  tree,  graph,  table,  …   l  Serialized  in  many  ways     l  XML,  CSV,  RDF,  PDF,  HTML  Tables,  microdata,…   (2)

(3)

(4)

(5)

(6)

(7)

W3C’s  standardization  focus  was,   traditionally,  on  Web  scale   integration  of  data   l  Some  basic  principles:   l  use  of  URIs  everywhere  (to  uniquely  identify  things)   l  relate  resources  among  one  another  (to  connect   things  on  the  Web)   l  discover  new  relationships  through  inferences   l  This  is  what  the  Semantic  Web  technologies  are  

all  about  

  (8)

We  have  a  number  of  standards  

SPARQL:  querying  data  

SPARQL  1.1   JSON-­‐LD  

RDF  1.1  

Turtle  

RDFa  

URI  

RDF/XML  

RDF:  data  model,  links,  basic  assertions;   different  serializations    

A  fairly  stable  set  of  technologies  by  now!   (9)

We  have  a  number  of  standards  

OWL  2   SPARQL  1.1   RDB2RDF  

JSON-­‐LD  

RDFS  1.1  

RDF  1.1  

Turtle  

RDFa  

URI  

RDF/XML  

RDB2RDF:  databases  to  RDF   SPARQL:  querying  data   OWL:  complex  vocabularies,  ontologies   RDFS:    simple  vocabularies  

RDF:  data  model,  links,  basic  assertions;   different  serializations    

A  fairly  stable  set  of  technologies  by  now!   (10)

We  have  Linked  Data  principles  

(11)

Integration  is  done  in  different  ways   l  Very  roughly:   l  data  is  accessed  directly  as  RDF  and  turned  into   something  useful   l  relies  on  data  being  “preprocessed”  and  published  as  RDF  

l  data  is  collected  from  different  sources,  integrated  

internally  

l  using,  say,  a  triple  store  

(12)

(13)

However…   l  There  is  a  price  to  pay:  a  relatively  heavy  

ecosystem  

l  many  developers  shy  away  from  using  RDF  and  

related  tools  

l  Not  all  applications  need  this!   l  data  may  be  used  directly,  no  need  for  integration   concerns   l  the  emphasis  may  be  on  easy  production  and   manipulation  of  data  with  simple  tools  

(15)

Typical  situation  on  the  Web   l  Data  published  in  CSV,  JSON,  XML   l  An  application  uses  only  1-­‐2  datasets,  

integration  done  by  direct  programming  is   straightforward   l  e.g.,  in  a  Web  Application  

l  Data  is  often  very  large,  direct  manipulation  is  

more  efficient  

(16)

Non-­‐RDF  Data   l  In  some  setting  that  data  can  be  converted  into  

RDF  

l  But,  in  many  cases,  it  is  not  done   l  e.g.,  CSV  data  is  way  too  big   l  RDF  tooling  may  not  be  adequate  for  the  task  at   hand   l  integration  is  not  a  major  issue  

(17)

(18)

What  that  application  does…     l  Gets  the  data  published  by  NHS   l  Processes  the  data  (e.g.,  through  Hadoop)   l  Integrates  the  result  of  the  analysis  with  

geographical  data  

Ie:  the  raw  data  is  used  without  integration  

(19)

The  reality  of  data  on  the  Web…   l  It  is  still  a  fairly  messy  space  out  there  L   l  many  different  formats  are  used   l  data  is  difficult  to  find   l  published  data  are  messy,  erroneous,     l  tools  are  complex,  unfinished…    

(20)

How  do  developers   perceive  this?   ‘When  transportation  agencies  consider  data   integration,  one  pervasive  notion  is  that  the   analysis  of  existing  information  needs  and   infrastructure,  much  less  the  organization  of  data   into  viable  channels  for  integration,  requires  a   monumental  initial  commitment  of  resources   and  staff.  Resource-­‐scarce  agencies  identify  this   perceived  major  upfront  overhaul  as   "unachievable"  and  "disruptive.”’      -­‐-­‐  Data  Integration  Primer:  Challenges  to  Data  Integration,  US  

Dept.  of  Transportation  

  (21)

One  may  look  at  the  problem   through  different  goggles   l  Two  alternatives  come  to  the  fore:   1.  provide  tools,  environments,  etc.,  to  help   outsiders  to  publish  Linked  Data  (in  RDF)   easily   l  a  typical  example  is  the  Datalift  project  

2.  forget  about  RDF,  Linked  Data,  etc,  and  

concentrate  on  the  raw  data  instead  

(22)

But  religions  and   cultures  can   coexist…  J  

(24)

Open  Data  on  the  Web  Workshop   l  Had  a  successful  workshop  in  London,  in  April:   l  around  100  participants   l  coming  from  different  horizons:  publishers  and  users   of    Linked  Data,  CSV,  PDF,  …    

(25)

We  also  talked  to  our   “stakeholders”   l  Member  organizations  and  companies   l  Open  Data  Institute,  Open  Knowledge  

Foundation,  Schema.org  

l  …  

(26)

Some  takeaway   l  The  Semantic  Web  community  needs  stability  of  

the  technology  

l  do  not  add  yet  another  technology  block  J   l  existing  technologies  should  be  maintained  

(27)

Some  takeaway   l  Look  at  the  more  general  space,  too   l  importance  of  metadata   l  deal  with  non-­‐RDF  data  formats   l  best  practices  are  necessary  to  raise  the  quality  of   published  data  

(28)

We  need  to  meet  app  developers   where  they  are!  

(29)

Metadata  is  of  a  major   importance   l  Metadata  describes  the  characteristics  of  the  

dataset  

l  structure,  datatypes  used   l  access  rights,  licenses   l  provenance,  authorship   l  etc.  

l  Vocabularies  are  also  key  for  Linked  Data  

(30)

Vocabulary  Management  Action   l  Standard  vocabularies  are  necessary  to  describe  

data  

l  there  are  already  some  initiatives:  W3C’s  data  cube,  

data  catalog,  PROV,  schema.org,  DCMI,  …    

l  At  the  moment,  it  is  a  fairly  chaotic  world…   l  many,  possibly  overlapping  vocabularies   l  difficult  to  locate  the  one  that  is  needed   l  vocabularies  may  not  be  properly  managed,   maintained,  versioned,  provided  persistence…  

(31)

W3C’s  plan:     l  Provide  a  space  whereby   l  communities  can  develop   l  host  vocabularies  at  W3C  if  requested   l  annotate  vocabularies  with  a  proper  set  of  metadata   terms   l  establish  a  vocabulary  directory   l  The  exact  structure  is  still  being  discussed:   http://www.w3.org/2013/04/vocabs/  

(32)

CSV  on  the  Web   l  Planned  work  areas:   l  metadata  vocabulary  to  describe  CSV  data   l  structure,  reference  to  access  rights,  annotations,  etc.  

l  methods  to  find  the  metadata   l  part  of  an  HTTP  header,  special  rows  and  columns,   packaging  formats…   l  mapping  content  to  RDF,  JSON,  XML  

l  Possibly  at  a  later  phase:     l  API  standards  to  access  CSV  data  

(34)

Open  Data  Best  Practices   l  Document  best  practices  for  data  publishers   l  management  of  persistence,  versioning,  URI  design   l  use  of  core  vocabularies  (provenance,  access  control,   ownership,  annotations,…)   l  business  models   l  Specialized  Metadata  vocabularies   l  quality  description  (quality  of  the  data,  update   frequencies,  correction  policies,  etc.)   l  description  of  data  access  API-­‐s   l  …   (36)

Summary   l  Data  on  the  Web  has  many  different  facets   l  We  have  concentrated  on  the  integration  

aspects  in  the  past  years  

l  We  have  to  take  a  more  general  view,  look  at  

other  types  of  data  published  on  the  Web  

    (37)

In  future…   l  We  should  look  at  other  formats,  not  only  CSV   l  MARC,  GIS,  ABIF,…   l  Better  outreach  to  data  publishing  communities  

and  organizations  

l  WF,  RDA,  ODI,  OKFN,  …  

(38)

E

  y njo

  ! t n e v e   e h t